◆ wave8Transpose_16x4_pipe4()

FL_OPTIMIZE_FUNCTION FL_IRAM void fl::wave8Transpose_16x4_pipe4	(	const u8(&)	lanes_a[16],
		const u8(&)	lanes_b[16],
		const u8(&)	lanes_c[16],
		const u8(&)	lanes_d[16],
		const Wave8ByteExpansionLut &	lut,
		u8(&)	output_a[16 *sizeof(Wave8Byte)],
		u8(&)	output_b[16 *sizeof(Wave8Byte)],
		u8(&)	output_c[16 *sizeof(Wave8Byte)],
		u8(&)	output_d[16 *sizeof(Wave8Byte)] )

Pipe4: transpose 16-lane × 4-byte-positions (#2548).

Bit-identical to four sequential wave8Transpose_16 calls. Peak of the cross-position ILP curve on RV32 P4 — pipe2 = +26%, pipe3 = +36%, pipe4 = +41% vs baseline (9 651 → 6 822 µs/frame). pipe6 / pipe8 regress to 94% (32-GPR budget exceeded → compiler spills). Measured 11% UNDER the 7 680 µs WS2812B TX target — comfortable margin for ISR-chunked streaming.

Definition at line 215 of file wave8.cpp.hpp.

                                                                                         {
    Wave8Byte laneWaveformsA[16];
    Wave8Byte laneWaveformsB[16];
    Wave8Byte laneWaveformsC[16];
    Wave8Byte laneWaveformsD[16];
    for (int lane = 0; lane < 16; lane++) {
        detail::wave8_expand_byte(lanes_a[lane], lut, &laneWaveformsA[lane]);
    }
    for (int lane = 0; lane < 16; lane++) {
        detail::wave8_expand_byte(lanes_b[lane], lut, &laneWaveformsB[lane]);
    }
    for (int lane = 0; lane < 16; lane++) {
        detail::wave8_expand_byte(lanes_c[lane], lut, &laneWaveformsC[lane]);
    }
    for (int lane = 0; lane < 16; lane++) {
        detail::wave8_expand_byte(lanes_d[lane], lut, &laneWaveformsD[lane]);
    }
    detail::wave8_transpose_16x4_pipe4(laneWaveformsA, laneWaveformsB,
                                       laneWaveformsC, laneWaveformsD,
                                       output_a, output_b, output_c, output_d);
}

References FL_RESTRICT_PARAM, fl::detail::wave8_expand_byte(), and fl::detail::wave8_transpose_16x4_pipe4().

Here is the call graph for this function: