|
FastLED 3.9.15
|
| FASTLED_FORCE_INLINE FL_IRAM FL_OPTIMIZE_FUNCTION void fl::detail::wave8_transpose_16x4_pipe4 | ( | const Wave8Byte | lane_waves_a[16], |
| const Wave8Byte | lane_waves_b[16], | ||
| const Wave8Byte | lane_waves_c[16], | ||
| const Wave8Byte | lane_waves_d[16], | ||
| u8 | output_a[16 *sizeof(Wave8Byte)], | ||
| u8 | output_b[16 *sizeof(Wave8Byte)], | ||
| u8 | output_c[16 *sizeof(Wave8Byte)], | ||
| u8 | output_d[16 *sizeof(Wave8Byte)] ) |
Pipe4: transpose 16-lane × 4-byte-positions in one fused call.
Bit-identical to four sequential wave8_transpose_16 calls. Extends the pipe2 idea to 4 positions — empirically the peak of the curve on the in-order RV32 P4 core (#2548). pipe2 saved 26% over baseline, pipe3 36%, pipe4 41%, pipe6 regressed to 94% (register spill). Stays within the 32-GPR budget: 4 × 4 = 16 OR-tree accumulators + ~8 misc GPRs = ~24 live; pipe6 would push this past 32.
Definition at line 273 of file wave8.hpp.
References spread_transpose16_symbol().
Referenced by fl::wave8Transpose_16x4_pipe4().
Here is the call graph for this function:
Here is the caller graph for this function: