FastLED 3.9.15
Loading...
Searching...
No Matches

◆ wave8_transpose_16x4_bf1_pipe4()

FASTLED_FORCE_INLINE FL_IRAM FL_OPTIMIZE_FUNCTION void fl::detail::wave8_transpose_16x4_bf1_pipe4 ( const u8 lanes_a[16],
const u8 lanes_b[16],
const u8 lanes_c[16],
const u8 lanes_d[16],
u8 W0,
u8 W1,
u8 output_a[16 *sizeof(Wave8Byte)],
u8 output_b[16 *sizeof(Wave8Byte)],
u8 output_c[16 *sizeof(Wave8Byte)],
u8 output_d[16 *sizeof(Wave8Byte)] )

BF1 + pipe4: 4-position software-pipelined BF1 (#2548 deep-dive).

Combines BF1's algorithmic reduction (1 transpose per byte-position instead of 8) with pipe4's cross-position ILP. Empirical peak of all prototypes: 1 757 µs/frame vs 9 651 baseline (5.49×).

Definition at line 464 of file wave8.hpp.

472 {
473 u8 d_mask[8];
474 u8 m0_mask[8];
475 const u8 D_byte = W0 ^ W1;
476 for (int p = 0; p < 8; ++p) {
477 const int shift = 7 - p;
478 d_mask[p] = ((D_byte >> shift) & 1) ? 0xFFu : 0x00u;
479 m0_mask[p] = ((W0 >> shift) & 1) ? 0xFFu : 0x00u;
480 }
481 u8 cols_a[16], cols_b[16], cols_c[16], cols_d[16];
482 spread_transpose16_symbol(lanes_a, cols_a);
483 spread_transpose16_symbol(lanes_b, cols_b);
484 spread_transpose16_symbol(lanes_c, cols_c);
485 spread_transpose16_symbol(lanes_d, cols_d);
486 for (int s = 0; s < 8; ++s) {
487 const u8 al = cols_a[2*s + 0], ah = cols_a[2*s + 1];
488 const u8 bl = cols_b[2*s + 0], bh = cols_b[2*s + 1];
489 const u8 cl = cols_c[2*s + 0], ch = cols_c[2*s + 1];
490 const u8 dl = cols_d[2*s + 0], dh = cols_d[2*s + 1];
491 for (int p = 0; p < 8; ++p) {
492 const u8 dm = d_mask[p], mm = m0_mask[p];
493 output_a[s*16 + p*2 + 0] = mm ^ (al & dm);
494 output_a[s*16 + p*2 + 1] = mm ^ (ah & dm);
495 output_b[s*16 + p*2 + 0] = mm ^ (bl & dm);
496 output_b[s*16 + p*2 + 1] = mm ^ (bh & dm);
497 output_c[s*16 + p*2 + 0] = mm ^ (cl & dm);
498 output_c[s*16 + p*2 + 1] = mm ^ (ch & dm);
499 output_d[s*16 + p*2 + 0] = mm ^ (dl & dm);
500 output_d[s*16 + p*2 + 1] = mm ^ (dh & dm);
501 }
502 }
503}
FASTLED_FORCE_INLINE FL_IRAM FL_OPTIMIZE_FUNCTION void spread_transpose16_symbol(const u8 l[16], u8 out[16])
Transpose one symbol of 16 lanes (16 input bytes) into 16 output bytes: 8 pulses × 2 bytes,...
unsigned char u8
Definition stdint.h:131
@ W1
White is second.
Definition eorder.h:26
@ W0
White is first.
Definition eorder.h:27

References spread_transpose16_symbol(), fl::W0, and fl::W1.

Referenced by fl::wave8Transpose_16x4_bf1_pipe4().

+ Here is the call graph for this function:
+ Here is the caller graph for this function: