portable_simd version of Sub (3bpp/4bpp) #642

Sentimentron · 2025-09-09T18:36:50Z

Results from silicon (bpp=3)

CPU	Baseline	Result	Speedup
Arm Cortex A520	910.9 MiB/s	2025.9 MiB/s	122.40%
Arm Cortex X4	6551.8 MiB/s	6313 MiB/s	-3.64%
Apple Silicon M2	5839.2 MiB/s	5751.8 MiB/s	-1.33%
AMD EPYC 7B13	3830.6 MiB/s	5472.9 MiB/s	42.87%

Results from silicon (bpp=4)

CPU	Baseline	Result	Speedup
Arm Cortex A520	607.0 MiB/s	3226.8 MiB/s	431.62%
Arm Cortex X4	6551.8 MiB/s	6313.0 MiB/s	105.71%
Apple Silicon M2	5800.3 MiB/s	10616.0 MiB/s	83.03%
AMD EPYC 7B13	10796.0 MiB/s	15268.0 MiB/s	41.42%

Opened as a draft until #632 is resolved.

fintelia · 2025-12-02T07:15:21Z

I see mixed results on my Zen 5 CPU.

On the bpp=3 case, throughput regresses 13% from 9.2 GB/s to 8.0 GB/s with the default target-cpu and regresses to 6.4 GB/s with target-cpu=native (?).

However for bpp=4 I see nearly double the performance, going from 10.4 GB/s to 19.4 GB/s. And I discovered that I was able to get an additional 30% improvement to 25.1 GB/s by removing the data-dependency between the two unrolled loop iterations. (There's probably a better way to splat the low 4 bytes of the SIMD vector to all the elements)

 
             // Process chunk 2
             let mut x_vec2: SimdVector = SimdVector::from_slice(chunk2_slice);
-            let carry_in_vec2 = prev_pixel_val_for_chunk2.resize::<STRIDE_BYTES>(0u8);
-            x_vec2 = x_vec2 + carry_in_vec2;
             x_vec2 = x_vec2 + x_vec2.shift_elements_right::<BPP>(0u8);
             x_vec2 = x_vec2 + x_vec2.shift_elements_right::<{ 2 * BPP }>(0u8);
             x_vec2 = x_vec2 + x_vec2.shift_elements_right::<{ 4 * BPP }>(0u8);
             x_vec2 = x_vec2 + x_vec2.shift_elements_right::<{ 8 * BPP }>(0u8);
+
+            x_vec2 = x_vec2
+                + std::simd::simd_swizzle!(
+                    prev_pixel_val_for_chunk2.resize::<STRIDE_BYTES>(0u8),
+                    [
+                        0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0,
+                        1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1,
+                        2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
+                    ]
+                );
+
             prev_pixel_val = x_vec2.extract::<{ STRIDE_BYTES - BPP }, BPP>();
             x_vec2.copy_to_slice(chunk2_slice);
         }

Prefix sum improves performance on A520 and Epyc by 122.40% and 42.87% respectively.

Sentimentron · 2025-12-03T21:29:28Z

Checking in with current status on 1.93.0-nightly (2a7c49606 2025-11-25)

Results from silicon (bpp=3)

CPU	Baseline	Result	Speedup
Arm Cortex A520	950.92 MiB/s	2109.7 MiB/s	121.9%
Arm Cortex X4	5254.9 MiB/s	6336.1 MiB/s	20.06%
Apple Silicon M2	5828.8 MiB/s	5741.5 MiB/s	-1.53%
Apple Silicon M4 Pro	7740.9 MiB/s	7518.4 MiB/s	-2.82%
AMD EPYC 7B13	3828.6 MiB/s	6655.4 MiB/s	73.3%

Results from silicon (bpp=4)

CPU	Baseline	Result	Speedup
Arm Cortex A520	634.61 MiB/s	3368.5 MiB/s	430.8%
Arm Cortex X4	5084.6 MiB/s	10974.0 MiB/s	115.8%
Apple Silicon M2	5795.6 MiB/s	1059.7 MiB/s	82.7%
Apple Silicon M4 Pro	8359.9 MiB/s	13615.0 MiB/s	62.1%
AMD EPYC 7B13	10799.0 MiB/s	18683.0 MiB/s	72.95 %

(Both are from the Rust default CPU setting). I'll next try the new simd_swizzle approach

Improves performance by around 40% on the Epyc system, 431% on the Cortex-A520.

perf: portable_simd version of sub (3bpp)

551f36e

Prefix sum improves performance on A520 and Epyc by 122.40% and 42.87% respectively.

Sentimentron force-pushed the portable_simd-sub branch from a807fe4 to 56f73e3 Compare December 3, 2025 21:22

perf: 4bpp version of the sub filter

7acabce

Improves performance by around 40% on the Epyc system, 431% on the Cortex-A520.

Sentimentron force-pushed the portable_simd-sub branch from 56f73e3 to 7acabce Compare December 3, 2025 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

portable_simd version of Sub (3bpp/4bpp) #642

portable_simd version of Sub (3bpp/4bpp) #642

Uh oh!

Sentimentron commented Sep 9, 2025

Uh oh!

fintelia commented Dec 2, 2025

Uh oh!

Sentimentron commented Dec 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

portable_simd version of Sub (3bpp/4bpp) #642

Are you sure you want to change the base?

portable_simd version of Sub (3bpp/4bpp) #642

Uh oh!

Conversation

Sentimentron commented Sep 9, 2025

Uh oh!

fintelia commented Dec 2, 2025

Uh oh!

Sentimentron commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sentimentron commented Dec 3, 2025 •

edited

Loading