Skip to content

Conversation

@Sentimentron
Copy link
Contributor

Results from silicon (bpp=3)

CPU Baseline Result Speedup
Arm Cortex A520 910.9 MiB/s 2025.9 MiB/s 122.40%
Arm Cortex X4 6551.8 MiB/s 6313 MiB/s -3.64%
Apple Silicon M2 5839.2 MiB/s 5751.8 MiB/s -1.33%
AMD EPYC 7B13 3830.6 MiB/s 5472.9 MiB/s 42.87%

Results from silicon (bpp=4)

CPU Baseline Result Speedup
Arm Cortex A520 607.0 MiB/s 3226.8 MiB/s 431.62%
Arm Cortex X4 6551.8 MiB/s 6313.0 MiB/s 105.71%
Apple Silicon M2 5800.3 MiB/s 10616.0 MiB/s 83.03%
AMD EPYC 7B13 10796.0 MiB/s 15268.0 MiB/s 41.42%

Opened as a draft until #632 is resolved.

@fintelia
Copy link
Contributor

fintelia commented Dec 2, 2025

I see mixed results on my Zen 5 CPU.

On the bpp=3 case, throughput regresses 13% from 9.2 GB/s to 8.0 GB/s with the default target-cpu and regresses to 6.4 GB/s with target-cpu=native (?).

However for bpp=4 I see nearly double the performance, going from 10.4 GB/s to 19.4 GB/s. And I discovered that I was able to get an additional 30% improvement to 25.1 GB/s by removing the data-dependency between the two unrolled loop iterations. (There's probably a better way to splat the low 4 bytes of the SIMD vector to all the elements)

 
             // Process chunk 2
             let mut x_vec2: SimdVector = SimdVector::from_slice(chunk2_slice);
-            let carry_in_vec2 = prev_pixel_val_for_chunk2.resize::<STRIDE_BYTES>(0u8);
-            x_vec2 = x_vec2 + carry_in_vec2;
             x_vec2 = x_vec2 + x_vec2.shift_elements_right::<BPP>(0u8);
             x_vec2 = x_vec2 + x_vec2.shift_elements_right::<{ 2 * BPP }>(0u8);
             x_vec2 = x_vec2 + x_vec2.shift_elements_right::<{ 4 * BPP }>(0u8);
             x_vec2 = x_vec2 + x_vec2.shift_elements_right::<{ 8 * BPP }>(0u8);
+
+            x_vec2 = x_vec2
+                + std::simd::simd_swizzle!(
+                    prev_pixel_val_for_chunk2.resize::<STRIDE_BYTES>(0u8),
+                    [
+                        0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0,
+                        1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1,
+                        2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
+                    ]
+                );
+
             prev_pixel_val = x_vec2.extract::<{ STRIDE_BYTES - BPP }, BPP>();
             x_vec2.copy_to_slice(chunk2_slice);
         }

Prefix sum improves performance on A520 and Epyc by
122.40% and 42.87% respectively.
@Sentimentron
Copy link
Contributor Author

Sentimentron commented Dec 3, 2025

Checking in with current status on 1.93.0-nightly (2a7c49606 2025-11-25)

Results from silicon (bpp=3)

CPU Baseline Result Speedup
Arm Cortex A520 950.92 MiB/s 2109.7 MiB/s 121.9%
Arm Cortex X4 5254.9 MiB/s 6336.1 MiB/s 20.06%
Apple Silicon M2 5828.8 MiB/s 5741.5 MiB/s -1.53%
Apple Silicon M4 Pro 7740.9 MiB/s 7518.4 MiB/s -2.82%
AMD EPYC 7B13 3828.6 MiB/s 6655.4 MiB/s 73.3%

Results from silicon (bpp=4)

CPU Baseline Result Speedup
Arm Cortex A520 634.61 MiB/s 3368.5 MiB/s 430.8%
Arm Cortex X4 5084.6 MiB/s 10974.0 MiB/s 115.8%
Apple Silicon M2 5795.6 MiB/s 1059.7 MiB/s 82.7%
Apple Silicon M4 Pro 8359.9 MiB/s 13615.0 MiB/s 62.1%
AMD EPYC 7B13 10799.0 MiB/s 18683.0 MiB/s 72.95 %

(Both are from the Rust default CPU setting). I'll next try the new simd_swizzle approach

Improves performance by around 40% on the Epyc system,
431% on the Cortex-A520.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants