Skip to content

Conversation

@Sentimentron
Copy link
Contributor

@Sentimentron Sentimentron commented Aug 28, 2025

With --features=unstable, this implementation will decode up to 16 pixels at once.

Reduces the number of cycles required to decode a 768px row by 20-30% on Arm systems, and around 50-70% on x64 systems (see discussion)

Results on an AMD EPYC 7B13 system

    $  taskset -c 64-68 cargo +nightly bench --features=benchmarks -- filter=Paeth/bpp=3
    unfilter/filter=Paeth/bpp=3
        time:   [18.987 µs 18.994 µs 19.001 µs]
        thrpt:  [616.74 MiB/s 616.96 MiB/s 617.19 MiB/s]

    $ taskset -c 64-68 cargo +nightly bench --features=benchmarks,unstable -- filter=Paeth/bpp=3
        unfilter/filter=Paeth/bpp=3
            time:   [15.355 µs 15.361 µs 15.367 µs]
            thrpt:  [762.59 MiB/s 762.89 MiB/s 763.18 MiB/s]
        change:
            time:   [-19.180% -19.087% -19.007%] (p = 0.00 < 0.05)
            thrpt:  [+23.468% +23.589% +23.732%]

@Sentimentron
Copy link
Contributor Author

AI disclosure: I wrote an original sliding-window 3bpp Paeth portable_simd implementation and optimized it for best performance on the Cortex A520. I then used the Gemini family of LLMs to automatically refine the code to achieve the best possible code-generation and performance across all other micro-architectures. The PR is derived from that work, but includes documentation and other cleanups.

@Sentimentron Sentimentron changed the title perf: portable_simd 3bpp paeth filter perf: portable_simd 3bpp paeth unfiltering Aug 28, 2025
Copy link
Contributor

@fintelia fintelia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are quite significant improvements! On my Zen 3 machine, I see a ~11% speed improvement compiling with the default x86-64 target, and a ~30% improvement when compiling for x86-64-v2 or x86-64-v3

@197g
Copy link
Member

197g commented Aug 29, 2025

So portable_simd implies we only have instructions per target architecture, not as runtime dispatch, hence the different target needs? That's unfortunate. But for ~11% I'm willing to make a lot of temporary and less temporary compromises.

@Shnatsel
Copy link
Member

That is correct, portable_simd does not do runtime feature selection.

It can be added as a layer on top of portable_simd but that will require some unsafe in the implementation. The multiversion crate provides a safe wrapper around that using a proc macro.

I suggest exploring multiversioning separately from this PR, since it is a mostly orthogonal concern.

@Shnatsel
Copy link
Member

We never really got 3bpp Paeth to vectorize, so it makes sense that large gains are possible here.

On a desktop Zen4 the difference is under 5%, but at least it seems to be stable and not noise. "My CPU is too good for optimization work" is a weird problem to have.

@Shnatsel
Copy link
Member

Bad news: this is a 10% regression on an Apple M4 laptop.

Measured with cargo +nightly bench --features=benchmarks,unstable -- filter=Paeth/bpp=3

@Sentimentron
Copy link
Contributor Author

Sentimentron commented Aug 30, 2025

Yes, looks like I get a similar result on my personal M2 laptop:

unfilter/filter=Paeth/bpp=3
                        time:   [19.663 µs 19.667 µs 19.671 µs]
                        thrpt:  [595.75 MiB/s 595.87 MiB/s 595.98 MiB/s]
                 change:
                        time:   [+11.301% +11.404% +11.494%] (p = 0.00 < 0.05)
                        thrpt:  [-10.309% -10.236% -10.154%]
                        Performance has regressed.

I don't know if the LLVM cost model for the Apple Silicon CPUs is in any way accurate (if it is, I might be able to fix it), but I think we can gate this off for Mac OS for now.

@Shnatsel
Copy link
Member

Apple uses LLVM for all their C/C++/Objective-C/Swift compilation, so I'd be very much surprised if the LLVM cost model for Apple was way off.

I am curious what your microarchitecture simulator (llvm-mca?) has to say about this code on Apple systems. If it still shows a performance improvement on Apple, that's a major red flag flag for its reliability.

@Sentimentron
Copy link
Contributor Author

Looking at the public LLVM cost model, seems like all the Apple cores are modelled using this SchedCyclone def:

https://github.com/llvm/llvm-project/blame/main/llvm/lib/Target/AArch64/AArch64SchedCyclone.td

Doesn’t seem to have received any significant updates in the last 11 years or so. It’s possible that they do this properly in their proprietary Clang, but I don’t think the LLVM MCA-based simulator that I can run will accurately model this processor. This work is mostly about lifting up the Arm low-end (Cortex A520/A35/A53) which are all relatively simple to model - any extra things that benefit are just a bonus!

@Shnatsel
Copy link
Member

I wonder if this is at all beneficial in practice on lower-end ARM, since older SoCs are heavily bottlenecked by memory bandwidth, so microarchitectural optimizations might be entirely irrelevant.

It's a hassle to get benchmarks from Android, but maybe we could get this benchmarked on a Raspberry Pi or some such?

@Sentimentron
Copy link
Contributor Author

So I got it running on an Android system (specifically a Pixel 10), and here's what came back. I'll start with the old code on the little core:

$ simpleperf stat -e instructions,cpu-cycles,stalled-cycles-backend,bus-cycles --per-core taskset 1 ./unfilter-841f*
unfilter/filter=Paeth/bpp=3
                        time:   [68.486 µs 68.523 µs 68.562 µs]
                        thrpt:  [170.92 MiB/s 171.02 MiB/s 171.11 MiB/s]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

Performance counter statistics:

# cpu           count  event_name               # count / runtime
  0    35,301,571,758  instructions             # 2.737 G/sec  
  2        15,500,501  instructions             # 1.877 G/sec  
  0    28,964,228,993  cpu-cycles               # 2.245909 GHz 
  2        10,455,860  cpu-cycles               # 1.266378 GHz 
  0                 0  stalled-cycles-backend   # 0.000 /sec   
  2         3,113,045  stalled-cycles-backend   # 377.041 M/sec
  0    28,964,229,818  bus-cycles               # 2.246 G/sec  
  2        10,455,852  bus-cycles               # 1.266 G/sec  

And here's the new code on the little core:

 simpleperf stat -a taskset 1 ./unfilter-d7ac32ab9f97a889   --bench filter=Paeth/bpp=3                                   <
unfilter/filter=Paeth/bpp=3
                        time:   [52.192 µs 52.221 µs 52.252 µs]
                        thrpt:  [224.27 MiB/s 224.40 MiB/s 224.53 MiB/s]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

Performance counter statistics:

# cpu           count  event_name               # count / runtime              
  0    26,091,209,126  instructions             # 1.006585 cycles per instruction 
  1        29,086,319  instructions             # 6.070105 cycles per instruction 
  2        72,566,321  instructions             # 1.039818 cycles per instruction 
  3        68,670,799  instructions             # 1.122647 cycles per instruction 
  4        69,968,132  instructions             # 1.105196 cycles per instruction 
  5        21,288,155  instructions             # 1.590875 cycles per instruction 
  6        20,176,375  instructions             # 1.784563 cycles per instruction 
  7           372,543  instructions             # 11.952097 cycles per instruction
  0    26,263,015,945  cpu-cycles               # 2.238589 GHz                    
  1       176,557,011  cpu-cycles               # 0.015049 GHz                    
  2        75,455,791  cpu-cycles               # 0.006432 GHz                    
  3        77,093,081  cpu-cycles               # 0.006571 GHz                    
  4        77,328,509  cpu-cycles               # 0.006591 GHz                    
  5        33,866,802  cpu-cycles               # 0.002887 GHz                    
  6        36,006,015  cpu-cycles               # 0.003069 GHz                    
  7         4,452,670  cpu-cycles               # 0.000380 GHz                    
  0                 0  stalled-cycles-backend   # 0.000 /sec                      
  1                 0  stalled-cycles-backend   # 0.000 /sec                      
  2        19,623,562  stalled-cycles-backend   # 1.673 M/sec                     
  3        18,591,696  stalled-cycles-backend   # 1.585 M/sec                     
  4        21,507,071  stalled-cycles-backend   # 1.833 M/sec                     
  5        12,880,399  stalled-cycles-backend   # 1.098 M/sec                     
  6        15,129,010  stalled-cycles-backend   # 1.289 M/sec                     
  7         2,473,048  stalled-cycles-backend   # 210.772 K/sec                   
  0    26,261,083,597  bus-cycles               # 2.238 G/sec                     
  1       176,557,056  bus-cycles               # 15.047 M/sec                    
  2        74,951,477  bus-cycles               # 6.388 M/sec                     
  3        75,946,519  bus-cycles               # 6.472 M/sec                     
  4        76,761,850  bus-cycles               # 6.542 M/sec                     
  5        33,879,962  bus-cycles               # 2.887 M/sec                     
  6        35,922,982  bus-cycles               # 3.061 M/sec                     
  7         4,511,325  bus-cycles               # 384.453 K/sec  

And now the big, first the old code:

Total test time: 12.928474 seconds.
$ simpleperf stat -e instructions,cpu-cycles,stalled-cycles-backend,bus-cycles -a taskset 80 ./unfilter-841f40*
unfilter/filter=Paeth/bpp=3
                        time:   [16.658 µs 16.664 µs 16.669 µs]
                        thrpt:  [703.01 MiB/s 703.25 MiB/s 703.48 MiB/s]
                 change:
                        time:   [-75.813% -75.733% -75.680%] (p = 0.00 < 0.05)
                        thrpt:  [+311.19% +312.08% +313.44%]
                        Performance has improved.
# cpu            count  event_name               # count / runtime             
  0         28,787,130  instructions             # 8.334599 cycles per instruction
  1         10,111,115  instructions             # 3.952444 cycles per instruction
  2         58,626,139  instructions             # 0.956735 cycles per instruction
  3         53,388,940  instructions             # 1.069962 cycles per instruction
  4         46,872,058  instructions             # 1.129856 cycles per instruction
  5         19,913,000  instructions             # 1.609684 cycles per instruction
  6         46,021,462  instructions             # 3.058840 cycles per instruction
  7    122,111,879,028  instructions             # 0.313492 cycles per instruction
  0        239,929,199  cpu-cycles               # 0.023653 GHz                   
  1         39,963,619  cpu-cycles               # 0.003940 GHz                   
  2         56,089,706  cpu-cycles               # 0.005529 GHz                   
  3         57,124,140  cpu-cycles               # 0.005631 GHz                   
  4         52,958,675  cpu-cycles               # 0.005221 GHz                   
  5         32,053,640  cpu-cycles               # 0.003160 GHz                   
  6        140,772,280  cpu-cycles               # 0.013877 GHz                   
  7     38,281,114,095  cpu-cycles               # 3.773429 GHz                   
  0                  0  stalled-cycles-backend   # 0.000 /sec                     
  1                  0  stalled-cycles-backend   # 0.000 /sec                     
  2         14,569,942  stalled-cycles-backend   # 1.436 M/sec                    
  3         13,572,410  stalled-cycles-backend   # 1.338 M/sec                    
  4         12,873,788  stalled-cycles-backend   # 1.269 M/sec                    
  5         11,380,351  stalled-cycles-backend   # 1.122 M/sec                    
  6         55,503,126  stalled-cycles-backend   # 5.471 M/sec                    
  7     11,467,846,471  stalled-cycles-backend   # 1.130 G/sec                    
  0        242,286,605  bus-cycles               # 23.884 M/sec                   
  1         40,022,922  bus-cycles               # 3.945 M/sec                    
  2         56,147,545  bus-cycles               # 5.535 M/sec                    
  3         57,267,840  bus-cycles               # 5.645 M/sec                    
  4         53,019,129  bus-cycles               # 5.226 M/sec                    
  5         32,028,206  bus-cycles               # 3.157 M/sec                    
  6        141,003,152  bus-cycles               # 13.900 M/sec                   
  7     38,281,309,463  bus-cycles               # 3.774 G/sec 

And now the new:

Total test time: 11.718414 seconds.
 -a taskset 80 ./unfilter-d7ac32ab9f97a889   --bench filter=Paeth/bpp=3                                 <
unfilter/filter=Paeth/bpp=3
                        time:   [18.041 µs 18.042 µs 18.043 µs]
                        thrpt:  [649.50 MiB/s 649.54 MiB/s 649.58 MiB/s]
                 change:
                        time:   [-65.676% -65.535% -65.445%] (p = 0.00 < 0.05)
                        thrpt:  [+189.39% +190.15% +191.34%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  1 (1.00%) high mild
  3 (3.00%) high severe

Performance counter statistics:

# cpu           count  event_name               # count / runtime             
  0        40,163,354  instructions             # 6.407979 cycles per instruction
  1        14,727,533  instructions             # 3.544385 cycles per instruction
  2        52,998,332  instructions             # 1.031884 cycles per instruction
  3        87,045,464  instructions             # 0.801187 cycles per instruction
  4        39,933,075  instructions             # 1.164127 cycles per instruction
  5        37,973,836  instructions             # 1.799052 cycles per instruction
  6        15,332,612  instructions             # 7.761318 cycles per instruction
  7    72,310,059,395  instructions             # 0.546846 cycles per instruction
  0       257,365,924  cpu-cycles               # 0.024565 GHz                   
  1        52,200,045  cpu-cycles               # 0.004982 GHz                   
  2        54,688,142  cpu-cycles               # 0.005220 GHz                   
  3        69,739,711  cpu-cycles               # 0.006657 GHz                   
  4        46,487,170  cpu-cycles               # 0.004437 GHz                   
  5        68,316,906  cpu-cycles               # 0.006521 GHz                   
  6       119,001,272  cpu-cycles               # 0.011358 GHz                   
  7    39,542,496,810  cpu-cycles               # 3.774136 GHz                   
  0                 0  stalled-cycles-backend   # 0.000 /sec                     
  1                 0  stalled-cycles-backend   # 0.000 /sec                     
  2        13,694,794  stalled-cycles-backend   # 1.307 M/sec                    
  3        17,640,685  stalled-cycles-backend   # 1.684 M/sec                    
  4        12,133,632  stalled-cycles-backend   # 1.158 M/sec                    
  5        30,350,764  stalled-cycles-backend   # 2.897 M/sec                    
  6        45,669,009  stalled-cycles-backend   # 4.359 M/sec                    
  7    16,945,062,179  stalled-cycles-backend   # 1.617 G/sec                    
  0       258,026,474  bus-cycles               # 24.627 M/sec                   
  1        52,237,438  bus-cycles               # 4.986 M/sec                    
  2        55,064,781  bus-cycles               # 5.256 M/sec                    
  3        69,774,618  bus-cycles               # 6.660 M/sec                    
  4        46,100,141  bus-cycles               # 4.400 M/sec                    
  5        68,377,583  bus-cycles               # 6.526 M/sec                    
  6       118,919,675  bus-cycles               # 11.350 M/sec                   
  7    39,542,552,563  bus-cycles               # 3.774 G/sec                    

What I think this could be indicating is that the penalty from unaligned loads/stores might be a bit higher on the big cores: there's an uplift in backend-stalled cycles on the big core. Just a theory.

Sometimes, when doing this kind of optimization, we have to make the decision about whether to raise the ceiling or raise the floor: a 30% uplift on a small core (usually around 1/3rd the speed of a big core) might be a lot more noticeable than an 8% loss on a big core.

@Shnatsel
Copy link
Member

Does PNG decoding get scheduled onto little cores often? I would imagine that interactive workloads like web browsing would run mostly on the big cores, with the little ones mostly getting non-interactive background processes.

@fintelia
Copy link
Contributor

The last time we had a portable SIMD unfiltering implementation, we did a bunch of benchmarking before adding it. But then the scalar implementations improved, new versions of rustc and LLVM came out, and we discovered that it no longer seemed to help. Since our measurements pretty much just come from whichever contributor's CPUs happen to test, part of the difference may have just been which processors were tested.

Is there a way we can avoid repeating this process?

@awxkee
Copy link

awxkee commented Aug 30, 2025

Making effective intrinsics on NEON for such a small series like [u8; 3] is incredible complicated even with direct intrinsics, I almost never had success with this. NEON instructions are power hungry, so instead of cheap scalar instructions on big cores they consume roughly 2 times more energy to execute and cause more heat ( and the amount of intructions generated by this generic SIMD is unlikely twice less than scalar ). As a result, this can lead to throttling.

Does PNG decoding get scheduled onto little cores often? I would imagine that interactive workloads like web browsing would run mostly on the big cores, with the little ones mostly getting non-interactive background processes.

It starts on the little cores, and if the workload does not decrease, it switches to the big cores (if not pinned). After that, it may throttle. Here some sketch how this works. However, this depends on OS device and other factors as well.

@197g
Copy link
Member

197g commented Aug 30, 2025

I'm starting to wonder, what if we shipped the benchmark loops together with the library. So we could have an interface that determines the optimal implementation based on your actual target environment and we don't need to invest as much guesswork into excluding some implementations from the library. Esp. with the vast divide between big-little effect it seems impossible to determine this at compile time, there's a lot of interplay between core selection and thermal/power throttling that will depend a lot even on the SMB design. That's something you can optimize only at a system level..

Of course, we still want to ship good code and this wouldn't resolve how to answer #632 (comment), i.e. how to determine the static default option be applied. It could mean however that users can be more flexible and maybe report back to us their own benchmarks (the Chromium scale experiment was quite enlightening, i.e surprisingly different, when comparing it to the rationales we got from our internal results).

@Sentimentron
Copy link
Contributor Author

Couple of observations:

  • Looks like I can get 2-3% back on Apple Silicon by increasing the number of pixels per loop from 16 to 21, unknown what effect that has on other micro-architectures yet.
  • Apple Silicon in particular seems ultra-sensitive to the formulation of paeth_predictor_simd, with this PR containing the current fastest known version (adopting STBI's formulation, FPNGE, some fairly logical tweaks were all worse). This indicates to me that we're both memory and compute bound at different parts of the calculation.

@Shnatsel
Copy link
Member

Shnatsel commented Aug 31, 2025

I'm curious what the microarchitecture simulator has to say about #635 which significantly improves benchmarks for x86_64, Apple M3 and gives an especially huge boost to Graviton 3. I misread the benchmarks, it actually regresses ARM a lot. Never mind.

@Sentimentron
Copy link
Contributor Author

I'm curious what the microarchitecture simulator has to say about #635 which significantly improves benchmarks for x86_64, Apple M3 and gives an especially huge boost to Graviton 3. I misread the benchmarks, it actually regresses ARM a lot. Never mind.

I was curious too. Testing #635, the micro-architecture simulator indicates this:

Filter Arm Cortex A520 Arm Cortex X4 Intel Arrow Lake AMD Zen 5 Arm Cortex A520 (baseline cycles) Arm Cortex X4 (baseline cycles) Intel Arrow Lake (baseline cycles) AMD Zen 5 (baseline cycles)
Paeth (3bpp) -13.92% -18.52% 1.35% 0.20% 1073543 520265 1158373 505037
Paeth (4bpp) 9.53% -7.49% 6.28% -4.25% 885251 288701 413955 259222

The PR doesn't say which CPU it was measured on, but the results indicate a 6% uplift on both bpp=3 and bpp=4. I've measured it on my Epyc system and it yields about a 2% uplift. It's not surprising that the micro-arch simulation is a little off for the reasons I've discussed (it assumes basically a "perfect" path to main memory with no latency), this is what LLVM's cost-model thinks the code will cost to execute. A theoretically worse solution can be better in practise as long as it decreases stalled cycles.

@Shnatsel
Copy link
Member

Thank you! Do I understand correctly that on EPYC 7B13 the portable_simd formulation in this PR is still worthwhile even after #635 ?

Also, sorry about the conflicts that the merge-and-revert has caused 😞

@Sentimentron
Copy link
Contributor Author

From what I can tell, yes it's still worth it for bpp=3:

unfilter/filter=Paeth/bpp=3
                        time:   [15.227 µs 15.239 µs 15.256 µs]
                        thrpt:  [768.15 MiB/s 768.99 MiB/s 769.61 MiB/s]
                 change:
                        time:   [−15.119% −15.050% −14.977%] (p = 0.00 < 0.05)
                        thrpt:  [+17.615% +17.716% +17.812%]

@awxkee
Copy link

awxkee commented Aug 31, 2025

FWIW
To make this PR effective on NEON, it’s necessary to figure out a way to utilize the pipeline and instruction scheduling, to compensate introduced latency. On AArch64, you can’t just generate nearly the same amount of code with more power hungry instructions and expect guaranteed performance improvements. Some gains might appear sometimes, but they often come at the cost of shifting work from one bottleneck to another.
For example, instead of processing pixel(row, 0), pixel(row, 1), pixel(row, 2), process process_pixels(row0, row1, row2, row3, 0), process_pixels(row0, row1, row2, row3, 1) to give LLVM chance to schedule instructions on the NEON pipeline, since multiple rows will be processed for the same latency ( as they are not blocking each other ) if there is more than one SIMD unit on a chip.

@Sentimentron
Copy link
Contributor Author

Yes, you've very succinctly highlighted the main problem with optimizing anything for AArch64: there are some wildly different micro-architectures, and it's not always possible to get an uplift on everything in all circumstances. There are very big out-of-order cores like the X4 where instruction scheduling doesn't matter so much, and some that are effectively in-order (though dual-issue) like the Cortex-A53/Cortex-A35/Cortex-A520 (little family), where instruction scheduling matters an awful lot. The little cores tend to be 30-40% of the performance of the big cores. I'm intrigued at breaking the data-dependency vertically, but I think it would probably work best for really small images due to the data locality.

Disclosure I worked at Arm for around 10 years on AArch64 software optimization. I currently work in Google's Pixel software division on optimizing Chrome's performance on AArch64. This contribution is on behalf of my current employer.

@Sentimentron
Copy link
Contributor Author

Although, @awxkee's comment does have some truth to it. Switching off SLP auto-vectorization entirely (i.e. no neon generated):

$ RUSTFLAGS="-C llvm-args=--slp-threshold=30000" cargo +nightly bench --features=benchmarks -- filter=Paeth/bpp=3
unfilter/filter=Paeth/bpp=3
                        time:   [14.587 µs 14.619 µs 14.675 µs]
                        thrpt:  [798.57 MiB/s 801.59 MiB/s 803.35 MiB/s]
                 change:
                        time:   [-17.228% -17.138% -16.971%] (p = 0.00 < 0.05)
                        thrpt:  [+20.440% +20.683% +20.813%]
                        Performance has improved.

This indicates that... yes, someone should definitely try and model Apple Silicon a bit better in upstream LLVM: it's a very unusual CPU.

@Sentimentron
Copy link
Contributor Author

OK, some more results with and without auto-vectorization:

image

I'd say this is low-quality evidence so far, but it does definitely indicate that SLP auto-vectorization is not an automatic win over merely using scalar instructions for this filter.

@awxkee
Copy link

awxkee commented Sep 2, 2025

To have reliable improvements with SIMD where after launch you get improvements in almost all cases this is necessary to justify one or more of the following points (roughly):

  1. Process significantly more data with approximately same amount of instructions to overcome penalties (technically accounting for their latencies and throughput, but since we aren't writing direct intrinsics here, just “instructions amount”).
  2. Utilize the pipeline more effectively than before.
  3. Break data dependencies and process more data chunks in parallel that would be sequential.
  4. Use specific data preparation and DSP instructions for it: pmulhrsw, sdot, udot, pmaddubsw, sqrdmlah etc.
  5. Process more data with less amount of intructions.

I believe we all see that point 2 is covered for ARM little cores. You've tried to unroll it but it didn't work out for other platforms.

I don’t believe any of the baseline points are reliably covered in the other cases at the current state, so it is debatable whether it should work. Without some reliable baseline plan, it produces mixed results due to compiler, platform differences, OS, and other factors. At the moment, it tries to do the same thing that LLVM does, and LLVM appears to be competitive in this area.

@Sentimentron
Copy link
Contributor Author

@awxkee thank you for the discussion.

Doing this (or not) is a strategic trade-off. The Paeth filter's unbreakable top-to-bottom, left-to-right data dependency and its many formulations make it really challenging to optimize (especially the 3bpp version). With that constraint in mind, this implementation succeeds in its primary goal: delivering a significant performance improvement for low-end Arm cores and most x64 systems, which will benefit many users. By explicitly writing a SIMD-style solution, it delivers reasonable performance irrespective of whether LLVM's SLP auto-vectorization is working effectively, without the maintenance overhead of architecture-specific intrinsics.

To handle the regressions, my preferred solution is to gate the new code behind a cfg flag and switch it off on platforms like Apple Silicon. Let me know your thoughts on this approach.

@awxkee
Copy link

awxkee commented Sep 2, 2025

To handle the regressions, my preferred solution is to gate the new code behind a cfg flag and switch it off on platforms like Apple Silicon. Let me know your thoughts on this approach.

It needs sign-off from one of the maintainers. I’m rather interested to see if we can make it more reliable :)

From my side I'd prefer this in intrinsics where more opportunities to do a neat work and better chances to make something reliable for a long run.

However, at the moment I believe that it worth it for budget ARM CPUs ( if Chrome compiles it using nightly with this feature ofc ). It’s a bit unfortunate though that in most cases high end devices will likely see a small performance penalty, but considering that the number of low-end devices is significantly higher than high-end ones, I think it's ok. I also think that if someone compiles the crate with AVX-512 support I'd also expect a regression there on some CPUs/LLVM/Rust versions ( this is just a thought ).

Those trade-offs are better discussed with the maintainers, to decide whether they want to support those configs.

@Sentimentron
Copy link
Contributor Author

Checking in on the 21 pixel stride idea to reduce the regression on Apple silicon: seems that it decreases the gain on Cortex A520 (from 224 MiB/s to 214 MiB/s), near-identical on X4. Conclusion: probably not worth it, 16 pixels seems to be the optimal stride.

@197g
Copy link
Member

197g commented Sep 2, 2025

To handle the regressions, my preferred solution is to gate the new code behind a cfg flag and switch it off on platforms like Apple Silicon. Let me know your thoughts on this approach.

Features in particular would not be a good solution for this. A cfg is a bit better but I have some reservations about setting CI up for something like that. We can have quite a few filter implementations live in the crate simultaneously (just gated by architecture), even some that are not active in any configuration except for testing.

Given the micro-architectural discussions above and that we don't know what environment everyone is using image with, some form of configuration mechanism to setup the filter function pointers at runtime would be appealing anyways. Then we merely use the existing unstable feature flag to make some options available through the public API.

@Sentimentron
Copy link
Contributor Author

@197g that's an interesting idea, and it brings up an important point about how we should handle performance features.

The unstable feature flag, in my opinion, should serve as a well-managed "expert mode" for sophisticated users who are willing to use a nightly compiler to get bleeding-edge performance. It's not a guarantee of universal improvement, but it's an opportunity to gain access to something potentially faster. However, we shouldn't put something in that mode if it's definitely known to regress a given platform: it breaks the implicit good-faith promise that the unstable feature makes. I think that cfg(not(any(target_os="ios", target_os="macos"))) is OK for this: the code can still be tested on MacOS in this config, it just won't be used when decoding images.

Should I proceed with gating the code off for Apple Silicon?

@Shnatsel
Copy link
Member

Shnatsel commented Sep 3, 2025

Apple Silicon also has big and little cores. I've only benchmarked with the built-in harness which runs in a loop for a long time, so I assume it gets placed on the big core. It's possible that on the little cores this would be beneficial on Apple Silicon; I think we should benchmark with core pinning before gating it off.

@Sentimentron Sentimentron force-pushed the portable_simd-paeth-3bpp branch from e980731 to f6ba382 Compare September 3, 2025 20:55
@Sentimentron
Copy link
Contributor Author

I found a couple of ways to try and measure this, the first is using the taskpolicy command:

$ taskpolicy -b cargo +nightly bench --features=benchmarks -- filter=Paeth/bpp=3
unfilter/filter=Paeth/bpp=3
                        time:   [70.397 µs 70.568 µs 70.719 µs]
                        thrpt:  [165.71 MiB/s 166.06 MiB/s 166.47 MiB/s]
                 change:
                        time:   [+0.0841% +2.6611% +5.2874%] (p = 0.04 < 0.05)
                        thrpt:  [−5.0219% −2.5921% −0.0841%]
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

$ taskpolicy -b cargo +nightly bench --features=benchmarks,unstable -- filter=Paeth/bpp=3
unfilter/filter=Paeth/bpp=3
                        time:   [83.191 µs 83.257 µs 83.331 µs]
                        thrpt:  [140.63 MiB/s 140.75 MiB/s 140.87 MiB/s]
                 change:
                        time:   [+18.000% +19.620% +21.775%] (p = 0.00 < 0.05)
                        thrpt:  [−17.881% −16.402% −15.254%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low severe

That's at background QoS level (so barely idle) - it's actually quite comparable to the performance on an Cortex A520, despite being a vastly more powerful chip. Playing with the latency and throughput tiers did not reveal any interesting behaviour. The other option seems to be putting the device on low power mode:

$ taskpolicy -a cargo +nightly bench --features=benchmarks -- filter=Paeth/bpp=3
                        time:   [31.426 µs 31.440 µs 31.459 µs]
                        thrpt:  [372.51 MiB/s 372.73 MiB/s 372.90 MiB/s]
                 change:
                        time:   [−0.9927% −0.6168% −0.3670%] (p = 0.00 < 0.05)
                        thrpt:  [+0.3683% +0.6206% +1.0026%]
                        Change within noise threshold.

$ taskpolicy -a cargo +nightly bench --features=benchmarks,unstable -- filter=Paeth/bpp=3
unfilter/filter=Paeth/bpp=3
                        time:   [35.013 µs 35.018 µs 35.024 µs]
                        thrpt:  [334.59 MiB/s 334.65 MiB/s 334.70 MiB/s]
                 change:
                        time:   [+11.368% +11.424% +11.485%] (p = 0.00 < 0.05)
                        thrpt:  [−10.302% −10.253% −10.208%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

I'm not sure whether this means background QoS = efficiency core at the lowest frequency and low power mode = efficiency core at the higher frequency, or low power mode = performance core at the lowest frequency.

But basically, seems whatever issue is affecting Apple Silicon, it's worse on the efficiency core.

@kornelski
Copy link
Contributor

kornelski commented Sep 6, 2025

This can be automatically hidden from Apple Silicon with #[cfg(target_vendor = "apple")] without requiring any Cargo features or custom cfgs. Luckily Apple doesn't use any other ARM CPUs. LLVM's auto-vectorization changes more often than CPU architectures, so a cfg(target_vendor = "apple") shouldn't be more of a problem long-term than LLVM already is.

Benchmarking reveals that the portable_simd implementation is
slower than the auto-vectorized code it replaces. Switch it off
for now.
@Sentimentron
Copy link
Contributor Author

Quick update on Apple Silicon on the latest nightly (rustc 1.91.0-nightly (12eb345e5 2025-09-07))

$ cargo +nightly bench --features=benchmarks -- filter=Paeth/bpp=3
unfilter/filter=Paeth/bpp=3
                        time:   [17.630 µs 17.635 µs 17.641 µs]
                        thrpt:  [664.30 MiB/s 664.51 MiB/s 664.69 MiB/s]
                 change:
                        time:   [−0.1722% −0.0070% +0.1132%] (p = 0.94 > 0.05)
                        thrpt:  [−0.1131% +0.0070% +0.1725%]
                        No change in performance detected.

$ cargo +nightly bench --features=benchmarks,unstable -- filter=Paeth/bpp=3
unfilter/filter=Paeth/bpp=3
                        time:   [19.746 µs 19.945 µs 20.187 µs]
                        thrpt:  [580.50 MiB/s 587.54 MiB/s 593.48 MiB/s]
                 change:
                        time:   [+11.566% +12.253% +13.085%] (p = 0.00 < 0.05)
                        thrpt:  [−11.571% −10.915% −10.367%]

Regression looks durable so I still think that the cfg gate is the best way to proceed for now.

@Sentimentron
Copy link
Contributor Author

cfg gate has been implemented, PR is ready for the next round of review

Copy link
Member

@197g 197g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also tested on a Raspberry Pi 3b:

model name	: ARMv7 Processor rev 4 (v7l)
BogoMIPS	: 38.40
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32

Apart from difficulties in getting benchmarks despite the heavy power/heat-throttling, it looks to get ~15-20% for the relevant filters. It's not entirely clear to, in end-to-end some test images seem affected that do not use Paeth 3bpp so I'm going to try again with better methodology for the final call. But overall, pretty positive.

@197g 197g merged commit f33b850 into image-rs:master Sep 12, 2025
24 checks passed
@Sentimentron
Copy link
Contributor Author

Thanks for the reviews all!

@Sentimentron Sentimentron deleted the portable_simd-paeth-3bpp branch September 12, 2025 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants