perf: portable_simd 3bpp paeth unfiltering #632

Sentimentron · 2025-08-28T19:20:25Z

With --features=unstable, this implementation will decode up to 16 pixels at once.

Reduces the number of cycles required to decode a 768px row by 20-30% on Arm systems, and around 50-70% on x64 systems (see discussion)

Results on an AMD EPYC 7B13 system

    $  taskset -c 64-68 cargo +nightly bench --features=benchmarks -- filter=Paeth/bpp=3
    unfilter/filter=Paeth/bpp=3
        time:   [18.987 µs 18.994 µs 19.001 µs]
        thrpt:  [616.74 MiB/s 616.96 MiB/s 617.19 MiB/s]

    $ taskset -c 64-68 cargo +nightly bench --features=benchmarks,unstable -- filter=Paeth/bpp=3
        unfilter/filter=Paeth/bpp=3
            time:   [15.355 µs 15.361 µs 15.367 µs]
            thrpt:  [762.59 MiB/s 762.89 MiB/s 763.18 MiB/s]
        change:
            time:   [-19.180% -19.087% -19.007%] (p = 0.00 < 0.05)
            thrpt:  [+23.468% +23.589% +23.732%]

Sentimentron · 2025-08-28T19:37:27Z

AI disclosure: I wrote an original sliding-window 3bpp Paeth portable_simd implementation and optimized it for best performance on the Cortex A520. I then used the Gemini family of LLMs to automatically refine the code to achieve the best possible code-generation and performance across all other micro-architectures. The PR is derived from that work, but includes documentation and other cleanups.

src/filter.rs

examples/unfilter_testbench.rs

fintelia

These are quite significant improvements! On my Zen 3 machine, I see a ~11% speed improvement compiling with the default x86-64 target, and a ~30% improvement when compiling for x86-64-v2 or x86-64-v3

src/filter.rs

197g · 2025-08-29T23:26:14Z

So portable_simd implies we only have instructions per target architecture, not as runtime dispatch, hence the different target needs? That's unfortunate. But for ~11% I'm willing to make a lot of temporary and less temporary compromises.

Shnatsel · 2025-08-29T23:36:51Z

That is correct, portable_simd does not do runtime feature selection.

It can be added as a layer on top of portable_simd but that will require some unsafe in the implementation. The multiversion crate provides a safe wrapper around that using a proc macro.

I suggest exploring multiversioning separately from this PR, since it is a mostly orthogonal concern.

Shnatsel · 2025-08-30T00:12:08Z

We never really got 3bpp Paeth to vectorize, so it makes sense that large gains are possible here.

On a desktop Zen4 the difference is under 5%, but at least it seems to be stable and not noise. "My CPU is too good for optimization work" is a weird problem to have.

Shnatsel · 2025-08-30T00:25:32Z

Bad news: this is a 10% regression on an Apple M4 laptop.

Measured with cargo +nightly bench --features=benchmarks,unstable -- filter=Paeth/bpp=3

Sentimentron · 2025-08-30T01:51:42Z

Yes, looks like I get a similar result on my personal M2 laptop:

unfilter/filter=Paeth/bpp=3
                        time:   [19.663 µs 19.667 µs 19.671 µs]
                        thrpt:  [595.75 MiB/s 595.87 MiB/s 595.98 MiB/s]
                 change:
                        time:   [+11.301% +11.404% +11.494%] (p = 0.00 < 0.05)
                        thrpt:  [-10.309% -10.236% -10.154%]
                        Performance has regressed.

I don't know if the LLVM cost model for the Apple Silicon CPUs is in any way accurate (if it is, I might be able to fix it), but I think we can gate this off for Mac OS for now.

Shnatsel · 2025-08-30T02:01:55Z

Apple uses LLVM for all their C/C++/Objective-C/Swift compilation, so I'd be very much surprised if the LLVM cost model for Apple was way off.

I am curious what your microarchitecture simulator (llvm-mca?) has to say about this code on Apple systems. If it still shows a performance improvement on Apple, that's a major red flag flag for its reliability.

Sentimentron · 2025-08-30T04:00:16Z

Looking at the public LLVM cost model, seems like all the Apple cores are modelled using this SchedCyclone def:

https://github.com/llvm/llvm-project/blame/main/llvm/lib/Target/AArch64/AArch64SchedCyclone.td

Doesn’t seem to have received any significant updates in the last 11 years or so. It’s possible that they do this properly in their proprietary Clang, but I don’t think the LLVM MCA-based simulator that I can run will accurately model this processor. This work is mostly about lifting up the Arm low-end (Cortex A520/A35/A53) which are all relatively simple to model - any extra things that benefit are just a bonus!

Shnatsel · 2025-08-30T15:19:29Z

I wonder if this is at all beneficial in practice on lower-end ARM, since older SoCs are heavily bottlenecked by memory bandwidth, so microarchitectural optimizations might be entirely irrelevant.

It's a hassle to get benchmarks from Android, but maybe we could get this benchmarked on a Raspberry Pi or some such?

Sentimentron · 2025-08-30T18:29:04Z

So I got it running on an Android system (specifically a Pixel 10), and here's what came back. I'll start with the old code on the little core:

$ simpleperf stat -e instructions,cpu-cycles,stalled-cycles-backend,bus-cycles --per-core taskset 1 ./unfilter-841f*
unfilter/filter=Paeth/bpp=3
                        time:   [68.486 µs 68.523 µs 68.562 µs]
                        thrpt:  [170.92 MiB/s 171.02 MiB/s 171.11 MiB/s]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

Performance counter statistics:

# cpu           count  event_name               # count / runtime
  0    35,301,571,758  instructions             # 2.737 G/sec  
  2        15,500,501  instructions             # 1.877 G/sec  
  0    28,964,228,993  cpu-cycles               # 2.245909 GHz 
  2        10,455,860  cpu-cycles               # 1.266378 GHz 
  0                 0  stalled-cycles-backend   # 0.000 /sec   
  2         3,113,045  stalled-cycles-backend   # 377.041 M/sec
  0    28,964,229,818  bus-cycles               # 2.246 G/sec  
  2        10,455,852  bus-cycles               # 1.266 G/sec

And here's the new code on the little core:

 simpleperf stat -a taskset 1 ./unfilter-d7ac32ab9f97a889   --bench filter=Paeth/bpp=3                                   <
unfilter/filter=Paeth/bpp=3
                        time:   [52.192 µs 52.221 µs 52.252 µs]
                        thrpt:  [224.27 MiB/s 224.40 MiB/s 224.53 MiB/s]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

Performance counter statistics:

# cpu           count  event_name               # count / runtime              
  0    26,091,209,126  instructions             # 1.006585 cycles per instruction 
  1        29,086,319  instructions             # 6.070105 cycles per instruction 
  2        72,566,321  instructions             # 1.039818 cycles per instruction 
  3        68,670,799  instructions             # 1.122647 cycles per instruction 
  4        69,968,132  instructions             # 1.105196 cycles per instruction 
  5        21,288,155  instructions             # 1.590875 cycles per instruction 
  6        20,176,375  instructions             # 1.784563 cycles per instruction 
  7           372,543  instructions             # 11.952097 cycles per instruction
  0    26,263,015,945  cpu-cycles               # 2.238589 GHz                    
  1       176,557,011  cpu-cycles               # 0.015049 GHz                    
  2        75,455,791  cpu-cycles               # 0.006432 GHz                    
  3        77,093,081  cpu-cycles               # 0.006571 GHz                    
  4        77,328,509  cpu-cycles               # 0.006591 GHz                    
  5        33,866,802  cpu-cycles               # 0.002887 GHz                    
  6        36,006,015  cpu-cycles               # 0.003069 GHz                    
  7         4,452,670  cpu-cycles               # 0.000380 GHz                    
  0                 0  stalled-cycles-backend   # 0.000 /sec                      
  1                 0  stalled-cycles-backend   # 0.000 /sec                      
  2        19,623,562  stalled-cycles-backend   # 1.673 M/sec                     
  3        18,591,696  stalled-cycles-backend   # 1.585 M/sec                     
  4        21,507,071  stalled-cycles-backend   # 1.833 M/sec                     
  5        12,880,399  stalled-cycles-backend   # 1.098 M/sec                     
  6        15,129,010  stalled-cycles-backend   # 1.289 M/sec                     
  7         2,473,048  stalled-cycles-backend   # 210.772 K/sec                   
  0    26,261,083,597  bus-cycles               # 2.238 G/sec                     
  1       176,557,056  bus-cycles               # 15.047 M/sec                    
  2        74,951,477  bus-cycles               # 6.388 M/sec                     
  3        75,946,519  bus-cycles               # 6.472 M/sec                     
  4        76,761,850  bus-cycles               # 6.542 M/sec                     
  5        33,879,962  bus-cycles               # 2.887 M/sec                     
  6        35,922,982  bus-cycles               # 3.061 M/sec                     
  7         4,511,325  bus-cycles               # 384.453 K/sec

And now the big, first the old code:

Total test time: 12.928474 seconds.
$ simpleperf stat -e instructions,cpu-cycles,stalled-cycles-backend,bus-cycles -a taskset 80 ./unfilter-841f40*
unfilter/filter=Paeth/bpp=3
                        time:   [16.658 µs 16.664 µs 16.669 µs]
                        thrpt:  [703.01 MiB/s 703.25 MiB/s 703.48 MiB/s]
                 change:
                        time:   [-75.813% -75.733% -75.680%] (p = 0.00 < 0.05)
                        thrpt:  [+311.19% +312.08% +313.44%]
                        Performance has improved.
# cpu            count  event_name               # count / runtime             
  0         28,787,130  instructions             # 8.334599 cycles per instruction
  1         10,111,115  instructions             # 3.952444 cycles per instruction
  2         58,626,139  instructions             # 0.956735 cycles per instruction
  3         53,388,940  instructions             # 1.069962 cycles per instruction
  4         46,872,058  instructions             # 1.129856 cycles per instruction
  5         19,913,000  instructions             # 1.609684 cycles per instruction
  6         46,021,462  instructions             # 3.058840 cycles per instruction
  7    122,111,879,028  instructions             # 0.313492 cycles per instruction
  0        239,929,199  cpu-cycles               # 0.023653 GHz                   
  1         39,963,619  cpu-cycles               # 0.003940 GHz                   
  2         56,089,706  cpu-cycles               # 0.005529 GHz                   
  3         57,124,140  cpu-cycles               # 0.005631 GHz                   
  4         52,958,675  cpu-cycles               # 0.005221 GHz                   
  5         32,053,640  cpu-cycles               # 0.003160 GHz                   
  6        140,772,280  cpu-cycles               # 0.013877 GHz                   
  7     38,281,114,095  cpu-cycles               # 3.773429 GHz                   
  0                  0  stalled-cycles-backend   # 0.000 /sec                     
  1                  0  stalled-cycles-backend   # 0.000 /sec                     
  2         14,569,942  stalled-cycles-backend   # 1.436 M/sec                    
  3         13,572,410  stalled-cycles-backend   # 1.338 M/sec                    
  4         12,873,788  stalled-cycles-backend   # 1.269 M/sec                    
  5         11,380,351  stalled-cycles-backend   # 1.122 M/sec                    
  6         55,503,126  stalled-cycles-backend   # 5.471 M/sec                    
  7     11,467,846,471  stalled-cycles-backend   # 1.130 G/sec                    
  0        242,286,605  bus-cycles               # 23.884 M/sec                   
  1         40,022,922  bus-cycles               # 3.945 M/sec                    
  2         56,147,545  bus-cycles               # 5.535 M/sec                    
  3         57,267,840  bus-cycles               # 5.645 M/sec                    
  4         53,019,129  bus-cycles               # 5.226 M/sec                    
  5         32,028,206  bus-cycles               # 3.157 M/sec                    
  6        141,003,152  bus-cycles               # 13.900 M/sec                   
  7     38,281,309,463  bus-cycles               # 3.774 G/sec

And now the new:

Total test time: 11.718414 seconds.
 -a taskset 80 ./unfilter-d7ac32ab9f97a889   --bench filter=Paeth/bpp=3                                 <
unfilter/filter=Paeth/bpp=3
                        time:   [18.041 µs 18.042 µs 18.043 µs]
                        thrpt:  [649.50 MiB/s 649.54 MiB/s 649.58 MiB/s]
                 change:
                        time:   [-65.676% -65.535% -65.445%] (p = 0.00 < 0.05)
                        thrpt:  [+189.39% +190.15% +191.34%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  1 (1.00%) high mild
  3 (3.00%) high severe

Performance counter statistics:

# cpu           count  event_name               # count / runtime             
  0        40,163,354  instructions             # 6.407979 cycles per instruction
  1        14,727,533  instructions             # 3.544385 cycles per instruction
  2        52,998,332  instructions             # 1.031884 cycles per instruction
  3        87,045,464  instructions             # 0.801187 cycles per instruction
  4        39,933,075  instructions             # 1.164127 cycles per instruction
  5        37,973,836  instructions             # 1.799052 cycles per instruction
  6        15,332,612  instructions             # 7.761318 cycles per instruction
  7    72,310,059,395  instructions             # 0.546846 cycles per instruction
  0       257,365,924  cpu-cycles               # 0.024565 GHz                   
  1        52,200,045  cpu-cycles               # 0.004982 GHz                   
  2        54,688,142  cpu-cycles               # 0.005220 GHz                   
  3        69,739,711  cpu-cycles               # 0.006657 GHz                   
  4        46,487,170  cpu-cycles               # 0.004437 GHz                   
  5        68,316,906  cpu-cycles               # 0.006521 GHz                   
  6       119,001,272  cpu-cycles               # 0.011358 GHz                   
  7    39,542,496,810  cpu-cycles               # 3.774136 GHz                   
  0                 0  stalled-cycles-backend   # 0.000 /sec                     
  1                 0  stalled-cycles-backend   # 0.000 /sec                     
  2        13,694,794  stalled-cycles-backend   # 1.307 M/sec                    
  3        17,640,685  stalled-cycles-backend   # 1.684 M/sec                    
  4        12,133,632  stalled-cycles-backend   # 1.158 M/sec                    
  5        30,350,764  stalled-cycles-backend   # 2.897 M/sec                    
  6        45,669,009  stalled-cycles-backend   # 4.359 M/sec                    
  7    16,945,062,179  stalled-cycles-backend   # 1.617 G/sec                    
  0       258,026,474  bus-cycles               # 24.627 M/sec                   
  1        52,237,438  bus-cycles               # 4.986 M/sec                    
  2        55,064,781  bus-cycles               # 5.256 M/sec                    
  3        69,774,618  bus-cycles               # 6.660 M/sec                    
  4        46,100,141  bus-cycles               # 4.400 M/sec                    
  5        68,377,583  bus-cycles               # 6.526 M/sec                    
  6       118,919,675  bus-cycles               # 11.350 M/sec                   
  7    39,542,552,563  bus-cycles               # 3.774 G/sec

What I think this could be indicating is that the penalty from unaligned loads/stores might be a bit higher on the big cores: there's an uplift in backend-stalled cycles on the big core. Just a theory.

Sometimes, when doing this kind of optimization, we have to make the decision about whether to raise the ceiling or raise the floor: a 30% uplift on a small core (usually around 1/3rd the speed of a big core) might be a lot more noticeable than an 8% loss on a big core.

Shnatsel · 2025-08-30T19:01:21Z

Does PNG decoding get scheduled onto little cores often? I would imagine that interactive workloads like web browsing would run mostly on the big cores, with the little ones mostly getting non-interactive background processes.

fintelia · 2025-08-30T19:05:34Z

The last time we had a portable SIMD unfiltering implementation, we did a bunch of benchmarking before adding it. But then the scalar implementations improved, new versions of rustc and LLVM came out, and we discovered that it no longer seemed to help. Since our measurements pretty much just come from whichever contributor's CPUs happen to test, part of the difference may have just been which processors were tested.

Is there a way we can avoid repeating this process?

awxkee · 2025-08-30T19:15:24Z

Making effective intrinsics on NEON for such a small series like [u8; 3] is incredible complicated even with direct intrinsics, I almost never had success with this. NEON instructions are power hungry, so instead of cheap scalar instructions on big cores they consume roughly 2 times more energy to execute and cause more heat ( and the amount of intructions generated by this generic SIMD is unlikely twice less than scalar ). As a result, this can lead to throttling.

Does PNG decoding get scheduled onto little cores often? I would imagine that interactive workloads like web browsing would run mostly on the big cores, with the little ones mostly getting non-interactive background processes.

It starts on the little cores, and if the workload does not decrease, it switches to the big cores (if not pinned). After that, it may throttle. Here some sketch how this works. However, this depends on OS device and other factors as well.

197g · 2025-08-30T20:31:40Z

I'm starting to wonder, what if we shipped the benchmark loops together with the library. So we could have an interface that determines the optimal implementation based on your actual target environment and we don't need to invest as much guesswork into excluding some implementations from the library. Esp. with the vast divide between big-little effect it seems impossible to determine this at compile time, there's a lot of interplay between core selection and thermal/power throttling that will depend a lot even on the SMB design. That's something you can optimize only at a system level..

Of course, we still want to ship good code and this wouldn't resolve how to answer #632 (comment), i.e. how to determine the static default option be applied. It could mean however that users can be more flexible and maybe report back to us their own benchmarks (the Chromium scale experiment was quite enlightening, i.e surprisingly different, when comparing it to the rationales we got from our internal results).

Sentimentron · 2025-08-31T16:30:25Z

Couple of observations:

Looks like I can get 2-3% back on Apple Silicon by increasing the number of pixels per loop from 16 to 21, unknown what effect that has on other micro-architectures yet.
Apple Silicon in particular seems ultra-sensitive to the formulation of paeth_predictor_simd, with this PR containing the current fastest known version (adopting STBI's formulation, FPNGE, some fairly logical tweaks were all worse). This indicates to me that we're both memory and compute bound at different parts of the calculation.

Shnatsel · 2025-08-31T20:19:58Z

~~I'm curious what the microarchitecture simulator has to say about #635 which significantly improves benchmarks for x86_64, Apple M3 and gives an especially huge boost to Graviton 3.~~ I misread the benchmarks, it actually regresses ARM a lot. Never mind.

Sentimentron · 2025-08-31T21:55:47Z

~~I'm curious what the microarchitecture simulator has to say about #635 which significantly improves benchmarks for x86_64, Apple M3 and gives an especially huge boost to Graviton 3.~~ I misread the benchmarks, it actually regresses ARM a lot. Never mind.

I was curious too. Testing #635, the micro-architecture simulator indicates this:

Filter	Arm Cortex A520	Arm Cortex X4	Intel Arrow Lake	AMD Zen 5	Arm Cortex A520 (baseline cycles)	Arm Cortex X4 (baseline cycles)	Intel Arrow Lake (baseline cycles)	AMD Zen 5 (baseline cycles)
Paeth (3bpp)	-13.92%	-18.52%	1.35%	0.20%	1073543	520265	1158373	505037
Paeth (4bpp)	9.53%	-7.49%	6.28%	-4.25%	885251	288701	413955	259222

The PR doesn't say which CPU it was measured on, but the results indicate a 6% uplift on both bpp=3 and bpp=4. I've measured it on my Epyc system and it yields about a 2% uplift. It's not surprising that the micro-arch simulation is a little off for the reasons I've discussed (it assumes basically a "perfect" path to main memory with no latency), this is what LLVM's cost-model thinks the code will cost to execute. A theoretically worse solution can be better in practise as long as it decreases stalled cycles.

Shnatsel · 2025-08-31T22:13:34Z

Thank you! Do I understand correctly that on EPYC 7B13 the portable_simd formulation in this PR is still worthwhile even after #635 ?

Also, sorry about the conflicts that the merge-and-revert has caused 😞

Sentimentron · 2025-08-31T22:26:23Z

From what I can tell, yes it's still worth it for bpp=3:

unfilter/filter=Paeth/bpp=3
                        time:   [15.227 µs 15.239 µs 15.256 µs]
                        thrpt:  [768.15 MiB/s 768.99 MiB/s 769.61 MiB/s]
                 change:
                        time:   [−15.119% −15.050% −14.977%] (p = 0.00 < 0.05)
                        thrpt:  [+17.615% +17.716% +17.812%]

awxkee · 2025-08-31T22:31:58Z

FWIW
To make this PR effective on NEON, it’s necessary to figure out a way to utilize the pipeline and instruction scheduling, to compensate introduced latency. On AArch64, you can’t just generate nearly the same amount of code with more power hungry instructions and expect guaranteed performance improvements. Some gains might appear sometimes, but they often come at the cost of shifting work from one bottleneck to another.
For example, instead of processing pixel(row, 0), pixel(row, 1), pixel(row, 2), process process_pixels(row0, row1, row2, row3, 0), process_pixels(row0, row1, row2, row3, 1) to give LLVM chance to schedule instructions on the NEON pipeline, since multiple rows will be processed for the same latency ( as they are not blocking each other ) if there is more than one SIMD unit on a chip.

Sentimentron · 2025-08-31T22:44:53Z

Yes, you've very succinctly highlighted the main problem with optimizing anything for AArch64: there are some wildly different micro-architectures, and it's not always possible to get an uplift on everything in all circumstances. There are very big out-of-order cores like the X4 where instruction scheduling doesn't matter so much, and some that are effectively in-order (though dual-issue) like the Cortex-A53/Cortex-A35/Cortex-A520 (little family), where instruction scheduling matters an awful lot. The little cores tend to be 30-40% of the performance of the big cores. I'm intrigued at breaking the data-dependency vertically, but I think it would probably work best for really small images due to the data locality.

Disclosure

I worked at Arm for around 10 years on AArch64 software optimization. I currently work in Google's Pixel software division on optimizing Chrome's performance on AArch64. This contribution is on behalf of my current employer.

Sentimentron · 2025-09-01T23:33:14Z

Although, @awxkee's comment does have some truth to it. Switching off SLP auto-vectorization entirely (i.e. no neon generated):

$ RUSTFLAGS="-C llvm-args=--slp-threshold=30000" cargo +nightly bench --features=benchmarks -- filter=Paeth/bpp=3
unfilter/filter=Paeth/bpp=3
                        time:   [14.587 µs 14.619 µs 14.675 µs]
                        thrpt:  [798.57 MiB/s 801.59 MiB/s 803.35 MiB/s]
                 change:
                        time:   [-17.228% -17.138% -16.971%] (p = 0.00 < 0.05)
                        thrpt:  [+20.440% +20.683% +20.813%]
                        Performance has improved.

This indicates that... yes, someone should definitely try and model Apple Silicon a bit better in upstream LLVM: it's a very unusual CPU.

Sentimentron · 2025-09-02T01:54:11Z

OK, some more results with and without auto-vectorization:

I'd say this is low-quality evidence so far, but it does definitely indicate that SLP auto-vectorization is not an automatic win over merely using scalar instructions for this filter.

awxkee · 2025-09-02T07:47:46Z

To have reliable improvements with SIMD where after launch you get improvements in almost all cases this is necessary to justify one or more of the following points (roughly):

Process significantly more data with approximately same amount of instructions to overcome penalties (technically accounting for their latencies and throughput, but since we aren't writing direct intrinsics here, just “instructions amount”).
Utilize the pipeline more effectively than before.
Break data dependencies and process more data chunks in parallel that would be sequential.
Use specific data preparation and DSP instructions for it: pmulhrsw, sdot, udot, pmaddubsw, sqrdmlah etc.
Process more data with less amount of intructions.

I believe we all see that point 2 is covered for ARM little cores. You've tried to unroll it but it didn't work out for other platforms.

I don’t believe any of the baseline points are reliably covered in the other cases at the current state, so it is debatable whether it should work. Without some reliable baseline plan, it produces mixed results due to compiler, platform differences, OS, and other factors. At the moment, it tries to do the same thing that LLVM does, and LLVM appears to be competitive in this area.

Sentimentron · 2025-09-02T19:06:27Z

@awxkee thank you for the discussion.

Doing this (or not) is a strategic trade-off. The Paeth filter's unbreakable top-to-bottom, left-to-right data dependency and its many formulations make it really challenging to optimize (especially the 3bpp version). With that constraint in mind, this implementation succeeds in its primary goal: delivering a significant performance improvement for low-end Arm cores and most x64 systems, which will benefit many users. By explicitly writing a SIMD-style solution, it delivers reasonable performance irrespective of whether LLVM's SLP auto-vectorization is working effectively, without the maintenance overhead of architecture-specific intrinsics.

To handle the regressions, my preferred solution is to gate the new code behind a cfg flag and switch it off on platforms like Apple Silicon. Let me know your thoughts on this approach.

awxkee · 2025-09-02T19:46:29Z

To handle the regressions, my preferred solution is to gate the new code behind a cfg flag and switch it off on platforms like Apple Silicon. Let me know your thoughts on this approach.

It needs sign-off from one of the maintainers. I’m rather interested to see if we can make it more reliable :)

From my side I'd prefer this in intrinsics where more opportunities to do a neat work and better chances to make something reliable for a long run.

However, at the moment I believe that it worth it for budget ARM CPUs ( if Chrome compiles it using nightly with this feature ofc ). It’s a bit unfortunate though that in most cases high end devices will likely see a small performance penalty, but considering that the number of low-end devices is significantly higher than high-end ones, I think it's ok. I also think that if someone compiles the crate with AVX-512 support I'd also expect a regression there on some CPUs/LLVM/Rust versions ( this is just a thought ).

Those trade-offs are better discussed with the maintainers, to decide whether they want to support those configs.

Sentimentron · 2025-09-02T21:07:35Z

Checking in on the 21 pixel stride idea to reduce the regression on Apple silicon: seems that it decreases the gain on Cortex A520 (from 224 MiB/s to 214 MiB/s), near-identical on X4. Conclusion: probably not worth it, 16 pixels seems to be the optimal stride.

197g · 2025-09-02T22:59:51Z

To handle the regressions, my preferred solution is to gate the new code behind a cfg flag and switch it off on platforms like Apple Silicon. Let me know your thoughts on this approach.

Features in particular would not be a good solution for this. A cfg is a bit better but I have some reservations about setting CI up for something like that. We can have quite a few filter implementations live in the crate simultaneously (just gated by architecture), even some that are not active in any configuration except for testing.

Given the micro-architectural discussions above and that we don't know what environment everyone is using image with, some form of configuration mechanism to setup the filter function pointers at runtime would be appealing anyways. Then we merely use the existing unstable feature flag to make some options available through the public API.

Sentimentron · 2025-09-03T20:41:19Z

@197g that's an interesting idea, and it brings up an important point about how we should handle performance features.

The unstable feature flag, in my opinion, should serve as a well-managed "expert mode" for sophisticated users who are willing to use a nightly compiler to get bleeding-edge performance. It's not a guarantee of universal improvement, but it's an opportunity to gain access to something potentially faster. However, we shouldn't put something in that mode if it's definitely known to regress a given platform: it breaks the implicit good-faith promise that the unstable feature makes. I think that cfg(not(any(target_os="ios", target_os="macos"))) is OK for this: the code can still be tested on MacOS in this config, it just won't be used when decoding images.

Should I proceed with gating the code off for Apple Silicon?

Shnatsel · 2025-09-03T20:52:00Z

Apple Silicon also has big and little cores. I've only benchmarked with the built-in harness which runs in a loop for a long time, so I assume it gets placed on the big core. It's possible that on the little cores this would be beneficial on Apple Silicon; I think we should benchmark with core pinning before gating it off.

Sentimentron · 2025-09-05T20:18:04Z

I found a couple of ways to try and measure this, the first is using the taskpolicy command:

$ taskpolicy -b cargo +nightly bench --features=benchmarks -- filter=Paeth/bpp=3
unfilter/filter=Paeth/bpp=3
                        time:   [70.397 µs 70.568 µs 70.719 µs]
                        thrpt:  [165.71 MiB/s 166.06 MiB/s 166.47 MiB/s]
                 change:
                        time:   [+0.0841% +2.6611% +5.2874%] (p = 0.04 < 0.05)
                        thrpt:  [−5.0219% −2.5921% −0.0841%]
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

$ taskpolicy -b cargo +nightly bench --features=benchmarks,unstable -- filter=Paeth/bpp=3
unfilter/filter=Paeth/bpp=3
                        time:   [83.191 µs 83.257 µs 83.331 µs]
                        thrpt:  [140.63 MiB/s 140.75 MiB/s 140.87 MiB/s]
                 change:
                        time:   [+18.000% +19.620% +21.775%] (p = 0.00 < 0.05)
                        thrpt:  [−17.881% −16.402% −15.254%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low severe

That's at background QoS level (so barely idle) - it's actually quite comparable to the performance on an Cortex A520, despite being a vastly more powerful chip. Playing with the latency and throughput tiers did not reveal any interesting behaviour. The other option seems to be putting the device on low power mode:

$ taskpolicy -a cargo +nightly bench --features=benchmarks -- filter=Paeth/bpp=3
                        time:   [31.426 µs 31.440 µs 31.459 µs]
                        thrpt:  [372.51 MiB/s 372.73 MiB/s 372.90 MiB/s]
                 change:
                        time:   [−0.9927% −0.6168% −0.3670%] (p = 0.00 < 0.05)
                        thrpt:  [+0.3683% +0.6206% +1.0026%]
                        Change within noise threshold.

$ taskpolicy -a cargo +nightly bench --features=benchmarks,unstable -- filter=Paeth/bpp=3
unfilter/filter=Paeth/bpp=3
                        time:   [35.013 µs 35.018 µs 35.024 µs]
                        thrpt:  [334.59 MiB/s 334.65 MiB/s 334.70 MiB/s]
                 change:
                        time:   [+11.368% +11.424% +11.485%] (p = 0.00 < 0.05)
                        thrpt:  [−10.302% −10.253% −10.208%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

I'm not sure whether this means background QoS = efficiency core at the lowest frequency and low power mode = efficiency core at the higher frequency, or low power mode = performance core at the lowest frequency.

But basically, seems whatever issue is affecting Apple Silicon, it's worse on the efficiency core.

kornelski · 2025-09-06T14:51:23Z

This can be automatically hidden from Apple Silicon with #[cfg(target_vendor = "apple")] without requiring any Cargo features or custom cfgs. Luckily Apple doesn't use any other ARM CPUs. LLVM's auto-vectorization changes more often than CPU architectures, so a cfg(target_vendor = "apple") shouldn't be more of a problem long-term than LLVM already is.

Benchmarking reveals that the portable_simd implementation is slower than the auto-vectorized code it replaces. Switch it off for now.

Sentimentron · 2025-09-08T19:08:32Z

Quick update on Apple Silicon on the latest nightly (rustc 1.91.0-nightly (12eb345e5 2025-09-07))

$ cargo +nightly bench --features=benchmarks -- filter=Paeth/bpp=3
unfilter/filter=Paeth/bpp=3
                        time:   [17.630 µs 17.635 µs 17.641 µs]
                        thrpt:  [664.30 MiB/s 664.51 MiB/s 664.69 MiB/s]
                 change:
                        time:   [−0.1722% −0.0070% +0.1132%] (p = 0.94 > 0.05)
                        thrpt:  [−0.1131% +0.0070% +0.1725%]
                        No change in performance detected.

$ cargo +nightly bench --features=benchmarks,unstable -- filter=Paeth/bpp=3
unfilter/filter=Paeth/bpp=3
                        time:   [19.746 µs 19.945 µs 20.187 µs]
                        thrpt:  [580.50 MiB/s 587.54 MiB/s 593.48 MiB/s]
                 change:
                        time:   [+11.566% +12.253% +13.085%] (p = 0.00 < 0.05)
                        thrpt:  [−11.571% −10.915% −10.367%]

Regression looks durable so I still think that the cfg gate is the best way to proceed for now.

Sentimentron · 2025-09-09T17:20:55Z

cfg gate has been implemented, PR is ready for the next round of review

197g

I've also tested on a Raspberry Pi 3b:

model name	: ARMv7 Processor rev 4 (v7l)
BogoMIPS	: 38.40
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32

Apart from difficulties in getting benchmarks despite the heavy power/heat-throttling, it looks to get ~15-20% for the relevant filters. It's not entirely clear to, in end-to-end some test images seem affected that do not use Paeth 3bpp so I'm going to try again with better methodology for the final call. But overall, pretty positive.

examples/unfilter_testbench.rs

src/filter.rs

Sentimentron · 2025-09-12T19:25:15Z

Thanks for the reviews all!

Sentimentron commented Aug 28, 2025

View reviewed changes

src/filter.rs Outdated Show resolved Hide resolved

src/filter.rs Outdated Show resolved Hide resolved

src/filter.rs Outdated Show resolved Hide resolved

Sentimentron changed the title ~~perf: portable_simd 3bpp paeth filter~~ perf: portable_simd 3bpp paeth unfiltering Aug 28, 2025

Shnatsel reviewed Aug 28, 2025

View reviewed changes

examples/unfilter_testbench.rs Outdated Show resolved Hide resolved

fintelia reviewed Aug 29, 2025

View reviewed changes

src/filter.rs Outdated Show resolved Hide resolved

src/filter.rs Outdated Show resolved Hide resolved

src/filter.rs Outdated Show resolved Hide resolved

Sentimentron mentioned this pull request Aug 29, 2025

perf: portable_simd 4bpp filtering #633

Merged

Sentimentron mentioned this pull request Sep 2, 2025

perf: Neon Paeth unfiltering (3bpp) #638

Draft

perf: portable_simd 3bpp paeth filter

f6ba382

Sentimentron force-pushed the portable_simd-paeth-3bpp branch from e980731 to f6ba382 Compare September 3, 2025 20:55

fix: restrict portable_simd feature to "unstable"

1b94193

perf: restrict portable_simd Paeth bpp=3 to non-Apple CPUs

99c8940

Benchmarking reveals that the portable_simd implementation is slower than the auto-vectorized code it replaces. Switch it off for now.

This was referenced Sep 9, 2025

portable_simd version of Avg (4bpp) #641

Open

portable_simd version of Sub (3bpp/4bpp) #642

Draft

portable_simd version of Up #643

Draft

197g reviewed Sep 11, 2025

View reviewed changes

examples/unfilter_testbench.rs Outdated Show resolved Hide resolved

src/filter.rs Show resolved Hide resolved

Sentimentron added 2 commits September 12, 2025 01:09

doc: link to PR with performance results

56083cb

nit: tidy up iterators

62a4ecd

kornelski approved these changes Sep 12, 2025

View reviewed changes

197g merged commit f33b850 into image-rs:master Sep 12, 2025
24 checks passed

Sentimentron deleted the portable_simd-paeth-3bpp branch September 12, 2025 19:25

perf: portable_simd 3bpp paeth unfiltering #632

perf: portable_simd 3bpp paeth unfiltering #632

Uh oh!

Conversation

Sentimentron commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sentimentron commented Aug 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fintelia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

197g commented Aug 29, 2025

Uh oh!

Shnatsel commented Aug 29, 2025

Uh oh!

Shnatsel commented Aug 30, 2025

Uh oh!

Shnatsel commented Aug 30, 2025

Uh oh!

Sentimentron commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shnatsel commented Aug 30, 2025

Uh oh!

Sentimentron commented Aug 30, 2025

Uh oh!

Shnatsel commented Aug 30, 2025

Uh oh!

Sentimentron commented Aug 30, 2025

Uh oh!

Shnatsel commented Aug 30, 2025

Uh oh!

fintelia commented Aug 30, 2025

Uh oh!

awxkee commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

197g commented Aug 30, 2025

Uh oh!

Sentimentron commented Aug 31, 2025

Uh oh!

Shnatsel commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sentimentron commented Aug 31, 2025

Uh oh!

Shnatsel commented Aug 31, 2025

Uh oh!

Sentimentron commented Aug 31, 2025

Uh oh!

awxkee commented Aug 31, 2025

Uh oh!

Sentimentron commented Aug 31, 2025

Uh oh!

Sentimentron commented Sep 1, 2025

Uh oh!

Sentimentron commented Sep 2, 2025

Uh oh!

awxkee commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sentimentron commented Sep 2, 2025

Uh oh!

awxkee commented Sep 2, 2025

Uh oh!

Sentimentron commented Sep 2, 2025

Uh oh!

197g commented Sep 2, 2025

Uh oh!

Sentimentron commented Sep 3, 2025

Uh oh!

Shnatsel commented Sep 3, 2025

Sentimentron commented Aug 28, 2025 •

edited

Loading

Sentimentron commented Aug 30, 2025 •

edited

Loading

awxkee commented Aug 30, 2025 •

edited

Loading

Shnatsel commented Aug 31, 2025 •

edited

Loading

awxkee commented Sep 2, 2025 •

edited

Loading

kornelski commented Sep 6, 2025 •

edited

Loading