-
Notifications
You must be signed in to change notification settings - Fork 154
perf: portable_simd 3bpp paeth unfiltering #632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
AI disclosure: I wrote an original sliding-window 3bpp Paeth portable_simd implementation and optimized it for best performance on the Cortex A520. I then used the Gemini family of LLMs to automatically refine the code to achieve the best possible code-generation and performance across all other micro-architectures. The PR is derived from that work, but includes documentation and other cleanups. |
fintelia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are quite significant improvements! On my Zen 3 machine, I see a ~11% speed improvement compiling with the default x86-64 target, and a ~30% improvement when compiling for x86-64-v2 or x86-64-v3
|
So |
|
That is correct, It can be added as a layer on top of I suggest exploring multiversioning separately from this PR, since it is a mostly orthogonal concern. |
|
We never really got 3bpp Paeth to vectorize, so it makes sense that large gains are possible here. On a desktop Zen4 the difference is under 5%, but at least it seems to be stable and not noise. "My CPU is too good for optimization work" is a weird problem to have. |
|
Bad news: this is a 10% regression on an Apple M4 laptop. Measured with |
|
Yes, looks like I get a similar result on my personal M2 laptop: I don't know if the LLVM cost model for the Apple Silicon CPUs is in any way accurate (if it is, I might be able to fix it), but I think we can gate this off for Mac OS for now. |
|
Apple uses LLVM for all their C/C++/Objective-C/Swift compilation, so I'd be very much surprised if the LLVM cost model for Apple was way off. I am curious what your microarchitecture simulator (llvm-mca?) has to say about this code on Apple systems. If it still shows a performance improvement on Apple, that's a major red flag flag for its reliability. |
|
Looking at the public LLVM cost model, seems like all the Apple cores are modelled using this SchedCyclone def: Doesn’t seem to have received any significant updates in the last 11 years or so. It’s possible that they do this properly in their proprietary Clang, but I don’t think the LLVM MCA-based simulator that I can run will accurately model this processor. This work is mostly about lifting up the Arm low-end (Cortex A520/A35/A53) which are all relatively simple to model - any extra things that benefit are just a bonus! |
|
I wonder if this is at all beneficial in practice on lower-end ARM, since older SoCs are heavily bottlenecked by memory bandwidth, so microarchitectural optimizations might be entirely irrelevant. It's a hassle to get benchmarks from Android, but maybe we could get this benchmarked on a Raspberry Pi or some such? |
|
So I got it running on an Android system (specifically a Pixel 10), and here's what came back. I'll start with the old code on the little core: And here's the new code on the little core: And now the big, first the old code: And now the new: What I think this could be indicating is that the penalty from unaligned loads/stores might be a bit higher on the big cores: there's an uplift in backend-stalled cycles on the big core. Just a theory. Sometimes, when doing this kind of optimization, we have to make the decision about whether to raise the ceiling or raise the floor: a 30% uplift on a small core (usually around 1/3rd the speed of a big core) might be a lot more noticeable than an 8% loss on a big core. |
|
Does PNG decoding get scheduled onto little cores often? I would imagine that interactive workloads like web browsing would run mostly on the big cores, with the little ones mostly getting non-interactive background processes. |
|
The last time we had a portable SIMD unfiltering implementation, we did a bunch of benchmarking before adding it. But then the scalar implementations improved, new versions of rustc and LLVM came out, and we discovered that it no longer seemed to help. Since our measurements pretty much just come from whichever contributor's CPUs happen to test, part of the difference may have just been which processors were tested. Is there a way we can avoid repeating this process? |
|
Making effective intrinsics on NEON for such a small series like
It starts on the little cores, and if the workload does not decrease, it switches to the big cores (if not pinned). After that, it may throttle. Here some sketch how this works. However, this depends on OS device and other factors as well. |
|
I'm starting to wonder, what if we shipped the benchmark loops together with the library. So we could have an interface that determines the optimal implementation based on your actual target environment and we don't need to invest as much guesswork into excluding some implementations from the library. Esp. with the vast divide between big-little effect it seems impossible to determine this at compile time, there's a lot of interplay between core selection and thermal/power throttling that will depend a lot even on the SMB design. That's something you can optimize only at a system level.. Of course, we still want to ship good code and this wouldn't resolve how to answer #632 (comment), i.e. how to determine the static default option be applied. It could mean however that users can be more flexible and maybe report back to us their own benchmarks (the Chromium scale experiment was quite enlightening, i.e surprisingly different, when comparing it to the rationales we got from our internal results). |
|
Couple of observations:
|
|
|
I was curious too. Testing #635, the micro-architecture simulator indicates this:
The PR doesn't say which CPU it was measured on, but the results indicate a 6% uplift on both bpp=3 and bpp=4. I've measured it on my Epyc system and it yields about a 2% uplift. It's not surprising that the micro-arch simulation is a little off for the reasons I've discussed (it assumes basically a "perfect" path to main memory with no latency), this is what LLVM's cost-model thinks the code will cost to execute. A theoretically worse solution can be better in practise as long as it decreases stalled cycles. |
|
Thank you! Do I understand correctly that on EPYC 7B13 the portable_simd formulation in this PR is still worthwhile even after #635 ? Also, sorry about the conflicts that the merge-and-revert has caused 😞 |
|
From what I can tell, yes it's still worth it for bpp=3: |
|
FWIW |
|
Yes, you've very succinctly highlighted the main problem with optimizing anything for AArch64: there are some wildly different micro-architectures, and it's not always possible to get an uplift on everything in all circumstances. There are very big out-of-order cores like the X4 where instruction scheduling doesn't matter so much, and some that are effectively in-order (though dual-issue) like the Cortex-A53/Cortex-A35/Cortex-A520 (little family), where instruction scheduling matters an awful lot. The little cores tend to be 30-40% of the performance of the big cores. I'm intrigued at breaking the data-dependency vertically, but I think it would probably work best for really small images due to the data locality. DisclosureI worked at Arm for around 10 years on AArch64 software optimization. I currently work in Google's Pixel software division on optimizing Chrome's performance on AArch64. This contribution is on behalf of my current employer. |
|
Although, @awxkee's comment does have some truth to it. Switching off SLP auto-vectorization entirely (i.e. no neon generated): This indicates that... yes, someone should definitely try and model Apple Silicon a bit better in upstream LLVM: it's a very unusual CPU. |
|
To have reliable improvements with SIMD where after launch you get improvements in almost all cases this is necessary to justify one or more of the following points (roughly):
I believe we all see that point 2 is covered for ARM little cores. You've tried to unroll it but it didn't work out for other platforms. I don’t believe any of the baseline points are reliably covered in the other cases at the current state, so it is debatable whether it should work. Without some reliable baseline plan, it produces mixed results due to compiler, platform differences, OS, and other factors. At the moment, it tries to do the same thing that LLVM does, and LLVM appears to be competitive in this area. |
|
@awxkee thank you for the discussion. Doing this (or not) is a strategic trade-off. The Paeth filter's unbreakable top-to-bottom, left-to-right data dependency and its many formulations make it really challenging to optimize (especially the 3bpp version). With that constraint in mind, this implementation succeeds in its primary goal: delivering a significant performance improvement for low-end Arm cores and most x64 systems, which will benefit many users. By explicitly writing a SIMD-style solution, it delivers reasonable performance irrespective of whether LLVM's SLP auto-vectorization is working effectively, without the maintenance overhead of architecture-specific intrinsics. To handle the regressions, my preferred solution is to gate the new code behind a |
It needs sign-off from one of the maintainers. I’m rather interested to see if we can make it more reliable :) From my side I'd prefer this in intrinsics where more opportunities to do a neat work and better chances to make something reliable for a long run. However, at the moment I believe that it worth it for budget ARM CPUs ( if Chrome compiles it using nightly with this feature ofc ). It’s a bit unfortunate though that in most cases high end devices will likely see a small performance penalty, but considering that the number of low-end devices is significantly higher than high-end ones, I think it's ok. I also think that if someone compiles the crate with AVX-512 support I'd also expect a regression there on some CPUs/LLVM/Rust versions ( this is just a thought ). Those trade-offs are better discussed with the maintainers, to decide whether they want to support those configs. |
|
Checking in on the 21 pixel stride idea to reduce the regression on Apple silicon: seems that it decreases the gain on Cortex A520 (from 224 MiB/s to 214 MiB/s), near-identical on X4. Conclusion: probably not worth it, 16 pixels seems to be the optimal stride. |
Features in particular would not be a good solution for this. A Given the micro-architectural discussions above and that we don't know what environment everyone is using image with, some form of configuration mechanism to setup the filter function pointers at runtime would be appealing anyways. Then we merely use the existing |
|
@197g that's an interesting idea, and it brings up an important point about how we should handle performance features. The Should I proceed with gating the code off for Apple Silicon? |
|
Apple Silicon also has big and little cores. I've only benchmarked with the built-in harness which runs in a loop for a long time, so I assume it gets placed on the big core. It's possible that on the little cores this would be beneficial on Apple Silicon; I think we should benchmark with core pinning before gating it off. |
e980731 to
f6ba382
Compare
|
I found a couple of ways to try and measure this, the first is using the That's at background QoS level (so barely idle) - it's actually quite comparable to the performance on an Cortex A520, despite being a vastly more powerful chip. Playing with the latency and throughput tiers did not reveal any interesting behaviour. The other option seems to be putting the device on low power mode: I'm not sure whether this means background QoS = efficiency core at the lowest frequency and low power mode = efficiency core at the higher frequency, or low power mode = performance core at the lowest frequency. But basically, seems whatever issue is affecting Apple Silicon, it's worse on the efficiency core. |
|
This can be automatically hidden from Apple Silicon with |
Benchmarking reveals that the portable_simd implementation is slower than the auto-vectorized code it replaces. Switch it off for now.
|
Quick update on Apple Silicon on the latest nightly ( Regression looks durable so I still think that the |
|
|
197g
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also tested on a Raspberry Pi 3b:
model name : ARMv7 Processor rev 4 (v7l)
BogoMIPS : 38.40
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
Apart from difficulties in getting benchmarks despite the heavy power/heat-throttling, it looks to get ~15-20% for the relevant filters. It's not entirely clear to, in end-to-end some test images seem affected that do not use Paeth 3bpp so I'm going to try again with better methodology for the final call. But overall, pretty positive.
|
Thanks for the reviews all! |

With
--features=unstable, this implementation will decode up to 16 pixels at once.Reduces the number of cycles required to decode a 768px row by 20-30% on Arm systems, and around 50-70% on x64 systems (see discussion)
Results on an AMD EPYC 7B13 system