- Notifications
You must be signed in to change notification settings - Fork 1.6k
Added scaffolding for Oryon arch as in Snapdragon X Elite#5537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base:develop
Are you sure you want to change the base?
Uh oh!
There was an error while loading. Please reload this page.
Conversation
theAeon commented Nov 18, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Using NeoverseN1 Kernels for now with cache info taken from official specs.
martin-frbg commented Nov 18, 2025
Thanks - do you get markedly better performance with this change, compared to the default approach in 0.3.30 of autodetecting this cpu as a regular NEOVERSEN1 ? I would prefer to avoid the code and library size explosion from adding any and all arm64 design variant, so unless the exact model-specific cost tables make a serious difference to the compiler output I'd like to avoid mere duplication. |
theAeon commented Nov 18, 2025
I need to do some benchmarking, so I'll report back on that. I have to imagine the significant difference in cache layout here is going to do something. |
theAeon commented Nov 18, 2025
Extremely unscientific runthrough: Stock Rblas.dll: OpenBLAS 3.30.0.dev: OpenBLAS NeoverseN1 kernel w/ oryon cache sizes: |
theAeon commented Nov 18, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Gonna be completely honest here-I can't quite tell. Looks like there's some sizes for which it performs better and some for which it is worse. Any recs for drilling down a bit deeper? edit: just saw the openblas_loops setting, bear with |
theAeon commented Nov 18, 2025
3.30.0dev Oryon-modded cache size |
theAeon commented Nov 18, 2025
I think there's definitely something here, judging by the decent improvement at certain matrix sizes, but this is not it judging by the degraded performance at other matrix sizes. May be worth having it as a full clone of neoverse n1 (ie-removing the cache changes i made here) pending further investigation. |
theAeon commented Nov 18, 2025
.....I had an idea. This is an 8-wide chip, neoverse is 5-wide. I wonder what happens if i run the VORTEX target (which is 7-wide and should be otherwise compatible. Because I get the feeling the optimization here isn't so much in the cache definitons as much as its in the kernels. |
theAeon commented Nov 18, 2025
Scratch that, it would do nothing, as there's no difference. |
martin-frbg commented Nov 18, 2025
Yes, right now VORTEX is also just ARMV8 with a bunch of NEOVERSEN1 kernels on top. Without dedicated kernels, I think the easiest fix would be to put the proper L1 and L2 cache sizes in cpuid_arm64.c when we're on Windows, to guide the block sizes for GEMM etc. |
theAeon commented Nov 18, 2025
Yeah-and even if there is optimization here (and there almost certanily is) I don't even know that the cache sizes are an improvement. |
martin-frbg commented Nov 18, 2025
Probably needs larger loops to get more stable benchmark results. I do have an Oryon system on loan from Qualcomm, it's just that I'm away from it at the moment but I'll try to run some experiments myself when I have more time for OpenBLAS again - hopefully soon |
theAeon commented Nov 18, 2025
Oh, I can absolutely just run them on my laptop. How large are we talking? |
martin-frbg commented Nov 18, 2025
I'd guess a hundred instead of ten should help |
theAeon commented Nov 18, 2025
Will report back. With bonus ArmPL for comparison. |
theAeon commented Nov 19, 2025
So, it turns out the issue was mostly that running BLAS on 12 cores well exceeds the heat capacity of my laptop. Fixed that one. Anyway: Seems that there's a thousand to a few thousand megaflops difference in favor of the cache-tuned build at all sizes, which is more what I would have expected. Funnily enough, ArmPL seems to be on par with the n1 build and similarly behind the tuned build. Guess that does make sense, they did optimize for their own cores. Do we know if QC has an optimized implementation? N1 Oryon ArmPL for comparison |
martin-frbg commented Nov 19, 2025
Hmm. I'm still not that convinced - looks like there is still a lot of noise in the data, and where it looks like there is an improvement from using the correct cache sizes, it is around 2 percent at most ? |
theAeon commented Nov 19, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
By noise do you mean the fluctuating MFlops as size increases? That's actually fairly reproducible. And yes, around 2%. I think the bottleneck here isn't so much cache locality as much as it is the difference in execution pipeline size (5-wide vs 8-wide). edit: looking at the block diagrams it appears the correct way of looking at it is 2 NEON/FP units on the N1 and 4 on oryon |
abhishek-iitmadras commented Nov 20, 2025
Hi @theAeon Out of curiosity, are you going to add/modifying/optimize any kernels for this arch in future? |
theAeon commented Nov 20, 2025
unfortunately this is not exactly my strong suit, so while I will take a look i am...not expecting to, no. |
martin-frbg commented Nov 20, 2025
I can add a small hack to the cpu detection code to put the correct cache sizes in the config file, as that bit of performance gain it is low-hanging (if fairly small) fruit. But frankly I expect the upcoming X2 Elite cpu with its SVE+SME capability to be a markedly more attractive platform for any kind of numerical workload, and it should be quite adequately covered by the ARMV9SME target already. |
theAeon commented Nov 20, 2025
That sounds like the way to go. |
All I know is that this builds and works fine with clangarm64 on my laptop. Unsure about performance improvement, but certainly no performance regression.
I am not an assembly wizard, so this still uses the neoverse kernels. I imagine there is much optimization to be had. Feel free to edit if I missed a spot.
https://www.hwcooling.net/en/oryon-arm-core-in-snapdragon-x-cpus-architecture-analysis/ for cache reference