Spent 1h today trying to implement an equivalent of vpermilps (_mm_permutevar_ps) in SSE, only to find that my "solution" used a per-lane shift (vpsrlvd)… which is only available in AVX2 🙄 SIMD on Intel is really the Swiss cheese of APIs; so difficult to do anything without an extensive knowledge of all the quirks and holes in the API. In the end the correct solution was to use pshufb, which is probably obvious if you’re familiar enough with SIMD but requires jumping through hoops. #simd #sse
Spent 1h today trying to implement an equivalent of vpermilps (_mm_permutevar_ps) in SSE, only to find that my "solution" used a per-lane shift (vpsrlvd)… which is only available in AVX2 🙄 SIMD on Intel is really the Swiss cheese of APIs; so difficult to do anything without an extensive knowledge of all the quirks and holes in the API. In the end the correct solution was to use pshufb, which is probably obvious if you’re familiar enough with SIMD but requires jumping through hoops. #simd #sse
High-performance C++ hash table using grouped SIMD metadata scanning
https://github.com/Cranot/grouped-simd-hashtable
#HackerNews #HighPerformance #C++ #HashTable #SIMD #MetadataScanning #TechnologyOptimization #GitHub
SIMD City: Auto-Vectorisation
https://xania.org/202512/20-simd-city
#HackerNews #SIMD #City #Auto-Vectorisation #SIMD #City #Vectorisation #Tech #News #Programming #Insights
Hey all! 👋🏻
I’m looking for some shader-like pipeline/#rendering system/library/framework for 1-bit graphics with 2x #framebuffer (double-buffered — actual & previous) with #blitting on #SIMD and #SWAR? CPU-only, mostly targeting ARM32/64/Thumb1.
I understand that it’s rare and mostly impossible to exist, so I just need some source-based guidance/hints of oldschool/demoscene- tricks and algorithms which I don’t know yet (I know a lot already, I’m 40)) and of course i can port.
The state of SIMD in Rust in 2025
https://shnatsel.medium.com/the-state-of-simd-in-rust-in-2025-32c263e5f53d
#HackerNews #SIMD #Rust #2025 #RustProgramming #Technology #Trends #FutureDevelopment
A story about never ever giving up...❤️🔥
After several weeks, questioning my life choices, I've finally figured out why my #Whisper #SpeechToText system had been so slow on #Windows:
It was because apparently the #Rust-FFI wrapped #CPlusPlus code (Whisper.cpp) didn't compile with AVX and AVX2 enabled ( #SIMD!). I've tried it on two Windows machines (both AVX-capable). On one of the machines, with #Linux, it has successfully detected AVX/AVX2, though and has run fast.
1/?
Hmm... 🤔
My suspicion why it's "not working" is:
Even though I do `cargo run --release` I've seen, during my investigation of the above compiling-fail-nightmare, that it puts artifacts into `Debug` folder.
So it might be that the program (Whisper.cpp to be precise) runs as a debug build and is just _terribly_ slow. 🐌
Oh boy, the struggle continues... 🤸
This might be related:
https://codeberg.org/tazz4843/whisper-rs/issues/226
A story about never ever giving up...❤️🔥
After several weeks, questioning my life choices, I've finally figured out why my #Whisper #SpeechToText system had been so slow on #Windows:
It was because apparently the #Rust-FFI wrapped #CPlusPlus code (Whisper.cpp) didn't compile with AVX and AVX2 enabled ( #SIMD!). I've tried it on two Windows machines (both AVX-capable). On one of the machines, with #Linux, it has successfully detected AVX/AVX2, though and has run fast.
1/?
I decided to share my Arm NEON optimizations for the FFmpeg Cinepak encoder. On Apple Silicon / RPI / NEON 32/64-bit, it gets a 250-300% speedup for encoding: