Spent 1h today trying to implement an equivalent of vpermilps (_mm_permutevar_ps) in SSE, only to find that my "solution" used a per-lane shift (vpsrlvd)… which is only available in AVX2 🙄 SIMD on Intel is really the Swiss cheese of APIs; so difficult to do anything without an extensive knowledge of all the quirks and holes in the API. In the end the correct solution was to use pshufb, which is probably obvious if you’re familiar enough with SIMD but requires jumping through hoops. #simd #sse