When one uses intrinsics, it is supposed to be almost assembly, compiler should only do [optimal] register allocations and deal with input and output variables. However, compilers now do much more than that, especially the optimizations they perform on top of intrinsics are often very non-obvious.
In case of the code I examined, ICC produced ~20% faster code for a scalar part, and ~10% faster code for one vector basic block, but ~2x slower for another vector basic block. When combined, these resulted in ~1.5x difference.
It is not the first time I noticed the compiler optimizations significantly alter SIMD intrinsics flow, but before it was always something small, like fusing 2 instructions when a SIMD equivalent exists, etc. In this case it just produced 2x+ instructions! Alas, just changing to -O0 won't make it back to the planned flow, as O0 adds too many intermediates... So if I am certain that my instruction flow is better than compilers, I'll have to do inline asm rather than intrinsics.