Description
I have been floating the idea to expose and use SIMD instructions in a generic way. @JeffBezanson has suggested that this is quite feasible. @StefanKarpinski suggested VecInt and VecFloat bits types in this thread:
https://groups.google.com/forum/?fromgroups=#!topic/julia-dev/h8EGtQvdq9U
To refine this further, it may be that we need VecInt32, VecInt64, VecFloat32, and VecFloat64 bits types. Arrays of these Vec types could correspond to simply reinterpreting arrays of the Int32, Int64, Float32, Float64.
We already do 16-byte alignment for memory allocation, although we may need 32-byte alignment for AVX instructions. This could be wasteful for small arrays, but there could be real performance gains for 2x2 and 3x3 matrix operations. As suggested in the thread above by @StefanKarpinski, we may need to do operations on the non-aligned portions in the head and tail separately.
Once we have this, we would probably want to implement all the vectorized base library operations using SIMD, which will give a nice performance boost until we have some kind of threading or fork/shmem based parallelism for Array objects.
Then there are also operations such as trigonometric and various math library functions, which are unlikely to be as accurate as openlibm, but a whole lot faster. Perhaps these should not be used by default, but available as vecsin etc.