-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Optimize Cython code. Use scipy blas function pointers. #1524
Conversation
just to understand: this is a rewrite of #1069 where you replaced the lapack/blas part by using the scipy pointers ? Do we have anyone to test on Apple? (given Pauli's comments about "funny" linear algebra libraries) |
Yes, it's a rewrite. We don't use any of the problematic headers from Apple. They're mostly only for single precision and we use double everywhere. I made a note about it at https://github.com/ChadFulton/pykalman_filter/issues/1 |
sounds good, one worry less. However it would still be good if an Apple user could test this before merge. |
We don't use sdot or any of the other problematic functions that Pauli pointed out. AFAIK, it's a non-issue if you're using scipy >= 0.13.x too or a non-accelerate BLAS. If we want to support older scipy and use these problematic functions, then we'll need to do something or tell people not to use the borked Accelerate BLAS. |
We need to bump the Cython version on Travis. Cf. #1523. |
Ubuntu 14.04 LTS will have SciPy 0.13.1, which makes it a little easier to choose a newer version. |
The current issue is that CObject does not exist in Python >= 3.2. They switched to PyCapsule. PyCapsule exists in Python >= 2.7, but as long as we target 2.6, then we can't use the easy fix. We need some kind of conditional header file. It looks like NumPy provides this in |
(Probably irrelevant) might want to use dsymm instead of dgemm here and there |
Yes, P is a symmetric matrix but it not yet handled as such in the code. |
Would save some operations. I think I'll have one more round of optimization / profiling before merging this, provided we get the build crap pinned down. We may not need to use memoryviews here all that much either. We might be better off without buffer acquisition here and using the numpy C api more. |
Ignorance here: what is the reason for not using scipy blas wrappers as is? xSYMM wrappers are available from 0.13 onwards. |
Well, we are using the scipy blas wrappers. We're using the C function pointer to the fblas function directly from Cython. We don't want to involve any Python function calls in the C code. Our dependency for scipy right now is 0.11.0 or 0.12.0. I don't remember offhand. |
Ah, I think I got this. Was trying to be too cute. Just wrote a header with the proper py3k macros. |
All green. Debating whether to try to squeeze a little more performance out of this. |
I'm leaning towards merging this and starting a new issue for further micro-optimizations. This is already a big speed-up even if we can get more. No reason to wait until I have some time to do further profiling. |
ENH: Optimize Cython code. Use scipy blas function pointers.
I'll be porting these scipy-blas shenanigans soon too, for dual py2/3 compatibility in gensim. How has the "capsule" thing panned out since being merged, Skipper? Any issues? |
Works fine AFAICT. No complaints on Travis. See Pauli's reply here [1] for potential issues on Apple's Accelerate with some functions. You can find all the known issues here [2]. There's also some workaround code in Theano that tries to detect problems and conditionally write the headers [3]. I'm fairly certain that's the same sdot issue, but I don't know theano all that well. [1] http://scipy-user.10969.n7.nabble.com/SciPy-User-scipy-linalg-blas-before-0-12-0-td19274.html |
Yeah I noticed, I reported these Apple problems some month ago: https://groups.google.com/d/msg/cython-users/V_DR1xi5Ang/48OamQSX7TwJ and numpy/numpy#4007 General response was "not our problem", so my solution in gensim was to detect whether the Will check out the theano code too, thanks for the links and heads up, Skipper. |
Oh right. I remember reading that thread now. Fun stuff and small world. |
ENH: Optimize Cython code. Use scipy blas function pointers.
Didn't have to change any of the build setup which is nice. Tested building on windows for Python 2.7 32- and 64-bit without problems.
The old mailing list example from 9/13 is now 10-30x faster.
Since the
Z
matrix in the univariate ARMA case is always [[1, 0, ..., 0]]. We now avoid writing general KF code and just use slicing to select what we want.There's probably a bit more performance that can be squeezed out here. We can us PyArray_DATA rather than memoryviews for the arrays that don't need to be sliced often. There are also still a few cases where we use
dgemm
when we don't need to. E.g., we're taking the outer product of twor x 1
vectors, but I left these asdgemm
. The initial variance estimation is also still pure Python because I didn't want to rewrite it. This might gain us the most, but it's only called once per hit. There might also be some more ways to avoid the temporary arrays, but my thinking in BLAS linear algebra needs some work.Need to add a note to the release about this.