Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation: use the new top-k implementation in Accelerate #7

Closed
pcuenca opened this issue Aug 8, 2023 · 7 comments
Closed

Generation: use the new top-k implementation in Accelerate #7

pcuenca opened this issue Aug 8, 2023 · 7 comments
Labels
good first issue Good for newcomers

Comments

@pcuenca
Copy link
Member

pcuenca commented Aug 8, 2023

Reference: https://developer.apple.com/documentation/accelerate/bnns#4164142

We should use it when permitted by the underlying OS, and compare its performance with the current implementation: d2917b2

@pcuenca pcuenca added the good first issue Good for newcomers label Aug 8, 2023
@brow
Copy link

brow commented Aug 11, 2023

Do you have a sense of what range of values for k and arr.count we want to be optimizing for? If not, I can try running swift-chat with a few models and see what values are typical.

@jkrukowski
Copy link
Contributor

hi, I've created a little benchmark repo to compare different implementations https://github.com/jkrukowski/topk

Look like BNNS.NearestNeighbors is the slowest option here (I might be using it wrong though)

@pcuenca
Copy link
Member Author

pcuenca commented Nov 20, 2023

@jkrukowski Very cool, thanks for that! 🙌

Actually, I was confused when I read the release notes and the implementation I linked in this issue is a nearest-neighbor selection, not top-k. The top-k functions in BNNS are described here.

From your analysis, it looks like the current implementation is faster for small numbers, but degrades for larger values of k. The usual scenario for these models is a vocabulary of a few tens of thousands of tokens. For example, Llama 2 uses 32000 tokens while Falcon uses about 65000.

For typical values of k, I'd say most of the time it's a small number (10 or less), and I think it'd be rare for it to be larger than 100. If you have time, could you maybe run your benchmark with values within those ranges? I'd also suggest to remove the softmax operation from your benchmark, as it would be constant among all the implementations.

@jkrukowski
Copy link
Contributor

@jkrukowski Very cool, thanks for that! 🙌

Actually, I was confused when I read the release notes and the implementation I linked in this issue is a nearest-neighbor selection, not top-k. The top-k functions in BNNS are described here.

From your analysis, it looks like the current implementation is faster for small numbers, but degrades for larger values of k. The usual scenario for these models is a vocabulary of a few tens of thousands of tokens. For example, Llama 2 uses 32000 tokens while Falcon uses about 65000.

For typical values of k, I'd say most of the time it's a small number (10 or less), and I think it'd be rare for it to be larger than 100. If you have time, could you maybe run your benchmark with values within those ranges? I'd also suggest to remove the softmax operation from your benchmark, as it would be constant among all the implementations.

Adjusted https://github.com/jkrukowski/topk please take a look. I've added additional XCTest performance benchmark. Google benchmark is not very conclusive but according to XCTest BNNS.applyTopK seems to be the best performer

@jkrukowski
Copy link
Contributor

@pcuenca given the above do you think that implementation present in https://github.com/jkrukowski/topk/blob/f2c164e5c360a6a055b785db8830b78c38936e12/Sources/TopK/Accelerate.swift#L7 would be ok? If so I could create a PR

@pcuenca
Copy link
Member Author

pcuenca commented Nov 23, 2023

@jkrukowski Sounds great!

I ran it on my system (M1 Max) and observed similar results, but wondered why the Google Benchmark shows results so different. I moved the array creation inside the measurement loop to invalidate any potential caching that might be in place. The differences are smaller (as they are dominated by the overhead of array creation), but the Accelerate implementation is still faster, so it looks safe to go ahead!

I'd just recommend to add a few top-k tests to the repo, if possible, to ensure nothing changes before and after – and because it'd be awesome to have them! 😇.

Thanks a lot for working on this, it's great to bring performance closer to the greedy case!

@jkrukowski
Copy link
Contributor

@pcuenca added here #21

@pcuenca pcuenca closed this as completed Nov 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants