Generation: use the new top-k implementation in Accelerate #7

pcuenca · 2023-08-08T10:28:11Z

Reference: https://developer.apple.com/documentation/accelerate/bnns#4164142

We should use it when permitted by the underlying OS, and compare its performance with the current implementation: d2917b2

The text was updated successfully, but these errors were encountered:

brow · 2023-08-11T17:20:49Z

Do you have a sense of what range of values for k and arr.count we want to be optimizing for? If not, I can try running swift-chat with a few models and see what values are typical.

jkrukowski · 2023-11-20T16:06:31Z

hi, I've created a little benchmark repo to compare different implementations https://github.com/jkrukowski/topk

Look like BNNS.NearestNeighbors is the slowest option here (I might be using it wrong though)

pcuenca · 2023-11-20T19:17:23Z

@jkrukowski Very cool, thanks for that! 🙌

Actually, I was confused when I read the release notes and the implementation I linked in this issue is a nearest-neighbor selection, not top-k. The top-k functions in BNNS are described here.

From your analysis, it looks like the current implementation is faster for small numbers, but degrades for larger values of k. The usual scenario for these models is a vocabulary of a few tens of thousands of tokens. For example, Llama 2 uses 32000 tokens while Falcon uses about 65000.

For typical values of k, I'd say most of the time it's a small number (10 or less), and I think it'd be rare for it to be larger than 100. If you have time, could you maybe run your benchmark with values within those ranges? I'd also suggest to remove the softmax operation from your benchmark, as it would be constant among all the implementations.

jkrukowski · 2023-11-21T10:25:47Z

@jkrukowski Very cool, thanks for that! 🙌

Actually, I was confused when I read the release notes and the implementation I linked in this issue is a nearest-neighbor selection, not top-k. The top-k functions in BNNS are described here.

From your analysis, it looks like the current implementation is faster for small numbers, but degrades for larger values of k. The usual scenario for these models is a vocabulary of a few tens of thousands of tokens. For example, Llama 2 uses 32000 tokens while Falcon uses about 65000.

For typical values of k, I'd say most of the time it's a small number (10 or less), and I think it'd be rare for it to be larger than 100. If you have time, could you maybe run your benchmark with values within those ranges? I'd also suggest to remove the softmax operation from your benchmark, as it would be constant among all the implementations.

Adjusted https://github.com/jkrukowski/topk please take a look. I've added additional XCTest performance benchmark. Google benchmark is not very conclusive but according to XCTest BNNS.applyTopK seems to be the best performer

jkrukowski · 2023-11-23T15:53:39Z

@pcuenca given the above do you think that implementation present in https://github.com/jkrukowski/topk/blob/f2c164e5c360a6a055b785db8830b78c38936e12/Sources/TopK/Accelerate.swift#L7 would be ok? If so I could create a PR

pcuenca · 2023-11-23T17:38:58Z

@jkrukowski Sounds great!

I ran it on my system (M1 Max) and observed similar results, but wondered why the Google Benchmark shows results so different. I moved the array creation inside the measurement loop to invalidate any potential caching that might be in place. The differences are smaller (as they are dominated by the overhead of array creation), but the Accelerate implementation is still faster, so it looks safe to go ahead!

I'd just recommend to add a few top-k tests to the repo, if possible, to ensure nothing changes before and after – and because it'd be awesome to have them! 😇.

Thanks a lot for working on this, it's great to bring performance closer to the greedy case!

jkrukowski · 2023-11-25T15:12:40Z

@pcuenca added here #21

pcuenca added the good first issue Good for newcomers label Aug 8, 2023

pcuenca closed this as completed Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation: use the new top-k implementation in Accelerate #7

Generation: use the new top-k implementation in Accelerate #7

pcuenca commented Aug 8, 2023

brow commented Aug 11, 2023

jkrukowski commented Nov 20, 2023

pcuenca commented Nov 20, 2023

jkrukowski commented Nov 21, 2023

jkrukowski commented Nov 23, 2023

pcuenca commented Nov 23, 2023 •

edited

Loading

jkrukowski commented Nov 25, 2023

Generation: use the new top-k implementation in Accelerate #7

Generation: use the new top-k implementation in Accelerate #7

Comments

pcuenca commented Aug 8, 2023

brow commented Aug 11, 2023

jkrukowski commented Nov 20, 2023

pcuenca commented Nov 20, 2023

jkrukowski commented Nov 21, 2023

jkrukowski commented Nov 23, 2023

pcuenca commented Nov 23, 2023 • edited Loading

jkrukowski commented Nov 25, 2023

pcuenca commented Nov 23, 2023 •

edited

Loading