-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak with table.update on non numbers? #1723
Comments
|
Thanks for the report and detailed repros @nickpholden! I am pretty sure you are describing two unrelated issues though - The second is most certainly a bug, The first issue is a bit more complicated (the effects of the above issue aside) - I'm aware of the this behavior already but it needed to be documented. String values in a Perspective column are always dictionary-encoded (coincidentally near identical to the Apache Arrow Dictionary type). This is great for category data when there are a limited number of potential strings - equivalence can be checked just by comparing indices, duplicate strings are conserved, and a bunch of other easy performance wins. However, there is no checking whatsoever when a string's index is no longer referenced by the column itself, and hence when you create a column with unique random strings the table will grow indefinitely. This isn't exactly what I think of as a memory leak, in that the memory is still tracked - calling There are a couple of options to address this, but no magic solution that does not impact either wall-clock performance, in-memory size, or API friendliness (or all 3). We could add a new string column type that does not dictionary encode, which would be slower than the current string column type, take up more memory and require developer/user intervention to inform the engine to treat the column differently - but would not grow unbounded with new unique string types. We could add a per-update or timer-based unused string collection check, but like any GC-ed system this may impose performance penalties where you do not want or need them. My preference for now is to add a "vacuum()" method where the developer could explicitly opt-in to invoking a dictionary column cleanup when they choose to. |
Thanks v much for the info, good to know about that string optimisation, will help a lot with the issue I am seeing (can get rid of some junk strings and avoid things like string timestamps). FYI I originally though table.clear might flush all data, delete is a bit painful because would have to destroy views, but vacuum would give more flexibility |
I could also make |
@texodus this sounds good, its a small performance penalty if you load strings after but it seems like |
I did expect that clear would totally wipe out any data so from that point of view I think it would make it easier to understand, however in practice when you have a long running table gathering data - I suppose it's usually more difficult to totally wipe it out and repopulate it from scratch - like the only complete source might be the data in the table itself (if stuff is getting added and removed all the time), I considered things like exporting the bytes, clear, and then updating it back in which seems a bit odd, so in that case vacuum would be easier. Having said all that either would be greatly appreciated! |
@nickroci is this still and issue, should it be closed? If it is still an issue I have some time on hand and I could take it on. |
I'd like to see the vacuum function that we discussed, yes please! FYI for my use case (lots of rows and columns with strings + deletes) ended up reimplementing a lot of stuff (added zstd compression, delete protocol support, ui, end to end arrow) but kept the in browser DB bit which i don't think has any competition (duckdb is pretty static and far behind in terms of features for instance). |
Believe that memory should be freed on updates with the same index val. Seems to be ok for float updates but for str it leaks - not sure this is expected?
The text was updated successfully, but these errors were encountered: