Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with table.update on non numbers? #1723

Open
nickroci opened this issue Feb 22, 2022 · 8 comments
Open

Memory leak with table.update on non numbers? #1723

nickroci opened this issue Feb 22, 2022 · 8 comments
Labels
bug Concrete, reproducible bugs C++

Comments

@nickroci
Copy link
Contributor

nickroci commented Feb 22, 2022

Believe that memory should be freed on updates with the same index val. Seems to be ok for float updates but for str it leaks - not sure this is expected?

import perspective

import pyarrow as pa

import os

import psutil

import random

# work with pa bytes for performance - seems to happen with python types too tho 


def get_bytes(tbl):
    stream = pa.BufferOutputStream()
    writer = pa.RecordBatchStreamWriter(stream, tbl.schema)
    writer.write_table(tbl)
    writer.close()
    return stream.getvalue().to_pybytes()

 
process = psutil.Process(os.getpid())

cols = ['col%s' % i for i in range(50)]

schema = pa.schema([(x, pa.string()) for x in cols])

pylist = [{col: str(random.random()) for col in cols} for _ in range(50000)]

table = perspective.Table(schema.serialize().to_pybytes(), index='col0')

max = process.memory_info().rss

while True:
    # leave index the same
    [row.update({col: str(random.random()) for col in cols[1:]}) for row in pylist]

    tbl = pa.Table.from_pylist(pylist, schema=schema)
    update = get_bytes(tbl)
    table.update(update)

    mem = process.memory_info().rss

    if mem > max:
        max = mem
        print(max)
@nickroci
Copy link
Contributor Author

nickroci commented Feb 23, 2022

This also seems to leak, maybe related

import perspective

import pyarrow as pa

import os

import psutil

import random

 

# work with pa bytes for performance - seems to happen with python types too

 

def get_bytes(tbl):

    stream = pa.BufferOutputStream()

    writer = pa.RecordBatchStreamWriter(stream, tbl.schema)

    writer.write_table(tbl)

    writer.close()

    return stream.getvalue().to_pybytes()

 

process = psutil.Process(os.getpid())

 

cols = ['col%s' % i for i in range(50)]

pylist = [{col: str(random.random()) for col in cols} for _ in range(5000)]




max = process.memory_info().rss

while True:

    schema = pa.schema([(x, pa.string()) for x in cols])

    [row.update({col: str(random.random()) for col in cols[1:]}) for row in pylist]

    table = perspective.Table(schema.serialize().to_pybytes(), index='col0')

    tbl = pa.Table.from_pylist(pylist, schema=schema)

    update = get_bytes(tbl)

    table.update(update)

    table.delete()

    mem = process.memory_info().rss

    if mem > max:

        max = mem

        print(max)

 

@texodus
Copy link
Member

texodus commented Feb 25, 2022

Thanks for the report and detailed repros @nickpholden! I am pretty sure you are describing two unrelated issues though -

The second is most certainly a bug, Table types are simply not collected at all in Python. This appears to be due to a pybind circular reference, I believe #1724 will address this and I've added some leak tests in this PR based on your repro as well.

The first issue is a bit more complicated (the effects of the above issue aside) - I'm aware of the this behavior already but it needed to be documented. String values in a Perspective column are always dictionary-encoded (coincidentally near identical to the Apache Arrow Dictionary type). This is great for category data when there are a limited number of potential strings - equivalence can be checked just by comparing indices, duplicate strings are conserved, and a bunch of other easy performance wins. However, there is no checking whatsoever when a string's index is no longer referenced by the column itself, and hence when you create a column with unique random strings the table will grow indefinitely. This isn't exactly what I think of as a memory leak, in that the memory is still tracked - calling .delete() on the table itself e.g. should reclaim this memory (disregarding your second example - see above).

There are a couple of options to address this, but no magic solution that does not impact either wall-clock performance, in-memory size, or API friendliness (or all 3). We could add a new string column type that does not dictionary encode, which would be slower than the current string column type, take up more memory and require developer/user intervention to inform the engine to treat the column differently - but would not grow unbounded with new unique string types. We could add a per-update or timer-based unused string collection check, but like any GC-ed system this may impose performance penalties where you do not want or need them. My preference for now is to add a "vacuum()" method where the developer could explicitly opt-in to invoking a dictionary column cleanup when they choose to.

@texodus texodus linked a pull request Feb 25, 2022 that will close this issue
@texodus texodus removed a link to a pull request Feb 25, 2022
@nickroci
Copy link
Contributor Author

Thanks v much for the info, good to know about that string optimisation, will help a lot with the issue I am seeing (can get rid of some junk strings and avoid things like string timestamps).

FYI I originally though table.clear might flush all data, delete is a bit painful because would have to destroy views, but vacuum would give more flexibility

@texodus
Copy link
Member

texodus commented Feb 25, 2022

I could also make clear() reset the internal dictionary - this is quite a bit easier than implementing vacuum() as the latter must read and then update all index columns that refer to the dictionary.

@timkpaine
Copy link
Member

@texodus this sounds good, its a small performance penalty if you load strings after but it seems like clear should properly clear.

@nickroci
Copy link
Contributor Author

I did expect that clear would totally wipe out any data so from that point of view I think it would make it easier to understand, however in practice when you have a long running table gathering data - I suppose it's usually more difficult to totally wipe it out and repopulate it from scratch - like the only complete source might be the data in the table itself (if stuff is getting added and removed all the time), I considered things like exporting the bytes, clear, and then updating it back in which seems a bit odd, so in that case vacuum would be easier. Having said all that either would be greatly appreciated!

@romeott
Copy link

romeott commented Jul 11, 2023

@nickroci is this still and issue, should it be closed? If it is still an issue I have some time on hand and I could take it on.

@nickroci
Copy link
Contributor Author

I'd like to see the vacuum function that we discussed, yes please!

FYI for my use case (lots of rows and columns with strings + deletes) ended up reimplementing a lot of stuff (added zstd compression, delete protocol support, ui, end to end arrow) but kept the in browser DB bit which i don't think has any competition (duckdb is pretty static and far behind in terms of features for instance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Concrete, reproducible bugs C++
Projects
None yet
Development

No branches or pull requests

4 participants