Memory leak with table.update on non numbers? #1723

nickroci · 2022-02-22T17:37:46Z

Believe that memory should be freed on updates with the same index val. Seems to be ok for float updates but for str it leaks - not sure this is expected?

import perspective

import pyarrow as pa

import os

import psutil

import random

# work with pa bytes for performance - seems to happen with python types too tho 


def get_bytes(tbl):
    stream = pa.BufferOutputStream()
    writer = pa.RecordBatchStreamWriter(stream, tbl.schema)
    writer.write_table(tbl)
    writer.close()
    return stream.getvalue().to_pybytes()

 
process = psutil.Process(os.getpid())

cols = ['col%s' % i for i in range(50)]

schema = pa.schema([(x, pa.string()) for x in cols])

pylist = [{col: str(random.random()) for col in cols} for _ in range(50000)]

table = perspective.Table(schema.serialize().to_pybytes(), index='col0')

max = process.memory_info().rss

while True:
    # leave index the same
    [row.update({col: str(random.random()) for col in cols[1:]}) for row in pylist]

    tbl = pa.Table.from_pylist(pylist, schema=schema)
    update = get_bytes(tbl)
    table.update(update)

    mem = process.memory_info().rss

    if mem > max:
        max = mem
        print(max)

The text was updated successfully, but these errors were encountered:

nickroci · 2022-02-23T09:20:37Z

This also seems to leak, maybe related

import perspective

import pyarrow as pa

import os

import psutil

import random

 

# work with pa bytes for performance - seems to happen with python types too

 

def get_bytes(tbl):

    stream = pa.BufferOutputStream()

    writer = pa.RecordBatchStreamWriter(stream, tbl.schema)

    writer.write_table(tbl)

    writer.close()

    return stream.getvalue().to_pybytes()

 

process = psutil.Process(os.getpid())

 

cols = ['col%s' % i for i in range(50)]

pylist = [{col: str(random.random()) for col in cols} for _ in range(5000)]




max = process.memory_info().rss

while True:

    schema = pa.schema([(x, pa.string()) for x in cols])

    [row.update({col: str(random.random()) for col in cols[1:]}) for row in pylist]

    table = perspective.Table(schema.serialize().to_pybytes(), index='col0')

    tbl = pa.Table.from_pylist(pylist, schema=schema)

    update = get_bytes(tbl)

    table.update(update)

    table.delete()

    mem = process.memory_info().rss

    if mem > max:

        max = mem

        print(max)

texodus · 2022-02-25T08:38:11Z

Thanks for the report and detailed repros @nickpholden! I am pretty sure you are describing two unrelated issues though -

The second is most certainly a bug, Table types are simply not collected at all in Python. This appears to be due to a pybind circular reference, I believe #1724 will address this and I've added some leak tests in this PR based on your repro as well.

The first issue is a bit more complicated (the effects of the above issue aside) - I'm aware of the this behavior already but it needed to be documented. String values in a Perspective column are always dictionary-encoded (coincidentally near identical to the Apache Arrow Dictionary type). This is great for category data when there are a limited number of potential strings - equivalence can be checked just by comparing indices, duplicate strings are conserved, and a bunch of other easy performance wins. However, there is no checking whatsoever when a string's index is no longer referenced by the column itself, and hence when you create a column with unique random strings the table will grow indefinitely. This isn't exactly what I think of as a memory leak, in that the memory is still tracked - calling .delete() on the table itself e.g. should reclaim this memory (disregarding your second example - see above).

There are a couple of options to address this, but no magic solution that does not impact either wall-clock performance, in-memory size, or API friendliness (or all 3). We could add a new string column type that does not dictionary encode, which would be slower than the current string column type, take up more memory and require developer/user intervention to inform the engine to treat the column differently - but would not grow unbounded with new unique string types. We could add a per-update or timer-based unused string collection check, but like any GC-ed system this may impose performance penalties where you do not want or need them. My preference for now is to add a "vacuum()" method where the developer could explicitly opt-in to invoking a dictionary column cleanup when they choose to.

nickroci · 2022-02-25T11:24:31Z

Thanks v much for the info, good to know about that string optimisation, will help a lot with the issue I am seeing (can get rid of some junk strings and avoid things like string timestamps).

FYI I originally though table.clear might flush all data, delete is a bit painful because would have to destroy views, but vacuum would give more flexibility

texodus · 2022-02-25T19:41:16Z

I could also make clear() reset the internal dictionary - this is quite a bit easier than implementing vacuum() as the latter must read and then update all index columns that refer to the dictionary.

timkpaine · 2022-02-25T19:53:59Z

@texodus this sounds good, its a small performance penalty if you load strings after but it seems like clear should properly clear.

nickroci · 2022-02-25T19:54:30Z

I did expect that clear would totally wipe out any data so from that point of view I think it would make it easier to understand, however in practice when you have a long running table gathering data - I suppose it's usually more difficult to totally wipe it out and repopulate it from scratch - like the only complete source might be the data in the table itself (if stuff is getting added and removed all the time), I considered things like exporting the bytes, clear, and then updating it back in which seems a bit odd, so in that case vacuum would be easier. Having said all that either would be greatly appreciated!

romeott · 2023-07-11T09:09:23Z

@nickroci is this still and issue, should it be closed? If it is still an issue I have some time on hand and I could take it on.

nickroci · 2023-07-11T10:59:37Z

I'd like to see the vacuum function that we discussed, yes please!

FYI for my use case (lots of rows and columns with strings + deletes) ended up reimplementing a lot of stuff (added zstd compression, delete protocol support, ui, end to end arrow) but kept the in browser DB bit which i don't think has any competition (duckdb is pretty static and far behind in terms of features for instance).

texodus linked a pull request Feb 25, 2022 that will close this issue

Fix python memory leak in table() #1724

Merged

texodus removed a link to a pull request Feb 25, 2022

Fix python memory leak in table() #1724

Merged

texodus mentioned this issue Feb 25, 2022

Fix python memory leak in table() #1724

Merged

timkpaine added bug Concrete, reproducible bugs C++ labels Jun 21, 2022

texodus mentioned this issue Jul 11, 2022

Large string outputs from calculated columns kill persptive viewer #1787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak with table.update on non numbers? #1723

Memory leak with table.update on non numbers? #1723

nickroci commented Feb 22, 2022 •

edited

Loading

nickroci commented Feb 23, 2022 •

edited

Loading

texodus commented Feb 25, 2022

nickroci commented Feb 25, 2022

texodus commented Feb 25, 2022

timkpaine commented Feb 25, 2022

nickroci commented Feb 25, 2022

romeott commented Jul 11, 2023

nickroci commented Jul 11, 2023

Memory leak with table.update on non numbers? #1723

Memory leak with table.update on non numbers? #1723

Comments

nickroci commented Feb 22, 2022 • edited Loading

nickroci commented Feb 23, 2022 • edited Loading

texodus commented Feb 25, 2022

nickroci commented Feb 25, 2022

texodus commented Feb 25, 2022

timkpaine commented Feb 25, 2022

nickroci commented Feb 25, 2022

romeott commented Jul 11, 2023

nickroci commented Jul 11, 2023

nickroci commented Feb 22, 2022 •

edited

Loading

nickroci commented Feb 23, 2022 •

edited

Loading