Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing evidence representation #998

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

bgyori
Copy link
Member

@bgyori bgyori commented Nov 5, 2019

This PR implements two optimizations to the representation of evidences that significantly decrease memory usage when manipulating large sets of INDRA Statements. The bulk of memory used by INDRA Statements is attributable to the Evidence objects (incl. evidence text) that are attached to them. One approach to decrease memory usage is to define the __slots__ attribute of Evidence to make sure the set of attributes it can have is pre-defined (rather than variable via a __dict__ attribute). This seemed to make a minor difference in memory usage. Much larger memory savings can be achieved if lists of Evidences attached to a Statement are stored in a serialized, compressed form, and only decompressed and deserialized when being accessed. Based on some experiments, a Statement with 100 pieces of Evidence uses 75% less memory using this PR. On some large assembled corpora that I tried, which have Statements with a mixture of number of Evidences, 80% lower memory usage is typical.

Not much of this affects the way INDRA Statements are used, however there is one important difference: when accessing a Statement's evidence (i.e., stmt.evidence) one gets a view of the list evidences rather than a reference to them. So directly manipulating stmt.evidence will not result in persistent changes to the Statement. Rather, one has to do something like:

evs = stmt.evidence
for ev in evs:
    # Make some changes to each ev object
stmt.evidence = evs

to make changes to a Statement's list of Evidences. Some specialized code dealing with Evidence manipulation, as well as some tests needed to be updated. I am still ambivalent about whether this change will cause confusion later, and therefore not sure yet if this PR should be merged.

@cthoyt
Copy link
Collaborator

cthoyt commented Nov 12, 2019

From my point of view, this new API is pretty confusing. It's unclear why saving in a variable solves this problem

@bgyori
Copy link
Member Author

bgyori commented Nov 12, 2019

Well, users of INDRA would never really notice any change, it's only during internal development (of e.g., pre-assembly algorithms or input processors) that one could make a mistake by attempting to change a view of a list of Evidences rather than the actual evidence attribute of a Statement. Saving into a variable is not really necessary, the key is just to always set evidences as stmt.evidence = [...] to update the actual evidence list attribute rather than attempt to iterate over and manipulate stmt.evidence[idx] directly, which with this change would just change a view of the evidences. I agree it is somewhat confusing hence my ambivalence about the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants