-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing evidence representation #998
base: master
Are you sure you want to change the base?
Conversation
From my point of view, this new API is pretty confusing. It's unclear why saving in a variable solves this problem |
Well, users of INDRA would never really notice any change, it's only during internal development (of e.g., pre-assembly algorithms or input processors) that one could make a mistake by attempting to change a view of a list of Evidences rather than the actual evidence attribute of a Statement. Saving into a variable is not really necessary, the key is just to always set evidences as |
This PR implements two optimizations to the representation of evidences that significantly decrease memory usage when manipulating large sets of INDRA Statements. The bulk of memory used by INDRA Statements is attributable to the Evidence objects (incl. evidence text) that are attached to them. One approach to decrease memory usage is to define the
__slots__
attribute of Evidence to make sure the set of attributes it can have is pre-defined (rather than variable via a__dict__
attribute). This seemed to make a minor difference in memory usage. Much larger memory savings can be achieved if lists of Evidences attached to a Statement are stored in a serialized, compressed form, and only decompressed and deserialized when being accessed. Based on some experiments, a Statement with 100 pieces of Evidence uses 75% less memory using this PR. On some large assembled corpora that I tried, which have Statements with a mixture of number of Evidences, 80% lower memory usage is typical.Not much of this affects the way INDRA Statements are used, however there is one important difference: when accessing a Statement's evidence (i.e.,
stmt.evidence
) one gets a view of the list evidences rather than a reference to them. So directly manipulatingstmt.evidence
will not result in persistent changes to the Statement. Rather, one has to do something like:to make changes to a Statement's list of Evidences. Some specialized code dealing with Evidence manipulation, as well as some tests needed to be updated. I am still ambivalent about whether this change will cause confusion later, and therefore not sure yet if this PR should be merged.