-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move implementation of the metadata pool to core #280
Move implementation of the metadata pool to core #280
Conversation
The type of an object is defined on the object set, not on the actual object, thus an instances of basic_object must remain linked to it's set in order to have a full representation of the object. The concept of a object set is not semantically intresting after the objects in the set are parsed. But because of the type, the set has to be carried through all the way to python. By setting the type directly on the basic_object when it's parsed the need to keep it close to the set goes away. It also fasilitates the posibility of removing object sets from the C++ API all together.
e1b9db5
to
b7e9d96
Compare
Logical Record Segment Headers contain useful information when determining the content of the record. Previously we had to extract the entire record in order to retreive information about it's content - often requiring our control flow to extract more records than needed. The lrsh are already being read when indexing so the only cost here is an increase in the size of the index.
When index_entry contains attribute and code it makes sense that the sanity checking of these header fields are done when creating the index. Also these checks help validate the index, which is useful if we want to allow an incomplete index to propagate further.
For lookup it's not actually interesting to know which object orders the lookup. We need barely the reference to logical file itself (which contains object pool). With the idea of moving tests to files and stopping their on-fly creation, situation where object will not have a logical file assigned is not supposed to happen anymore.
Creation of objects on fly isn't so easy anymore as we need to supply more and more values, and implementation keeps changing. Hence rewriting some tests to be based on actual files instead. This commit doesn't touch however on-fly tests which refered to singular objects and didn't need logical_file property set as for now they don't seem to be affected.
This is a small step in a larger quest of moving the pool of metadata objects to c++ which will give a performance boost. There is a nice side-effect to this change: Because attic is now a c++ type, the values of object-attributes are not casted to python types until the attribute is access. Thus UnicodeErrors that were previously emitted on object construction is now postpone until the attribute is accessed.
Creating a back-reference from channels to their parent frame is rather awkward. Ideally we would completely remove it, but that would require some notable changes to the api. So it stays for now. The c++ api doesn't currently take part in these back-references between objects - and we want it to stay that way. The loading routine will gradually be moved to c++, hence the referencing is moved so that the porting doesn't require c++ to do back-referencing.
Experience with the python implementation of parsing objects on-demand has shown that the complexity of bookkeeping the state of each object-set quickly becomes complicated. To combat some of that complexity, the object-parsing is moved to the set itself. The state of the set (i.e. parsed or unparsed) then becomes an implementation detail of the set. Thus freeing any logical file implementation of bookkeeping the state of all its object-sets.
As the object_set is given a more prominent role, add method at that makes it a bit easier to access objects in the set.
Neither explicit nor implicit encrypted records can be processed by dlisio's parsing routines, hence they are single out straight away and stored separately. This seams cleaner than implicitly dropping them in sub-sequent parsing routines. The checks for encryptions are preserved in the parsing routines in case someone tries to do some cleverness on their own.
dl::pool is intended to relieve python's dlis object of the responsibility of caching raw record, parsed objects and the tedious task of parsing records at the correct time. The c++ counterpart is conceptually simpler as each object_set are responsible for parsing. That means dl::pool is a simple container with some convenient functions for querying objects.
A common user pattern is to querry for the available object types. This is a cheap querry as no objects must be parsed in order to return the types.
Rather then hardcoding a regex matcher in dl::pool's implementation, let the caller pass along their own user. I.e. we dependancy inject a matcher function into dl::pool::match. This also nable us to go back to using python's re.
It the current type of an value_vector is mpark::monostate, mpark::visit picks the rvalue overload in variant_equal, which makes sense. However it also picks that overload when both lhs and rhs are monostate, resulting in a false negative for the comparison. A new overload with the same typename for both arguments would also resolve the problem, but I'm afraid that it would swallow more than it should.
Use dl::pool as the main driver for metadata within the dlis-class. Objects are no longer cached in python and will be read from the c++ pool at each query.
Due to the intricate structure of a dlis-file, dlisio over-parse when a certain pieces of information is queried. This would typically makes any warnings or errors pre-mature and might result in the query failing due to an error in some (from a user-perspective) unrelated data. To combat this issue, any errors, inconsistencies or actions taken by parse_object is communicated through enums set on the parsed objects. The consumer of the objects are then responsible for investigating the this information before using the object. The intention is to follow up on with the same approach basic_object and object_set.
Reap the benefits of dl::parsing_info's logging capabilities by moving invariant into dl::attribute.info.
Moving the object-pool to c++ generally postpone any UnicodeWarnings from f.load() to whenever the broken string is accessed. And as a side effect of using dl::basic_object as attic we get better encoding support for strings. That is due to the fact that our string encoding is flawed: decode_str only applies to our custom type_caster for dl::ident etc. While pybind11's default caster for string-like objects eludes decode_str.
b7e9d96
to
ee36163
Compare
Rather than filter the list for tells in python, pass the whole index to findfdata.
Because the memory layout of python lists and std::vector differs, pybind needs to do a copy operation during conversion. The index is passed back and forth between c++ and python, which results in at least two copies during load(). As the content of the index is never actually examined in python, these copies are completely unnecessary - and costly as the index may be quite large. Rendering them opaque means they can be passed by reference by pybind.
Objects are no longer cached in Python, so calling dlis.__getitem__ creates an unnecessary amount of python objects compared to dlis.object.
Creating the map directly in findfdata reduces the amount of strings that has to be returned to python. On my machine, the amount of time it takes to create the fdata index is reduced with about 6%.
Having an overload for just type means we are not restricted to use a regex matcher when just querying for type. It's rather common query, and it's faster to to with an exact matcher, compared to regex match on name which we don't care about anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍
Spoilers:
- Great hard work with moving stuff to C++ 👍
- Looks like we totally disagree about what must be defined as info/warn/etc
- I am probably not in favor of using std:error after all
- I don't want to see matcher internals, only main match method!
- Including consistency commit in this PR might have been a grave mistake :D
- Probably smaller PRs would work for us better in future than this one behemoth
* In short this is a container for the Logical Record Segment Header | ||
* information and the position of the records in the file. | ||
*/ | ||
struct index_entry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a formal complain: I still believe this must be a part of the record now. You even repeat stuff like is_explicit and is_encrypted.
But I am fine if that's planned, doesn't have to be in this PR.
@@ -67,6 +67,7 @@ class stream { | |||
struct stream_offsets { | |||
std::vector< index_entry > explicits; | |||
std::vector< index_entry > implicits; | |||
std::vector< index_entry > encrypted; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙀
Didn't we agree on something nice that separates explicit and encrypted property?
struct index_entry { | ||
long long tell; | ||
std::uint8_t code; // Logical Record Type - Appendix A | ||
std::uint8_t attributes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention explicitly that these are attributes of the very first segment, because they won't be the same across all the segments, but you still store them all. Basically explicit and encrypted are the only ones that make sense from the outside as they are (supposed to be) consistent.
@@ -687,6 +687,7 @@ const noexcept (true) { | |||
&& this->count == o.count | |||
&& this->reprc == o.reprc | |||
&& this->units == o.units | |||
&& this->info == o.info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questionable.
Info just means information found during processing. Attribute information may come from template, from object, from mix of object and template, from whatever.
If one attribute has empty value and in another it was set by dlisio, does it make these attributes different?
I do not think so.
std::string fp = tmp.fingerprint("FRAME"); | ||
|
||
/* Although index.size() often is fairly large, the number of | ||
* unique frames are typically just an handful. Hence the repeated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number of unique frames (is) typically (small)
or
there are just (a) handful of unique frames?
number is a handful sounds strange to me
|
||
/* Although index.size() often is fairly large, the number of | ||
* unique frames are typically just an handful. Hence the repeated | ||
* calls to find() are not so bad as it might seem at first glace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also want glace!!!! But probably we have to be satisfied with gla(n)ce 😿
No description provided.