Move implementation of the metadata pool to core #280

ErlendHaa · 2020-07-13T12:17:35Z

No description provided.

The type of an object is defined on the object set, not on the actual object, thus an instances of basic_object must remain linked to it's set in order to have a full representation of the object. The concept of a object set is not semantically intresting after the objects in the set are parsed. But because of the type, the set has to be carried through all the way to python. By setting the type directly on the basic_object when it's parsed the need to keep it close to the set goes away. It also fasilitates the posibility of removing object sets from the C++ API all together.

Logical Record Segment Headers contain useful information when determining the content of the record. Previously we had to extract the entire record in order to retreive information about it's content - often requiring our control flow to extract more records than needed. The lrsh are already being read when indexing so the only cost here is an increase in the size of the index.

When index_entry contains attribute and code it makes sense that the sanity checking of these header fields are done when creating the index. Also these checks help validate the index, which is useful if we want to allow an incomplete index to propagate further.

For lookup it's not actually interesting to know which object orders the lookup. We need barely the reference to logical file itself (which contains object pool). With the idea of moving tests to files and stopping their on-fly creation, situation where object will not have a logical file assigned is not supposed to happen anymore.

Creation of objects on fly isn't so easy anymore as we need to supply more and more values, and implementation keeps changing. Hence rewriting some tests to be based on actual files instead. This commit doesn't touch however on-fly tests which refered to singular objects and didn't need logical_file property set as for now they don't seem to be affected.

This is a small step in a larger quest of moving the pool of metadata objects to c++ which will give a performance boost. There is a nice side-effect to this change: Because attic is now a c++ type, the values of object-attributes are not casted to python types until the attribute is access. Thus UnicodeErrors that were previously emitted on object construction is now postpone until the attribute is accessed.

Creating a back-reference from channels to their parent frame is rather awkward. Ideally we would completely remove it, but that would require some notable changes to the api. So it stays for now. The c++ api doesn't currently take part in these back-references between objects - and we want it to stay that way. The loading routine will gradually be moved to c++, hence the referencing is moved so that the porting doesn't require c++ to do back-referencing.

Experience with the python implementation of parsing objects on-demand has shown that the complexity of bookkeeping the state of each object-set quickly becomes complicated. To combat some of that complexity, the object-parsing is moved to the set itself. The state of the set (i.e. parsed or unparsed) then becomes an implementation detail of the set. Thus freeing any logical file implementation of bookkeeping the state of all its object-sets.

As the object_set is given a more prominent role, add method at that makes it a bit easier to access objects in the set.

Neither explicit nor implicit encrypted records can be processed by dlisio's parsing routines, hence they are single out straight away and stored separately. This seams cleaner than implicitly dropping them in sub-sequent parsing routines. The checks for encryptions are preserved in the parsing routines in case someone tries to do some cleverness on their own.

dl::pool is intended to relieve python's dlis object of the responsibility of caching raw record, parsed objects and the tedious task of parsing records at the correct time. The c++ counterpart is conceptually simpler as each object_set are responsible for parsing. That means dl::pool is a simple container with some convenient functions for querying objects.

A common user pattern is to querry for the available object types. This is a cheap querry as no objects must be parsed in order to return the types.

Rather then hardcoding a regex matcher in dl::pool's implementation, let the caller pass along their own user. I.e. we dependancy inject a matcher function into dl::pool::match. This also nable us to go back to using python's re.

It the current type of an value_vector is mpark::monostate, mpark::visit picks the rvalue overload in variant_equal, which makes sense. However it also picks that overload when both lhs and rhs are monostate, resulting in a false negative for the comparison. A new overload with the same typename for both arguments would also resolve the problem, but I'm afraid that it would swallow more than it should.

Use dl::pool as the main driver for metadata within the dlis-class. Objects are no longer cached in python and will be read from the c++ pool at each query.

Due to the intricate structure of a dlis-file, dlisio over-parse when a certain pieces of information is queried. This would typically makes any warnings or errors pre-mature and might result in the query failing due to an error in some (from a user-perspective) unrelated data. To combat this issue, any errors, inconsistencies or actions taken by parse_object is communicated through enums set on the parsed objects. The consumer of the objects are then responsible for investigating the this information before using the object. The intention is to follow up on with the same approach basic_object and object_set.

Reap the benefits of dl::parsing_info's logging capabilities by moving invariant into dl::attribute.info.

Moving the object-pool to c++ generally postpone any UnicodeWarnings from f.load() to whenever the broken string is accessed. And as a side effect of using dl::basic_object as attic we get better encoding support for strings. That is due to the fact that our string encoding is flawed: decode_str only applies to our custom type_caster for dl::ident etc. While pybind11's default caster for string-like objects eludes decode_str.

Rather than filter the list for tells in python, pass the whole index to findfdata.

Because the memory layout of python lists and std::vector differs, pybind needs to do a copy operation during conversion. The index is passed back and forth between c++ and python, which results in at least two copies during load(). As the content of the index is never actually examined in python, these copies are completely unnecessary - and costly as the index may be quite large. Rendering them opaque means they can be passed by reference by pybind.

Objects are no longer cached in Python, so calling dlis.__getitem__ creates an unnecessary amount of python objects compared to dlis.object.

Creating the map directly in findfdata reduces the amount of strings that has to be returned to python. On my machine, the amount of time it takes to create the fdata index is reduced with about 6%.

Having an overload for just type means we are not restricted to use a regex matcher when just querying for type. It's rather common query, and it's faster to to with an exact matcher, compared to regex match on name which we don't care about anyway.

achaikou

👍 👍
Spoilers:

Great hard work with moving stuff to C++ 👍
Looks like we totally disagree about what must be defined as info/warn/etc
I am probably not in favor of using std:error after all
I don't want to see matcher internals, only main match method!
Including consistency commit in this PR might have been a grave mistake :D
Probably smaller PRs would work for us better in future than this one behemoth

achaikou · 2020-07-13T13:52:17Z

lib/extension/dlisio/ext/types.hpp

+ * In short this is a container for the Logical Record Segment Header
+ * information and the position of the records in the file.
+ */
+struct index_entry {


Just as a formal complain: I still believe this must be a part of the record now. You even repeat stuff like is_explicit and is_encrypted.
But I am fine if that's planned, doesn't have to be in this PR.

lib/extension/dlisio/ext/types.hpp

achaikou · 2020-07-13T15:47:49Z

lib/extension/dlisio/ext/io.hpp

@@ -67,6 +67,7 @@ class stream {
 struct stream_offsets {
    std::vector< index_entry > explicits;
    std::vector< index_entry > implicits;
+    std::vector< index_entry > encrypted;


🙀
Didn't we agree on something nice that separates explicit and encrypted property?

python/tests/test_logical_file.py

achaikou · 2020-07-23T10:14:49Z

lib/extension/dlisio/ext/types.hpp

+struct index_entry {
+    long long    tell;
+    std::uint8_t code;       // Logical Record Type - Appendix A
+    std::uint8_t attributes;


Mention explicitly that these are attributes of the very first segment, because they won't be the same across all the segments, but you still store them all. Basically explicit and encrypted are the only ones that make sense from the outside as they are (supposed to be) consistent.

achaikou · 2020-07-27T15:53:00Z

lib/src/parse.cpp

@@ -687,6 +687,7 @@ const noexcept (true) {
        && this->count == o.count
        && this->reprc == o.reprc
        && this->units == o.units
+        && this->info  == o.info


Questionable.
Info just means information found during processing. Attribute information may come from template, from object, from mix of object and template, from whatever.
If one attribute has empty value and in another it was set by dlisio, does it make these attributes different?
I do not think so.

python/dlisio/plumbing/linkage.py

achaikou · 2020-07-27T16:07:46Z

lib/src/io.cpp

+        std::string fp = tmp.fingerprint("FRAME");
+
+        /* Although index.size() often is fairly large, the number of
+         * unique frames are typically just an handful. Hence the repeated


number of unique frames (is) typically (small)
or
there are just (a) handful of unique frames?

number is a handful sounds strange to me

achaikou · 2020-07-27T16:09:38Z

lib/src/io.cpp

+
+        /* Although index.size() often is fairly large, the number of
+         * unique frames are typically just an handful. Hence the repeated
+         * calls to find() are not so bad as it might seem at first glace.


I also want glace!!!! But probably we have to be satisfied with gla(n)ce 😿

python/tests/test_object_structures.py

ErlendHaa marked this pull request as draft July 13, 2020 12:18

ErlendHaa force-pushed the cpp-implementation-of-metadata-pool branch from e1b9db5 to b7e9d96 Compare July 15, 2020 14:28

ErlendHaa and others added 18 commits July 20, 2020 16:10

Add dl::object_set::at

29f6559

As the object_set is given a more prominent role, add method at that makes it a bit easier to access objects in the set.

Add dl::pool:types

bed2b74

A common user pattern is to querry for the available object types. This is a cheap querry as no objects must be parsed in order to return the types.

Let dl::pool::match use external matchers

71189aa

Rather then hardcoding a regex matcher in dl::pool's implementation, let the caller pass along their own user. I.e. we dependancy inject a matcher function into dl::pool::match. This also nable us to go back to using python's re.

Add an regex_matcher that uses python's re module

8715bbb

Let dl::pool be the driver for metadata in dlis

f6b0442

Use dl::pool as the main driver for metadata within the dlis-class. Objects are no longer cached in python and will be read from the c++ pool at each query.

Move object_attribute.invariant to attribute.info

63b8500

Reap the benefits of dl::parsing_info's logging capabilities by moving invariant into dl::attribute.info.

ErlendHaa force-pushed the cpp-implementation-of-metadata-pool branch from b7e9d96 to ee36163 Compare July 23, 2020 09:23

ErlendHaa added 8 commits July 23, 2020 12:37

fixup! Set infocodes on dl::object_attribute when parsing

3e6582e

fixup! Let dl::pool::match use external matchers

f0b7402

Pass std::vector< dl::index_entry >& to findfdata

785a1db

Rather than filter the list for tells in python, pass the whole index to findfdata.

fixup! Set infocodes on dl::object_attribute when parsing

4a4c4af

Use dlis.object when creating obj-to-obj links

b99f75f

Objects are no longer cached in Python, so calling dlis.__getitem__ creates an unnecessary amount of python objects compared to dlis.object.

Create fdata_index in findfdata

12c69a1

Creating the map directly in findfdata reduces the amount of strings that has to be returned to python. On my machine, the amount of time it takes to create the fdata index is reduced with about 6%.

achaikou reviewed Jul 27, 2020

View reviewed changes

ErlendHaa closed this Oct 26, 2020

ErlendHaa deleted the cpp-implementation-of-metadata-pool branch October 27, 2020 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move implementation of the metadata pool to core #280

Move implementation of the metadata pool to core #280

ErlendHaa commented Jul 13, 2020

achaikou left a comment

achaikou Jul 13, 2020

achaikou Jul 13, 2020

achaikou Jul 23, 2020

achaikou Jul 27, 2020

achaikou Jul 27, 2020

achaikou Jul 27, 2020

Move implementation of the metadata pool to core #280

Move implementation of the metadata pool to core #280

Conversation

ErlendHaa commented Jul 13, 2020

achaikou left a comment

Choose a reason for hiding this comment

achaikou Jul 13, 2020

Choose a reason for hiding this comment

achaikou Jul 13, 2020

Choose a reason for hiding this comment

achaikou Jul 23, 2020

Choose a reason for hiding this comment

achaikou Jul 27, 2020

Choose a reason for hiding this comment

achaikou Jul 27, 2020

Choose a reason for hiding this comment

achaikou Jul 27, 2020

Choose a reason for hiding this comment