Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move implementation of the metadata pool to core #280

Closed

Conversation

ErlendHaa
Copy link
Contributor

No description provided.

The type of an object is defined on the object set, not on the actual
object, thus an instances of basic_object must remain linked to it's
set in order to have a full representation of the object.

The concept of a object set is not semantically intresting after
the objects in the set are parsed. But because of the type, the set has
to be carried through all the way to python.

By setting the type directly on the basic_object when it's parsed the
need to keep it close to the set goes away. It also fasilitates the
posibility of removing object sets from the C++ API all together.
@ErlendHaa ErlendHaa marked this pull request as draft July 13, 2020 12:18
@ErlendHaa ErlendHaa force-pushed the cpp-implementation-of-metadata-pool branch from e1b9db5 to b7e9d96 Compare July 15, 2020 14:28
ErlendHaa and others added 18 commits July 20, 2020 16:10
Logical Record Segment Headers contain useful information when
determining the content of the record. Previously we had to extract the
entire record in order to retreive information about it's content -
often requiring our control flow to extract more records than needed.

The lrsh are already being read when indexing so the only cost here is
an increase in the size of the index.
When index_entry contains attribute and code it makes sense that the
sanity checking of these header fields are done when creating the index.
Also these checks help validate the index, which is useful if we want to
allow an incomplete index to propagate further.
For lookup it's not actually interesting to know which object orders the
lookup. We need barely the reference to logical file itself (which
contains object pool).

With the idea of moving tests to files and stopping their on-fly
creation, situation where object will not have a logical file assigned
is not supposed to happen anymore.
Creation of objects on fly isn't so easy anymore as we need to supply
more and more values, and implementation keeps changing.
Hence rewriting some tests to be based on actual files instead.

This commit doesn't touch however on-fly tests which refered to singular
objects and didn't need logical_file property set as for now they don't
seem to be affected.
This is a small step in a larger quest of moving the pool of metadata
objects to c++ which will give a performance boost.

There is a nice side-effect to this change: Because attic is now a c++
type, the values of object-attributes are not casted to python types
until the attribute is access. Thus UnicodeErrors that were previously
emitted on object construction is now postpone until the attribute is
accessed.
Creating a back-reference from channels to their parent frame is rather
awkward. Ideally we would completely remove it, but that would require
some notable changes to the api. So it stays for now.

The c++ api doesn't currently take part in these back-references between
objects - and we want it to stay that way. The loading routine will
gradually be moved to c++, hence the referencing is moved so that the
porting doesn't require c++ to do back-referencing.
Experience with the python implementation of parsing objects
on-demand has shown that the complexity of bookkeeping the state of
each object-set quickly becomes complicated.

To combat some of that complexity, the object-parsing is moved to the
set itself. The state of the set (i.e. parsed or unparsed) then becomes
an implementation detail of the set. Thus freeing any logical file
implementation of bookkeeping the state of all its object-sets.
As the object_set is given a more prominent role, add method at that
makes it a bit easier to access objects in the set.
Neither explicit nor implicit encrypted records can be processed by
dlisio's parsing routines, hence they are single out straight away
and stored separately.  This seams cleaner than implicitly dropping them
in sub-sequent parsing routines.

The checks for encryptions are preserved in the parsing routines in
case someone tries to do some cleverness on their own.
dl::pool is intended to relieve python's dlis object of the
responsibility of caching raw record, parsed objects and the tedious
task of parsing records at the correct time.

The c++ counterpart is conceptually simpler as each object_set
are responsible for parsing. That means dl::pool is a simple
container with some convenient functions for querying objects.
A common user pattern is to querry for the available object types. This
is a cheap querry as no objects must be parsed in order to return the
types.
Rather then hardcoding a regex matcher in dl::pool's implementation, let
the caller pass along their own user. I.e. we dependancy inject a
matcher function into dl::pool::match. This also nable us to go back to
using python's re.
It the current type of an value_vector is mpark::monostate, mpark::visit
picks the rvalue overload in variant_equal, which makes sense. However
it also picks that overload when both lhs and rhs are monostate,
resulting in a false negative for the comparison.

A new overload with the same typename for both arguments would also
resolve the problem, but I'm afraid that it would swallow more than it
should.
Use dl::pool as the main driver for metadata within the dlis-class.
Objects are no longer cached in python and will be read from the c++
pool at each query.
Due to the intricate structure of a dlis-file, dlisio over-parse when a
certain pieces of information is queried. This would typically makes any
warnings or errors pre-mature and might result in the query failing due
to an error in some (from a user-perspective) unrelated data.

To combat this issue, any errors, inconsistencies or actions taken by
parse_object is communicated through enums set on the parsed objects.
The consumer of the objects are then responsible for investigating the
this information before using the object.

The intention is to follow up on with the same approach basic_object and
object_set.
Reap the benefits of dl::parsing_info's logging capabilities by moving
invariant into dl::attribute.info.
Moving the object-pool to c++ generally postpone any UnicodeWarnings
from f.load() to whenever the broken string is accessed. And as a side
effect of using dl::basic_object as attic we get better encoding support
for strings. That is due to the fact that our string encoding is flawed:
decode_str only applies to our custom type_caster for dl::ident etc.
While pybind11's default caster for string-like objects eludes
decode_str.
@ErlendHaa ErlendHaa force-pushed the cpp-implementation-of-metadata-pool branch from b7e9d96 to ee36163 Compare July 23, 2020 09:23
Rather than filter the list for tells in python, pass the whole index to
findfdata.
Because the memory layout of python lists and std::vector differs,
pybind needs to do a copy operation during conversion.

The index is passed back and forth between c++ and python, which results
in at least two copies during load(). As the content of the index is
never actually examined in python, these copies are completely
unnecessary - and costly as the index may be quite large.

Rendering them opaque means they can be passed by reference by pybind.
Objects are no longer cached in Python, so calling dlis.__getitem__
creates an unnecessary amount of python objects compared to
dlis.object.
Creating the map directly in findfdata reduces the amount of strings
that has to be returned to python. On my machine, the amount of time it
takes to create the fdata index is reduced with about 6%.
Having an overload for just type means we are not restricted to use a
regex matcher when just querying for type. It's rather common query,
and it's faster to to with an exact matcher, compared to regex match on
name which we don't care about anyway.
Copy link
Contributor

@achaikou achaikou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍
Spoilers:

  • Great hard work with moving stuff to C++ 👍
  • Looks like we totally disagree about what must be defined as info/warn/etc
  • I am probably not in favor of using std:error after all
  • I don't want to see matcher internals, only main match method!
  • Including consistency commit in this PR might have been a grave mistake :D
  • Probably smaller PRs would work for us better in future than this one behemoth

* In short this is a container for the Logical Record Segment Header
* information and the position of the records in the file.
*/
struct index_entry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a formal complain: I still believe this must be a part of the record now. You even repeat stuff like is_explicit and is_encrypted.
But I am fine if that's planned, doesn't have to be in this PR.

lib/extension/dlisio/ext/types.hpp Show resolved Hide resolved
@@ -67,6 +67,7 @@ class stream {
struct stream_offsets {
std::vector< index_entry > explicits;
std::vector< index_entry > implicits;
std::vector< index_entry > encrypted;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙀
Didn't we agree on something nice that separates explicit and encrypted property?

python/tests/test_logical_file.py Show resolved Hide resolved
struct index_entry {
long long tell;
std::uint8_t code; // Logical Record Type - Appendix A
std::uint8_t attributes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention explicitly that these are attributes of the very first segment, because they won't be the same across all the segments, but you still store them all. Basically explicit and encrypted are the only ones that make sense from the outside as they are (supposed to be) consistent.

@@ -687,6 +687,7 @@ const noexcept (true) {
&& this->count == o.count
&& this->reprc == o.reprc
&& this->units == o.units
&& this->info == o.info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questionable.
Info just means information found during processing. Attribute information may come from template, from object, from mix of object and template, from whatever.
If one attribute has empty value and in another it was set by dlisio, does it make these attributes different?
I do not think so.

python/dlisio/plumbing/linkage.py Show resolved Hide resolved
std::string fp = tmp.fingerprint("FRAME");

/* Although index.size() often is fairly large, the number of
* unique frames are typically just an handful. Hence the repeated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number of unique frames (is) typically (small)
or
there are just (a) handful of unique frames?

number is a handful sounds strange to me


/* Although index.size() often is fairly large, the number of
* unique frames are typically just an handful. Hence the repeated
* calls to find() are not so bad as it might seem at first glace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also want glace!!!! But probably we have to be satisfied with gla(n)ce 😿

python/tests/test_object_structures.py Show resolved Hide resolved
@ErlendHaa ErlendHaa closed this Oct 26, 2020
@ErlendHaa ErlendHaa deleted the cpp-implementation-of-metadata-pool branch October 27, 2020 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants