-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable OCCURS clause #239
Comments
Variable size OCCURS produces variable size records. And variable size records are only supported if records have an RDW header or a record length field. The idea of backtracking is interesting and might be the way to overcome the limitation. At first glance, it seems doable. I will think about the details next week. The issue to solve is related to scale processing of such files. Currently a sparse index is generated based on RDW. That allows to have a way to split a variable record length file between partitions. But the index generation happens before records are parsed so actual array sizes are unknown at that stage. To implement the backtracking the reader needs to
|
That's exactly what I was thinking. I did something similar for a custom
parser I built in the past for files without RDW headers. What it does is
the following (recursively):
1. If the group or subgroups contain no OCCURS, the size is known from the copybook
directly (i.e. group of fixed length)
2. Otherwise, determine the size of subgroups and gather all the depended-on
fields. The group size is the sum of the products of those values.
It's not as efficient as reading the RDW, but the result is the record size
that can be used to build the index. I can implement this is you think this
is the way to go.
Backtracking would be hard to implement with partitions. It works fine
serially but unless we know the exact start of the first record in the
partition we are kind of stuck.
…On Sat, Jan 18, 2020, 06:39 Ruslan Yushchenko ***@***.***> wrote:
Variable size OCCURS produces variable size records. And variable size
records are only supported if records have an RDW header or a record length
field.
The idea of backtracking is interesting and might be the way to overcome
the limitation. At first glance, it seems doable. I will think about the
details next week.
The issue to solve is related to scale processing of such files. Currently
a sparse index is generated based on RDW. That allows to have a way to
split a variable record length file between partitions. But the index
generation happens before records are parsed so actual array sizes are
unknown at that stage.
To implement the backtracking the reader needs to
- Gather all information on the locations of array element sizes and
array lengths and the ordering and nesting of arrays,
- This info needs to be used when a sparse index is generated so that
files are split properly between partitions.
- During the record decoding stage the backtracking algorithm needs to
be implemented.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#239?email_source=notifications&email_token=AAJ6T2K57G37CXIBMT37SI3Q6LS7JA5CNFSM4KIJUV3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJJWNQI#issuecomment-575891137>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ6T2MAJ3PMHAWOH5OZHC3Q6LS7JANCNFSM4KIJUV3A>
.
|
In fact, something like this could even simplify the code as there is no need for a distinction between fixed- and variable-length records; length would be determined by:
|
Yes, it seems like a very good idea to have a special custom record header parsers for a situation when
The 1-2 is how it already works. Yes, we may not distinguish between fixed record length files and variable record length files. But the simplest fixed record length files use Spark's |
Maybe we could change the logic of computing BinaryProperties to also keep
track of offsets of dependedOn fields.
What I meant was that we wouldn't have to tell cobrix explicitly whether
the record is fixed or not. If there is no rdw, no length record, and no
variable occurs, then we have fixed length records, otherwise variable.
I guess I'm proposing eliminating the variable records option and let the
parser decide based on the above.
…On Mon, Jan 20, 2020, 02:49 Ruslan Yushchenko ***@***.***> wrote:
Yes, it seems like a very good idea to have a special custom record header
parsers for a situation when variable_size_occurs=true and no RDWs, and
no record length field.
- That parser could fetch the input stream up to a first occurs size
field, calculate the next chunk, fetch data up to the next occurs size
field, and so forth.
- And finally, it could fetch the rest of the data up to the end of
the record.
- Since record header parsers are used by the index builder
automatically, no other logic needs to be updated.
In fact, something like this could even simplify the code as there is no
need for a distinction between fixed- and variable-length records; length
would be determined by:
1. existence of an RDW block (using the same header parser we use
today)
2. a field within the record [size]
3. figured out from the record itself by parsing the OCCURS dependents
The 1-2 is how it already works. Yes, we may not distinguish between fixed
record length files and variable record length files. But the simplest
fixed record length files use Spark's binaryRecords() API and are still
more efficient than any of the instances where record sizes are variable.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#239?email_source=notifications&email_token=AAJ6T2LXBRZUP3I32MLCXZDQ6VJSJA5CNFSM4KIJUV3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJLVJSA#issuecomment-576148680>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ6T2IX65BS5LGJXQXM35DQ6VJSJANCNFSM4KIJUV3A>
.
|
Yes, might be helpful. I'll look into this issue in a week or two. There are some high priority things I need to finish.
Yes, this is exactly how it works now. The |
Trying to figure out if support for simple cases is sufficient or we need a more generic implementasion. |
Yeah, I have a few files that have multiple variable occurs, with some of those nested! What I implemented in the past that I can do here too is a recursive size method to give the size of group given a record. The output would be something akin to Option[int, struct of offsets and datatypes] that could be applied to a record. If there are no variable occurs then it's just an int and we can skip the record. |
The way the custom record header interface is implemented it has the following limitations:
I'm wondering why such complicated files do not contain RDWs or a record length field? |
In order to implement this use case scalably and as generic as it is, we need a sparse index builder that takes an AST and an instance of A very helpful thing from you side would be a unit test that uses a copybook with several variable-size arrays, and a data file with various array sizes, including |
These are files that we receive from external vendors and have no control over. They structure the files such that there's a 10-byte header that defines the segment. From there we can determine the max size of the record, but we will only know as we read it. |
To make sure, we're doing |
Makes sense. Thanks! I have now a sketch of a solution in mind. I will proceed by creating another interface and call it something like In the future, this solution will provide an easy way to extend it with the support of 'custom record extractors'. I was thinking that if implemented properly this 'custom extractor' interface might be easier to use for custom record formats than 'custom record header parsers'. |
If I understand correctly, this is exactly what I did for my Python parser + decoder. In fact, I have something similar to that interface and the following different extractors since the files we receive have a few different formats (!!):
Even though I don't have files with RDW headers, I implemented it anyway as a particular case of the LeadingBytesSegmentExtractor. In particular, I only have one type of reader, which calls the length extractor before execution. If the file happens to be fixed len, the extractor code skips pre-parsing the file and just returns a range of offsets for each record. Otherwise, the file gets pre-parsed to create the offsets. |
Yes, this makes perfect sense. Thank you!
For now, I'll focus on the specific record extractor for fixed-record-length files with variable size OCCURS. I'm just wondering. You mentioned the other day that you don't use Spark, but just the Cobol parser to achieve the mainframe file conversion. Do the feature requests you created mean that you are planning to use Spark after all? |
I'm using cobrix with pyspark for all the files I can use it on, which at the moment are the fixed len ebcdic files. My current setup for the other files is suboptimal -- I use my custom parser/decoder in Python to output json files and then load them into Spark dataframes. I can help with whatever you need, I'm quite motivated to transition everything to cobrix. |
The variable size occurs feature shouldn't take long. |
I actually have a test for my decoder that tests that format. I'll implement it on a separate branch here (that will of course fail). It would be useful having a list of tasks that need to be completed (such as adding tests, adding a certain class or method, etc). I'll gladly take whatever task I can help with. |
Great, Thanks!
Thanks! I'll keep that in mind. For now, it seems that there are no big enough tasks to develop in parallel. |
Here's an unit test: |
Cool, thanks a lot! I hope to finish the implementation soon. |
While doing this issue I've found a bug in the support of nesting OCCURS. The support for variable OCCURS for fixed record length files is deployed as a snapshot: I plan to do more testing on this before releasing it. You can check if it works for you if you want. |
Describe the bug
Variable OCCURS fails if we don't specify variable record lengths. This may be very similar to the discussion of #156, but I think the issue there is slightly different.
To Reproduce
Run the following with pyspark
Expected behavior
All examples above should return the same
but the second one gives only the first record
and the third one fails with a file size check.
Additional context
This issue seems to be a consequence of the fact that the files I have to process do not have a leading RDW block or a field specifying record lengths. Essentially, the variable OCCURS clause means that in my case the record needs to be read in full, but then we need to backtrack the number of bytes that were not needed before reading the following record.
The text was updated successfully, but these errors were encountered: