Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement variable length arrays (OCCURS) #156

Closed
jpdev-sb opened this issue Aug 15, 2019 · 3 comments
Closed

Implement variable length arrays (OCCURS) #156

jpdev-sb opened this issue Aug 15, 2019 · 3 comments
Assignees
Labels
accepted Accepted for implementation enhancement New feature or request

Comments

@jpdev-sb
Copy link

@yruslan - Thanks so much with your help on #147 . We were able to re-export the data to include the RDW. However, we're still facing some issues.

Background

I'm reading a single file with two records that use the same copybook. However, when I trie to save the dataframe to JSON, I see four records and sections of the JSON that should include repeating values (i.e. sections of the copybook that use OCCURS...DEPENDING ON) are empty.

The relevant sections of the copybook are here.

           02 FI-IP-SNF-CLM-REC.
             04 FI-IP-SNF-CLM-FIX-GRP.
               06 CLM-REC-IDENT-GRP.
                 08 REC-LNGTH-CNT          PIC S9(5) COMP-3.
           ...
               06 IP-REV-CNTR-CD-I-CNT     PIC 99.
           ...
               06 CLM-REV-CNTR-GRP      OCCURS 0 TO 45 TIMES
                          DEPENDING ON IP-REV-CNTR-CD-I-CNT
                          OF FI-IP-SNF-CLM-REC.
           ...

Cobrix logs the following.

-------- FIELD LEVEL/NAME --------- --ATTRIBS--    FLD  START     END  LENGTH

FI_IP_SNF_CLM_REC                                            1  31656  31656
  4 FI_IP_SNF_CLM_FIX_GRP                           244      1   2058   2058
    6 CLM_REC_IDENT_GRP                               7      1      8      8
      8 REC_LNGTH_CNT                                 3      1      3      3
  ...
    6 IP_REV_CNTR_CD_I_CNT             D            153   1249   1250      2
  ...
    6 CLM_REV_CNTR_GRP                 []           360   4384  31653  27270
  ...

Here's my code:

    val inpk_df = spark
      .read
      .format("cobol")
      .option("copybook", "data/UTLIPSNK.txt")
      .option("generate_record_id", true)
      .option("is_record_sequence", "true")
      .option("is_rdw_big_endian", "true")
      .load("data/in/file1")
    inpk_df.write.json("data/out/file1")

This produces JSON that looks like this.

{
  "File_Id": 0,
  "Record_Id": 0,
  "FI_IP_SNF_CLM_REC": {...}
}
{
  "File_Id": 0,
  "Record_Id": 1,
  "FI_IP_SNF_CLM_REC": {...}
}
{
  "File_Id": 0,
  "Record_Id": 2,
  "FI_IP_SNF_CLM_REC": {...}
}
{
  "File_Id": 0,
  "Record_Id": 3,
  "FI_IP_SNF_CLM_REC": {...}
}

First Question

So, the first question is, why is it creating four records and not two? If I omit the .option("is_rdw_big_endian", "true"), then I see this error.

java.lang.IllegalStateException: RDW headers should never be zero (64,7,0,0). Found zero size record at 4.
	at za.co.absa.cobrix.cobol.parser.headerparsers.RecordHeaderParserRDW.processRdwHeader(RecordHeaderParserRDW.scala:82)
...

Now, the REC_LNGTH_CNT should contain the actual record length. It's value for the two records is 16,387 and 13,950, respectively. I tried to use that rather than the RDW, as follows.

...
//      .option("is_record_sequence", "true")
//      .option("is_rdw_big_endian", "true")
      .option("record_length_field", "FI-IP-SNF-CLM-REC.FI-IP-SNF-CLM-FIX-GRP.CLM-REC-IDENT-GRP.REC-LNGTH-CNT")
...

But, I got this error.

java.lang.IllegalStateException: Record length value of the field REC_LNGTH_CNT must be an integral type.
	at za.co.absa.cobrix.spark.cobol.reader.varlen.iterator.VarLenNestedIterator.fetchRecordUsingRecordLengthField(VarLenNestedIterator.scala:143)
...

Is that because this field is defined as PIC S9(5) COMP-3. in the copybook?

I'm guessing there is a mismatch between what the RDW is indicating and the actual data. Do you have some pointers for troubleshooting that and working around it?

Second Question

The second question is, how come the nested JSON array isn't populated for the variable length field values?

The value of the IP-REV-CNTR-CD-I-CNT field in the JSON for the first record looks like this:

...
"IP_REV_CNTR_CD_I_CNT": 23,
...

So, I expect 23 records to be populated. However, the value of the "CLM_REV_CNTR_GRP" key is an array of 23 elements, but they are all empty. The first 20 elements are all objects where each key has an empty value. The last three are just empty objects.

Any ideas?

Thanks so much for your help!!!

@yruslan
Copy link
Collaborator

yruslan commented Aug 19, 2019

Thanks for providing so much context. Still hard to tell what exactly went wrong, but here some ideas:

  • When RDWs are used the number of records is determined by them. So if RDWs are correct, then the file contains a sequence of 4 records. Or RDWs are wrong or biased and need adjustment. You can verify it this way. If each of the records starts with valid (properly parsed/decoded) values, then RDW is likely to be correct and there are actually 4 records (see the next idea).
  • If your data is hierarchical it might be that you have 2 root records having 2 child records. Although from a logical perspective it might be considered 2 records (child record being part of a root record), from the file layout perspective it might have 4 records (just an idea, I'm not suggesting this is the case).
  • An example RDW (from the error message) says the record size is 64*256 + 7 = 16391, but the size of the copybook is 31653. Your segments redefine each other in the copybook, right?
  • The fact that the parsed data contains 23 empty elements of an array might be a sign that the copybook doesn't completely match the data.
  • The REC_LNGTH_CNT must be an integral type issue is interesting since REC_LNGTH_CNT is definitely integral. We are going to release a 1.0.0-SNAPSHOT soon with the rewritten parser. I'm wondering if the issue is still valid for that version.

@yruslan yruslan added accepted Accepted for implementation enhancement New feature or request labels Aug 29, 2019
@yruslan yruslan changed the title Seemingly spurious records and missing variable length data Implement variable length arrays (OCCURS) Aug 29, 2019
@yruslan
Copy link
Collaborator

yruslan commented Aug 29, 2019

Implementation details are discussed in #172.

We might implement it as an option for spark-cobol, e.g. .option(...).

@yruslan
Copy link
Collaborator

yruslan commented Sep 2, 2019

Please, try this snapshot and let me know if it worked for you:

<dependency>
      <groupId>za.co.absa.cobrix</groupId>
      <artifactId>spark-cobol</artifactId>
      <version>1.0.1-SNAPSHOT</version>
</dependency>

You also need to use this option:

.option("variable_size_occurs", "true")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Accepted for implementation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants