Implement variable length arrays (OCCURS) #156

jpdev-sb · 2019-08-15T15:49:35Z

@yruslan - Thanks so much with your help on #147 . We were able to re-export the data to include the RDW. However, we're still facing some issues.

Background

I'm reading a single file with two records that use the same copybook. However, when I trie to save the dataframe to JSON, I see four records and sections of the JSON that should include repeating values (i.e. sections of the copybook that use OCCURS...DEPENDING ON) are empty.

The relevant sections of the copybook are here.

           02 FI-IP-SNF-CLM-REC.
             04 FI-IP-SNF-CLM-FIX-GRP.
               06 CLM-REC-IDENT-GRP.
                 08 REC-LNGTH-CNT          PIC S9(5) COMP-3.
           ...
               06 IP-REV-CNTR-CD-I-CNT     PIC 99.
           ...
               06 CLM-REV-CNTR-GRP      OCCURS 0 TO 45 TIMES
                          DEPENDING ON IP-REV-CNTR-CD-I-CNT
                          OF FI-IP-SNF-CLM-REC.
           ...

Cobrix logs the following.

-------- FIELD LEVEL/NAME --------- --ATTRIBS--    FLD  START     END  LENGTH

FI_IP_SNF_CLM_REC                                            1  31656  31656
  4 FI_IP_SNF_CLM_FIX_GRP                           244      1   2058   2058
    6 CLM_REC_IDENT_GRP                               7      1      8      8
      8 REC_LNGTH_CNT                                 3      1      3      3
  ...
    6 IP_REV_CNTR_CD_I_CNT             D            153   1249   1250      2
  ...
    6 CLM_REV_CNTR_GRP                 []           360   4384  31653  27270
  ...

Here's my code:

    val inpk_df = spark
      .read
      .format("cobol")
      .option("copybook", "data/UTLIPSNK.txt")
      .option("generate_record_id", true)
      .option("is_record_sequence", "true")
      .option("is_rdw_big_endian", "true")
      .load("data/in/file1")
    inpk_df.write.json("data/out/file1")

This produces JSON that looks like this.

{
  "File_Id": 0,
  "Record_Id": 0,
  "FI_IP_SNF_CLM_REC": {...}
}
{
  "File_Id": 0,
  "Record_Id": 1,
  "FI_IP_SNF_CLM_REC": {...}
}
{
  "File_Id": 0,
  "Record_Id": 2,
  "FI_IP_SNF_CLM_REC": {...}
}
{
  "File_Id": 0,
  "Record_Id": 3,
  "FI_IP_SNF_CLM_REC": {...}
}

First Question

So, the first question is, why is it creating four records and not two? If I omit the .option("is_rdw_big_endian", "true"), then I see this error.

java.lang.IllegalStateException: RDW headers should never be zero (64,7,0,0). Found zero size record at 4.
	at za.co.absa.cobrix.cobol.parser.headerparsers.RecordHeaderParserRDW.processRdwHeader(RecordHeaderParserRDW.scala:82)
...

Now, the REC_LNGTH_CNT should contain the actual record length. It's value for the two records is 16,387 and 13,950, respectively. I tried to use that rather than the RDW, as follows.

...
//      .option("is_record_sequence", "true")
//      .option("is_rdw_big_endian", "true")
      .option("record_length_field", "FI-IP-SNF-CLM-REC.FI-IP-SNF-CLM-FIX-GRP.CLM-REC-IDENT-GRP.REC-LNGTH-CNT")
...

But, I got this error.

java.lang.IllegalStateException: Record length value of the field REC_LNGTH_CNT must be an integral type.
	at za.co.absa.cobrix.spark.cobol.reader.varlen.iterator.VarLenNestedIterator.fetchRecordUsingRecordLengthField(VarLenNestedIterator.scala:143)
...

Is that because this field is defined as PIC S9(5) COMP-3. in the copybook?

I'm guessing there is a mismatch between what the RDW is indicating and the actual data. Do you have some pointers for troubleshooting that and working around it?

Second Question

The second question is, how come the nested JSON array isn't populated for the variable length field values?

The value of the IP-REV-CNTR-CD-I-CNT field in the JSON for the first record looks like this:

...
"IP_REV_CNTR_CD_I_CNT": 23,
...

So, I expect 23 records to be populated. However, the value of the "CLM_REV_CNTR_GRP" key is an array of 23 elements, but they are all empty. The first 20 elements are all objects where each key has an empty value. The last three are just empty objects.

Any ideas?

Thanks so much for your help!!!

The text was updated successfully, but these errors were encountered:

yruslan · 2019-08-19T10:53:23Z

Thanks for providing so much context. Still hard to tell what exactly went wrong, but here some ideas:

When RDWs are used the number of records is determined by them. So if RDWs are correct, then the file contains a sequence of 4 records. Or RDWs are wrong or biased and need adjustment. You can verify it this way. If each of the records starts with valid (properly parsed/decoded) values, then RDW is likely to be correct and there are actually 4 records (see the next idea).
If your data is hierarchical it might be that you have 2 root records having 2 child records. Although from a logical perspective it might be considered 2 records (child record being part of a root record), from the file layout perspective it might have 4 records (just an idea, I'm not suggesting this is the case).
An example RDW (from the error message) says the record size is 64*256 + 7 = 16391, but the size of the copybook is 31653. Your segments redefine each other in the copybook, right?
The fact that the parsed data contains 23 empty elements of an array might be a sign that the copybook doesn't completely match the data.
The REC_LNGTH_CNT must be an integral type issue is interesting since REC_LNGTH_CNT is definitely integral. We are going to release a 1.0.0-SNAPSHOT soon with the rewritten parser. I'm wondering if the issue is still valid for that version.

yruslan · 2019-08-29T13:56:00Z

Implementation details are discussed in #172.

We might implement it as an option for spark-cobol, e.g. .option(...).

yruslan · 2019-09-02T09:12:15Z

Please, try this snapshot and let me know if it worked for you:

<dependency>
      <groupId>za.co.absa.cobrix</groupId>
      <artifactId>spark-cobol</artifactId>
      <version>1.0.1-SNAPSHOT</version>
</dependency>

You also need to use this option:

.option("variable_size_occurs", "true")

chandrasekaravr mentioned this issue Aug 21, 2019

ParserVisitor fails parsing OccursTo #172

Closed

yruslan added accepted Accepted for implementation enhancement New feature or request labels Aug 29, 2019

yruslan changed the title ~~Seemingly spurious records and missing variable length data~~ Implement variable length arrays (OCCURS) Aug 29, 2019

yruslan self-assigned this Aug 29, 2019

yruslan added a commit that referenced this issue Sep 2, 2019

#156 Add a unit test for variable length arrays

13af9f4

yruslan added a commit that referenced this issue Sep 2, 2019

#156 Add a new option and describe it in README.

59a4ad2

yruslan added a commit that referenced this issue Sep 2, 2019

#156 Implement variable occurs extractor.

2ea91a2

yruslan closed this as completed Sep 17, 2019

tr11 mentioned this issue Jan 17, 2020

Variable OCCURS clause #239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement variable length arrays (OCCURS) #156

Implement variable length arrays (OCCURS) #156

jpdev-sb commented Aug 15, 2019

yruslan commented Aug 19, 2019

yruslan commented Aug 29, 2019

yruslan commented Sep 2, 2019

Implement variable length arrays (OCCURS) #156

Implement variable length arrays (OCCURS) #156

Comments

jpdev-sb commented Aug 15, 2019

yruslan commented Aug 19, 2019

yruslan commented Aug 29, 2019

yruslan commented Sep 2, 2019