Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The 'with_input_file_name_col' option doesn't work with File offsets #252

Closed
yruslan opened this issue Feb 21, 2020 · 2 comments
Closed
Assignees
Labels
accepted Accepted for implementation bug Something isn't working

Comments

@yruslan
Copy link
Collaborator

yruslan commented Feb 21, 2020

Describe the bug

The 'with_input_file_name_col' option doesn't work with File offsets.

See #221.

@yruslan yruslan added the bug Something isn't working label Feb 21, 2020
@yruslan yruslan self-assigned this Feb 21, 2020
@yruslan yruslan added the accepted Accepted for implementation label Feb 21, 2020
@bart-at-qqdatafruits
Copy link

Hi @yruslan

The chosen approach - as per recommendation and assistance of the Clients Mainframe specialist (Pascale) - was to adapt the copybook.

with info in the readme and input of prior issues issue 153 and issue 72 I managed to to successfully ectract the desired data and omit header and footer.

`
import org.apache.spark.sql.functions._

// import org.apache.spark.sql.SparkSession

// adapted the copybook as per recommendation of Clients Mainframe specialist (Pascale)
//
// used a redefine on 2nd level (non-root level)
//
// approach based on following input
// https://github.com/AbsaOSS/cobrix#automatic-segment-redefines-filtering
// #153
// #72

spark.udf.register("get_file_name", (path: String) => path.split("/").last)

val cobolDataframe = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
.option("schema_retention_policy", "collapse_root")
.option("segment_field", "REC_GSH_STUB_IDENT")
.option("segment_id_level0", "G")
.option("segment_id_level1", "2")
.option("segment_id_level2", "C")
.option("redefine_segment_id_map:0", "REC-GSH-STUB => G")
.option("redefine_segment_id_map:1", "REC-GSH => C")
.option("redefine_segment_id_map:2", "REC-GSH-STUB => 2")
.option("pedantic", "true")
.option("copybook", "file:///home/jovyan/data/BRAND/COPYBOOK_redefine_on_level_2.txt")
.load("file:///home/jovyan/data/BRAND/initial_transformed")
.withColumn("DPSource", callUDF("get_file_name", input_file_name()))
`
a simplified version of the copybook is

01 REC-GSH-GLOBAL. * 03 REC-GSH-STUB. 05 REC-GSH-STUB-IDENT PIC X(1). 05 REC-GSH-STUB-REST PIC X(599). * 03 REC-GSH REDEFINES REC-GSH-STUB. 05 REC-GSH-REAL-IDENT PIC X(1). 05 REC-GSH-REAL-REST PIC X(599).

I thank you very much for the assistance and recommend to close the issue.

@kriswijnants

A the entire adapted copybook will be shared with you

Thanks in advance,

Bart Debersaques,

@yruslan
Copy link
Collaborator Author

yruslan commented Feb 24, 2020

Glad you've found a workaround. Nevertheless, .option("with_input_file_name_col", "DPSource") could still be used with .option("file_start_offset", 100) or .option("file_end_offset", 100) after this fix is released.

@yruslan yruslan closed this as completed Feb 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Accepted for implementation bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants