-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spark input_file_name() not working in cobrix #221
Comments
Thanks for reporting the issue! Looks interesting. Will take a look. |
I can confirm the issue. Indeed, for variable-record-size files It will take a while to fix this properly (probably need to create a custom RDD). But we can add a workaround to generate a column with the input file name for each record. That's what we are going to do first. It would look like this:
|
Just a double check. Which Spark version are you using? We are planning to release Cobrix 2.0.0 first and all further changes will be made there. But it will support Spark 2.4 or above. |
Great! Cobrix 2.0.0 is planned to be released this week. And the workaround for this issue can be expected sometime next week. |
This should be fixed in the latest snapshot. <dependency>
<groupId>za.co.absa.cobrix</groupId>
<artifactId>spark-cobol_2.11</artifactId>
<version>2.0.1-SNAPSHOT</version>
</dependency> and let me know if the issue is fixed. |
Forgot to mention. In order to get input file names for each record of a variable record length file a workaround is used. In your case the option looks like this:
I'd also recommend using
So that unrecognized options cause errors. |
Hi Ruslan,
Apologies for replying late. I get an error when I try to install the new version over Maven.
So for the moment we are still using the version 1.0.2
[cid:image001.png@01D5B73C.253579C0]
Once I get the maven package running I’ll try.
But I believe you on your word when you say it’s fixed.
Thanks for having a look into this!
Regards,
Kris
Kris Wijnants
Innovation Wizard
m +32 (0)496 121 111
From: Ruslan Yushchenko <notifications@github.com>
Sent: woensdag 18 december 2019 8:43
To: AbsaOSS/cobrix <cobrix@noreply.github.com>
Cc: Wijnants Kris <kris.wijnants@kohera.be>; Author <author@noreply.github.com>
Subject: Re: [AbsaOSS/cobrix] spark input_file_name() not working in cobrix (#221)
Forgot to mention. In order to get input file names for each record of a variable record length file a workaround is used. In your case the option looks like this:
.option("with_input_file_name_col", "ISN_Source")
I'd also recommend using
.option("pedantic", "true")
So that unrecognized options cause errors.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAbsaOSS%2Fcobrix%2Fissues%2F221%3Femail_source%3Dnotifications%26email_token%3DANWTSU6TL7JXHYWLRZFLF23QZHH7NA5CNFSM4JWVEGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHFHIKY%23issuecomment-566916139&data=02%7C01%7Ckris.wijnants%40kohera.be%7C67e1da1addc8414f012d08d7838de0e2%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637122517754462709&sdata=B9a9JziTGSju3nXgY%2FzMzNF3s7BE9GLkU54DbdnKoWc%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANWTSUYSSK7GCMLU5OPW4L3QZHH7NANCNFSM4JWVEGAQ&data=02%7C01%7Ckris.wijnants%40kohera.be%7C67e1da1addc8414f012d08d7838de0e2%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637122517754467693&sdata=s5Iv97RAfNuebGGDSeywS0XEw3Qb1AWRhK3he1ahW%2B4%3D&reserved=0>.
This email has been scanned by BullGuard antivirus protection.
For more info visit www.bullguard.com<http://www.bullguard.com/tracking.aspx?affiliate=bullguard&buyaffiliate=smtp&url=/>
|
Hi Kris, Snapshot version linking requires additional configuration in Try setting the version to And please let me know if it worked for you. Thank you, |
Hi Ruslan,
I just tried, and it works perfect!
It’s now showing the filename of ebcdic files using the option is_record_sequence = true.
Thanks a lot for your efforts!
Regards,
Kris
Kris Wijnants
Innovation Wizard
m +32 (0)496 121 111
From: Ruslan Yushchenko <notifications@github.com>
Sent: vrijdag 20 december 2019 13:53
To: AbsaOSS/cobrix <cobrix@noreply.github.com>
Cc: Wijnants Kris <kris.wijnants@kohera.be>; Author <author@noreply.github.com>
Subject: Re: [AbsaOSS/cobrix] spark input_file_name() not working in cobrix (#221)
Hi Kris,
Snapshot version linking requires additional configuration in .m2/settings.xml. It might be even harder for managed clusters.
Try setting the version to 2.0.1 which was released today.
And please let me know if it worked for you.
Thank you,
Ruslan
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAbsaOSS%2Fcobrix%2Fissues%2F221%3Femail_source%3Dnotifications%26email_token%3DANWTSU5KQGAQARCJRBOVQ5TQZS53PA5CNFSM4JWVEGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHM3DJA%23issuecomment-567914916&data=02%7C01%7Ckris.wijnants%40kohera.be%7C70dd2f07eec548269d8e08d7854b92b6%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637124431937930582&sdata=au4S1vmXJI2QBWqOMbfmBhfV2WWfv5aLPA6ZOdZJYxg%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANWTSU6Z6E2GAZPXY47OKPTQZS53PANCNFSM4JWVEGAQ&data=02%7C01%7Ckris.wijnants%40kohera.be%7C70dd2f07eec548269d8e08d7854b92b6%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637124431937940579&sdata=yDl50HT2c2RxFEJLJbYJDt7jZo%2FxNl%2F3zuMiJ7WGK9g%3D&reserved=0>.
This email has been scanned by BullGuard antivirus protection.
For more info visit www.bullguard.com<http://www.bullguard.com/tracking.aspx?affiliate=bullguard&buyaffiliate=smtp&url=/>
|
H2. environment: docker: jupyter/all-spark-notebook:latest + Apache Toree - Scala H2. Issue when using .option("file_start_offset", "600") input_file_name() no longer works H3. Annonymized extract
` import org.apache.spark.sql.SparkSession spark.udf.register("get_file_name", (path: String) => path.split("/").last)
|
Hi Ruslan,
Hope you are doing well.
I’m also involved in the project Bart Debersaque is working on.
So you can reach out to or Bart or myself for testing, screenshots, … etc.
With best regards,
Kris
Kris Wijnants
Innovation Wizard
m +32 (0)496 121 111
From: bart-at-qqdatafruits <notifications@github.com>
Sent: donderdag 20 februari 2020 15:55
To: AbsaOSS/cobrix <cobrix@noreply.github.com>
Cc: Wijnants Kris <kris.wijnants@kohera.be>; Author <author@noreply.github.com>
Subject: Re: [AbsaOSS/cobrix] spark input_file_name() not working in cobrix (#221)
H2. environment: docker: jupyter/pyspark-notebook:latest + Apache Toree - Scala
H2. Issue
when using
.option("file_start_offset", "600")
.option("file_end_offset", "600")
input_file_name() bo longer works
H3. Annonymized extract
%AddDeps za.co.absa.cobrix spark-cobol_2.11 2.0.3 --transitive
val sparkBuilder = SparkSession.builder().appName("Example")
val spark = sparkBuilder .getOrCreate()
`
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
spark.udf.register("get_file_name", (path: String) => path.split("/").last)
val cobolDataframe = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
.option("pedantic", "true")
.option("copybook", "file:///home/jovyan/data/BRAND/COPYBOOK.txt")
.option("file_start_offset", "600")
.option("file_end_offset", "600")
.load("file:///home/jovyan/data/BRAND/initial_transformed/FILEPATTERN*")
.withColumn("DPSource", callUDF("get_file_name", input_file_name()))
`
cobolDataframe //.filter("RECORD.ID % 2 = 0") // filter the even values of the nested field 'RECORD_LENGTH' .take(20) .foreach(v => println(v))
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAbsaOSS%2Fcobrix%2Fissues%2F221%3Femail_source%3Dnotifications%26email_token%3DANWTSU6XKG4FS6ZQBHMYDWLRD2KTBA5CNFSM4JWVEGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMOM6AY%23issuecomment-589090563&data=02%7C01%7Ckris.wijnants%40kohera.be%7C3f15a09309854977db9308d7b614d0f1%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637178072829827348&sdata=%2FLKqzyJiGim8z0YhwtaAnv3eb9jwvdgwHR07UYJ2r2M%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANWTSU5GO5CODQEBT3LGU2DRD2KTBANCNFSM4JWVEGAQ&data=02%7C01%7Ckris.wijnants%40kohera.be%7C3f15a09309854977db9308d7b614d0f1%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637178072829837344&sdata=yW1nlBG9LR0TSHZGRyiMsSMAPZZHeEb4QdDp4n4Bjks%3D&reserved=0>.
This email has been scanned by BullGuard antivirus protection.
For more info visit www.bullguard.com<http://www.bullguard.com/tracking.aspx?affiliate=bullguard&buyaffiliate=smtp&url=/>
|
Hi Kris,
|
Hi Ruslan, "with_input_file_name_col" seems be intended for "is_record_sequence = true" only. In this case I have a copy book (fixed lenth) where the copybook does not mention the Header and footer. Possibly actions I should take are:
I value your opinion. Mainframe code can be messy. It is a trade off between handling source particuliarities out of the box and keeping the cobrix code maintainable. Thanks in advance, Regards, Bart, a test of your suggestion: ` import org.apache.spark.sql.SparkSession spark.udf.register("get_file_name", (path: String) => path.split("/").last) val cobolDataframe = spark the result:
|
Interesting. I will take a look. I think this can be easily fixed so that |
Opened #252 to continue the discussion there. Since the incompatibility between |
Hi,
Thank you for creating and maintaining Cobrix. It's a tool we discovered recently, and plan to implement it in our cloud data platform for our Mainframe project.
Just a small question to ask. We notice the input_file_name() command in spark always returns blanks when using cobrix. This in combination with the option("is_record_sequence", "true") option.
spark.read.format("cobol").option("copybook", "/mnt/inputMDP/BIWA_GUTEX/Copybooks/"+dbutils.widgets.get("version")+"/GAGUSECO_20070115.txt").option("is_record_sequence", "true").load("/mnt/inputMDP/BIWA_GUTEX/Datafiles/"+dbutils.widgets.get("version")+"/GA-GA324001*").withColumn("ISN_Source", input_file_name).createOrReplaceTempView("vw_gutex_GA")
Do you notice the same behaviour? Is there any chance to get this working?
Keep up the good work!
Regards,
Kris
The text was updated successfully, but these errors were encountered: