Unable to parse the Mainframe copybook which has a COBOL datatype of BBBB which means empty spacesc #734

suryagits · 2024-12-24T07:28:27Z

Describe the bug

We are using CoBrix with PySpark and executing it on AWS EMR.
We have the EBCDIC file and it's corresponding copybook in the AWS S3 bucket. While trying to parse the EBCDIC file using the Copybook, we are getting an error.

Error message :
py4j.protocol.Py4jJavaError : An error occurred while calling o2021.loa : za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException : Syntax error in the copybook at line 29 : Invalid input 'BBBB' at position 29:45

Code snippet that caused the issue

try : 
 file_path = f's3://{s3_bucket}/{ebcdic_file_path}'
 spark.read
   .format("cobol")
   .option("copybook_contents", copybook)
   .option("encoding", ebcdic)
   .option("schema_retention_policy", "collapse_root")
   .option("generate_record_id", True)
   .load(file_path)
except Exception as e:
   log_message = f'spark job failed with error : {e}'
   logging.error(log_message)
  raise e

Expected behavior

We expected the Cobrix to successfully parse the EBCDIC file record column using the Cobybook which has this datatype of 'BBBB'

Context

PySpark Jar dependencies :

cobol-parser_2.12-2.6.7.jar
hadoop-lzo-0.4.3.jar
scodec-bits_2.12-1.1.12.jar
scodec-core_2.12-1.11.4.jar
spark-cobol_2.12-2.6.7.jar
Operating system: AWS EMR (Linux Image)

Copybook (if possible)

                    15 EL02-267-COLNAME-A
                      20 EL02-267-COLNAME-B
                                                       PIC X(19).
                      .........
                      .........
                      .........
                      20 EL02-267-COLNAME-C  REDEFINES
                                    EL02-267-COLNAME-D
                                                       PIC 9(06)BBBB. (This is what is causing the issue we suppose)
GP5WHB        20 FILLER                 pic X(285).                      CLEAN-UP

Attach a small data file that can help reproduce the issue, if possible : Need to check the feasibility due to confidentiality of the data. Will get back.

The text was updated successfully, but these errors were encountered:

yruslan · 2024-12-27T12:29:50Z

Hi,

Yes, 'BBBB' is something Cobrix does not support mainly because we are not sure at the moment how to properly handle it.
This might be a relevant issue: #505

Does it work if you remove 'BBBB'? Does it produce the expected output in this case?

suryagits · 2024-12-29T09:19:47Z

Hi @yruslan ,

Thank you so much for your response!

As adviced, I will try once by removing the 'BBBB' from my Copybook file , rerun the Cobrix program and get will back to you asap.

Thank you

suryagits · 2024-12-31T05:43:28Z

Hi @yruslan ,

One query, could you advice on what could be a replacement for 'BBBB', I mean, is there any other Cobol datatype definition that could be analogous to the use-case of 'BBBB' and works with Cobrix too?

Please note, I am yet to try out your advice on removing the 'BBBB' and give a try. Sorry for the delay, will get back on that asap!

Thank you

yruslan · 2024-12-31T21:16:19Z

Hi @suryagits,

Since 'B' means just inserting spaces in the data representation of the number, and because Cobrix converts numbers to Spark native binary formats, 'B' should not need a replacement. We may eventually implement it so Cobrix ignores all 'B' in numbers. We haven't done it yet since we haven't encountered such PICs in our organization so we can't confirm that ignoring 'B's would be an expected behavior.

Once you confirm that removing 'B's from PICs produces correct output in numeric fields we are going to implement the support 'B's natively.

suryagits added the bug Something isn't working label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to parse the Mainframe copybook which has a COBOL datatype of BBBB which means empty spacesc #734

Unable to parse the Mainframe copybook which has a COBOL datatype of BBBB which means empty spacesc #734

suryagits commented Dec 24, 2024

yruslan commented Dec 27, 2024

suryagits commented Dec 29, 2024

suryagits commented Dec 31, 2024 •

edited

Loading

yruslan commented Dec 31, 2024

Unable to parse the Mainframe copybook which has a COBOL datatype of BBBB which means empty spacesc #734

Unable to parse the Mainframe copybook which has a COBOL datatype of BBBB which means empty spacesc #734

Comments

suryagits commented Dec 24, 2024

Describe the bug

Code snippet that caused the issue

Expected behavior

Context

Copybook (if possible)

yruslan commented Dec 27, 2024

suryagits commented Dec 29, 2024

suryagits commented Dec 31, 2024 • edited Loading

yruslan commented Dec 31, 2024

suryagits commented Dec 31, 2024 •

edited

Loading