Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to parse the Mainframe copybook which has a COBOL datatype of BBBB which means empty spacesc #734

Open
suryagits opened this issue Dec 24, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@suryagits
Copy link

Describe the bug

We are using CoBrix with PySpark and executing it on AWS EMR.
We have the EBCDIC file and it's corresponding copybook in the AWS S3 bucket. While trying to parse the EBCDIC file using the Copybook, we are getting an error.

Error message :
py4j.protocol.Py4jJavaError : An error occurred while calling o2021.loa : za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException : Syntax error in the copybook at line 29 : Invalid input 'BBBB' at position 29:45

Code snippet that caused the issue

try : 
 file_path = f's3://{s3_bucket}/{ebcdic_file_path}'
 spark.read
   .format("cobol")
   .option("copybook_contents", copybook)
   .option("encoding", ebcdic)
   .option("schema_retention_policy", "collapse_root")
   .option("generate_record_id", True)
   .load(file_path)
except Exception as e:
   log_message = f'spark job failed with error : {e}'
   logging.error(log_message)
  raise e

Expected behavior

We expected the Cobrix to successfully parse the EBCDIC file record column using the Cobybook which has this datatype of 'BBBB'

Context

PySpark Jar dependencies :

  • cobol-parser_2.12-2.6.7.jar
  • hadoop-lzo-0.4.3.jar
  • scodec-bits_2.12-1.1.12.jar
  • scodec-core_2.12-1.11.4.jar
  • spark-cobol_2.12-2.6.7.jar
  • Operating system: AWS EMR (Linux Image)

Copybook (if possible)

                    15 EL02-267-COLNAME-A
                      20 EL02-267-COLNAME-B
                                                       PIC X(19).
                      .........
                      .........
                      .........
                      20 EL02-267-COLNAME-C  REDEFINES
                                    EL02-267-COLNAME-D
                                                       PIC 9(06)BBBB. (This is what is causing the issue we suppose)
GP5WHB        20 FILLER                 pic X(285).                      CLEAN-UP

Attach a small data file that can help reproduce the issue, if possible : Need to check the feasibility due to confidentiality of the data. Will get back.

@suryagits suryagits added the bug Something isn't working label Dec 24, 2024
@yruslan
Copy link
Collaborator

yruslan commented Dec 27, 2024

Hi,

Yes, 'BBBB' is something Cobrix does not support mainly because we are not sure at the moment how to properly handle it.
This might be a relevant issue: #505

Does it work if you remove 'BBBB'? Does it produce the expected output in this case?

@suryagits
Copy link
Author

Hi @yruslan ,

Thank you so much for your response!

As adviced, I will try once by removing the 'BBBB' from my Copybook file , rerun the Cobrix program and get will back to you asap.

Thank you

@suryagits
Copy link
Author

suryagits commented Dec 31, 2024

Hi @yruslan ,

One query, could you advice on what could be a replacement for 'BBBB', I mean, is there any other Cobol datatype definition that could be analogous to the use-case of 'BBBB' and works with Cobrix too?

Please note, I am yet to try out your advice on removing the 'BBBB' and give a try. Sorry for the delay, will get back on that asap!

Thank you

@yruslan
Copy link
Collaborator

yruslan commented Dec 31, 2024

Hi @suryagits,

Since 'B' means just inserting spaces in the data representation of the number, and because Cobrix converts numbers to Spark native binary formats, 'B' should not need a replacement. We may eventually implement it so Cobrix ignores all 'B' in numbers. We haven't done it yet since we haven't encountered such PICs in our organization so we can't confirm that ignoring 'B's would be an expected behavior.

Once you confirm that removing 'B's from PICs produces correct output in numeric fields we are going to implement the support 'B's natively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants