Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for COMP-3 numbers without the sign nibble #701

Open
harendradh opened this issue Aug 7, 2024 · 34 comments
Open

Add support for COMP-3 numbers without the sign nibble #701

harendradh opened this issue Aug 7, 2024 · 34 comments
Labels
question Further information is requested

Comments

@harendradh
Copy link

harendradh commented Aug 7, 2024

We are using cobrix to convert the mainframe EBCDIC file. Below are the problematic data fields:

XXX-TRANSACTION-AMOUNT PIC S9(15) V99 COMP-3

We are not able to convert the fields correctly. I suspect due to sign field we are running into issues and coming as NULL.
Rest all fields are coming correctly.

cobolDataframe = spark.read.format("za.co.absa.cobrix.spark.cobol.source")
.option("copybook", "dbfs:/FileStore/Optis Test/copybook.txt")
.option("record_format", "D")
.option("is_rdw_big_endian", "true")
.option("rdw_adjustment", -4)
.load("dbfs:/FileStore/Optis Test/inputfile.txt")

thanks for the help

Background [Optional]

A clear explanation of the reason for raising the question.
This gives us a better understanding of your use cases and how we might accommodate them.

Question

A clear and concise inquiry

@harendradh harendradh added the question Further information is requested label Aug 7, 2024
@yruslan
Copy link
Collaborator

yruslan commented Aug 13, 2024

Could you add

.option("debug", "true")

and send the HEX value of the field that is incorrectly decoded, and I'll take a look

@harendradh
Copy link
Author

  1. Does spark has ability to handle signed packed decimal data. I see for unsigned, you suggested to use COMP-3U, instead of COMP-3.
  2. I am using Databricks Spark on Azure. This Debug statement didn't make any change.
  3. Can we read packed decimal fields using Spark read command from fixed length EBCDIC file, I suppose it will read it as garbage. We would need the file with these fields in Hexadecimal form. Is my understanding correct.

Sorry for so many questions, but we have been trying since long.

@yruslan
Copy link
Collaborator

yruslan commented Aug 14, 2024

Yes, Cobrix supports packed decimal data. 'debug' does not suppose to change anything, it just creates debug columns. I'm asking you to to send an example of HEX values that Cobrix didn't convert properly.
For instance, 0x12345C is 12345 as a packed decimal. You are saying that Cobrix didn't convert field values properly. But I need an example, which exact numbers were not properly converted.

@harendradh
Copy link
Author

harendradh commented Aug 14, 2024

Transaction code is coming as 601, while in mainframe I see the value as 791. The field type is PIC S9(3)V COMP-3.

Another is date field, which is PIC S9(7)V COMP-3, it is coming as null in dataframe, while it should come as 240802 actually in mainframe.

Cobrix version installed is spark_cobol_2_12_2_7_3_bundle.jar. We have spark version 2.12 in Databricks.

@yruslan
Copy link
Collaborator

yruslan commented Aug 14, 2024

When debug=true you should see columns with '_debug' suffix. How do they look like for the above fields?

E.g.

field1 = 601, field1_debug=?
field2 = 240802, field2_debug=?
?

@harendradh
Copy link
Author

Got it thanks.
Transaction code = 601, debug = 601C
Date : 2020020220 , debug = 2020020220

@yruslan
Copy link
Collaborator

yruslan commented Aug 15, 2024

Makes sense. Yes, 601C = 601 is supported. 2020020220 = 2020020220 is not supported.
What are the definitions for these fields in the copybook?

I think we can add support for 2020020220, just need to understand more about type, size, and layout.

@yruslan
Copy link
Collaborator

yruslan commented Aug 15, 2024

This are specs for COMP-3 that describe how these numbers are currently parsed in Cobrix: http://www.3480-3590-data-conversion.com/article-packed-fields.html

@harendradh
Copy link
Author

harendradh commented Aug 16, 2024

Transaction code is defined as PIC S9(3)V COMP-3. If I see this value in mainframe is 791, while in dataframe coming as 601. This is also coming incorrect :-( .

Thanks for the quick revert back. Date is defined as below.
'XXX'-TRAN-DT is the field we are printing
02 'XXX'-TRAN-DATE-TIME.
03 'XXX'-TRAN-DT PIC S9(9)V COMP-3.
03 'XXX'-TRAN-TM PIC S9(7)V COMP-3.

@yruslan yruslan added enhancement New feature or request accepted Accepted for implementation labels Aug 16, 2024
@yruslan yruslan changed the title Issue with coversion of Packed signed fields Add support for COMP-3 numbers without the sign nibble Aug 16, 2024
@yruslan
Copy link
Collaborator

yruslan commented Aug 16, 2024

Thanks for the field definition. We can add support for COMP-3 numbers without a sign nibble.

Just keep in mind that this definition:

03 'XXX'-TRAN-DT PIC S9(9)V COMP-3.

implies 9 digits. while 2020020220 has 10 digits - e.g. is in breach of the field definition.

@yruslan
Copy link
Collaborator

yruslan commented Aug 16, 2024

Transaction code is defined as PIC S9(3)V COMP-3. If I see this value in mainframe is 791, while in dataframe coming as 601. This is also coming incorrect :-( .

:(
Please, send the value of the _debug field that you are claiming to be 791

@yruslan yruslan removed enhancement New feature or request accepted Accepted for implementation labels Aug 16, 2024
@yruslan
Copy link
Collaborator

yruslan commented Aug 16, 2024

Checked - parsing of 0x2020020220 is already supported. Use COMP-3U for fields that might not contain the mandatory sign nibble, like 'XXX'-TRAN-DT

@yruslan yruslan closed this as completed Aug 20, 2024
@harendradh
Copy link
Author

Transaction code is defined as PIC S9(3)V COMP-3. If I see this value in mainframe is 791, while in dataframe coming as 601. This is also coming incorrect :-( .

:( Please, send the value of the _debug field that you are claiming to be 791

When i tried this with keeping the field in copybook unchanged i.e. PIC S9(3)V COMP-3, in debug it was coming as 601C.

After changing the data from COMP-3 to COMP-3U,
now this field (Transaction code ) comes as null, in debug value is coming as 601C.
Date 1 : 20000, Debug date 1 : 204E PIC S9(4)V COMP-9
Date 2 (transaction date): 2020020220 , same data in debug field PIC S9(9)V COMP-3U
Transaction Time : null, Debug field : EFBFBD2A PIC S9(7)V COMP-3U

@yruslan yruslan reopened this Aug 20, 2024
@yruslan
Copy link
Collaborator

yruslan commented Aug 20, 2024

  • The column that contains 601C in the _debug field is a correct COMP-3, so you need to use COMP-3 there. You didn't have to switch to COMP-3U.
  • Date that has 2020020220 in the _debug column is COMP-3U since it does not contain the sign nibble.
  • Transaction Time that has EFBFBD2A in the _debug column is not correct BCD format, so neither COMP-3 nor COMP-3U will work.

Maybe you can do something like

df.select("failure_field1", "failure_field1_debug").show(false)

and send here the table, for each field that is failing for you.

@harendradh
Copy link
Author

I will double check the 601 with user, if he is sending me wrong snapshots. I can't upload the table due to data privacy, here are the values. I printed first 10, all are coming as below

Trans_code Acct_open_dt Tran_date Trans_time
|601 |601C |20000 |204E |2020020220 |2020020220 |NULL |EFBFBD2A |
|601 |601C |20000 |204E |2020020220 |2020020220 |NULL |EFBFBD2A |
|601 |601C |20000 |204E |2020020220 |2020020220 |NULL |EFBFBD2A

Tran Date and time are together under a level field called trans-date-time, if it makes any difference.

  • The column that contains 601C in the _debug field is a correct COMP-3, so you need to use COMP-3 there. You didn't have to switch to COMP-3U.
  • Date that has 2020020220 in the _debug column is COMP-3U since it does not contain the sign nibble.
  • Transaction Time that has EFBFBD2A in the _debug column is not correct BCD format, so neither COMP-3 nor COMP-3U will work.

Maybe you can do something like

df.select("failure_field1", "failure_field1_debug").show(false)

and send here the table, for each field that is failing for you.

I will double check the 601 with user, if he is sending me wrong snapshots. I can't upload the table due to data privacy, here are the values. I printed first 10, all are coming as below

Trans_code Acct_open_dt Tran_date Trans_time
|601 |601C |20000 |204E |2020020220 |2020020220 |NULL |EFBFBD2A |
|601 |601C |20000 |204E |2020020220 |2020020220 |NULL |EFBFBD2A |
|601 |601C |20000 |204E |2020020220 |2020020220 |NULL |EFBFBD2A

Tran Date and time are together under a level field called trans-date-time, if it makes any difference.

@harendradh
Copy link
Author

I will double check the 601 with user, if he is sending me wrong snapshots. I can't upload the table due to data privacy, here are the values. I printed first 10, all are coming as below

Trans_code Acct_open_dt Tran_date Trans_time
|601 |601C |20000 |204E |2020020220 |2020020220 |NULL |EFBFBD2A |
|601 |601C |20000 |204E |2020020220 |2020020220 |NULL |EFBFBD2A |
|601 |601C |20000 |204E |2020020220 |2020020220 |NULL |EFBFBD2A

Tran Date and time are together under a level field called trans-date-time, if it makes any difference.

@yruslan
Copy link
Collaborator

yruslan commented Aug 20, 2024

Looks good. The only issue left then is Trans_time, right?

@harendradh
Copy link
Author

Looks good. The only issue left then is Trans_time, right?
Trans_date value looks incorrect i.e. 2020020220 . Similarly Acct open date doesnt look good 20000.
Other values I will double check in mainframe.

@harendradh
Copy link
Author

Other than the wrong values above,
02 'XXX'-TIME-HHMMSS PIC X(06) , this field is taking values from one byte PICX field prior and post to that field.

Value: 00829 , Debug: 203030383239
If I increase the size of the fiels to PIC X(8), the values comes perfectly as 0082918 . I thought PIC X field will be straightforward using Cobrix.
Value: 0082918 , Debug: 2030303832393138

@yruslan
Copy link
Collaborator

yruslan commented Aug 21, 2024

Value: 0082918 , Debug: 2030303832393138

I realized for this example that your data is ASCII, not EBCDIC. Ebcdic encoding for 0082918 is F0F0F8F2F9F1F8
I never saw COMP-3 to be used with EBCDIC. Usually, it is converted automatically to DISPLAY.

I thought PIC X field will be straightforward using Cobrix.

It is straightforward. One character is 1 byte.

@noumanhk34
Copy link

Hi @yruslan
I am working on ascii file which have packed fields like PIC S9(09) COMP-3 in copybook and data is not getting un pack how we do through java spark

@yruslan
Copy link
Collaborator

yruslan commented Jan 7, 2025

Hi @noumanhk34 , please provide a small example of such a file with a copybook. Even one field copybook is sufficient.

@noumanhk34
Copy link

noumanhk34 commented Jan 7, 2025

AMXTCB02-CB-OPEN -DTE PIC S9(09) COMP-3 @yruslan
This is a date field which is coming null for some records and integer values for some records it should come as date

@yruslan
Copy link
Collaborator

yruslan commented Jan 8, 2025

Hi @noumanhk34 , how the data looks like for this field?

You can use

.option("debug", "true")

as above to extract HEX dump of raw values. Please, send a couple of examples where values are coming as null, and where values are coming as non-null.

@noumanhk34
Copy link

noumanhk34 commented Jan 8, 2025

@yruslan Please find below code and output
Dataset df = spark.read().format("cobol").option("copybook","copybookpath").option("pedantic","true").option("encoding","ascii")
.option("ascii_charset","US-ASCII").option("record_format","D").option("debug","true").load("filepath")

field_1 field_1_debug
null 3030303030
null 001CA6713C
null 2020202020
1194003 001194003C
8001713 008001713C

@yruslan
Copy link
Collaborator

yruslan commented Jan 8, 2025

I see.

Indeed,

  • 3030303030 2020202020 as not valid COMP-3 numbers.
  • 001CA6713C is also not a valid COMP-3 number because of 'CA' in the middle.

Also, Cobrix does not support date fields since there is no date type in copybooks. Dates in copybooks are represented by numbers and each mainframe system stores dates in a different way. What you can do is post-processing. Once you have the dataframe, you can convert numeric fields to dates by applying the date format pattern.

It looks like your copybook and your data file do not match. I you want, you can attach a data file and the copybook and I can take a look. Keep in mind that this is a public chat so the data you attach is going to be visible to everyone.

@noumanhk34
Copy link

No I can not attach

@noumanhk34
Copy link

its date in 09282005 format

@yruslan
Copy link
Collaborator

yruslan commented Jan 8, 2025

its date in 09282005 format

You can convert numbers to dates in the format specified using a post-processing such as:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

import spark.implicits._

// Let's assume we get these numbers from a file (column is 'n')
val dfInput = List(9282005, 10282005).toDF("n")

// Let's convert the column 'n' to date and output it to the column 'd'
val dfOutput = dfInput.withColumn("d", to_date(lpad(col("n"), 8, "0"), "MMddyyyy"))

dfOutput.show()

For this example the output is

+--------+----------+
|       n|         d|
+--------+----------+
| 9282005|2005-09-28|
|10282005|2005-10-28|
+--------+----------+

@noumanhk34
Copy link

its date in 09282005 format

You can convert numbers to dates in the format specified using a post-processing such as:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

import spark.implicits._

// Let's assume we get these numbers from a file (column is 'n')
val dfInput = List(9282005, 10282005).toDF("n")

// Let's convert the column 'n' to date and output it to the column 'd'
val dfOutput = dfInput.withColumn("d", to_date(lpad(col("n"), 8, "0"), "MMddyyyy"))

dfOutput.show()

For this example the output is

+--------+----------+
|       n|         d|
+--------+----------+
| 9282005|2005-09-28|
|10282005|2005-10-28|
+--------+----------+

@yruslan i mean to say the packed field contain date in this format

@noumanhk34
Copy link

@yruslan Please find below code and output Dataset df = spark.read().format("cobol").option("copybook","copybookpath").option("pedantic","true").option("encoding","ascii") .option("ascii_charset","US-ASCII").option("record_format","D").option("debug","true").load("filepath")

field_1 field_1_debug null 3030303030 null 001CA6713C null 2020202020 1194003 001194003C 8001713 008001713C

@yruslan this debug data is a date in given format might be it in string data type

@noumanhk34
Copy link

@yruslan I have a ascii file while have multiple segments like HEADER,BASE SEGMENT,J1 SEGEMENT,J2 SEGMENT,K4 SEGMENT,L1 SEGMENT and TRAILER,
Here until BASE SEGEMNT is fixed length 362 but after that it is variable length.
record can have multiple j2 segment and no segments
Can you help how we can read these kind of data.
Can we pass multiple layouts.?

@noumanhk34
Copy link

noumanhk34 commented Jan 8, 2025

@yruslan I have a ascii file which have multiple segments like HEADER,BASE SEGMENT,J1 SEGEMENT,J2 SEGMENT,K4 SEGMENT,L1 SEGMENT and TRAILER, Here until BASE SEGEMNT is fixed length 362 but after that it is variable length. record can have multiple j2 segment and no segments Can you help how we can read these kind of data. Can we pass multiple layouts.?
Can we pass multiple copybooks?

@yruslan
Copy link
Collaborator

yruslan commented Jan 9, 2025

Yes, multi-segment ASCII files are supported. And yes, you can use multiple copybooks. But in practice since copybooks usually have arts common across segments, and segment-specific parts, we usually create a single copybook for parting multi-segment files. In the multi-segment copybooks each segment is a GROUP, and each segment GROUP uses REDEFINE to ensure only one group is used per record. You can take a look at multi-segment examples in README:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants