Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates created on copy to S3 using parquet format #57

Open
dowvini opened this issue Dec 31, 2024 · 0 comments
Open

Duplicates created on copy to S3 using parquet format #57

dowvini opened this issue Dec 31, 2024 · 0 comments

Comments

@dowvini
Copy link

dowvini commented Dec 31, 2024

SUMMARY

This command creates duplicates

COPY (SELECT *
FROM dsp_staging.schema.table)
TO 's3://bucket_path'
(FORMAT PARQUET, ROW_GROUP_SIZE 2_000_000, ROW_GROUPS_PER_FILE 2);

Setup

as a setup I install postgres and httpfs
INSTALL postgres;
LOAD postgres;

CREATE SECRET dsp_secret (
TYPE POSTGRES,
HOST '{dsp_db_secret['host']}',
PORT {dsp_db_secret['port']},
DATABASE {dsp_db_secret['database']},
USER '{dsp_db_secret['user']}',
PASSWORD '{dsp_db_secret['password']}'
);

ATTACH '' AS dsp_staging (TYPE POSTGRES, SECRET dsp_secret);

duckdb.sql(f"""
INSTALL httpfs; LOAD httpfs;
SET s3_session_token='{aws_session_token}';
SET s3_endpoint='s3.us-gov-west-1.amazonaws.com';
""")

duckdb.sql(f"""
CREATE SECRET secret1 (
TYPE S3,
KEY_ID '{aws_access_key_id}',
SECRET '{aws_secret_access_key}',
REGION '{region}'
);
""")

Result

Then when using the command(I apologize if some variable names are incorrect, I had to redact all the actual names )

COPY (SELECT *
FROM dsp_staging.schema.table)
TO 's3://bucket_path'
(FORMAT PARQUET, ROW_GROUP_SIZE 2_000_000, ROW_GROUPS_PER_FILE 2);

I go to check the table and verify that it is an exact copy of postgres_db.schema.table a few duplicate rows have been added. the table is ~60 million records. This process is run daily and creates duplicates about 20% of the time. It only creates about between 2 - 4 duplicates. The table has a primary key that is enforced by postgres and we also have done this process using SPARK and have never had the validation fail.

validation logs from great expectations
VALIDATION expect_column_values_to_be_unique FAILED
ON BATCH XXXXXX-XXX ON COLUMN(S) eid
WITH 2 ROWS FAILING
WHICH IS 0.0 PERCENT OF THE DATA
SAMPLE OF FAILING ROWS
[1015013856, 1015013856]

This is a bug, but perhaps there is also a configuration that would make it less likely that that process creates duplication. Thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant