Duplicates created on copy to S3 using parquet format #57

dowvini · 2024-12-31T17:23:58Z

SUMMARY

This command creates duplicates

COPY (SELECT *
FROM dsp_staging.schema.table)
TO 's3://bucket_path'
(FORMAT PARQUET, ROW_GROUP_SIZE 2_000_000, ROW_GROUPS_PER_FILE 2);

Setup

as a setup I install postgres and httpfs
INSTALL postgres;
LOAD postgres;

CREATE SECRET dsp_secret (
TYPE POSTGRES,
HOST '{dsp_db_secret['host']}',
PORT {dsp_db_secret['port']},
DATABASE {dsp_db_secret['database']},
USER '{dsp_db_secret['user']}',
PASSWORD '{dsp_db_secret['password']}'
);

ATTACH '' AS dsp_staging (TYPE POSTGRES, SECRET dsp_secret);

duckdb.sql(f"""
INSTALL httpfs; LOAD httpfs;
SET s3_session_token='{aws_session_token}';
SET s3_endpoint='s3.us-gov-west-1.amazonaws.com';
""")

duckdb.sql(f"""
CREATE SECRET secret1 (
TYPE S3,
KEY_ID '{aws_access_key_id}',
SECRET '{aws_secret_access_key}',
REGION '{region}'
);
""")

Result

Then when using the command(I apologize if some variable names are incorrect, I had to redact all the actual names )

COPY (SELECT *
FROM dsp_staging.schema.table)
TO 's3://bucket_path'
(FORMAT PARQUET, ROW_GROUP_SIZE 2_000_000, ROW_GROUPS_PER_FILE 2);

I go to check the table and verify that it is an exact copy of postgres_db.schema.table a few duplicate rows have been added. the table is ~60 million records. This process is run daily and creates duplicates about 20% of the time. It only creates about between 2 - 4 duplicates. The table has a primary key that is enforced by postgres and we also have done this process using SPARK and have never had the validation fail.

validation logs from great expectations
VALIDATION expect_column_values_to_be_unique FAILED
ON BATCH XXXXXX-XXX ON COLUMN(S) eid
WITH 2 ROWS FAILING
WHICH IS 0.0 PERCENT OF THE DATA
SAMPLE OF FAILING ROWS
[1015013856, 1015013856]

This is a bug, but perhaps there is also a configuration that would make it less likely that that process creates duplication. Thank you for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicates created on copy to S3 using parquet format #57

Duplicates created on copy to S3 using parquet format #57

dowvini commented Dec 31, 2024

Duplicates created on copy to S3 using parquet format #57

Duplicates created on copy to S3 using parquet format #57

Comments

dowvini commented Dec 31, 2024

SUMMARY

Setup

Result