You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
COPY (SELECT *
FROM dsp_staging.schema.table)
TO 's3://bucket_path'
(FORMAT PARQUET, ROW_GROUP_SIZE 2_000_000, ROW_GROUPS_PER_FILE 2);
Setup
as a setup I install postgres and httpfs
INSTALL postgres;
LOAD postgres;
CREATE SECRET dsp_secret (
TYPE POSTGRES,
HOST '{dsp_db_secret['host']}',
PORT {dsp_db_secret['port']},
DATABASE {dsp_db_secret['database']},
USER '{dsp_db_secret['user']}',
PASSWORD '{dsp_db_secret['password']}'
);
ATTACH '' AS dsp_staging (TYPE POSTGRES, SECRET dsp_secret);
duckdb.sql(f"""
INSTALL httpfs; LOAD httpfs;
SET s3_session_token='{aws_session_token}';
SET s3_endpoint='s3.us-gov-west-1.amazonaws.com';
""")
duckdb.sql(f"""
CREATE SECRET secret1 (
TYPE S3,
KEY_ID '{aws_access_key_id}',
SECRET '{aws_secret_access_key}',
REGION '{region}'
);
""")
Result
Then when using the command(I apologize if some variable names are incorrect, I had to redact all the actual names )
COPY (SELECT *
FROM dsp_staging.schema.table)
TO 's3://bucket_path'
(FORMAT PARQUET, ROW_GROUP_SIZE 2_000_000, ROW_GROUPS_PER_FILE 2);
I go to check the table and verify that it is an exact copy of postgres_db.schema.table a few duplicate rows have been added. the table is ~60 million records. This process is run daily and creates duplicates about 20% of the time. It only creates about between 2 - 4 duplicates. The table has a primary key that is enforced by postgres and we also have done this process using SPARK and have never had the validation fail.
validation logs from great expectations
VALIDATION expect_column_values_to_be_unique FAILED
ON BATCH XXXXXX-XXX ON COLUMN(S) eid
WITH 2 ROWS FAILING
WHICH IS 0.0 PERCENT OF THE DATA
SAMPLE OF FAILING ROWS
[1015013856, 1015013856]
This is a bug, but perhaps there is also a configuration that would make it less likely that that process creates duplication. Thank you for your help.
The text was updated successfully, but these errors were encountered:
SUMMARY
This command creates duplicates
COPY (SELECT *
FROM dsp_staging.schema.table)
TO 's3://bucket_path'
(FORMAT PARQUET, ROW_GROUP_SIZE 2_000_000, ROW_GROUPS_PER_FILE 2);
Setup
as a setup I install postgres and httpfs
INSTALL postgres;
LOAD postgres;
CREATE SECRET dsp_secret (
TYPE POSTGRES,
HOST '{dsp_db_secret['host']}',
PORT {dsp_db_secret['port']},
DATABASE {dsp_db_secret['database']},
USER '{dsp_db_secret['user']}',
PASSWORD '{dsp_db_secret['password']}'
);
ATTACH '' AS dsp_staging (TYPE POSTGRES, SECRET dsp_secret);
duckdb.sql(f"""
INSTALL httpfs; LOAD httpfs;
SET s3_session_token='{aws_session_token}';
SET s3_endpoint='s3.us-gov-west-1.amazonaws.com';
""")
duckdb.sql(f"""
CREATE SECRET secret1 (
TYPE S3,
KEY_ID '{aws_access_key_id}',
SECRET '{aws_secret_access_key}',
REGION '{region}'
);
""")
Result
Then when using the command(I apologize if some variable names are incorrect, I had to redact all the actual names )
COPY (SELECT *
FROM dsp_staging.schema.table)
TO 's3://bucket_path'
(FORMAT PARQUET, ROW_GROUP_SIZE 2_000_000, ROW_GROUPS_PER_FILE 2);
I go to check the table and verify that it is an exact copy of postgres_db.schema.table a few duplicate rows have been added. the table is ~60 million records. This process is run daily and creates duplicates about 20% of the time. It only creates about between 2 - 4 duplicates. The table has a primary key that is enforced by postgres and we also have done this process using SPARK and have never had the validation fail.
validation logs from great expectations
VALIDATION expect_column_values_to_be_unique FAILED
ON BATCH XXXXXX-XXX ON COLUMN(S) eid
WITH 2 ROWS FAILING
WHICH IS 0.0 PERCENT OF THE DATA
SAMPLE OF FAILING ROWS
[1015013856, 1015013856]
This is a bug, but perhaps there is also a configuration that would make it less likely that that process creates duplication. Thank you for your help.
The text was updated successfully, but these errors were encountered: