full dump reporting #1148

shelleydoljack · 2024-08-22T16:49:49Z

We need to have better reporting on how many unique records are in the full dump, possibly how many per file, how many files make up the full dump, how many, if any, duplicates in the materialized view, etc. This could help us with troubleshooting issues, such as POD only getting 5.4 million unique records and duplicates occurring across files.

With previous full dump from Symphony, we'd get an email at sul-unicorn-devs list of counts. Maybe consider this when the full dump selection dag finishes.

shelleydoljack · 2024-08-22T17:10:14Z

In the logs for number_of_records task, we could pull out into an email:

[2024-08-17, 02:04:30 UTC] {full_dump_marc.py:45} INFO - Record count: 9785445
[2024-08-17, 02:04:30 UTC] {python.py:202} INFO - Done. Returned value was: 9785445

In the calculate_start_stop tasks, the logs in the mapped tasks show:

[2024-08-17, 02:09:41 UTC] {full_dump_retrieval.py:103} INFO - Output in calculate_start_stop {'start': 0, 'stop': 2000000}
[2024-08-17, 02:09:41 UTC] {python.py:202} INFO - Done. Returned value was: {'start': 0, 'stop': 2000000}
[2024-08-17, 02:04:40 UTC] {full_dump_retrieval.py:103} INFO - Output in calculate_start_stop {'start': 2000000, 'stop': 4000000}
[2024-08-17, 02:04:40 UTC] {python.py:202} INFO - Done. Returned value was: {'start': 2000000, 'stop': 4000000}

☝️ this might be the cause of some duplication? Shouldn't the next mapped task have start: 2000001 ?

I see in the logs for transform_marc_records_add_holdings task, we get stuff like that might be good to pull out for a reporting email:

[2024-08-17, 04:11:40 UTC] {transformer.py:111} INFO - Writing 4,537 modified MARC records to /folio-data-export-prod/data-export-files/full-dump/marc-files/5000_10000.mrc
[2024-08-17, 04:11:57 UTC] {transformer.py:111} INFO - Writing 4,519 modified MARC records to /folio-data-export-prod/data-export-files/full-dump/marc-files/10000_15000.mrc

For the transform_marc_records_clean_serialize task, this stuff in the logs might also be good to add to an email report. For the serializing and removing fields logging, maybe add the filename and skip pulling out of the log the smart_open call.

[2024-08-17, 05:27:29 UTC] {s3.py:1054} INFO - smart_open.s3.MultipartWriter('folio-data-export-prod', 'data-export-files/full-dump/marc-files/5000_10000.mrc'): uploading part_num: 1, 8711374 bytes (total 0.008GB)
[2024-08-17, 05:27:30 UTC] {transforms.py:131} INFO - Serializing 4537 MARC records as xml
[2024-08-17, 05:27:35 UTC] {s3.py:1054} INFO - smart_open.s3.MultipartWriter('folio-data-export-prod', 'data-export-files/full-dump/marc-files/5000_10000.xml'): uploading part_num: 1, 22124796 bytes (total 0.021GB)
[2024-08-17, 05:27:38 UTC] {transforms.py:102} INFO - Removing MARC fields using AWS S3 with path: /folio-data-export-prod/data-export-files/full-dump/marc-files/10000_15000.mrc
[2024-08-17, 05:27:40 UTC] {transforms.py:107} INFO - Removing MARC fields for 4,519 records

jgreben · 2024-08-27T19:16:36Z

In the meantime I have been running this script I wrote around marccli

#!/opt/homebrew/bin/bash

# First run `brew install marcli`
# Usage: ./marc_record_counts.sh [development|test|stage|prod] [xml.gz|mrc]
for F in `aws s3 ls folio-data-export-${1}/data-export-files/full-dump/marc-files/ | awk '{print $3" "$4}' | egrep -v '^0|^102|^112' |  awk '{print $2}'`
do
    if [[ $F == *.$2 ]]; then
        aws s3 cp --quiet s3://folio-data-export-${1}/data-export-files/full-dump/marc-files/${F} /tmp/${F}
        if [[ $2 == xml.gz ]]; then
            gunzip /tmp/${F}
        fi
        echo -n "Num HRIDs" & marcli -file /tmp/${F} -fields 001 | grep -v '^[[:space:]]*$' | awk '{print $2}' | wc -l
        echo -n "Uniq HRIDs" & marcli -file /tmp/${F} -fields 001 | grep -v '^[[:space:]]*$' | awk '{print $2}' | sort -u | wc -l
        rm /tmp/${F}
    fi
done

shelleydoljack added the data export Related to Data Export out of FOLIO to external vendors label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

full dump reporting #1148

full dump reporting #1148

shelleydoljack commented Aug 22, 2024

shelleydoljack commented Aug 22, 2024

jgreben commented Aug 27, 2024 •

edited

Loading

full dump reporting #1148

full dump reporting #1148

Comments

shelleydoljack commented Aug 22, 2024

shelleydoljack commented Aug 22, 2024

jgreben commented Aug 27, 2024 • edited Loading

jgreben commented Aug 27, 2024 •

edited

Loading