Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

full dump reporting #1148

Open
shelleydoljack opened this issue Aug 22, 2024 · 2 comments
Open

full dump reporting #1148

shelleydoljack opened this issue Aug 22, 2024 · 2 comments
Labels
data export Related to Data Export out of FOLIO to external vendors

Comments

@shelleydoljack
Copy link
Contributor

We need to have better reporting on how many unique records are in the full dump, possibly how many per file, how many files make up the full dump, how many, if any, duplicates in the materialized view, etc. This could help us with troubleshooting issues, such as POD only getting 5.4 million unique records and duplicates occurring across files.

With previous full dump from Symphony, we'd get an email at sul-unicorn-devs list of counts. Maybe consider this when the full dump selection dag finishes.

@shelleydoljack shelleydoljack added the data export Related to Data Export out of FOLIO to external vendors label Aug 22, 2024
@shelleydoljack
Copy link
Contributor Author

In the logs for number_of_records task, we could pull out into an email:

[2024-08-17, 02:04:30 UTC] {full_dump_marc.py:45} INFO - Record count: 9785445
[2024-08-17, 02:04:30 UTC] {python.py:202} INFO - Done. Returned value was: 9785445

In the calculate_start_stop tasks, the logs in the mapped tasks show:

[2024-08-17, 02:09:41 UTC] {full_dump_retrieval.py:103} INFO - Output in calculate_start_stop {'start': 0, 'stop': 2000000}
[2024-08-17, 02:09:41 UTC] {python.py:202} INFO - Done. Returned value was: {'start': 0, 'stop': 2000000}
[2024-08-17, 02:04:40 UTC] {full_dump_retrieval.py:103} INFO - Output in calculate_start_stop {'start': 2000000, 'stop': 4000000}
[2024-08-17, 02:04:40 UTC] {python.py:202} INFO - Done. Returned value was: {'start': 2000000, 'stop': 4000000}

☝️ this might be the cause of some duplication? Shouldn't the next mapped task have start: 2000001 ?

I see in the logs for transform_marc_records_add_holdings task, we get stuff like that might be good to pull out for a reporting email:

[2024-08-17, 04:11:40 UTC] {transformer.py:111} INFO - Writing 4,537 modified MARC records to /folio-data-export-prod/data-export-files/full-dump/marc-files/5000_10000.mrc
[2024-08-17, 04:11:57 UTC] {transformer.py:111} INFO - Writing 4,519 modified MARC records to /folio-data-export-prod/data-export-files/full-dump/marc-files/10000_15000.mrc

For the transform_marc_records_clean_serialize task, this stuff in the logs might also be good to add to an email report. For the serializing and removing fields logging, maybe add the filename and skip pulling out of the log the smart_open call.

[2024-08-17, 05:27:29 UTC] {s3.py:1054} INFO - smart_open.s3.MultipartWriter('folio-data-export-prod', 'data-export-files/full-dump/marc-files/5000_10000.mrc'): uploading part_num: 1, 8711374 bytes (total 0.008GB)
[2024-08-17, 05:27:30 UTC] {transforms.py:131} INFO - Serializing 4537 MARC records as xml
[2024-08-17, 05:27:35 UTC] {s3.py:1054} INFO - smart_open.s3.MultipartWriter('folio-data-export-prod', 'data-export-files/full-dump/marc-files/5000_10000.xml'): uploading part_num: 1, 22124796 bytes (total 0.021GB)
[2024-08-17, 05:27:38 UTC] {transforms.py:102} INFO - Removing MARC fields using AWS S3 with path: /folio-data-export-prod/data-export-files/full-dump/marc-files/10000_15000.mrc
[2024-08-17, 05:27:40 UTC] {transforms.py:107} INFO - Removing MARC fields for 4,519 records

@jgreben
Copy link
Contributor

jgreben commented Aug 27, 2024

In the meantime I have been running this script I wrote around marccli

#!/opt/homebrew/bin/bash

# First run `brew install marcli`
# Usage: ./marc_record_counts.sh [development|test|stage|prod] [xml.gz|mrc]
for F in `aws s3 ls folio-data-export-${1}/data-export-files/full-dump/marc-files/ | awk '{print $3" "$4}' | egrep -v '^0|^102|^112' |  awk '{print $2}'`
do
    if [[ $F == *.$2 ]]; then
        aws s3 cp --quiet s3://folio-data-export-${1}/data-export-files/full-dump/marc-files/${F} /tmp/${F}
        if [[ $2 == xml.gz ]]; then
            gunzip /tmp/${F}
        fi
        echo -n "Num HRIDs" & marcli -file /tmp/${F} -fields 001 | grep -v '^[[:space:]]*$' | awk '{print $2}' | wc -l
        echo -n "Uniq HRIDs" & marcli -file /tmp/${F} -fields 001 | grep -v '^[[:space:]]*$' | awk '{print $2}' | sort -u | wc -l
        rm /tmp/${F}
    fi
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data export Related to Data Export out of FOLIO to external vendors
Projects
None yet
Development

No branches or pull requests

2 participants