Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect MD5 checksums being used for some files #331

Open
tanaes opened this issue Jan 9, 2025 · 0 comments
Open

Incorrect MD5 checksums being used for some files #331

tanaes opened this issue Jan 9, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@tanaes
Copy link

tanaes commented Jan 9, 2025

Description of the bug

I've been having an issue with some files failing at checksum in some studies. Upon investigation, for at least some of these failing samples, it appears to be due to the pipeline not picking the correct MD5 value from the metadata.

For example, manually downloading the this file finishes and yields a 7b730 checksum:

(aspera) jonsan@nf-head:~/fetchngs/EA_pharma/fetchngs_exec/test$ ascp     -QT -l 300m -P33001     -i $CONDA_PREFIX/etc/aspera/aspera_bypass_dsa.pem     [email protected]:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz     SRX13191258_SRR17001000_1.fastq.gz
SRX13191258_SRR17001000_1.fastq.gz                                                                                                                 100%   26MB 15.3Mb/s    00:10
Completed: 26745K bytes transferred in 11 seconds
 (19634K bits/sec), in 1 file.
(aspera) jonsan@nf-head:~/fetchngs/EA_pharma/fetchngs_exec/test$ md5sum SRX13191258_SRR17001000_1.fastq.gz                                                                           7b7e0af5429bcb54b2c232489ea8212b  SRX13191258_SRR17001000_1.fastq.gz

However, looking at the command.sh file for this operation, the pipeline is comparing with a 3fcee checksum:

#!/bin/bash -euo pipefail
ascp \
    -QT -l 300m -P33001 \
    -i $CONDA_PREFIX/etc/aspera/aspera_bypass_dsa.pem \
    [email protected]:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz \
    SRX13191258_SRR17001000_1.fastq.gz

echo "3fcee2e72a2ec6221cac142538aff092  SRX13191258_SRR17001000_1.fastq.gz" > SRX13191258_SRR17001000_1.fastq.gz.md5
md5sum -c SRX13191258_SRR17001000_1.fastq.gz.md5

If we look at the metadata downloaded for this run, we see both checksums being represented, but in different columns:

fastq_md5	**7b7e0**af5429bcb54b2c232489ea8212b**;3fcee**2e72a2ec6221cac142538aff092;383df08e03e1cd1ee071fd67c16b085b
fastq_bytes	27387589;1445187226;1481254395
fastq_ftp	ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_2.fastq.gz
fastq_galaxy	ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_2.fastq.gz
fastq_aspera	fasp.sra.ebi.ac.uk:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz;fasp.sra.ebi.ac.uk:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_1.fastq.gz;fasp.sra.ebi.ac.uk:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_2.fastq.gz
fastq_1	ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_1.fastq.gz
fastq_2	ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_2.fastq.gz
md5_1	**3fcee**2e72a2ec6221cac142538aff092
md5_2	383df08e03e1cd1ee071fd67c16b085b

It appears as if there are three fastq files, and the workflow is grabbing the first one (maybe an index read? it's much smaller than the other two) and renaming it _1.fastq.gz, then comparing against the latter's MD5. I haven't looked in the code yet to determine where the logic is that's splitting reads 1 and 2, but it appears that it might be making too liberal an assumption about the structure of the fastq_ftp field?

Maybe related to issue #260 ?

Either way, this is leading to failed downloads, it seems like it might properly be considered a bug.

Command used and terminal output

Relevant files

No response

System information

No response

@tanaes tanaes added the bug Something isn't working label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant