Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm error - more processors requested than permitted #10

Open
SophieS9 opened this issue Nov 27, 2024 · 9 comments
Open

Slurm error - more processors requested than permitted #10

SophieS9 opened this issue Nov 27, 2024 · 9 comments
Assignees
Labels
question Further information is requested

Comments

@SophieS9
Copy link

Hi PacBio Team,

I've just downloaded the workflow and I'm trying to run it for the first time on a slurm compute cluster on the test data. I'm getting the error message for both the call-align_all_bams-0 and call-split_contigs-0 tasks:

srun: error: Unable to create step for job 58474048: More processors requested than permitted

I can't see anywhere where I can adjust the number of tasks requested in a job? Can you advise at all?

@proteinosome
Copy link
Collaborator

Hi @SophieS9 thanks for your interest in using the pipeline. May I know how many CPUs are there on your clusters?

You're running using miniwdl, right? There's a quirky issue whereby if you run miniwdl on a compute node, it's limited to what's on that node (where you run the miniwdl run command), so if a certain step requires 32 CPUs, but you are running it on a node where you requested for 8 nodes, it'll not be able to submit jobs via srun. I believe newer Slurm version may be able to get around this, but a trick I usually use when I run miniwdl on a compute node is that I'll unset all relevant Slurm environment variable. You can do this by doing for example:

for param in $(printenv | grep SLURM | sed "s|=.*||g"); do unset ${param}; done

@proteinosome proteinosome self-assigned this Nov 28, 2024
@proteinosome proteinosome added the question Further information is requested label Nov 28, 2024
@SophieS9
Copy link
Author

Thanks @proteinosome that has fixed that issue!

However, I am now having a second issue with the test data. It's failing at alignment both for the tumour and normal sample but isn't giving an error message at.

The stderr simply says:

+ echo 'Aligning /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/COLO829.30X.SV_region.bam and /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta for patient1.tumor using pbmm2 into COLO829.30X.SV_region.aligned.bam'
+ pbmm2 --version
+ pbmm2 align /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/COLO829.30X.SV_region.bam COLO829.30X.SV_region.aligned.bam --sample patient1.tumor --sort -j 8 --unmapped --preset HIFI --log-level INFO --log-file pbmm2.log -A 2

And the stdout simply says:

Aligning /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/COLO829.30X.SV_region.bam and /mnt/miniwdl_task_container
/work/_miniwdl_inputs/0/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta for patient1.tumor
using pbmm2 into COLO829.30X.SV_region.aligned.bam
pbmm2 1.14.99

Using:
  pbmm2    : 1.14.99 (commit v1.13.1-7-g864413e)
  pbbam    : 2.5.99 (commit v2.5.0-32-g4bb7db2)
  pbcopper : 2.5.99 (commit v2.4.0-70-gce23130)
  boost    : 1.81
  htslib   : 1.17
  minimap2 : 2.26
  zlib     : 1.2.13

So no indication at all as to why this isn't working! As a note, if I try to run the pbmm2 command outside of the workflow, it's complaining that it can't find pbmm2.

@proteinosome
Copy link
Collaborator

@SophieS9 Sorry for the late reply, I've been away for a bit.

Did you see anything in the work folder in the pbmm2 output directory? I'd also suggest checking the Slurm jobs to see if there's anything that could have caused your jobs to be terminated.

Re: no pbmm2 outside of workflow: WDL workflow runs using container (docker or singularity). Unless you are in the container, you would not have the tools in the container available.

@SophieS9
Copy link
Author

@proteinosome the work directory contains a _miniwdl_inputs/ directory which contains the three files expected (bam and reference fasta/fai) and a log file. The log file is giving the error:

pbmm2 align ERROR: Could not determine read input type(s). Please do not mix data types, such as BAM+FASTQ, File of files may only contain BAMs or datasets.

So the input files seem to be the problem! pbmm2 needs the file type flag? I was running the workflow exactly as specified here - https://github.com/PacificBiosciences/HiFi-somatic-WDL/blob/main/docs/step-by-step.md#run-the-workflow

@proteinosome
Copy link
Collaborator

@SophieS9 in the work directory there's also a file named command and another one called inputs.json. Can you share those two?

Thanks.

@SophieS9
Copy link
Author

Sure! This in the following directory:

<PATH-TO>/COLO829_demo/20241129_102934_hifisomatic/call-align_all_bams-0/call-NormalAlign-0

The command file contains:

set -euxo pipefail

echo "Aligning /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/COLO829BL.30X.SV_region.bam and /mnt/miniwdl_task_c
ontainer/work/_miniwdl_inputs/0/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta for patient
1.normal using pbmm2 into COLO829BL.30X.SV_region.aligned.bam"

pbmm2 --version

pbmm2 align \
  /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions
_v2.fasta \
  /mnt/miniwdl_task_container/work/_miniwdl_inputs/0/COLO829BL.30X.SV_region.bam \
  COLO829BL.30X.SV_region.aligned.bam \
  --sample patient1.normal \
  --sort -j 8 \
  --unmapped \
  --preset HIFI \
  --log-level INFO --log-file pbmm2.log \
   \
  -A 2

And the inputs.json contains:

{
  "additional_args": "-A 2",
  "bam_file": "/scratch/c.medss21/COLO829BL.30X.SV_region.bam",
  "ref_fasta": "/scratch/c.medss21/hifisomatic_resources/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclus
ions_v2.fasta",
  "ref_fasta_index": "/scratch/c.medss21/hifisomatic_resources/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_
exclusions_v2.fasta.fai",
  "sample_name": "patient1.normal",
  "strip_kinetics": false,
  "threads": 8
}

@proteinosome
Copy link
Collaborator

That is really strange. I'll see if I can reproduce that, but meanwhile can you check if the BAM file is well-formed? E.g. using samtools quickcheck: http://www.htslib.org/doc/samtools-quickcheck.html

Sorry for the issue.

@SophieS9
Copy link
Author

Seems to be OK. The samtools version that I have doesn't have the -u parameter for unmapped reads though:

[c.medss21@cl2(hawk) HiFi_Workflow]$ samtools quickcheck *.bam
COLO829.30X.SV_region.bam had no targets in header.
COLO829BL.30X.SV_region.bam had no targets in header.

I've also ran ValidateSamFile from picardtools for your info:

[c.medss21@cl2(hawk) HiFi_Workflow]$ java -jar $PICARD ValidateSamFile --INPUT COLO829.30X.SV_region.bam
10:11:03.997 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/apps/genomics/picard/2.27.5/bin/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Dec 18 10:11:04 GMT 2024] ValidateSamFile --INPUT COLO829.30X.SV_region.bam --MODE VERBOSE --MAX_OUTPUT 100 --IGNORE_WARNINGS false --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed Dec 18 10:11:04 GMT 2024] Executing as c.medss21@cl2 on Linux 3.10.0-1160.119.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.27.5
WARNING 2024-12-18 10:11:04     ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
ERROR::MISSING_READ_GROUP:Read groups is empty
SAMFormatException on record 01
ERROR   2024-12-18 10:11:04     ValidateSamFile SAMFormatException on record 01
[Wed Dec 18 10:11:04 GMT 2024] picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
[c.medss21@cl2(hawk) HiFi_Workflow]$ java -jar $PICARD ValidateSamFile --INPUT COLO829BL.30X.SV_region.bam
10:11:25.389 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/apps/genomics/picard/2.27.5/bin/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Dec 18 10:11:25 GMT 2024] ValidateSamFile --INPUT COLO829BL.30X.SV_region.bam --MODE VERBOSE --MAX_OUTPUT 100 --IGNORE_WARNINGS false --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed Dec 18 10:11:25 GMT 2024] Executing as c.medss21@cl2 on Linux 3.10.0-1160.119.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.27.5
WARNING 2024-12-18 10:11:25     ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
ERROR::MISSING_READ_GROUP:Read groups is empty
SAMFormatException on record 01
ERROR   2024-12-18 10:11:25     ValidateSamFile SAMFormatException on record 01
[Wed Dec 18 10:11:25 GMT 2024] picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

@proteinosome
Copy link
Collaborator

I've just done a run and I was not able to reproduce that. samtools quickcheck won't work on the BAM without the -u option. Your ValidateSamFile suggests the RG tag is missing, but RG tag should be there. Can you re-download the BAM or try running it on a real sample (If you have any)?

FYI I ran Picard on the same file and this is the output (No error):

gatk ValidateSamFile --INPUT /dept/bifx/kpin/downstream/cancer/2023-3-15_COLO829_revio/all_cells/COLO829/COLO829.30X.SV_region.bam
Using GATK jar /home/kpin/softwares/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/kpin/softwares/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar ValidateSamFile --INPUT /dept/bifx/kpin/downstream/cancer/2023-3-15_COLO829_revio/all_cells/COLO829/COLO829.30X.SV_region.bam
17:27:38.702 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/kpin/softwares/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Dec 19 17:27:38 SGT 2024] ValidateSamFile --INPUT /dept/bifx/kpin/downstream/cancer/2023-3-15_COLO829_revio/all_cells/COLO829/COLO829.30X.SV_region.bam --MODE VERBOSE --MAX_OUTPUT 100 --IGNORE_WARNINGS false --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Dec 19 17:27:38 SGT 2024] Executing as [email protected] on Linux 5.14.0-284.25.1.el9_2.x86_64 amd64; OpenJDK 64-Bit Server VM 17.0.2+8-86; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.4.0.0
WARNING 2024-12-19 17:27:38     ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
No errors found
[Thu Dec 19 17:29:00 SGT 2024] picard.sam.ValidateSamFile done. Elapsed time: 1.36 minutes.
Runtime.totalMemory()=1224736768
Tool returned:
0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants