Gconcepcion/yak changes #9

gconcepcion · 2023-09-25T17:33:37Z

Add settings for yak - The idea is that singleton kmers are more likely to be errors. Therefore use bloom filter (-b37) when we have a alot of data; and no bloom when either parent has low coverage.

Also added support for optional alignment to multiple references

workflows/input_template.json

workflows/de_novo_assembly_sample/de_novo_assembly_sample.wdl

williamrowell · 2023-09-28T04:51:07Z

workflows/assemble_genome/assemble_genome.wdl

Given that alignments are taking less than 24h now, we can probably reconsider forcing GCP to use on-demand.

parameter_meta needs an update.

https://github.com/PacificBiosciences/wdl-humanassembly/blob/6c0232749f49dc4451c67a1911b2b5166e958b51/workflows/assemble_genome/assemble_genome.wdl#L101

If the first FASTA in the array is abnormally small, this could result in requesting too little disk space. We've been converting these to something more like:

Int disk_size = ceil(size(reads_fastas "GB") * 4 + 20)

https://github.com/PacificBiosciences/wdl-humanassembly/blob/6c0232749f49dc4451c67a1911b2b5166e958b51/workflows/assemble_genome/assemble_genome.wdl#L205

Int disk_size = ceil((size(query_sequences, "GB") + size(reference, "GB")) * 2 + 20)

https://github.com/PacificBiosciences/wdl-humanassembly/blob/6c0232749f49dc4451c67a1911b2b5166e958b51/workflows/assemble_genome/assemble_genome.wdl#L221

-@ 3

https://github.com/PacificBiosciences/wdl-humanassembly/blob/6c0232749f49dc4451c67a1911b2b5166e958b51/workflows/assemble_genome/assemble_genome.wdl#L238

memory: threads * 8 + " GB"

Or something like this. Maybe define mem_gb in inputs depending on threads, and use memory: mem_gb + " GB"

williamrowell · 2023-09-28T04:52:53Z

workflows/assemble_genome/assemble_genome.wdl

+			"data_index": align_hifiasm.asm_bam_index
+		}
+
+		Pair[ReferenceData,IndexData] align_data = (ref, sample_aligned_bam)


I'd be interested in seeing what this looks like in outputs.json.

workflows/de_novo_assembly_trio/de_novo_assembly_trio.wdl

- updated parameter_meta - updated inputs.json - cleaned up some whitespace - added comments - using fasta filesize to estimate depth rather than a separate task; based on Greg's experiments, an uncompressed 10x FASTA is ~60GB

williamrowell · 2023-09-28T20:53:59Z

We'll need to update the README as well.

Use FASTA file size to estimate depth for yak count parameters.

…of aligned bam outputs

gconcepcion added 5 commits August 16, 2023 11:53

Calculate total bases input for each parent to set yak params on the fly

9c6c96e

less than not greater than

c9f7092

Adding multi-reference alignment option

56833d4

add yak bloom filter condition

a40d2a2

fix coverage

fcf9ffe

gconcepcion requested a review from williamrowell September 25, 2023 17:33

gconcepcion and others added 4 commits September 25, 2023 11:03

determine yak settings for both parents rather than independently

d462924

fix tests and remove some debug comments I missed

af88a08

update wdl-ci config file after successful tests

137d6a9

update wdl-ci config file after successful tests

6c02327