Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cactus-hal2maf generates incomplete maf result #277

Open
xiaoyezao opened this issue Jun 29, 2023 · 6 comments
Open

cactus-hal2maf generates incomplete maf result #277

xiaoyezao opened this issue Jun 29, 2023 · 6 comments

Comments

@xiaoyezao
Copy link

Hello HAL designer,

I am using cactus-hal2maf to generate maf file. When I use --dupeMode all, the result seems ok
cactus-hal2maf ./js $hal $maf --refGenome $reference --dupeMode all --chunkSize 1000000 --noAncestors --raw

but when I use --dupeMode single, the resulted maf only contains two species (the reference species and the root species)
cactus-hal2maf ./js $hal $maf --refGenome $reference --dupeMode single --chunkSize 1000000 --noAncestors

Did I do it wrong?

The Haf file is valid by running halValidate and the halStats is:

hal v2.2
(T.koksaghyz:1.44628	(L.virosa:1.71614	(L.saligna:1	(L.serriola:1.02684	L.sativa:1.34239)Anc3:2.11001)Anc2:0.44435)Anc1:1.44628)Anc0;

GenomeName	 NumChildren	 Length	 NumSequences	 NumTopSegments	 NumBottomSegments
Anc0	 2	 52565806	 383	 0	 1696018
T.koksaghyz	 0	 1101786192	 9	 2840974	 0
Anc1	 2	 269438758	 4946	 1981150	 6136743
L.virosa	 0	 3446681275	 5855	 12167702	 0
Anc2	 2	 442941005	 5148	 5478814	 11292458
L.saligna	 0	 2165762035	 10	 12501915	 0
Anc3	 2	 1804676494	 4384	 12944598	 21077141
L.serriola	 0	 2495061189	 10	 23803612	 0
L.sativa	 0	 2590130143	 10	 24324535	 0
@glennhickey
Copy link
Collaborator

glennhickey commented Jun 29, 2023

The "." character in the genome names is probably at issue. UCSC uses the convention that MAF lines are of the format <species>.<contig>. If your species names all begin with "L." then it will assume they are all the same species when running the single species filter.

Your only way to fix, I think, would be to rename your species in the hal with halRenameGenomes. If you, for example, replace all "."s with "_"s then --deupMode single should work as intended.

@xiaoyezao
Copy link
Author

Thank you so much. After renaming the species, this worked with --deupMode single.

But when I use --deupMode consensus, the job failed with no clear clues. I attached the log file. Can you please give it a look?

nohup_cactus-hal2maf_dupeModeCon.txt

@glennhickey
Copy link
Collaborator

Yeah, not much to go on -- the error logging of cactus-hal2maf isn't so great.

I only recently --dupeMode consensus and have only tried it on a few genomes. It uses maf_stream, which is a tool I'm not super familiar with.

If you want to get a more meaningful error, maybe you can run --dupeMode all, then use [maf_stream[(https://github.com/ComparativeGenomicsToolkit/maf_stream) (included in cactus) on the maf to see what happens. Doing it this way would give the same result as if --dupeMode consensus worked.

If you want to share the hal, I will also try to take a look here at what's going wrong.

@xiaoyezao
Copy link
Author

Yes, that would be great if you could have a look at the hal. It's 4.5Gb, what's the best way to send it to you? Maybe we can subset it?

@glennhickey
Copy link
Collaborator

You can send it however you like. If that's too hard, you can try making a maf without dupemode consensus then running maf_stream yourself on it to try to find the error.

@glennhickey
Copy link
Collaborator

I think I can reproduce this -- it's maf_stream not being able to handle empty files. This should be fairly straightforward to fix.

And it also means that as a work-around before the next release, you create the maf with --dupeMode all and then run maf_stream merge_dups consensus on it yourself and it will work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants