Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is the annotation performance in the large genome (>10G) #3

Open
haoyongchao opened this issue Jul 9, 2024 · 9 comments
Open

Comments

@haoyongchao
Copy link

I would like to use the pipeline on a large plant genome. Would it be to run separately on chromosomes or directly on the entire genome? Are there any requirements for CPUs and RAM? Have you ever tested it on a large genome? Thanks!!

@CSU-KangHu
Copy link
Owner

Hi @haoyongchao ,

Thank you for using HiTE. While it is possible to run HiTE on the entire genome or on individual chromosomes separately, I recommend running HiTE on the entire genome. Running it on single chromosomes may miss TEs that are distributed across different chromosomes.

Previously, we ran the older version of HiTE on a 4.9 GB wheat genome using 40 CPU cores, which took 2-3 days. In tests with the new version of HiTE, it took 25 hours to process a 2.1 GB maize genome and 10 hours for a 2.6 GB mouse genome. Memory is generally not a limiting factor, but we suggest having 100 GB or more. We haven't tested HiTE on large plant genomes over 10 GB, but you are welcome to try it out. Additionally, if you encounter any issues during the process, we are happy to assist.

Best,
Kang

@haoyongchao
Copy link
Author

Hi @haoyongchao ,

Thank you for using HiTE. While it is possible to run HiTE on the entire genome or on individual chromosomes separately, I recommend running HiTE on the entire genome. Running it on single chromosomes may miss TEs that are distributed across different chromosomes.

Previously, we ran the older version of HiTE on a 4.9 GB wheat genome using 40 CPU cores, which took 2-3 days. In tests with the new version of HiTE, it took 25 hours to process a 2.1 GB maize genome and 10 hours for a 2.6 GB mouse genome. Memory is generally not a limiting factor, but we suggest having 100 GB or more. We haven't tested HiTE on large plant genomes over 10 GB, but you are welcome to try it out. Additionally, if you encounter any issues during the process, we are happy to assist.

Best, Kang

Thank you for your prompt reply. I am running the pipeline on a 10G plant genome using 100 CPUs.

@wjq1981
Copy link

wjq1981 commented Aug 16, 2024

Hi, thanks for developing such a great software.
When I run it on top of a 9g sized genome, it feels like nothing ever comes out of it, I've been running it since July 30th and it's been at “2024-07-30 02:18:12,685 - main.py[line:389] - INFO: cd /HiTE/module && python3 / HiTE/module/judge_LTR_transposons.py -g /dev/hdd/wangjq/genome/Ago/09.repeat/HiTE/Ago.fasta --ltrharvest_home /HiTE/bin/LTR_HARVEST_ parallel --ltrfinder_home /HiTE/bin/LTR_FINDER_parallel-master -t 24 --tmp_output_dir /dev/hdd/genome/Ago/repeat/HiTE --recover 1 --miu 7e- 09 --use_NeuralTE 1 --is_wicker 0 --NeuralTE_home /HiTE/bin/NeuralTE --TEClass_home /HiTE/classification”. Can you suggest anything?

@CSU-KangHu
Copy link
Owner

Hi @wjq1981,

Thank you for using HiTE, and I apologize for the long runtime. Previously, I didn’t recommend splitting contigs because I was concerned it might break the TE sequences. However, considering the efficiency needed for extremely large genomes, some trade-offs in performance might be necessary.

From your output, it appears that HiTE is still stuck in the first stage of LTR search. I suspect your genome contains some particularly long contigs. Since the LTR module processes each contig separately, with one process handling one contig, a very long contig could cause that process to run for an extended period, leaving other threads idle.

Therefore, I suggest splitting your genome into contigs with more balanced lengths to ensure the runtime is more evenly distributed across processes. Since LTRs typically span up to around 20 kb, you should aim to make the contigs long enough to avoid breaking LTRs—perhaps around 10 Mb? This should help fully utilize the remaining idle processes.

I hope you find this suggestion helpful.

Best,
Kang

@wjq1981
Copy link

wjq1981 commented Aug 16, 2024

Hi @wjq1981,

Thank you for using HiTE, and I apologize for the long runtime. Previously, I didn’t recommend splitting contigs because I was concerned it might break the TE sequences. However, considering the efficiency needed for extremely large genomes, some trade-offs in performance might be necessary.

From your output, it appears that HiTE is still stuck in the first stage of LTR search. I suspect your genome contains some particularly long contigs. Since the LTR module processes each contig separately, with one process handling one contig, a very long contig could cause that process to run for an extended period, leaving other threads idle.

Therefore, I suggest splitting your genome into contigs with more balanced lengths to ensure the runtime is more evenly distributed across processes. Since LTRs typically span up to around 20 kb, you should aim to make the contigs long enough to avoid breaking LTRs—perhaps around 10 Mb? This should help fully utilize the remaining idle processes.

I hope you find this suggestion helpful.

Best, Kang

Thank you for your prompt response. I will give it a try.

@CSU-KangHu
Copy link
Owner

Hello @haoyongchao and @wjq1981,

I hope you’re well. We’ve noticed that the current version of HiTE might not be ideal for handling large genomes, as it tends to require extensive runtime. To address this, we are optimizing the LTR module to enhance both performance and speed. Could you please share the links to the 10GB and 9GB genomes you used previously? This will help us assess whether the improvements have indeed sped up the LTR module.

Best regards,
Kang

@wjq1981
Copy link

wjq1981 commented Sep 5, 2024

Hello @haoyongchao and @wjq1981,

I hope you’re well. We’ve noticed that the current version of HiTE might not be ideal for handling large genomes, as it tends to require extensive runtime. To address this, we are optimizing the LTR module to enhance both performance and speed. Could you please share the links to the 10GB and 9GB genomes you used previously? This will help us assess whether the improvements have indeed sped up the LTR module.

Best regards, Kang

Sorry school started today and I'm just now seeing it. The link to it is here.

https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Alisma_plantago-aquatica/all_assembly_versions/GCA_963693085.1_laAliPlan1.1/GCA_963693085.1_laAliPlan1.1_genomic.fna.gz

@xiekunwhy
Copy link

Hi,

Can I break genome sequences into small pieces (~ 10 M ) and use HiTE to annotate each piece independently, then combined results?

Best,
Kun

@CSU-KangHu
Copy link
Owner

Hi @xiekunwhy,

Yes, that's possible. However, I suggest dividing the segments into slightly larger parts, such as 200M or 400M. After merging, you can cluster and remove redundancies across different TE libraries. A straightforward approach is to use cd-hit-est. I recommend using the parameters -aS 0.95 -aL 0.95 -c 0.8/0.95, where 0.8 allows more divergence, and 0.95 is stricter.

Best,
Kang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants