How is the annotation performance in the large genome (>10G) #3

haoyongchao · 2024-07-09T12:56:50Z

I would like to use the pipeline on a large plant genome. Would it be to run separately on chromosomes or directly on the entire genome? Are there any requirements for CPUs and RAM? Have you ever tested it on a large genome? Thanks!!

CSU-KangHu · 2024-07-09T13:52:38Z

Hi @haoyongchao ,

Thank you for using HiTE. While it is possible to run HiTE on the entire genome or on individual chromosomes separately, I recommend running HiTE on the entire genome. Running it on single chromosomes may miss TEs that are distributed across different chromosomes.

Previously, we ran the older version of HiTE on a 4.9 GB wheat genome using 40 CPU cores, which took 2-3 days. In tests with the new version of HiTE, it took 25 hours to process a 2.1 GB maize genome and 10 hours for a 2.6 GB mouse genome. Memory is generally not a limiting factor, but we suggest having 100 GB or more. We haven't tested HiTE on large plant genomes over 10 GB, but you are welcome to try it out. Additionally, if you encounter any issues during the process, we are happy to assist.

Best,
Kang

haoyongchao · 2024-07-09T14:01:33Z

Hi @haoyongchao ,

Thank you for using HiTE. While it is possible to run HiTE on the entire genome or on individual chromosomes separately, I recommend running HiTE on the entire genome. Running it on single chromosomes may miss TEs that are distributed across different chromosomes.

Previously, we ran the older version of HiTE on a 4.9 GB wheat genome using 40 CPU cores, which took 2-3 days. In tests with the new version of HiTE, it took 25 hours to process a 2.1 GB maize genome and 10 hours for a 2.6 GB mouse genome. Memory is generally not a limiting factor, but we suggest having 100 GB or more. We haven't tested HiTE on large plant genomes over 10 GB, but you are welcome to try it out. Additionally, if you encounter any issues during the process, we are happy to assist.

Best, Kang

Thank you for your prompt reply. I am running the pipeline on a 10G plant genome using 100 CPUs.

wjq1981 · 2024-08-16T02:09:51Z

Hi, thanks for developing such a great software.
When I run it on top of a 9g sized genome, it feels like nothing ever comes out of it, I've been running it since July 30th and it's been at “2024-07-30 02:18:12,685 - main.py[line:389] - INFO: cd /HiTE/module && python3 / HiTE/module/judge_LTR_transposons.py -g /dev/hdd/wangjq/genome/Ago/09.repeat/HiTE/Ago.fasta --ltrharvest_home /HiTE/bin/LTR_HARVEST_ parallel --ltrfinder_home /HiTE/bin/LTR_FINDER_parallel-master -t 24 --tmp_output_dir /dev/hdd/genome/Ago/repeat/HiTE --recover 1 --miu 7e- 09 --use_NeuralTE 1 --is_wicker 0 --NeuralTE_home /HiTE/bin/NeuralTE --TEClass_home /HiTE/classification”. Can you suggest anything?

CSU-KangHu · 2024-08-16T03:13:22Z

Hi @wjq1981,

Thank you for using HiTE, and I apologize for the long runtime. Previously, I didn’t recommend splitting contigs because I was concerned it might break the TE sequences. However, considering the efficiency needed for extremely large genomes, some trade-offs in performance might be necessary.

From your output, it appears that HiTE is still stuck in the first stage of LTR search. I suspect your genome contains some particularly long contigs. Since the LTR module processes each contig separately, with one process handling one contig, a very long contig could cause that process to run for an extended period, leaving other threads idle.

Therefore, I suggest splitting your genome into contigs with more balanced lengths to ensure the runtime is more evenly distributed across processes. Since LTRs typically span up to around 20 kb, you should aim to make the contigs long enough to avoid breaking LTRs—perhaps around 10 Mb? This should help fully utilize the remaining idle processes.

I hope you find this suggestion helpful.

Best,
Kang

wjq1981 · 2024-08-16T03:36:56Z

Hi @wjq1981,

Thank you for using HiTE, and I apologize for the long runtime. Previously, I didn’t recommend splitting contigs because I was concerned it might break the TE sequences. However, considering the efficiency needed for extremely large genomes, some trade-offs in performance might be necessary.

From your output, it appears that HiTE is still stuck in the first stage of LTR search. I suspect your genome contains some particularly long contigs. Since the LTR module processes each contig separately, with one process handling one contig, a very long contig could cause that process to run for an extended period, leaving other threads idle.

Therefore, I suggest splitting your genome into contigs with more balanced lengths to ensure the runtime is more evenly distributed across processes. Since LTRs typically span up to around 20 kb, you should aim to make the contigs long enough to avoid breaking LTRs—perhaps around 10 Mb? This should help fully utilize the remaining idle processes.

I hope you find this suggestion helpful.

Best, Kang

Thank you for your prompt response. I will give it a try.

CSU-KangHu · 2024-09-05T08:43:40Z

Hello @haoyongchao and @wjq1981,

I hope you’re well. We’ve noticed that the current version of HiTE might not be ideal for handling large genomes, as it tends to require extensive runtime. To address this, we are optimizing the LTR module to enhance both performance and speed. Could you please share the links to the 10GB and 9GB genomes you used previously? This will help us assess whether the improvements have indeed sped up the LTR module.

Best regards,
Kang

wjq1981 · 2024-09-05T13:10:08Z

Hello @haoyongchao and @wjq1981,

I hope you’re well. We’ve noticed that the current version of HiTE might not be ideal for handling large genomes, as it tends to require extensive runtime. To address this, we are optimizing the LTR module to enhance both performance and speed. Could you please share the links to the 10GB and 9GB genomes you used previously? This will help us assess whether the improvements have indeed sped up the LTR module.

Best regards, Kang

Sorry school started today and I'm just now seeing it. The link to it is here.

https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Alisma_plantago-aquatica/all_assembly_versions/GCA_963693085.1_laAliPlan1.1/GCA_963693085.1_laAliPlan1.1_genomic.fna.gz

xiekunwhy · 2024-11-02T02:51:07Z

Hi,

Can I break genome sequences into small pieces (~ 10 M ) and use HiTE to annotate each piece independently, then combined results?

Best,
Kun

CSU-KangHu · 2024-11-02T10:37:06Z

Hi @xiekunwhy,

Yes, that's possible. However, I suggest dividing the segments into slightly larger parts, such as 200M or 400M. After merging, you can cluster and remove redundancies across different TE libraries. A straightforward approach is to use cd-hit-est. I recommend using the parameters -aS 0.95 -aL 0.95 -c 0.8/0.95, where 0.8 allows more divergence, and 0.95 is stricter.

Best,
Kang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is the annotation performance in the large genome (>10G) #3

How is the annotation performance in the large genome (>10G) #3

haoyongchao commented Jul 9, 2024

CSU-KangHu commented Jul 9, 2024

haoyongchao commented Jul 9, 2024

wjq1981 commented Aug 16, 2024

CSU-KangHu commented Aug 16, 2024

wjq1981 commented Aug 16, 2024

CSU-KangHu commented Sep 5, 2024

wjq1981 commented Sep 5, 2024

xiekunwhy commented Nov 2, 2024

CSU-KangHu commented Nov 2, 2024

How is the annotation performance in the large genome (>10G) #3

How is the annotation performance in the large genome (>10G) #3

Comments

haoyongchao commented Jul 9, 2024

CSU-KangHu commented Jul 9, 2024

haoyongchao commented Jul 9, 2024

wjq1981 commented Aug 16, 2024

CSU-KangHu commented Aug 16, 2024

wjq1981 commented Aug 16, 2024

CSU-KangHu commented Sep 5, 2024

wjq1981 commented Sep 5, 2024

xiekunwhy commented Nov 2, 2024

CSU-KangHu commented Nov 2, 2024