-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How is the annotation performance in the large genome (>10G) #3
Comments
Hi @haoyongchao , Thank you for using HiTE. While it is possible to run HiTE on the entire genome or on individual chromosomes separately, I recommend running HiTE on the entire genome. Running it on single chromosomes may miss TEs that are distributed across different chromosomes. Previously, we ran the older version of HiTE on a Best, |
Thank you for your prompt reply. I am running the pipeline on a 10G plant genome using 100 CPUs. |
Hi, thanks for developing such a great software. |
Hi @wjq1981, Thank you for using HiTE, and I apologize for the long runtime. Previously, I didn’t recommend splitting contigs because I was concerned it might break the TE sequences. However, considering the efficiency needed for extremely large genomes, some trade-offs in performance might be necessary. From your output, it appears that HiTE is still stuck in the first stage of LTR search. I suspect your genome contains some particularly long contigs. Since the LTR module processes each contig separately, with one process handling one contig, a very long contig could cause that process to run for an extended period, leaving other threads idle. Therefore, I suggest splitting your genome into contigs with more balanced lengths to ensure the runtime is more evenly distributed across processes. Since LTRs typically span up to around 20 kb, you should aim to make the contigs long enough to avoid breaking LTRs—perhaps around 10 Mb? This should help fully utilize the remaining idle processes. I hope you find this suggestion helpful. Best, |
Thank you for your prompt response. I will give it a try. |
Hello @haoyongchao and @wjq1981, I hope you’re well. We’ve noticed that the current version of HiTE might not be ideal for handling large genomes, as it tends to require extensive runtime. To address this, we are optimizing the LTR module to enhance both performance and speed. Could you please share the links to the 10GB and 9GB genomes you used previously? This will help us assess whether the improvements have indeed sped up the LTR module. Best regards, |
Sorry school started today and I'm just now seeing it. The link to it is here. |
Hi, Can I break genome sequences into small pieces (~ 10 M ) and use HiTE to annotate each piece independently, then combined results? Best, |
Hi @xiekunwhy, Yes, that's possible. However, I suggest dividing the segments into slightly larger parts, such as 200M or 400M. After merging, you can cluster and remove redundancies across different TE libraries. A straightforward approach is to use Best, |
I would like to use the pipeline on a large plant genome. Would it be to run separately on chromosomes or directly on the entire genome? Are there any requirements for CPUs and RAM? Have you ever tested it on a large genome? Thanks!!
The text was updated successfully, but these errors were encountered: