Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another hardcoded path -- to linenum.R #8

Open
rsharris opened this issue May 24, 2018 · 9 comments
Open

Another hardcoded path -- to linenum.R #8

rsharris opened this issue May 24, 2018 · 9 comments

Comments

@rsharris
Copy link

Line 439 in processRead.sh.
Rscript --vanilla ~/replong/linenum.R $file

Lines 390 and 461 get it right (I guess), as do ten other uses of R scripts.

@rsharris
Copy link
Author

rsharris commented May 24, 2018

After fixing that, I finally get a non-empty result.fa.

Would it be possible for you to run replong on my data so that I know what the answer from the working replong is? My data is in https://github.com/makovalab-psu/FakeTRGenome , the genome size is exactly 100K, and used -c true.

@bzvew
Copy link
Collaborator

bzvew commented May 25, 2018

Thank you for fixing that! I try your data, and new.fa and result.fa is the result before merging similar reads and after merging similar reads. The result.fa has only 4 reads at first. The merging process needs to be improved, and I change some parameter in edge_1.py to have more reads in the result.fa (threshold1 = 250, threshold2 = 600, threshold3 = 0.8 now). The result.fa is now 80k, with 12 reads, and I can see there are still redundancy in the result. I will try to improve that(I'm busy and not sure if I can finish that in this week). You can try to modify three threshold in edge_1.py to test the result if you have interest in that. I attached the result_old.fa(with original parameters) and result.fa(with modified parameters).
result.zip

@bzvew
Copy link
Collaborator

bzvew commented May 25, 2018

new.fa.zip

@rsharris
Copy link
Author

Thanks for running that.

The reason I asked is because I'm not convinced that replong is running correctly for me. I was hoping you would get the same result that I got, but you didn't. My result.fa has 2 sequences totaling ≈14K bp, whereas your first attempt (which I believe would be using the same parameters I used) has 4 sequences and 26K.

Am I correct to assume that they should be the same result, since I have not tweaked any parameters (I fetched the source code on May/22)? I can't see any simple means to figure out where/how/why differences occurred. Maybe my version of igraph gives different behavior? Or maybe some tool in this pipeline is non-deterministic? Is canu deterministic?

I do have some questions about the output -- I refer here to your first result, result_old.fa.

Of the 26K, about 15K is the tandem repeats I embedded in my fake genome. For example, sequence 1 103..2150 is a 96% match to GGAAT*, and 3452..4208 is a 97% match to TAGTCGAGGAATTATGGATGGCGAAGATAAAT*. Those are things I expect to find.

However, what's in between those two, 2150..4208, shouldn't be a repeat in my genome. Because, when I generated my random genome, I filled in between TRs with just random sequence (sequence that appears nowhere else in the genome). What I find is that replong's sequence 1, in its entirety, is a 98.5% match for my genome segment 49871..57728. Only 5K of this is TR (per the truth table), the rest is random, so 35% of that reported sequence is a false positive ((7900-5080)/7900).

I guess that happens because, presumably, that region was over-represented in the reads. Average coverage is around 20X but that region ranged from 29X to 35X. Coverage skew is probably high because the genome length, 100K, is so short relative to the average read length, around 8K, and is likely to be higher in the center of the genome. I don't know if that kind of skew would be present in real pacbio data, but I recall that in short read sequencing there are often regions highly over-represented.

I need to look back at your manuscript to see what it says about false positive rate.

@rsharris
Copy link
Author

I pondered
Or maybe some tool in this pipeline is non-deterministic?
Looks like the answer is yes, or it still isn't working right on my machine.

I ran it twice in a row and the number reported at line 410 was slightly different between the runs. Is it igraph?

I don't know if that's a problem, but it does raise a lot of questions.

@bzvew
Copy link
Collaborator

bzvew commented May 28, 2018

Yes, the igraph community detection process is not a deterministic process. It's a heuristic algorithm. How about the coverage of the 5080bp sequences in the genome?

@rsharris
Copy link
Author

rsharris commented May 28, 2018

How about the coverage of the 5080bp sequences in the genome?

I'm not sure exactly what you're asking. In that 7900 bp piece of the genome, there are five TRs. Three are GGAAT* of lengths 2081, 614, and 581, and two are a 32-mer (the same as each other), lengths 751 and 1053. There are many other TRs with the same or similar sequence in the genome, but no other repeat content. Each has a different length. These total about 55% of the genome. (This is all recorded in FakeTRGenome/FakeTRGenome.truth.dat).

The GGAAT* repeats have about 4800 copies of GGAAT. The 32-mer repeats have about 800 copies of the same 32-mer.

@rsharris
Copy link
Author

I've come to a conclusion about RepLong and TRs -- correct me if I'm wrong. RepLong doesn't view TRs differently than any other sequence, and can be expected to report a TR only if another instance of it, with the same length, is duplicated in the genome.

In other words, if I have GGAAT* of lengths 600bp and 2000bp, RepLong considers these as distinct elements. Right?

@bzvew
Copy link
Collaborator

bzvew commented May 30, 2018

Definitely, RepLong does not have any prior structure information, and only output all the sequences which has more coverages in genome. I think there is much potential to improve it, and for your case it outputs a longer sequence than just the repeat region. The reason can be those output reads are in the network community which appears to be dense. If only parts of those sequences are repeats, there are still some possibility to favor the output of a longer sequence. I think it is a issue needs to be corrected. I create one issue #9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants