Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug handling reads with clipping when converting mapped reads to insertion plot #135

Open
emgemg opened this issue Jan 25, 2024 · 1 comment

Comments

@emgemg
Copy link

emgemg commented Jan 25, 2024

I recently installed biotradis using conda

Name Version Build Channel
biotradis 1.4.5 hdfd78af_5 bioconda

I noticed that I had an issue with reads with clipping (which I think you've seen before) where reads with soft clipping at the 5' end take the first mapped nucleotide as the insertion site, resulting in an artificially inflated "unique insertions" count. I can see the fix in the Cigar.pm script,
Screenshot 2024-01-25 at 3 32 41 pm
but found that this fix wasn't in the conda-installed hdfd78af_5 build.
I tried a few conda envs with alternate versions with no success, but possibly some user error on my part somewhere, so in any case I downloaded Bio-Tradis-master directly from github to test again.

This time with the correct Cigar.pm, I have a different error. When I view the plot file in artemis, it looks like all insertion sites have been artificially shifted by a consistent factor, while the bam file viewed in IGV is unchanged Screenshot 2024-01-25 at 3 40 17 pm
I think (but I'm not familiar enough with perl to be sure I've understood the code correctly) the issue is I have reads mapping across the annotation origin, such that their cigar is something like "77H32M". Because these are the first reads encountered, they seem to offset the rest of the data by "77". I wondered if the issue is here

$current_coordinate += $number;

In any case, I wondered if there might be an option to exclude all reads with clipping (hard and soft)? I know it's very conservative, but even when I don't have the issue with annotation-origin-spanning reads, I found that the following

                                $current_coordinate -= $number if($results{start} == 0);
                                $results{start} = $current_coordinate if($results{start} == 0);
				$current_coordinate += $number;
                                $results{end} = $current_coordinate -1 if($results{end} < $current_coordinate);

reported false insertion sites for "chimeric reads". As in, if a read had the cigar 40S71M, the insertion coordinate would be adjusted by 40 to give a tn-insertion event 40bp upstream, yet when I look at the individual read it's not a real transposon-junction, and I'm still getting artificially inflated insertion reporting. (This particular library is intentionally not very diverse, so doesn't need much sequencing coverage, but as a consequence is more susceptible to "noise")

I hope this makes sense, I'd be happy to try and explain some more. Maybe this is just specific to my data(?) but posting here just in case. Thanks

@lbarquist
Copy link
Contributor

Hi,

So, there are some issues with handling soft-clipping, and this has been exacerbated by the switch to BWA. I would try using the original default mapper, smalt (--smalt), and setting --smalt_y 1; this should filter out anything that doesn't have an exact match in the genome and will hopefully resolve the problem you're having. It may be worth doing some adapter/quality trimming if you're getting low mapping numbers.

-Lars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants