Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different length in bedgraph output #12

Open
laugon10 opened this issue May 19, 2021 · 7 comments
Open

Different length in bedgraph output #12

laugon10 opened this issue May 19, 2021 · 7 comments

Comments

@laugon10
Copy link

Hi,
I have run the pipeline with paired end data. I have bedgraphs with different column lengths for each sample. Should not be equal since I have used the same reference GATC.gff?

thank you very much,
Laura

@owenjm
Copy link
Owner

owenjm commented May 19, 2021

Hi Laura,

Do you mean different row lengths, rather than column lengths? (i.e. the output of "wc -l" for each file is different?)

If so, yes, all files should certainly have the same number of GATC fragments and thus the same number of rows. Can you tell me how extensive the difference is?

Also, were there any errors or issues noted in the log files?

Thanks,
Owen

@laugon10
Copy link
Author

Hi Owen,
That is what I meant, sorry.
This is an example of wc -l from three replicates:
cat rep1/.bedgraph | wc -l
394070
cat rep2/
.bedgraph | wc -l
394149
cat rep3/*.bedgraph | wc -l
394109

I have tried both with my own made GATC file and with the one provided by you for Dmel_BDGP6.GATC.gff.gz.
I have notice that I have some warnings in the log file with both GATC files. "Warning: alignment contains chromosome identities not found in GATC file:" I am attaching the log of one of the samples with mi own GATC file.

Thank you very much!

damidseq_pipeline v1.4.6
Command-line options: --gatc_frag_file=/mnt/reposervicesdg/repo_servicios/DamIDseq/damidseq_pipeline/files_TaDA/dmel_r6.36.GATC.gff --bowtie2_genome_dir=/mnt/reposervicesdg/repo_servicios/DamIDseq/damidseq_pipeline/files_TaDA/Dm_r6.36/dmel_r6.36 --bowtie2_path=/mnt/nfs_comun/dreamgenics/apps/bowtie2/2.2.3/ --samtools_path=/mnt/nfs_comun/dreamgenics/apps/samtools/1.11/ --bedtools_path=/mnt/nfs_comun/dreamgenics/apps/bedtools/ --paired --threads=40 --out_name=Alarm-1

*** Reading data files ***

Using samtools version 1.11

*** Aligning files with bowtie2 ***
Now working on Alarm ...
27760163 reads; of these:
  27760163 (100.00%) were paired; of these:
    3021003 (10.88%) aligned concordantly 0 times
    23809193 (85.77%) aligned concordantly exactly 1 time
    929967 (3.35%) aligned concordantly >1 times
    ----
    3021003 pairs aligned concordantly 0 times; of these:
      1861497 (61.62%) aligned discordantly 1 time
    ----
    1159506 pairs aligned 0 times concordantly or discordantly; of these:
      2319012 mates make up the pairs; of these:
        1620967 (69.90%) aligned 0 times
        523947 (22.59%) aligned exactly 1 time
        174098 (7.51%) aligned >1 times
97.08% overall alignment rate

Now working on Dam ...
28150072 reads; of these:
  28150072 (100.00%) were paired; of these:
    2177565 (7.74%) aligned concordantly 0 times
    23650699 (84.02%) aligned concordantly exactly 1 time
    2321808 (8.25%) aligned concordantly >1 times
    ----
    2177565 pairs aligned concordantly 0 times; of these:
      1078138 (49.51%) aligned discordantly 1 time
    ----
    1099427 pairs aligned 0 times concordantly or discordantly; of these:
      2198854 mates make up the pairs; of these:
        1394253 (63.41%) aligned 0 times
        515389 (23.44%) aligned exactly 1 time
        289212 (13.15%) aligned >1 times
97.52% overall alignment rate


*** Reading GATC file ***
*** Calculating bins ***
Now working on Alarm ...
  Generating bins from Alarm.bam ...
  24739162 mapped fragments
  Paired-end reads detected in bamfile
  Converting to GATC resolution ...                 
  Warning: alignment contains chromosome identities not found in GATC file:
  211000022278711, Unmapped_Scaffold_17_D1756_D1775, 211000022278890, 211000022278935, 211000022278254, 211000022279743, 211000022279095, 211000022278151, 211000022280626, 211000022280014, 211000022279479, 211000022279376, 2Cen_mapped_Scaffold_10_D1684, 211000022278313, 211000022279286, 211000022280211, Unmapped_Scaffold_37_D1608, Y_mapped_Scaffold_26_D1717, 3Cen_mapped_Scaffold_31_D1643_D1653_D1791, Unmapped_Scaffold_24_D1707, 211000022279308, 211000022279050, Unmapped_Scaffold_11_D1754, 2Cen_mapped_Scaffold_43_D1668, 211000022279695, 211000022279711, 211000022279693, 211000022278993, Y_mapped_Scaffold_20_D1762_D1719, 211000022280198, 211000022279403, 211000022278164, 211000022279310, 211000022278241, 211000022278338, 211000022278132, Unmapped_Scaffold_28_D1723, 3Cen_mapped_Scaffold_50_D1686, Unmapped_Scaffold_44_D1670, 211000022278571, 211000022278210, 3Cen_mapped_Scaffold_27_D1777, Unmapped_Scaffold_52_D1739, 211000022278370, 211000022279057, Unmapped_Scaffold_29_D1705, 211000022278888, 211000022280196, Unmapped_Scaffold_8_D1580_D1567, Y_mapped_Scaffold_30_D1720, 211000022278311, 211000022278521, 211000022280033, 211000022278178, 211000022279336, Y_mapped_Scaffold_53_D1765, 211000022280596, 3Cen_mapped_Scaffold_1_D1896_D1895, 211000022280570, 211000022278091, 211000022280644, 211000022279117, Unmapped_Scaffold_45_D1673, 211000022280294, 211000022279692, 211000022280112, Y_mapped_Scaffold_15_D1727, Y_mapped_Scaffold_5_D1748_D1610, Unmapped_Scaffold_4_D1555_D1692, Y_mapped_Scaffold_9_D1573, mitochondrion_genome, Unmapped_Scaffold_35_D1599, 2R2_mapped_Scaffold_56_D1828, 211000022279235, 211000022279116, 211000022278453, 211000022279051, 211000022278672, 211000022278673, 211000022279805, 211000022278503, 211000022280292, 211000022280124, 211000022279842, 211000022278712, XY_mapped_Scaffold_42_D1648, 211000022278859, 211000022278161, 211000022278891, 211000022280176, 211000022278253, Unmapped_Scaffold_46_D1675, 211000022279782, 211000022280336, 211000022278501, 211000022279096, 211000022279694, Y_mapped_Scaffold_12_D1771, Y_mapped_Scaffold_23_D1638, Y_mapped_Scaffold_21_D1683_D1693, 211000022280212, Unmapped_Scaffold_60_D1601, Unmapped_Scaffold_38_D1625, 211000022280192, Unmapped_Scaffold_22_D1753, 211000022278860, Unmapped_Scaffold_48_D1678, 211000022278889, 3Cen_mapped_Scaffold_41_D1641, Y_mapped_Scaffold_18_D1698, 211000022278522, X3X4_mapped_Scaffold_6_D1712, 3Cen_mapped_Scaffold_36_D1605, 211000022279575, 211000022279089, X3X4_mapped_Scaffold_14_D1732, Y_mapped_Scaffold_34_D1584, 211000022280193, 211000022278611, 211000022279373, 211000022280071, Unmapped_Scaffold_54_D1776, 211000022278224, 211000022278456, Unmapped_Scaffold_58_D1862, 211000022279097, XY_mapped_Scaffold_7_D1574, Unmapped_Scaffold_13_D1782, 211000022278309, 211000022280297, 211000022278349, 211000022279158, Unmapped_Scaffold_51_D1697, 211000022279134, Unmapped_Scaffold_32_D1773, 211000022279159


Now working on Dam ...
  Generating bins from Dam.bam ...
  25972509 mapped fragments
  Paired-end reads detected in bamfile
  Converting to GATC resolution ...                 
  Warning: alignment contains chromosome identities not found in GATC file:
  211000022278711, Unmapped_Scaffold_17_D1756_D1775, 211000022278890, 211000022280177, 211000022278935, 211000022278254, 211000022279743, 211000022278151, 211000022279095, 211000022280014, 211000022279376, 2Cen_mapped_Scaffold_10_D1684, 211000022278313, 211000022279286, 211000022280211, Unmapped_Scaffold_37_D1608, Y_mapped_Scaffold_26_D1717, 3Cen_mapped_Scaffold_31_D1643_D1653_D1791, 211000022278652, Unmapped_Scaffold_24_D1707, 211000022279308, 211000022279050, Unmapped_Scaffold_11_D1754, 211000022279604, 2Cen_mapped_Scaffold_43_D1668, 211000022279695, 211000022279711, 211000022279693, 211000022278993, Y_mapped_Scaffold_20_D1762_D1719, 211000022280198, 211000022279403, 211000022278164, 211000022279310, 211000022278241, 211000022278338, Unmapped_Scaffold_28_D1723, 211000022278132, 3Cen_mapped_Scaffold_50_D1686, Unmapped_Scaffold_44_D1670, 211000022278210, 3Cen_mapped_Scaffold_27_D1777, Unmapped_Scaffold_52_D1739, 211000022279057, Unmapped_Scaffold_29_D1705, 211000022278888, 211000022280196, Unmapped_Scaffold_8_D1580_D1567, Y_mapped_Scaffold_30_D1720, 211000022278521, 211000022278311, 211000022280033, 211000022278178, 211000022279336, Y_mapped_Scaffold_53_D1765, 211000022280596, 3Cen_mapped_Scaffold_1_D1896_D1895, 211000022280570, 211000022278091, 211000022280644, 211000022279117, Unmapped_Scaffold_45_D1673, 211000022280294, 211000022279692, 211000022280112, Y_mapped_Scaffold_15_D1727, Y_mapped_Scaffold_5_D1748_D1610, Unmapped_Scaffold_4_D1555_D1692, Y_mapped_Scaffold_9_D1573, mitochondrion_genome, Unmapped_Scaffold_35_D1599, 2R2_mapped_Scaffold_56_D1828, 211000022279235, 211000022279116, 211000022278453, 211000022279051, 211000022278672, 211000022278862, 211000022278673, 211000022279805, 211000022278503, 211000022278416, 211000022280292, 211000022279842, 211000022280124, 211000022278712, XY_mapped_Scaffold_42_D1648, 211000022278859, 211000022278161, 211000022278891, 211000022280176, Unmapped_Scaffold_46_D1675, 211000022279782, 211000022280336, 211000022278501, 211000022279165, 211000022279096, 211000022279694, Y_mapped_Scaffold_12_D1771, Y_mapped_Scaffold_23_D1638, Y_mapped_Scaffold_21_D1683_D1693, 211000022280212, Unmapped_Scaffold_60_D1601, Unmapped_Scaffold_38_D1625, 211000022280192, Unmapped_Scaffold_22_D1753, 211000022278860, Unmapped_Scaffold_48_D1678, 211000022278889, 3Cen_mapped_Scaffold_41_D1641, Y_mapped_Scaffold_18_D1698, 211000022278522, X3X4_mapped_Scaffold_6_D1712, 3Cen_mapped_Scaffold_36_D1605, 211000022279575, 211000022279089, X3X4_mapped_Scaffold_14_D1732, Y_mapped_Scaffold_34_D1584, 211000022278611, 211000022279373, 211000022280071, Unmapped_Scaffold_54_D1776, 211000022278224, 211000022278456, Unmapped_Scaffold_58_D1862, 211000022279097, XY_mapped_Scaffold_7_D1574, Unmapped_Scaffold_13_D1782, 211000022278309, 211000022280297, 211000022278349, 211000022279158, Unmapped_Scaffold_51_D1697, 211000022279134, Unmapped_Scaffold_32_D1773, 211000022279159


*** Calculating quantiles ***
Now working on Alarm ...
Sorting ...
   Quantile 0.1: 1.50
   Quantile 0.2: 4.00
   Quantile 0.3: 9.25
   Quantile 0.4: 18.62
   Quantile 0.5: 35.19
   Quantile 0.6: 65.50
   Quantile 0.7: 122.80
   Quantile 0.8: 243.33
   Quantile 0.9: 566.00
   Quantile 1.0: 23237.00

Now working on Dam ...
Sorting ...
   Quantile 0.1: 7.00
   Quantile 0.2: 16.54
   Quantile 0.3: 30.00
   Quantile 0.4: 48.50
   Quantile 0.5: 74.57
   Quantile 0.6: 112.00
   Quantile 0.7: 167.50
   Quantile 0.8: 258.00
   Quantile 0.9: 433.00
   Quantile 1.0: 28701.50

*** Calculating normalisation factor ***
Now working on Alarm ...
  Spearman's correlation: 0.66  
  Norm factor = 2.09 based off 223973 frags (total 394069)

*** Normalising ***
Processing sample: Alarm ...
  ... normalising by 2.09

*** Generating ratios ***
Now working on Alarm ...
  Reading Dam ...
  Reading Alarm ...
  ... adding 130.08 pseudocounts to each sample

All done.

@owenjm
Copy link
Owner

owenjm commented May 20, 2021

Hi Laura,

That's very strange -- this shouldn't happen, and I can't see anything that's obviously wrong.

If you have all your bedgraphs in the same directory, can you run the following one-liner in the directory and send me the output? (This should output any GATC fragments that are not fully shared between the files, which will hopefully help me understand what's going on here a bit more):

perl -e 'for$f(@ARGV){open($IN,"<",$f);while(<$IN>){next if m/^tr/;($c,$s,$e)=split;$g{"$c:$s-$e"}++;}}for(sort{($a=~m/.*?:(\d+)/)[0]<=>($b=~m/.*?:(\d+)/)[0]}keys%g){print"$_\n"if$g{$_}<@ARGV}' *.bedgraph

Thanks,
Owen

@laugon10
Copy link
Author

Hi Owen,
Thank you very much for your attention.
I am attaching the output. I have 4 different samples with 3 replicates each. I hope we can find the problem.

output.txt

@owenjm
Copy link
Owner

owenjm commented May 21, 2021

Ok, that really helps a lot. As you probably noticed, all of those non-matching fragments are coming from unmapped scaffolds in the genome assembly. My pre-built GATC files exclude these, so in theory if you run the pipeline on the BAMs using the GATC file from the damidseq_pipeline repository you should get files with matching lengths. That should, at least, fix the problem for now (and please let me know if it doesn't?) (And can I also confirm that the GATC file you build included the unmapped scaffolds?)

As to why this has happened in the first place -- is there any chance that either the bowtie2 alignment indices or the GATC file may not been the same for all samples? If this isn't the case, then I should check how very small scaffolds are handled if there is no mapping data at all. It's possible that scaffolds that fail to map any reads at all are excluded, which could also explain what has happened.

@laugon10
Copy link
Author

Hi Owen,
Thank you very much. I have run 2 replicates with your pre-built GATC and I get the same number of lines. I can confirm that my GATC contained the un mapped scaffolds, I am quite new to this and I have not realized that could be a problem.

To the second part. Everything was run with the same bowtie2 index and GATC files. I am sure because all was done at the same time and I have the commands stored.

Thank you very much for your attention, I really appreciate! I hope everything goes well now on.
Laura

@owenjm
Copy link
Owner

owenjm commented May 22, 2021

That's great to hear, Laura. It's not that you're new to this, though -- it would appear that there's a bug in my code in dealing with chromosomes without any mapping data (a highly unlikely occurrence with the main chromosomes, but not at all impossible when including the small, often heterochromatic, unmapped scaffolds).

It should be fairly straightforward to ensure any chromosomes without data are still included in the output file, so I'll look into it further and update once fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants