Skip to content

Latest commit

 

History

History
305 lines (245 loc) · 13.5 KB

README.md

File metadata and controls

305 lines (245 loc) · 13.5 KB

Built-in variable added to bioawk

FILENUM added as variable that counts the files being processed. Useful when multiple files processed with different actions for each.

The FNR==NR test is usually sufficient to check that the first file is being processed. However, if the first file is empty and the second file is not the test fails and reports true for the second file not the first as intended. FILENUM==1 for the first file and FILENUM==2 for the second will work even if one of the files has no lines (i.e., records). [Note: gawk special rules BEGINFILE and ENDFILE are not in bioawk currently].

Functions added to bioawk

ctSkennerton's and derivative functions translate gffattr gtfattr samattr

translate gffattr originally added in https://github.com/ctSkennerton/bioawk and bawk script added here is bioawk_cas -c fastx "$@". Optional arg added to this version of gffattr. Also gtfattr to parse the slightly different syntax of gtf attributes and samattr for sam tags.

(1) translate(nucl_str [, table_num]) translates nucleotide string nucl_str. Returns the protein sequence.

bawk '{print ">"$name;print translate($seq)}' seq.fa.gz

can also use different translation tables. Here are the genetic code table numbers. To translate using the bacteria/archaea code table 11:

bawk '{print ">"$name;print translate($seq, 11)}' seq.fa.gz

(2) gffattr( attr_str, arr[, pos_arr]) or gtfattr( attr_str, arr[, pos_arr]) parses the attr_str and puts the values in the arr argument. The attr_str is expected to be in the format of the gff attribute field (gtf attribute field format for gtfattr). This field has subfields delimited by semi-colons where each subfield has a name and a value after an equal sign (space for gtfattr). Quotes removed around value for gtfattr.

Returns number of subfields, which is same as length(arr). Optional pos_array will put the first key name in pos_arr[1], second in pos_arr[2], etc. The pos_arr option added in bioawk_cas. For example, if this is the first line of the gff file:

NC_000085.6     Gnomon  gene    3064929 3075825 .       -       .       ID=gene-Gm38527;Dbxref=GeneID:102640308,MGI:MGI:5621412;Name=Gm38527;gbkey=Gene;gene=Gm38527;gene_biotype=lncRNA
bioawk_cas 'NR==1{
   gffattr($NF, arr)
   print arr["ID"] " is the ID\n"
   
   for(subfield in arr)
      print subfield " = " arr[subfield]
      
   print "\nOutput in same order as attr string"
   flds = gffattr($NF, arr, pos)
   for(f = 1; f <= flds; f++)
      printf "%d\t%s\t%s\n", f, pos[f], arr[ pos[f] ]
}' example.gff

outputs

gene-Gm38527 is the ID

gbkey = Gene
ID = gene-Gm38527
gene_biotype = lncRNA
Dbxref = GeneID:102640308,MGI:MGI:5621412
gene = Gm38527
Name = Gm38527

Output in same order as attr string
1	ID	gene-Gm38527
2	Dbxref	GeneID:102640308,MGI:MGI:5621412
3	Name	Gm38527
4	gbkey	Gene
5	gene	Gm38527
6	gene_biotype	lncRNA

Remaining functions have been added in bioawk_cas

(3) samattr(sam_line, arr[, pos_arr]) similar to gffattr and gtfattr except since tag fields are tab delimited, the entire line is provided to the function using the $0 variable or you can use fldcat(12,NF,"\t") as shown below. The type char is also in pos_arr with 'T' prefixed position, pos_arr["T1"], pos_arr["T2"], etc. For example, if this was the sam line:

ref1_grp1_p004  99      ref1    13      6       10M     =       37      34      CCGGGGATCC      ''''''''''      fa:f:1.38e-23   za:Z:xRG:Z:grp2 RG:Z:grp1       NM:i:0  MD:Z:10

then

bioawk_cas 'BEGIN{OFS="\t"}
!/^@/ {
    tot = samattr($0, ar, pos)
    for(t=1; t<=tot; t++) {
      key = pos[t]; val = ar[key]
      typ = pos["T" t]
      print t, key,  val, typ
    }
} ' oneliner_example.sam

outputs

1       fa      1.38e-23    f
2       za      xRG:Z:grp2  Z
3       RG      grp1    Z
4       NM      0   i
5       MD      10  Z

Miscellaneous functions systime md5 fldcat

(4) systime() returns the number of seconds since the Linux epoch. This is already in most other awk versions. Useful for timing.

(5) md5(str) returns the md5 code of the string argument. For example:

$ echo "example string to check" | bioawk_cas '{print; print "md5:", md5($0)}'
example string to check
md5: 59471d22e23e4198fd170cc7e4a58cbb

(6) fldcat(start_fldno, end_fldno[, separator]) return fields from start_fldno to end_fldno with separator given or OFS if no 3rd arg. This is useful to print fields starting from one field to the end of fields using NF as end_fldno. Or, for example you might give just 12th to final field to samattr instead of $0.

tag_flds = fldcat(12, NF, "\t")
tot_tags = samattr(tag_flds, ar, pos)
...

Character functions modstr setat charcount applytochars

A few of the functions were added since it is difficult or slow using substr() to modify or access the characters of a string.

(7) modstr takes 3 to 5 arguments modstr(str, start, length, [mod_type, str_length]) and is used for in-place string variable case modification. mod_type 0 to lowercase, 1 to uppercase (default 0). Optional str_length faster for multiple calls, so the length isn't recalculated every call.

(8) setat takes 3 or 4 arguments setat(str,pos,replacement[, optional repeat_count]) and does an in-place overwrite of a string with another string. Often used with a single character replacement string. The modified string is not changed in length.

(9) charcount(str, arr) fills arr with count of each character in str. returns number of different chars.

bioawk_cas 'BEGIN{OFS=":"
   exmp="AaGBCNdEfaGHNINNJ"
   charcount(exmp, arr)
   for(c in arr)
      print c, arr[c]
}' |  sort -t ":" -k2,2nr | tr "\n" " "

gives

N:4 G:2 a:2 A:1 B:1 C:1 E:1 H:1 I:1 J:1 d:1 f:1

(10) applytochars(str, stmt_or_func [,...]) call the 2nd and other arguments for each character in str with CHAR and ORD variables set.

bioawk_cas '
   function apply(){if(CHAR=="N")Ns++}
   function prt(s){print s}
   BEGIN{
      str="AaNBCNdEfNGHNINNJ"

      # applytochars(str, apply())  # example just using the function to count the Ns
      applytochars(str, Ns += CHAR=="N", prt(CHAR ":" ORD ":" or(ORD, 32))) # example using a statement and a function

      # CHAR and ORD still set to last char in string
      print "\nLast CHAR is:", CHAR, ORD, "or32", or(ORD, 32)
      print "tot Ns:", Ns
}'

gives

A:65:97
a:97:97
N:78:110
B:66:98
C:67:99
N:78:110
d:100:100
E:69:101
f:102:102
N:78:110
G:71:103
H:72:104
N:78:110
I:73:105
N:78:110
N:78:110
J:74:106

Last CHAR is: J 74 or32 106
tot Ns: 6

Search functions hamming edit_dist end_adapter_pos

(11) hamming( pattern, text [, text_pos: (1_indexed)default 1 [, case_sensitive: true [, N_wildcard: false] ]] ) compares the pattern of the first argument to the characters in the text of the second argument for the length of the pattern (up to any remaining characters in the text string).

Returns the number of mismatches.

Comparison starts at position of the optional third argument, text_pos. Default is to start at text string beginning which is text_pos 1. Optional arguments 4 and 5 allow for case insensitive comparisons and the ability to treat the N character as a wildcard. To use either of these a text_pos must be provided.

(12) edit_dist( max_editdist, str1, str1_match_len, str2[, str2_len [, mode: default 1 [, flags]]] )

       max_editdist: -1 means no max set. setting a max_editdist speeds up the search.   

       mode: 0 complete match, 1 prefix match, 2 infix match (add 10 or 20 for CIGAR). Can use string len -1 for full length.
       
       flags: 1 N matches ACTG; 2 Y matches CT, R matches AG; 3 both.
       
       Returns string starting with edit distance, -1 for no match 0 for exact match, and other values.

Example below looks for a PacBio amplification adapter in CCS reads. Search mode is 22: 2 for InFix (HW) search plus 20 to add the extended CIGAR match to the output. The edit_dist() call is in the adapterMatch function which is called for forward and for reverse compliment of the adapter.

bawk '
   function int_ceil(f) { add = !(f==int(f)); return int(f)+add }
   function adapterMatch(Adapter, AdapStr) {
      matchInfo = edit_dist(max_miss, Adapter, alen, $seq, slen, mode)
      if(matchInfo >= 0) {
         printf "%s %s %s", prefix, AdapStr, matchInfo
         prefix = "\t"
      }
   }

   BEGIN{InFix = 2; ExtCIGAR = 20; RegCIGAR = 10; extmode = InFix + ExtCIGAR; ComputeLen = -1

         adap="AAGCAGTGGTATCAACGCAGAGTACT"
         adap_rc=adap; revcomp(adap_rc)
         alen=length(adap)
         max_miss = int_ceil(alen/10) + 1 # little less than 90% match at worst

         #mode = InFix
         mode = extmode
   }
   # check for amplification adapter
   {  slen = length($seq)
      prefix = $name "\t"

      adapterMatch(adap,    "ampF")
      adapterMatch(adap_rc, "ampR")

      if(prefix == "\t")
         print "\t rdlen " slen
 }' subreads_ccs.fasta

This outputs in part

m64044_201011_075919/1/ccs       ampF 0 26= 1 26         ampR 0 26= 11994 12019  rdlen 12019
m64044_201011_075919/2/ccs       ampF 0 26= 1 26         ampR 0 26= 17194 17219  rdlen 17219
m64044_201011_075919/3/ccs       ampF 0 26= 1 26         ampR 2 12=1D7=1D7= 14871 14898  rdlen 14898
m64044_201011_075919/7/ccs       ampF 1 5=1I20= 1 25     ampR 3 8=1I4=1X5=1X6= 19327 19351       rdlen 19351
m64044_201011_075919/8/ccs       ampF 0 26= 1 26         ampR 1 25=1I 10536 10560        rdlen 10560
m64044_201011_075919/48/ccs      ampF 0 26= 1 26         ampR 0 26= 11256 11281  rdlen 11281
m64044_201011_075919/49/ccs      ampF 1 5=1I20= 1 25     ampR 2 19=1X5=1I 8912 8936 8912 8937 8912 8938  rdlen 8938
m64044_201011_075919/51/ccs      ampF 1 10=1D16= 1 27    ampR 0 26= 11678 11703  rdlen 11703
m64044_201011_075919/118/ccs     ampF 2 17=1X5=1D3= 1 27         ampR 0 26= 12642 12667  rdlen 12667
m64044_201011_075919/120/ccs     ampF 0 26= 1 26         ampR 1 4=1D22= 17923 17949      rdlen 17949
m64044_201011_075919/124/ccs     ampF 0 26= 1 26         ampR 0 26= 13041 13066  rdlen 13066

(13) end_adapter_pos checks the last 16 nt of the sequence against the first 16 nt of the adapter seq, then the last 15 nt of the read for the first 15 nt of the adapter and so-on until the last 4 nt of the read is checked with first 4 adapter nt.

Considered matched when an attempt has an acceptable hamming distance: 4 mismatches at 16 nt, 3 starting at 12 nt then 2 starting at 8 nt, 1 mismatch at 6 and 5 nt, no mismatch at 4 nt.

end_adapter_pos("", adapter) to set adapter, subsequently end_adapter_pos(seq) to check seq suffix against adapter prefix.

Returns string with 3 numbers: position of match, len, mismatches. If no match, position is set to -1.

For example

bawk 'BEGIN{end_adapter_pos("", "GATCGGAAGAGCACAC")} # set to check first 16 bases of the TruSeq Indexed Adapter

   (rslt=end_adapter_pos($seq)) > 0 {  # check the last 16 to last 4 bases of $seq against the adapter head for close match
   
      split(rslt, ar, " ")
      pos = ar[1]; mlen = ar[2]; mismatches = ar[3]
      
      printf "%s\trec %s:  \t%s %s %s\n", $name, NR, pos, mlen, mismatches
      
 }' reads_L4_R1.fq.gz

outputs in part

E00489:558:H7N5LCCX2:4:1101:24718:1344  rec 94:         139 11 3
E00489:558:H7N5LCCX2:4:1101:27600:1344  rec 103:        143 7 0
E00489:558:H7N5LCCX2:4:1101:23531:1397  rec 165:        146 4 0
E00489:558:H7N5LCCX2:4:1101:26951:1415  rec 189:        134 16 3
E00489:558:H7N5LCCX2:4:1101:23429:1432  rec 217:        135 15 0
E00489:558:H7N5LCCX2:4:1101:25733:1450  rec 245:        143 7 2
E00489:558:H7N5LCCX2:4:1101:24698:1485  rec 287:        145 5 0

Top line tells us record 94 has the first 11 bases of the adapter matching with 3 errors starting at position 139 of the read.

I have used this with libraries, especially mate-pair, after trimming with Trimmomatic to do an additional clean-up at the end of reads when the matched adapter length is below the horizon of the other tool.

(14) find_codons searches a nucleotide string for codon equivalents of an amino acid string. It places the positions where the matched codons occur into an array as keys and the codons as the value. It finds all such matches in the input nucleotide string. Returns number of matches found and placed into the array. Optional hamming threshold argument.

The amino acid sequence can contain a dot "." to match any run of three nucleotides, including a stop codon. You can use translate and index to check for this. Optional hamming threshold allows the input number of codon mismatches to occur and still consider the AA sequence matched. You can revcomp the sequence and use find_codons again to search in both directions.

find_codons(nucleotides_to_search, AA_pattern, result_arr[, optional hamming threshold, defaults to 0])

bawk '
{
   n = find_codons($seq, "M.PSVLLLL", aCodons)
   for (key in aCodons) {
      val = aCodons[key]
      print $name, key, val, translate(val)
   }
}' GRCm39_genomic.fna | sort -k1,1V -k2,2n

outputs

NC_000068.8	103773480	ATGCTCCCTTCAGTGTTGCTCCTTCTT	MLPSVLLLL
NC_000073.7	25597110	ATGGAGCCCAGTGTCCTGCTCCTCCTT	MEPSVLLLL
NC_000073.7	25649678	ATGGAGCCCAGTGTTCTACTCCTCCTT	MEPSVLLLL
NC_000073.7	25760945	ATGGATCCTAGTGTCCTGCTCCTCCTT	MDPSVLLLL
NC_000073.7	25872859	ATGGATCCTAGTGTCCTGCTCCTCCTT	MDPSVLLLL
NC_000076.7	51816292	ATGTGGCCCTCAGTACTCCTGCTTCTG	MWPSVLLLL
NC_000078.7	45856902	atgATGCcatcagttcttttattattg	MMPSVLLLL
NC_000078.7	105618399	ATGAGCCCCtccgtcctcctcctcctg	MSPSVLLLL