-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Masked sequences #5
Comments
This is great and it would fit in perfectly with this project. Maybe with the 6 extra codes afforded by the 4-bit representation of I suppose there could be something clever to do with If you want to add this codec in then a pull request is certainly welcome. The one API thing to consider is that maybe we want a |
Hey @J-Wall I've made a commit that has a masked encoding module. Here's what usage looks like: #[cfg(test)]
mod tests {
use crate::codec::masked;
use crate::prelude::*;
#[test]
fn mask_sequence() {
let seq = Seq::<masked::Dna>::try_from("A.TCGCgtcataN--A").unwrap();
assert_eq!(seq.mask().to_string(), "a.tcgcGTCATAn--a".to_string());
}
#[test]
fn masked_revcomp() {
let seq = Seq::<masked::Dna>::try_from("A.TCGCgtcataN--A").unwrap();
assert_eq!(seq.revcomp().to_string(), "T--NtatgacGCGA.T".to_string());
}
} If instead of sequential bit sequences we choose these patterns, we get efficient It's quite lucky: there are 2 groups of bit sequences with the symmetry that satisfies the square relationships we need for 4 nucleotides, one maskable group that's its own complement, N = 0b0000,
#[display('n')]
NMasked = 0b1111,
#[display('-')]
#[alt(0b0011)]
Gap = 0b1100,
#[display('.')]
#[alt(0b0101)]
Pad = 0b1010, which leaves us with two extra symbols that complement to themselves but mask to eachother: #[display('?')]
Unknown1 = 0b0110,
#[display('!')]
Unknown2 = 0b1001, Any ideas what we could use them for? I tried to have a think about masked IUPAC encodings but I got a headache. I think the 5-bit version makes the most sense. Have you tried it out? |
Hi @jeff-k, this looks really cool.
I tried to have a look online if there were any standards/commonly use characters. I think Regarding the API, I think maybe rename
I wrote the relationships out, and with a 5-bit representation, there are 6 "square-relationship groups" and 4 "maskable self-complementing groups". Unfortunately the IUPAC alphabet would require 7 "square-relationship groups" and (at least) 2 "maskable self-complementing groups" ( I also wrote out the 6-bit relationships as well. There you have 13 "square groups" and 6 "maskable self-complement groups". So you easily fit it in to that with enough spare to also include Reduced IUPACThis exercise got me thinking about if there was an even more compact representation of the masked IUPAC alphabet. This is a bit lossy and might be a bit specific to my application, but bear with me: I think some software treats the 3-fold degenerate bases ( |
Yeah, I guess inverting the mask isn't actually as useful as setting it or unsetting it over a region. So we'll want to implement What's appealing about the 5-bit alphabet is that the leading bit sets the mask, so implementing I really like the idea of the degenerate encodings. Actually, I've been playing with a 1-bit encoding of Similarly for amino acids. If we encode amino acids into 8 categories (hydrophobic/positive charge etc.) that's also 1-bit per base. Less than 1-bit with 4 categories. I think some crystallography data only resolves residues at that level of detail. Anyway, a big motivation for this crate is to be able to swap out encodings at leisure so I'm all for showing off the possibilities. |
For the 5-bit masked IUPAC encoding, if we put the mask bit in the centre of the sequence we'll a cheap reverse complement operation:
although I guess the appropriate thing to do would be to set up a proper benchmark suite before going further into this kind of optimisation |
Thanks for sharing this crate with the community Jeff. It looks really clean!
I have some suggestions for additional codecs to support masked sequences. Of course, I can easily just implement these in my own application thanks to your
Codec
derive macro, but following the philosophy of this crate, it might make sense to just implement them once for if others want to use them.Often, we deal with DNA sequences which can be soft-masked, represented by lowercase, e.g.
or hard-masked, represented by
N
/n
. Sometimes both at the same time.Therefore we have an alphabet of 10 characters;
A
,C
,G
,T
,N
,a
,c
,g
,t
, andn
.2⁴ would cover that with 6 spare, so you could throw in gaps
-
and padding.
characters:The obvious extension (and what I actually have a use-case for) is masked IUPAC sequences. Something like
Finally, one could add 3-bit representations of
HardMaskedDna
(which allowsN
, and maybe-
/.
but not lowercase letters), and 3-bitSoftMaskedDna
, which allowsacgt
, but notN
or-
/.
)The text was updated successfully, but these errors were encountered: