-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fasta2 package #337
Add fasta2 package #337
Conversation
io/fasta2/fasta.go
Outdated
// Writes the Record []byte representation to the passed io.Writer. | ||
func (r Record) Write(w io.Writer) error { | ||
recBytes := r.Bytes() | ||
_, err := w.Write(recBytes) | ||
if err != nil { | ||
return fmt.Errorf("error writing record to io.Writer: %w", err) | ||
} | ||
|
||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write transforms the bytes into memory (in the bytes buffer) before actually writing them. Why not just directly write to the io Writer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah and also call it WriteTo
and implement the io.WriterTo
interface.
I mostly tried to avoid too many allocations since io.Writer
takes a []byte
as arguments so it need to be serialized fist in a []byte
, and there was already a way to get the record a []byte
.
I was actually just looking at it right now, i might send a fixup later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also writing to byte.Buffer
returns almost always nil
errors so it's just simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sent a small fixup, I think at this point the Record.Write()
method can be removed, as the user has a way of getting a string
and []byte
representation of the Record
.
For bulk writing the Write()
function is now reusing the same buffer (and buffering the I/O is usually a goodthing™).
It would be cool to have a generic Record.Write()
method that takes either a bytes.Buffer
or a bufio.Writer
so we can just decide to pass the buffer or wrap the writer. I unfortunately am traveling tomorrow until next week so no time this weekend 🙁
Generally, I really like the idea of changing the name of fasta.Fasta to fasta.Record. I think we should implement this in I don't really understand those "gzipReaderFn, openFn, and buildFn functions" either - if there are two of us, probably others wouldn't either, and this would decrease readability. I think we should remove them.
This is something I'd really like to do! Though I think it would probably be a generic of some kind? Basically, I think for every io (genbank, fasta, fastq, slow5, sam, pileup) we should implement the following standard functions: type Parser[T genbank | fasta | fastq | slow5 | sam | pileup, TH any] struct { // TH should have their respective headers
Header() (TH, error)
Next() (T, error)
MaxLineCount() int64
Reset() error
} or maybe something like that - I am not sure what is most readable. Of course, the records / alignments / reads themselves would implement the Thoughts? happy to help prototype this this weekend.
GZIP is so widely used by fastq at least, a reasonable default would be to read gz files, not raw text, if it wasn't for the fact that it wouldn't make as much sense to programmers (but make perfect sense to end users - you pretty much only push around sequencing as fastq). Still, it think this should probably be implemented on top of or somewhere else. It's kinda ridiculous that we have to implement functions like reading gz 6 times. If we had this in one place, it'd be a lot easier to build functions that could compress in may different ways. For the io.Writer stuff, I have a small PR doing that here - https://github.com/TimothyStiles/poly/pull/337/files . There is one important bug fix in there too - https://github.com/TimothyStiles/poly/blob/5cde1b2232331be07625786b6201c7c0dce1717e/io/fastq/fastq.go#L179-L185 - which I have no idea why it helped but it did, but only for my very large files. I think it has something to do with memory, so it was hard to reproduce in an individual test. |
Aight, we will include that then. Hope that works @folago ! I appreciate your contributions! |
Hey, |
This is exploratory work and does not have to be merged in this state.
There are for sure some functionalities missing as well as docs and examples, but this more or less the direction I would go about a fasta package.
I called it
fasta2
just to have it side by side and maybe try both API, but in case some of this package is liked we can decide how much of it to keep and integrate or discard.Here is a list of things in no particular order I do not like about the status quo that I'm trying to fix in the
fasta2
package:fasta.Fasta
stutters so I would rename it tofasta.Record
also theName
field should be calledIdentifier
as in thefastq
package orHeader
as mentioned in the BioStar handbook.gzipReaderFn
,openFn
, andbuildFn
functions.Build()
does not sound vary idiomatic for serialization in bytes I usedBytes()
as a method ofRecord
.testdata
directory.ParseConcurrent()
is IMHO the nicest parser implementation, so I extracted the state in a struct and added some methods to get it to be the only parsing code to maintain.Some other thoughts:
Parser
interface in theio
package and have implementation of it in each sub package?Please let me know what you think.