Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fasta2 package #337

Closed
wants to merge 3 commits into from
Closed

Add fasta2 package #337

wants to merge 3 commits into from

Conversation

folago
Copy link
Contributor

@folago folago commented Aug 23, 2023

This is exploratory work and does not have to be merged in this state.

There are for sure some functionalities missing as well as docs and examples, but this more or less the direction I would go about a fasta package.

I called it fasta2 just to have it side by side and maybe try both API, but in case some of this package is liked we can decide how much of it to keep and integrate or discard.

Here is a list of things in no particular order I do not like about the status quo that I'm trying to fix in the fasta2 package:

  • fasta.Fasta stutters so I would rename it to fasta.Record also the Name field should be called Identifier as in the fastq package or Header as mentioned in the BioStar handbook.

image

  • I dot understand the gzipReaderFn, openFn, and buildFn functions.
  • Build() does not sound vary idiomatic for serialization in bytes I used Bytes() as a method of Record.
  • Test data should be in the testdata directory.
  • ParseConcurrent() is IMHO the nicest parser implementation, so I extracted the state in a struct and added some methods to get it to be the only parsing code to maintain.
  • As for the rest of the concurrent functions it is frowned upon to have channels in the API, better to provide a clean API that is easy to use with channels.

Some other thoughts:

  • Does it make sense to have a Parser interface in the io package and have implementation of it in each sub package?
  • Is it necessary to have the gzip functions? I understand that it can be convenient but then should we add some other compression algorithms?

Please let me know what you think.

Comment on lines 48 to 57
// Writes the Record []byte representation to the passed io.Writer.
func (r Record) Write(w io.Writer) error {
recBytes := r.Bytes()
_, err := w.Write(recBytes)
if err != nil {
return fmt.Errorf("error writing record to io.Writer: %w", err)
}

return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write transforms the bytes into memory (in the bytes buffer) before actually writing them. Why not just directly write to the io Writer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah and also call it WriteTo and implement the io.WriterTo interface.
I mostly tried to avoid too many allocations since io.Writer takes a []byte as arguments so it need to be serialized fist in a []byte, and there was already a way to get the record a []byte.

I was actually just looking at it right now, i might send a fixup later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also writing to byte.Buffer returns almost always nil errors so it's just simpler.

Copy link
Contributor Author

@folago folago Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent a small fixup, I think at this point the Record.Write() method can be removed, as the user has a way of getting a string and []byte representation of the Record.
For bulk writing the Write() function is now reusing the same buffer (and buffering the I/O is usually a goodthing™).

It would be cool to have a generic Record.Write() method that takes either a bytes.Buffer or a bufio.Writer so we can just decide to pass the buffer or wrap the writer. I unfortunately am traveling tomorrow until next week so no time this weekend 🙁

@Koeng101
Copy link
Contributor

Koeng101 commented Aug 24, 2023

Generally, I really like the idea of changing the name of fasta.Fasta to fasta.Record. I think we should implement this in fasta.

I don't really understand those "gzipReaderFn, openFn, and buildFn functions" either - if there are two of us, probably others wouldn't either, and this would decrease readability. I think we should remove them.

Does it make sense to have a Parser interface in the io package and have implementation of it in each sub package?

This is something I'd really like to do! Though I think it would probably be a generic of some kind? Basically, I think for every io (genbank, fasta, fastq, slow5, sam, pileup) we should implement the following standard functions:

type Parser[T genbank | fasta | fastq | slow5 | sam | pileup, TH any] struct { // TH should have their respective headers
   Header() (TH, error)
   Next() (T, error)
   MaxLineCount() int64
   Reset() error
}

or maybe something like that - I am not sure what is most readable. Of course, the records / alignments / reads themselves would implement the Write interface, which honestly I think we should just standardize to get into a io.Writer. Generics should at least allow us to build on top of that io.Writer in a way that we aren't rebuilding the same function six times.

Thoughts? happy to help prototype this this weekend.

Is it necessary to have the gzip functions? I understand that it can be convenient but then should we add some other compression algorithms?

GZIP is so widely used by fastq at least, a reasonable default would be to read gz files, not raw text, if it wasn't for the fact that it wouldn't make as much sense to programmers (but make perfect sense to end users - you pretty much only push around sequencing as fastq).

Still, it think this should probably be implemented on top of or somewhere else. It's kinda ridiculous that we have to implement functions like reading gz 6 times. If we had this in one place, it'd be a lot easier to build functions that could compress in may different ways.

For the io.Writer stuff, I have a small PR doing that here - https://github.com/TimothyStiles/poly/pull/337/files . There is one important bug fix in there too - https://github.com/TimothyStiles/poly/blob/5cde1b2232331be07625786b6201c7c0dce1717e/io/fastq/fastq.go#L179-L185 - which I have no idea why it helped but it did, but only for my very large files. I think it has something to do with memory, so it was hard to reproduce in an individual test.

@carreter
Copy link
Collaborator

@Koeng101 Is there anything in this PR that wasn't included in #339 ?

@Koeng101
Copy link
Contributor

Everything here is implemented in #339 I think, but the start of #339 was because of this PR. I'd like to give some credit somehow, even if there isn't significant code overlap

@TimothyStiles
Copy link
Collaborator

TimothyStiles commented Sep 16, 2023

We should be able to do that @Koeng101. We just need to add @folago's GitHub email in a certain way to the bottom of the PR when we merge it.

@carreter carreter added the duplicate This issue or pull request already exists label Sep 16, 2023
@Koeng101
Copy link
Contributor

Aight, we will include that then. Hope that works @folago ! I appreciate your contributions!

@Koeng101 Koeng101 closed this Sep 17, 2023
@folago
Copy link
Contributor Author

folago commented Sep 18, 2023

Hey,
sure all good on my side.
Thanks a lot 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants