Notes about a potential virus database
The backend server should be able to be standalone-installed, where potential users could setup a server of their own for their own local use, whether installing it on their own machine or via a Docker image. It should also be able to scale to user needs.
Current planned backend setup: Cloudflare workers or just npm run dev
for local environments, connected to Supabase (with a PostgreSQL database) and file storage on Cloudflare R2 (likely a local file system or some kind of mock S3 / R2 system for local environments). Look into libraries that help manage code complexity / abstract platform-specific syntax away.
CRAM format: Highly compressed, lossless, reference-based, sequence data format. May have higher processing overhead for compression/decompression.
For possibly an even more compressed form (possibly importing / exporting data), can look into: https://github.com/refresh-bio/agc.
Querying By Sequence
Querying By Metadata
Live-Sequence Searching (?)
Searching For Mutations
Read Mapping
Multi-Sequence Alignment
Consensus Sequence Generation
Variant Calling
Will likely initially use PostgreSQL because it is an industry standard, has a comprehensive ecosystem, and can also support full text search (among other features). I (Daniel) also am the most familiar with this database language, so it will be easiest for me to get some kind of MVP out. Possibly can be later augmented with Redis, a graph database, or a search engine database.
https://medium.com/codex/turn-neo4j-into-a-genome-browser-e94c7311dfab https://medium.com/geekculture/analyzing-genomes-in-a-graph-database-27a45faa0ae8 https://neo4j.com/blog/geneweaver-building-a-graph-to-map-variants-to-genes-using-neo4j-4-x-and-bulk-import/
Possible speed up of finding sequences in large database.
Possible speed up of finding similar sequences (as represented by vectors that point in the similar direction). Could be useful for variant calling speed up.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870668 (Look into DNA sequence embedding models to turn it into vectors).
GeNemo: https://pubmed.ncbi.nlm.nih.gov/27098038/
See Data Operations.
Svelte (likely).