Skip to content

Commit

Permalink
📝 Polish
Browse files Browse the repository at this point in the history
  • Loading branch information
falexwolf committed Nov 24, 2024
1 parent 926a52a commit a8accfb
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 13 deletions.
14 changes: 7 additions & 7 deletions docs/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,16 @@ GUI
Graphical user interface, for instance, a browser-based data catalog.
feature
A feature is an individual measurable property of a phenomenon [[Wikipedia](https://en.wikipedia.org/wiki/Feature_(machine_learning))], a measured event like a microscopy image or transcriptomic readout of a biological system.
A feature is a property of a measurement [[Wikipedia](https://en.wikipedia.org/wiki/Feature_(machine_learning))]. It's equivalent to a {term}`variable` in statistics and is typically equated with a dimension of a dataset.
It's equivalent to the term "independent {term}`variable`" in statistics, but is the preferred term to denote dimensions of "feature spaces" in machine learning.
LaminDB comes with a {class}`~lamindb.Feature` registry to organize dataset dimensions and equates them with statistical variables.
label
A label refers to a descriptor or tag that is assigned to something to describe, identify, or categorize it.
lakehouse
A data lakehouse combines the flexibility and cost-effectiveness of a data lake with the data management and ACID transaction support of a data warehouse, enabling both structured and unstructured data analytics in a single framework. Some of the early lakehouse frameworks were Databrick's [Delta Lake](https://delta.io/), Google's [BigLake](https://cloud.google.com/biglake), Amazon's [Lake Formation](). Later examples include [Dremio](https://www.dremio.com/), [Starburst](https://www.starburst.io/) and others. Here is a [blog post](https://cloud.google.com/blog/products/data-analytics/unify-data-lakes-and-warehouses-with-biglake-now-generally-available) from Google, a [blog post](https://aws.amazon.com/blogs/big-data/build-a-lake-house-architecture-on-aws/) from AWS, a [glossary entry](https://www.databricks.com/glossary/data-lakehouse) and a [paper](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) from Databricks.
ORM
Object-relational mapper. In LaminDB every sub-class of `Record` (every instance of `Registry`) is an ORM that corresponds to a SQL table in the underlying metadata database [wikipedia](https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping).
Expand All @@ -25,12 +28,9 @@ observation
In biology, an observation typically corresponds to measuring (reading out) a set of properties from a biological sample.
record
A record is a data structure that consists in [fields](https://en.wikipedia.org/wiki/Field_(computer_science)), typically of different types but in a fixed sequence [[Wikipedia](https://en.wikipedia.org/wiki/Record_(computer_science))].
Importantly, we refer to instances of [Registry](https://lamin.ai/docs/lamindb.core.registry) as records. Once a record is inserted into a database table, it becomes a row in that table.
Every `Registry` class (in LaminDB) has a 1:1 correspondence with a database table and a django [model](https://docs.djangoproject.com/en/4.2/topics/db/models/), every row in a database table has a 1:1 correspondence with a record.
A record is a data structure that consists in a sequence of typed [fields](https://en.wikipedia.org/wiki/Field_(computer_science)) that store values [[Wikipedia](https://en.wikipedia.org/wiki/Record_(computer_science))].
A record often stores jointly measured {term}`variables <variable>` in its fields, but in general allows updating fields when more information becomes available or changes.
In LaminDB, a metadata record is modeled as a {class}`~lamindb.Record`.
sample
In biology, a sample is an instance or part of a biological system.
Expand Down
18 changes: 12 additions & 6 deletions docs/introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,21 +16,27 @@
"# Introduction\n",
"\n",
"LaminDB is an open-source data framework to make computational biology more robust, scalable, and understandable.\n",
"It provides a queryable data lakehouse, tracks data sources & transformations, offers data curation & transfer tools, and helps managing experiments and ontologies.\n",
"It provides a queryable {term}`lakehouse`, tracks data sources & transformations, offers data curation & transfer tools, and helps managing experimental metadata and ontologies.\n",
"Datasets, models, and code in LaminDB are findable, accessible, interoperable, and reusable (FAIR).\n",
"\n",
":::{dropdown} Why?\n",
"\n",
"<img src=\"https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQck.svg\" width=\"350px\" style=\"background: transparent\" align=\"right\">\n",
"\n",
"Biological data are often poorly organized. \n",
"In many organizations, biological datasets are mismanaged, not queryable, and not standardized.\n",
"It's often difficult to reproduce analytical results or understand how a dataset was processed.\n",
"And it's typically hard to train models on historical data, orthogonal assays, or datasets generated by other teams.\n",
"It's typically hard to train models on historical data, orthogonal assays, or datasets generated by other teams.\n",
"\n",
"Datasets have so far been managed with plain file systems, data objects for individual datasets (`DataFrame`, `AnnData`, etc.), {term}`GUI`-focused community platforms, structure-less data lakes, rigorous data warehouses (SQL, monolithic arrays), and data lakehouses that only know tabular data.\n",
"Datasets have so far been managed with versioned storage systems (file systems, object storage, git, dvc), {term}`GUI`-focused community platforms, structure-less data lakes, rigorous data warehouses (SQL, monolithic arrays), and data lakehouses for tabular data.\n",
"\n",
"LaminDB provides a lakehouse framework that models biological data objects in the rich context that collaborative research requires.\n",
"It provides enough structure to enable queries and enough freedom to keep the pace of R&D high.\n",
"LaminDB goes beyond these systems with a lakehouse that models biological data objects beyond tables with enough structure to enable queries and enough freedom to keep the pace of R&D high.\n",
"\n",
"For data objects like `DataFrame`, `AnnData`, `.zarr`, `.tiledbsoma`, etc., LaminDB tracks and provides the rich context that collaborative biological research requires:\n",
"\n",
"- data lineage: data sources and transformations; scientists and machine learning models\n",
"- domain knowledge and experimental metadata: the features and labels derived from domain entities\n",
"- data curation: validation, standardization, and annotation\n",
"- data transfer: simple sharing of datasets with their metadata context\n",
"\n",
"In this [blog post](https://lamin.ai/blog/problems), we discuss a breadth of data management problems of the field.\n",
"\n",
Expand Down

0 comments on commit a8accfb

Please sign in to comment.