📝 Polish

laminlabs · Nov 24, 2024 · a8accfb · a8accfb
1 parent 926a52a
commit a8accfb
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 13 deletions.
diff --git a/docs/glossary.md b/docs/glossary.md
@@ -9,13 +9,16 @@ GUI
     Graphical user interface, for instance, a browser-based data catalog.
 
 feature
-    A feature is an individual measurable property of a phenomenon [[Wikipedia](https://en.wikipedia.org/wiki/Feature_(machine_learning))], a measured event like a microscopy image or transcriptomic readout of a biological system.
+    A feature is a property of a measurement [[Wikipedia](https://en.wikipedia.org/wiki/Feature_(machine_learning))]. It's equivalent to a {term}`variable` in statistics and is typically equated with a dimension of a dataset.
 
-    It's equivalent to the term "independent {term}`variable`" in statistics, but is the preferred term to denote dimensions of "feature spaces" in machine learning.
+    LaminDB comes with a {class}`~lamindb.Feature` registry to organize dataset dimensions and equates them with statistical variables.
 
 label
     A label refers to a descriptor or tag that is assigned to something to describe, identify, or categorize it.
 
+lakehouse
+    A data lakehouse combines the flexibility and cost-effectiveness of a data lake with the data management and ACID transaction support of a data warehouse, enabling both structured and unstructured data analytics in a single framework. Some of the early lakehouse frameworks were Databrick's [Delta Lake](https://delta.io/), Google's [BigLake](https://cloud.google.com/biglake), Amazon's [Lake Formation](). Later examples include [Dremio](https://www.dremio.com/), [Starburst](https://www.starburst.io/) and others. Here is a [blog post](https://cloud.google.com/blog/products/data-analytics/unify-data-lakes-and-warehouses-with-biglake-now-generally-available) from Google, a [blog post](https://aws.amazon.com/blogs/big-data/build-a-lake-house-architecture-on-aws/) from AWS, a [glossary entry](https://www.databricks.com/glossary/data-lakehouse) and a [paper](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) from Databricks.
+
 ORM
     Object-relational mapper. In LaminDB every sub-class of `Record` (every instance of `Registry`) is an ORM that corresponds to a SQL table in the underlying metadata database [wikipedia](https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping).
 
@@ -25,12 +28,9 @@ observation
     In biology, an observation typically corresponds to measuring (reading out) a set of properties from a biological sample.
 
 record
-    A record is a data structure that consists in [fields](https://en.wikipedia.org/wiki/Field_(computer_science)), typically of different types but in a fixed sequence [[Wikipedia](https://en.wikipedia.org/wiki/Record_(computer_science))].
-
-    Importantly, we refer to instances of [Registry](https://lamin.ai/docs/lamindb.core.registry) as records. Once a record is inserted into a database table, it becomes a row in that table.
-    Every `Registry` class (in LaminDB) has a 1:1 correspondence with a database table and a django [model](https://docs.djangoproject.com/en/4.2/topics/db/models/), every row in a database table has a 1:1 correspondence with a record.
+    A record is a data structure that consists in a sequence of typed [fields](https://en.wikipedia.org/wiki/Field_(computer_science)) that store values [[Wikipedia](https://en.wikipedia.org/wiki/Record_(computer_science))].
 
-    A record often stores jointly measured {term}`variables <variable>` in its fields, but in general allows updating fields when more information becomes available or changes.
+    In LaminDB, a metadata record is modeled as a {class}`~lamindb.Record`.
 
 sample
     In biology, a sample is an instance or part of a biological system.

diff --git a/docs/introduction.ipynb b/docs/introduction.ipynb
@@ -16,21 +16,27 @@
     "# Introduction\n",
     "\n",
     "LaminDB is an open-source data framework to make computational biology more robust, scalable, and understandable.\n",
-    "It provides a queryable data lakehouse, tracks data sources & transformations, offers data curation & transfer tools, and helps managing experiments and ontologies.\n",
+    "It provides a queryable {term}`lakehouse`, tracks data sources & transformations, offers data curation & transfer tools, and helps managing experimental metadata and ontologies.\n",
     "Datasets, models, and code in LaminDB are findable, accessible, interoperable, and reusable (FAIR).\n",
     "\n",
     ":::{dropdown} Why?\n",
     "\n",
     "<img src=\"https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQck.svg\" width=\"350px\" style=\"background: transparent\" align=\"right\">\n",
     "\n",
-    "Biological data are often poorly organized. \n",
+    "In many organizations, biological datasets are mismanaged, not queryable, and not standardized.\n",
     "It's often difficult to reproduce analytical results or understand how a dataset was processed.\n",
-    "And it's typically hard to train models on historical data, orthogonal assays, or datasets generated by other teams.\n",
+    "It's typically hard to train models on historical data, orthogonal assays, or datasets generated by other teams.\n",
     "\n",
-    "Datasets have so far been managed with plain file systems, data objects for individual datasets (`DataFrame`, `AnnData`, etc.), {term}`GUI`-focused community platforms, structure-less data lakes, rigorous data warehouses (SQL, monolithic arrays), and data lakehouses that only know tabular data.\n",
+    "Datasets have so far been managed with versioned storage systems (file systems, object storage, git, dvc), {term}`GUI`-focused community platforms, structure-less data lakes, rigorous data warehouses (SQL, monolithic arrays), and data lakehouses for tabular data.\n",
     "\n",
-    "LaminDB provides a lakehouse framework that models biological data objects in the rich context that collaborative research requires.\n",
-    "It provides enough structure to enable queries and enough freedom to keep the pace of R&D high.\n",
+    "LaminDB goes beyond these systems with a lakehouse that models biological data objects beyond tables with enough structure to enable queries and enough freedom to keep the pace of R&D high.\n",
+    "\n",
+    "For data objects like `DataFrame`, `AnnData`, `.zarr`, `.tiledbsoma`, etc., LaminDB tracks and provides the rich context that collaborative biological research requires:\n",
+    "\n",
+    "- data lineage: data sources and transformations; scientists and machine learning models\n",
+    "- domain knowledge and experimental metadata: the features and labels derived from domain entities\n",
+    "- data curation: validation, standardization, and annotation\n",
+    "- data transfer: simple sharing of datasets with their metadata context\n",
     "\n",
     "In this [blog post](https://lamin.ai/blog/problems), we discuss a breadth of data management problems of the field.\n",
     "\n",