datajoint · kushalbakshi · Dec 27, 2024 · Jan 10, 2025 · Jan 23, 2025
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -17,5 +17,5 @@
     "[dockercompose]": {
         "editor.defaultFormatter": "disable"
     },
-    "files.autoSave": "off"
+    "files.autoSave": "afterDelay"
 }
diff --git a/docs/src/concepts/data-model.md b/docs/src/concepts/data-model.md
@@ -2,11 +2,23 @@
 
 ## What is a data model?
 
-A **data model** refers to a conceptual framework for thinking about data and about 
-operations on data.
-A data model defines the mental toolbox of the data scientist; it has less to do with 
-the architecture of the data systems, although architectures are often intertwined with 
-data models.
+A **data model** is a conceptual framework that defines how data is organized,
+represented, and transformed. It gives us the components for creating blueprints for the
+structure and operations of data management systems, ensuring consistency and efficiency
+in data handling.
+
+Data management systems are built to accommodate these models, allowing us to manage
+data according to the principles laid out by the model. If you’re studying data science
+or engineering, you’ve likely encountered different data models, each providing a unique
+approach to organizing and manipulating data.
+
+A data model is defined by considering the following key aspects:
+
++ What are the fundamental elements used to structure the data?
++ What operations are available for defining, creating, and manipulating the data?
++ What mechanisms exist to enforce the structure and rules governing valid data interactions?
+
+## Types of data models
 
 Among the most familiar data models are those based on files and folders: data of any 
 kind are lumped together into binary strings called **files**, files are collected into 
@@ -24,17 +36,16 @@ objects in memory with properties and methods for transformations of such data.
 ## Relational data model
 
 The **relational model** is a way of thinking about data as sets and operations on sets.
-Formalized almost a half-century ago 
-([Codd, 1969](https://dl.acm.org/citation.cfm?doid=362384.362685)), the relational data 
-model provides the most rigorous approach to structured data storage and the most 
-precise approach to data querying.
-The model is defined by the principles of data representation, domain constraints, 
-uniqueness constraints, referential constraints, and declarative queries as summarized 
-below.
+Formalized almost a half-century ago ([Codd,
+1969](https://dl.acm.org/citation.cfm?doid=362384.362685)). The relational data model is
+one of the most powerful and precise ways to store and manage structured data. At its
+core, this model organizes all data into tables--representing mathematical
+relations---where each table consists of rows (representing mathematical tuples) and
+columns (often called attributes).
 
 ### Core principles of the relational data model
 
-**Data representation**
+**Data representation:**
   Data are represented and manipulated in the form of relations.
   A relation is a set (i.e. an unordered collection) of entities of values for each of 
   the respective named attributes of the relation.
@@ -43,26 +54,26 @@ below.
   A collection of base relations with their attributes, domain constraints, uniqueness 
   constraints, and referential constraints is called a schema.
 
-**Domain constraints**
-  Attribute values are drawn from corresponding attribute domains, i.e. predefined sets 
-  of values.
-  Attribute domains may not include relations, which keeps the data model flat, i.e. 
-  free of nested structures.
+**Domain constraints:** 
+  Each attribute (column) in a table is associated with a specific attribute domain (or
+  datatype, a set of possible values), ensuring that the data entered is valid.
+  Attribute domains may not include relations, which keeps the data model
+  flat, i.e. free of nested structures.
 
-**Uniqueness constraints**
+**Uniqueness constraints:**
   Entities within relations are addressed by values of their attributes.
   To identify and relate data elements, uniqueness constraints are imposed on subsets 
   of attributes.
   Such subsets are then referred to as keys.
   One key in a relation is designated as the primary key used for referencing its elements.
 
-**Referential constraints**
+**Referential constraints:**
   Associations among data are established by means of referential constraints with the 
   help of foreign keys.
   A referential constraint on relation A referencing relation B allows only those 
   entities in A whose foreign key attributes match the key attributes of an entity in B.
 
-**Declarative queries**
+**Declarative queries:**
   Data queries are formulated through declarative, as opposed to imperative, 
   specifications of sought results.
   This means that query expressions convey the logic for the result rather than the 
@@ -86,32 +97,76 @@ Similar to spreadsheets, relations are often visualized as tables with *attribut
 corresponding to *columns* and *entities* corresponding to *rows*.
 In particular, SQL uses the terms *table*, *column*, and *row*.
 
-## DataJoint is a refinement of the relational data model
-
-DataJoint is a conceptual refinement of the relational data model offering a more 
-expressive and rigorous framework for database programming 
-([Yatsenko et al., 2018](https://arxiv.org/abs/1807.11104)).
-The DataJoint model facilitates clear conceptual modeling, efficient schema design, and 
-precise and flexible data queries.
-The model has emerged over a decade of continuous development of complex data pipelines 
-for neuroscience experiments 
-([Yatsenko et al., 2015](https://www.biorxiv.org/content/early/2015/11/14/031658)).
-DataJoint has allowed researchers with no prior knowledge of databases to collaborate 
-effectively on common data pipelines sustaining data integrity and supporting flexible 
-access.
-DataJoint is currently implemented as client libraries in MATLAB and Python.
-These libraries work by transpiling DataJoint queries into SQL before passing them on 
-to conventional relational database systems that serve as the backend, in combination 
-with bulk storage systems for storing large contiguous data objects.
+## The DataJoint Model
+
+DataJoint is a conceptual refinement of the relational data model offering a more
+expressive and rigorous framework for database programming ([Yatsenko et al.,
+2018](https://arxiv.org/abs/1807.11104)). The DataJoint model facilitates conceptual
+clarity, efficiency, workflow management, and precise and flexible data
+queries. By enforcing entity normalization,
+simplifying dependency declarations, offering a rich query algebra, and visualizing
+relationships through schema diagrams, DataJoint makes relational database programming
+more intuitive and robust for complex data pipelines. 
+
+The model has emerged over a decade of continuous development of complex data
+pipelines for neuroscience experiments ([Yatsenko et al.,
+2015](https://www.biorxiv.org/content/early/2015/11/14/031658)). DataJoint has allowed
+researchers with no prior knowledge of databases to collaborate effectively on common
+data pipelines sustaining data integrity and supporting flexible access. DataJoint is
+currently implemented as client libraries in MATLAB and Python. These libraries work by
+transpiling DataJoint queries into SQL before passing them on to conventional relational
+database systems that serve as the backend, in combination with bulk storage systems for
+storing large contiguous data objects.
 
 DataJoint comprises:
 
-- a schema [definition](../design/tables/declare.md) language
-- a data [manipulation](../manipulation/index.md) language
-- a data [query](../query/principles.md) language
-- a [diagramming](../design/diagrams.md) notation for visualizing relationships between 
++ a schema [definition](../design/tables/declare.md) language
++ a data [manipulation](../manipulation/index.md) language
++ a data [query](../query/principles.md) language
++ a [diagramming](../design/diagrams.md) notation for visualizing relationships between 
 modeled entities
 
 The key refinement of DataJoint over other relational data models and their 
 implementations is DataJoint's support of 
 [entity normalization](../design/normalization.md).
+
+### Core principles of the DataJoint model
+
+**Entity Normalization**
+  DataJoint enforces entity normalization, ensuring that every entity set (table) is
+  well-defined, with each element belonging to the same type, sharing the same
+  attributes, and distinguished by the same primary key. This principle reduces
+  redundancy and avoids data anomalies, similar to Boyce-Codd Normal Form, but with a
+  more intuitive structure than traditional SQL.
+
+**Simplified Schema Definition and Dependency Management**
+  DataJoint introduces a schema definition language that is more expressive and less
+  error-prone than SQL. Dependencies are explicitly declared using arrow notation
+  (->), making referential constraints easier to understand and visualize. The
+  dependency structure is enforced as an acyclic directed graph, which simplifies
+  workflows by preventing circular dependencies.
+
+**Integrated Query Operators producing a Relational Algebra**
+  DataJoint introduces five query operators (restrict, join, project, aggregate, and
+  union) with algebraic closure, allowing them to be combined seamlessly. These
+  operators are designed to maintain operational entity normalization, ensuring query
+  outputs remain valid entity sets.
+
+**Diagramming Notation for Conceptual Clarity**
+  DataJoint’s schema diagrams simplify the representation of relationships between
+  entity sets compared to ERM diagrams. Relationships are expressed as dependencies
+  between entity sets, which are visualized using solid or dashed lines for primary
+  and secondary dependencies, respectively.
+
+**Unified Logic for Binary Operators**
+  DataJoint simplifies binary operations by requiring attributes involved in joins or
+  comparisons to be homologous (i.e., sharing the same origin). This avoids the
+  ambiguity and pitfalls of natural joins in SQL, ensuring more predictable query
+  results.
+
+**Optimized Data Pipelines for Scientific Workflows**
+  DataJoint treats the database as a data pipeline where each entity set defines a
+  step in the workflow. This makes it ideal for scientific experiments and complex
+  data processing, such as in neuroscience. Its MATLAB and Python libraries transpile
+  DataJoint queries into SQL, bridging the gap between scientific programming and
+  relational databases.
diff --git a/docs/src/concepts/data-pipelines.md b/docs/src/concepts/data-pipelines.md
@@ -157,10 +157,10 @@ with external groups.
 ## Summary of DataJoint features
 
 1. A free, open-source framework for scientific data pipelines and workflow management
-1. Data hosting in cloud or in-house
-1. MySQL, filesystems, S3, and Globus for data management
-1. Define, visualize, and query data pipelines from MATLAB or Python
-1. Enter and view data through GUIs
-1. Concurrent access by multiple users and computational agents
-1. Data integrity: identification, dependencies, groupings
-1. Automated distributed computation
+2. Data hosting in cloud or in-house
+3. MySQL, filesystems, S3, and Globus for data management
+4. Define, visualize, and query data pipelines from MATLAB or Python
+5. Enter and view data through GUIs
+6. Concurrent access by multiple users and computational agents
+7. Data integrity: identification, dependencies, groupings
+8. Automated distributed computation
diff --git a/docs/src/concepts/teamwork.md b/docs/src/concepts/teamwork.md
@@ -5,10 +5,9 @@
 Science labs organize their projects as a sequence of activities of experiment design, 
 data acquisition, and processing and analysis.
 
-<figure markdown>
-     ![data science in a science lab](../images/data-science-before.png){: style="width:520px; align:center"}
-     <figcaption>Workflow and dataflow in a common findings-centered approach to data science in a science lab.</figcaption>
-</figure>
+![data science in a science lab](../images/data-science-before.png){: style="width:510px; display:block; margin: 0 auto;"}
+
+<figcaption style="text-align: center;">Workflow and dataflow in a common findings-centered approach to data science in a science lab.</figcaption>
 
 Many labs lack a uniform data management strategy that would span longitudinally across 
 the entire project lifecycle as well as laterally across different projects.
@@ -29,10 +28,9 @@ This approach requires formulating a general data science plan and upfront inves
 for setting up resources and processes and training the teams.
 The team uses DataJoint to build data pipelines to support multiple projects.
 
-<figure markdown>
-     ![data science in a science lab](../images/data-science-after.png){: style="width:510px; align:center"}
-     <figcaption>Workflow and dataflow in a data pipeline-centered approach.</figcaption>
-</figure>
+![data science in a science lab](../images/data-science-after.png){: style="width:510px; display:block; margin: 0 auto;"}
+
+<figcaption style="text-align: center;">Workflow and dataflow in a data pipeline-centered approach.</figcaption>
 
 Data pipelines support project data across their entire lifecycle, including the 
 following functions
@@ -55,42 +53,41 @@ data integrity.
 The adoption of a uniform data management framework allows separation of roles and 
 division of labor among team members, leading to greater efficiency and better scaling.
 
-<figure markdown>
-     ![data science vs engineering](../images/data-engineering.png){: style="width:350px; align:center"}
-     <figcaption>Distinct responsibilities of data science and data engineering.</figcaption>
-</figure>
+![data science in a science lab](../images/data-engineering.png){: style="width:510px; display:block; margin: 0 auto;"}
+
+<figcaption style="text-align: center;">Distinct responsibilities of data science and data engineering.</figcaption>
 
-Scientists
+### Scientists
 
-    design and conduct experiments, collecting data.
-    They interact with the data pipeline through graphical user interfaces designed by 
-    others.
-    They understand what analysis is used to test their hypotheses.
+Design and conduct experiments, collecting data.
+They interact with the data pipeline through graphical user interfaces designed by 
+others.
+They understand what analysis is used to test their hypotheses.
 
-Data scientists
+### Data scientists
 
-    have the domain expertise and select and implement the processing and analysis 
-    methods for experimental data.
-    Data scientists are in charge of defining and managing the data pipeline using 
-    DataJoint's data model, but they may not know the details of the underlying 
-    architecture.
-    They interact with the pipeline using client programming interfaces directly from 
-    languages such as MATLAB and Python.
+Have the domain expertise and select and implement the processing and analysis 
+methods for experimental data.
+Data scientists are in charge of defining and managing the data pipeline using 
+DataJoint's data model, but they may not know the details of the underlying 
+architecture.
+They interact with the pipeline using client programming interfaces directly from 
+languages such as MATLAB and Python.
 
-    The bulk of this manual is written for working data scientists, except for System 
-    Administration.
+The bulk of this manual is written for working data scientists, except for System 
+Administration.
 
-Data engineers
+### Data engineers
 
-    work with the data scientists to support the data pipeline.
-    They rely on their understanding of the DataJoint data model to configure and 
-    administer the required IT resources such as database servers, data storage 
-    servers, networks, cloud instances, [Globus](https://globus.org) endpoints, etc.
-    Data engineers can provide general solutions such as web hosting, data publishing, 
-    interfaces, exports and imports.
+Work with the data scientists to support the data pipeline.
+They rely on their understanding of the DataJoint data model to configure and 
+administer the required IT resources such as database servers, data storage 
+servers, networks, cloud instances, [Globus](https://globus.org) endpoints, etc.
+Data engineers can provide general solutions such as web hosting, data publishing, 
+interfaces, exports and imports.
 
-    The System Administration section of this tutorial contains materials helpful in 
-    accomplishing these tasks.
+The System Administration section of this tutorial contains materials helpful in 
+accomplishing these tasks.
 
 DataJoint is designed to delineate a clean boundary between **data science** and **data 
 engineering**.

diff --git a/docs/src/design/alter.md b/docs/src/design/alter.md
@@ -1 +1,53 @@
 # Altering Populated Pipelines
+
+Tables can be altered after they have been declared and populated. This is useful when
+you want to add new secondary attributes or change the data type of existing attributes.
+Users can use the `definition` property to update a table's attributes and then use
+`alter` to apply the changes in the database. Currently, `alter` does not support
+changes to primary key attributes.
+
+Let's say we have a table `Student` with the following attributes:
+
+```python
+@schema
+class Student(dj.Manual):
+    definition = """
+    student_id: int
+    ---
+    first_name: varchar(40)
+    last_name: varchar(40)
+    home_address: varchar(100)
+    """
+```
+
+We can modify the table to include a new attribute `email`:
+
+```python
+Student.definition = """
+student_id: int
+---
+first_name: varchar(40)
+last_name: varchar(40)
+home_address: varchar(100)
+email: varchar(100)
+"""
+Student.alter()
+```
+
+The `alter` method will update the table in the database to include the new attribute
+`email` added by the user in the table's `definition` property.
+
+Similarly, you can modify the data type or length of an existing attribute. For example,
+to alter the `home_address` attribute to have a length of 200 characters:
+
+```python
+Student.definition = """
+student_id: int
+---
+first_name: varchar(40)
+last_name: varchar(40)
+home_address: varchar(200)
+email: varchar(100)
+"""
+Student.alter()
+```
diff --git a/docs/src/design/integrity.md b/docs/src/design/integrity.md
@@ -1,6 +1,6 @@
 # Data Integrity
 
-The term **data integrity** describes  guarantees made by the data management process 
+The term **data integrity** describes guarantees made by the data management process 
 that prevent errors and corruption in data due to technical failures and human errors 
 arising in the course of continuous use by multiple agents.
 DataJoint pipelines respect the following forms of data integrity: **entity