Clay-foundation · weiji14 · Jul 23, 2024 · Jul 23, 2024 · Jul 23, 2024 · Jul 23, 2024
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,6 +1,10 @@
 # See https://pre-commit.com for more information
 # See https://pre-commit.com/hooks.html for more hooks
 repos:
+- repo: https://github.com/codespell-project/codespell
+  rev: v2.3.0
+  hooks:
+  - id: codespell
 - repo: https://github.com/pre-commit/pre-commit-hooks
   rev: v4.6.0
   hooks:

diff --git a/docs/clay-v0/clay-v0-location-embeddings.ipynb b/docs/clay-v0/clay-v0-location-embeddings.ipynb
@@ -223,15 +223,15 @@
    "id": "3384f479-ef84-420d-a4e9-e3b038f05497",
    "metadata": {},
    "source": [
-    "> Latitude & Longitude map to 768 dimentional vector"
+    "> Latitude & Longitude map to 768 dimensional vector"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "9e419fc9-e7d3-49de-a8ea-72912c365510",
    "metadata": {},
    "source": [
-    "## Preform PCA over the location embeddings to visualize them in 2 dimension"
+    "## Perform PCA over the location embeddings to visualize them in 2 dimension"
    ]
   },
   {

diff --git a/docs/clay-v0/data_labels.md b/docs/clay-v0/data_labels.md
@@ -26,7 +26,7 @@ then followed the same `datacube` creation logic to generate datacubes with
 Sentinel-1 VV and VH and the Copernicus Digital Elevation Model (DEM). We also
 ensured that the Sentinel-1 data was within a +/- 3 day interval of each
 reference Sentinel-2 scene (same method used by the benchmark dataset authors)
-and that the Sentinel-1 data was indeed already included in the bechmark
+and that the Sentinel-1 data was indeed already included in the benchmark
 dataset's list of granules. The datacubes generated have all three inputs
 matching the exact specs of the Foundation model's training data, at 512x512
 pixels.

diff --git a/docs/clay-v0/model_finetuning.md b/docs/clay-v0/model_finetuning.md
@@ -2,7 +2,7 @@
 
 Fine-tuning refers to a process in machine learning where a pre-trained model
 is further trained on a specific dataset to adapt its parameters to a
-downstream task characterized by a relevent domain. It's distinct from training
+downstream task characterized by a relevant domain. It's distinct from training
 a model from scratch using the downstream task dataset exclusively.
 
 Related to finetuning in the field of training Foundation models is linear
@@ -21,7 +21,7 @@ the Foundation model both during its pre-training and afterwards.
 Let's take a look at how we are finetuning on the benchmark datacube-adapted
 [Cloud to Street - Microsoft Flood Dataset](https://beta.source.coop/repositories/c2sms/c2smsfloods).
 As a reminder, that is a downstream
-segmentation task for identifiying water pixels in recorded flood events. It's
+segmentation task for identifying water pixels in recorded flood events. It's
 a binary segmentation problem, specifically.
 
 We process the datacubes into batches formatted in the way the pretrained Clay
@@ -150,7 +150,7 @@ segmentation problem, and on the predictions, we run sigmoid and max functions
 to obtain final segmentation results.
 
 The way we measure relative performance between the finetuned and
-"from scratch" model variants happens through calculation of evalution metrics
+"from scratch" model variants happens through calculation of evaluation metrics
 common for segmentation, such as Dice coefficient, Intersection over Union, F1
 score, precision and recall.
 

diff --git a/docs/clay-v0/patch_level_cloud_cover.ipynb b/docs/clay-v0/patch_level_cloud_cover.ipynb
@@ -698,7 +698,7 @@
    "id": "bd3d1cc1-9d79-4059-a1f6-4ac8cf4d2e51",
    "metadata": {},
    "source": [
-    "#### Set up filtered searchs"
+    "#### Set up filtered searches"
    ]
   },
   {

diff --git a/docs/clay-v0/run_region.md b/docs/clay-v0/run_region.md
@@ -3,7 +3,7 @@
 This section shows in a few simple steps how the clay model can be run for
 custom AOIs and over custom date ranges.
 
-## Prepare folder strucutre for data
+## Prepare folder structure for data
 
 ```bash
 # Move into the model repository
@@ -87,7 +87,7 @@ outside of the AOI specified.
 
 To speed up processing in the example below, we use the subset argument to
 reduce each MGRS tile to a small pixel window. When subsetting, the script
-will only download a fraction of each MGRS tile. This will lead to discontinous
+will only download a fraction of each MGRS tile. This will lead to discontinuous
 datasets and should not be used in a real use case. Remove the subset argument
 when using the script for a real world application, where all the data should
 be downloaded for each MGRS tile.

diff --git a/docs/clay-v0/specification-v0.md b/docs/clay-v0/specification-v0.md
@@ -19,7 +19,7 @@ The model was trained on AWS on 4 NVIDIA A10G GPUs for 25 epochs (~14h per epoch
 
 Model weights are available on HuggingFace [here](https://huggingface.co/made-with-clay/Clay/).
 
-We also generated embeddings for all trainning data, which can be found on Source Cooperative [here](https://source.coop/).
+We also generated embeddings for all training data, which can be found on Source Cooperative [here](https://source.coop/).
 
 ## Model Architecture
 

diff --git a/docs/finetune/classify.md b/docs/finetune/classify.md
@@ -20,7 +20,7 @@ The `Classifier` class is designed for classification tasks, utilizing the Clay
 
 In this example, we will use the `Classifier` class to classify images from the [EuroSAT MS dataset](https://github.com/phelber/EuroSAT). The implementation includes data preprocessing, data loading, and model training workflow using [PyTorch Lightning](https://lightning.ai/) & [TorchGeo](https://github.com/microsoft/torchgeo).
 
-In this example we freeze the Clay encoder and only train a very simple 2 layer MLP head for classification. The MLP head recieves as input the Clay class token embedding, which already contains the essence of the image as seen by Clay. The model for classification can then be kept very simple while still guaranteeing high quality results.
+In this example we freeze the Clay encoder and only train a very simple 2 layer MLP head for classification. The MLP head receives as input the Clay class token embedding, which already contains the essence of the image as seen by Clay. The model for classification can then be kept very simple while still guaranteeing high quality results.
 
 Notice that the EuroSAT dataset comes without date stamps or location information. The Clay model requires encoded versions of a date stamp and a latitude and longitude information. These values can be set to zero if they are not available, which is what we are doing in the datamodule script.
 
@@ -72,7 +72,7 @@ data/ds
 ```
 
 
-### Training the Classifcation Head
+### Training the Classification Head
 
 The model can be run via LightningCLI using configurations in `finetune/classify/configs/classify_eurosat.yaml`.
 

diff --git a/docs/finetune/finetune-on-embeddings.ipynb b/docs/finetune/finetune-on-embeddings.ipynb
@@ -365,7 +365,7 @@
     "\n",
     "### Choose your example\n",
     "\n",
-    "In the following cell, choose which set of training points to use. The input shoudl be a point dataset\n",
+    "In the following cell, choose which set of training points to use. The input should be a point dataset\n",
     "with a `class` column, containing `1` for positive examples, and `0` for negative examples.\n",
     "\n",
     "Use your own dataset or use one of the two provided ones."

diff --git a/docs/finetune/regression.md b/docs/finetune/regression.md
@@ -157,7 +157,7 @@ Compressed: 729766400
 
 This will take the average of all timesteps available for each tile.
 The time steps for Sentinel-2 are not complete, not all months are
-provided for all tiles. In addtion, the Clay model does not take time
+provided for all tiles. In addition, the Clay model does not take time
 series as input. So aggregating the time element is simplifying but
 ok for the purpose of this example.
 

diff --git a/docs/release-notes/changelog-v1.0.md b/docs/release-notes/changelog-v1.0.md
@@ -37,7 +37,7 @@
 * Shorten comment line length by @yellowcap in https://github.com/Clay-foundation/model/pull/261
 * Refactor docs by moving v0 docs into separate section by @yellowcap in https://github.com/Clay-foundation/model/pull/262
 * Docs v1 continued by @yellowcap in https://github.com/Clay-foundation/model/pull/263
-* Documented metadata file for normalization and wavelenghts by @yellowcap in https://github.com/Clay-foundation/model/pull/266
+* Documented metadata file for normalization and wavelengths by @yellowcap in https://github.com/Clay-foundation/model/pull/266
 * [small change] add source.coop link by @brunosan in https://github.com/Clay-foundation/model/pull/137
 * Segmentation on Clay by @srmsoumya in https://github.com/Clay-foundation/model/pull/257
 

diff --git a/docs/tutorials/clay-v1-wall-to-wall.ipynb b/docs/tutorials/clay-v1-wall-to-wall.ipynb
@@ -15,7 +15,7 @@
     "3. Load the model checkpoint\n",
     "4. Prepare data into a format for the model\n",
     "5. Run the model on the imagery\n",
-    "6. Analyise the model embeddings output using PCA\n",
+    "6. Analyse the model embeddings output using PCA\n",
     "7. Train a Support Vector Machines fine tuning head"
    ]
   },
@@ -333,7 +333,7 @@
    "source": [
     "### Prepare band metadata for passing it to the model\n",
     "\n",
-    "This is the most technical part so far. We will take the information in the stack of imagery and convert it into the formate that the model requires. This includes converting the lat/lon and the date of the imagery into normalized values.\n",
+    "This is the most technical part so far. We will take the information in the stack of imagery and convert it into the format that the model requires. This includes converting the lat/lon and the date of the imagery into normalized values.\n",
     "\n",
     "The Clay model will accept any band combination in any order, from different platforms. But for this the model needs to know the wavelength of each band that is passed to it, and normalization parameters for each band as well. It will use that to normalize the data and to interpret each band based on its central wavelength.\n",
     "\n",
@@ -374,7 +374,7 @@
    "source": [
     "### Convert the band pixel data in to the format for the model\n",
     "\n",
-    "We will take the information in the stack of imagery and convert it into the formate that the model requires. This includes converting the lat/lon and the date of the imagery into normalized values."
+    "We will take the information in the stack of imagery and convert it into the format that the model requires. This includes converting the lat/lon and the date of the imagery into normalized values."
    ]
   },
   {

diff --git a/finetune/classify/classify.py b/finetune/classify/classify.py
@@ -19,7 +19,7 @@
 # %%
 def cli_main():
     """
-    Command-line inteface to run Clasifier model with EuroSATDataModule.
+    Command-line interface to run Clasifier model with EuroSATDataModule.
     """
     cli = LightningCLI(EuroSATClassifier, EuroSATDataModule)
     return cli

diff --git a/finetune/regression/regression.py b/finetune/regression/regression.py
@@ -19,7 +19,7 @@
 # %%
 def cli_main():
     """
-    Command-line inteface to run Regression with BioMastersDataModule.
+    Command-line interface to run Regression with BioMastersDataModule.
     """
     cli = LightningCLI(
         BioMastersClassifier,

diff --git a/finetune/segment/segment.py b/finetune/segment/segment.py
@@ -19,7 +19,7 @@
 # %%
 def cli_main():
     """
-    Command-line inteface to run Segmentation Model with ChesapeakeDataModule.
+    Command-line interface to run Segmentation Model with ChesapeakeDataModule.
     """
     cli = LightningCLI(ChesapeakeSegmentor, ChesapeakeDataModule)
     return cli

diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,11 @@
+[tool.codespell]
+ignore-words-list = [
+    "linz",
+    "socio-economic",
+    "therefrom",
+]
+skip = [
+    "docs/clay-v0/tutorial_digital_earth_pacific_patch_level.ipynb",
+    "docs/clay-v0/partial-inputs.ipynb",
+    "docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb",
+]
diff --git a/trainer.py b/trainer.py
@@ -19,7 +19,7 @@
 # %%
 def cli_main():
     """
-    Command-line inteface to run ClayMAE with ClayDataModule.
+    Command-line interface to run ClayMAE with ClayDataModule.
     """
     cli = LightningCLI(save_config_kwargs={"overwrite": True})
     return cli