Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pre-commit hook for codespell #307

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/codespell-project/codespell
rev: v2.3.0
hooks:
- id: codespell
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
Expand Down
4 changes: 2 additions & 2 deletions docs/clay-v0/clay-v0-location-embeddings.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -223,15 +223,15 @@
"id": "3384f479-ef84-420d-a4e9-e3b038f05497",
"metadata": {},
"source": [
"> Latitude & Longitude map to 768 dimentional vector"
"> Latitude & Longitude map to 768 dimensional vector"
]
},
{
"cell_type": "markdown",
"id": "9e419fc9-e7d3-49de-a8ea-72912c365510",
"metadata": {},
"source": [
"## Preform PCA over the location embeddings to visualize them in 2 dimension"
"## Perform PCA over the location embeddings to visualize them in 2 dimension"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion docs/clay-v0/data_labels.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ then followed the same `datacube` creation logic to generate datacubes with
Sentinel-1 VV and VH and the Copernicus Digital Elevation Model (DEM). We also
ensured that the Sentinel-1 data was within a +/- 3 day interval of each
reference Sentinel-2 scene (same method used by the benchmark dataset authors)
and that the Sentinel-1 data was indeed already included in the bechmark
and that the Sentinel-1 data was indeed already included in the benchmark
dataset's list of granules. The datacubes generated have all three inputs
matching the exact specs of the Foundation model's training data, at 512x512
pixels.
Expand Down
6 changes: 3 additions & 3 deletions docs/clay-v0/model_finetuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Fine-tuning refers to a process in machine learning where a pre-trained model
is further trained on a specific dataset to adapt its parameters to a
downstream task characterized by a relevent domain. It's distinct from training
downstream task characterized by a relevant domain. It's distinct from training
a model from scratch using the downstream task dataset exclusively.

Related to finetuning in the field of training Foundation models is linear
Expand All @@ -21,7 +21,7 @@ the Foundation model both during its pre-training and afterwards.
Let's take a look at how we are finetuning on the benchmark datacube-adapted
[Cloud to Street - Microsoft Flood Dataset](https://beta.source.coop/repositories/c2sms/c2smsfloods).
As a reminder, that is a downstream
segmentation task for identifiying water pixels in recorded flood events. It's
segmentation task for identifying water pixels in recorded flood events. It's
a binary segmentation problem, specifically.

We process the datacubes into batches formatted in the way the pretrained Clay
Expand Down Expand Up @@ -150,7 +150,7 @@ segmentation problem, and on the predictions, we run sigmoid and max functions
to obtain final segmentation results.

The way we measure relative performance between the finetuned and
"from scratch" model variants happens through calculation of evalution metrics
"from scratch" model variants happens through calculation of evaluation metrics
common for segmentation, such as Dice coefficient, Intersection over Union, F1
score, precision and recall.

Expand Down
2 changes: 1 addition & 1 deletion docs/clay-v0/patch_level_cloud_cover.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -698,7 +698,7 @@
"id": "bd3d1cc1-9d79-4059-a1f6-4ac8cf4d2e51",
"metadata": {},
"source": [
"#### Set up filtered searchs"
"#### Set up filtered searches"
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions docs/clay-v0/run_region.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This section shows in a few simple steps how the clay model can be run for
custom AOIs and over custom date ranges.

## Prepare folder strucutre for data
## Prepare folder structure for data

```bash
# Move into the model repository
Expand Down Expand Up @@ -87,7 +87,7 @@ outside of the AOI specified.

To speed up processing in the example below, we use the subset argument to
reduce each MGRS tile to a small pixel window. When subsetting, the script
will only download a fraction of each MGRS tile. This will lead to discontinous
will only download a fraction of each MGRS tile. This will lead to discontinuous
datasets and should not be used in a real use case. Remove the subset argument
when using the script for a real world application, where all the data should
be downloaded for each MGRS tile.
Expand Down
2 changes: 1 addition & 1 deletion docs/clay-v0/specification-v0.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The model was trained on AWS on 4 NVIDIA A10G GPUs for 25 epochs (~14h per epoch

Model weights are available on HuggingFace [here](https://huggingface.co/made-with-clay/Clay/).

We also generated embeddings for all trainning data, which can be found on Source Cooperative [here](https://source.coop/).
We also generated embeddings for all training data, which can be found on Source Cooperative [here](https://source.coop/).

## Model Architecture

Expand Down
4 changes: 2 additions & 2 deletions docs/finetune/classify.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ The `Classifier` class is designed for classification tasks, utilizing the Clay

In this example, we will use the `Classifier` class to classify images from the [EuroSAT MS dataset](https://github.com/phelber/EuroSAT). The implementation includes data preprocessing, data loading, and model training workflow using [PyTorch Lightning](https://lightning.ai/) & [TorchGeo](https://github.com/microsoft/torchgeo).

In this example we freeze the Clay encoder and only train a very simple 2 layer MLP head for classification. The MLP head recieves as input the Clay class token embedding, which already contains the essence of the image as seen by Clay. The model for classification can then be kept very simple while still guaranteeing high quality results.
In this example we freeze the Clay encoder and only train a very simple 2 layer MLP head for classification. The MLP head receives as input the Clay class token embedding, which already contains the essence of the image as seen by Clay. The model for classification can then be kept very simple while still guaranteeing high quality results.

Notice that the EuroSAT dataset comes without date stamps or location information. The Clay model requires encoded versions of a date stamp and a latitude and longitude information. These values can be set to zero if they are not available, which is what we are doing in the datamodule script.

Expand Down Expand Up @@ -72,7 +72,7 @@ data/ds
```


### Training the Classifcation Head
### Training the Classification Head

The model can be run via LightningCLI using configurations in `finetune/classify/configs/classify_eurosat.yaml`.

Expand Down
2 changes: 1 addition & 1 deletion docs/finetune/finetune-on-embeddings.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,7 @@
"\n",
"### Choose your example\n",
"\n",
"In the following cell, choose which set of training points to use. The input shoudl be a point dataset\n",
"In the following cell, choose which set of training points to use. The input should be a point dataset\n",
"with a `class` column, containing `1` for positive examples, and `0` for negative examples.\n",
"\n",
"Use your own dataset or use one of the two provided ones."
Expand Down
2 changes: 1 addition & 1 deletion docs/finetune/regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ Compressed: 729766400

This will take the average of all timesteps available for each tile.
The time steps for Sentinel-2 are not complete, not all months are
provided for all tiles. In addtion, the Clay model does not take time
provided for all tiles. In addition, the Clay model does not take time
series as input. So aggregating the time element is simplifying but
ok for the purpose of this example.

Expand Down
2 changes: 1 addition & 1 deletion docs/release-notes/changelog-v1.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
* Shorten comment line length by @yellowcap in https://github.com/Clay-foundation/model/pull/261
* Refactor docs by moving v0 docs into separate section by @yellowcap in https://github.com/Clay-foundation/model/pull/262
* Docs v1 continued by @yellowcap in https://github.com/Clay-foundation/model/pull/263
* Documented metadata file for normalization and wavelenghts by @yellowcap in https://github.com/Clay-foundation/model/pull/266
* Documented metadata file for normalization and wavelengths by @yellowcap in https://github.com/Clay-foundation/model/pull/266
* [small change] add source.coop link by @brunosan in https://github.com/Clay-foundation/model/pull/137
* Segmentation on Clay by @srmsoumya in https://github.com/Clay-foundation/model/pull/257

Expand Down
6 changes: 3 additions & 3 deletions docs/tutorials/clay-v1-wall-to-wall.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"3. Load the model checkpoint\n",
"4. Prepare data into a format for the model\n",
"5. Run the model on the imagery\n",
"6. Analyise the model embeddings output using PCA\n",
"6. Analyse the model embeddings output using PCA\n",
"7. Train a Support Vector Machines fine tuning head"
]
},
Expand Down Expand Up @@ -333,7 +333,7 @@
"source": [
"### Prepare band metadata for passing it to the model\n",
"\n",
"This is the most technical part so far. We will take the information in the stack of imagery and convert it into the formate that the model requires. This includes converting the lat/lon and the date of the imagery into normalized values.\n",
"This is the most technical part so far. We will take the information in the stack of imagery and convert it into the format that the model requires. This includes converting the lat/lon and the date of the imagery into normalized values.\n",
"\n",
"The Clay model will accept any band combination in any order, from different platforms. But for this the model needs to know the wavelength of each band that is passed to it, and normalization parameters for each band as well. It will use that to normalize the data and to interpret each band based on its central wavelength.\n",
"\n",
Expand Down Expand Up @@ -374,7 +374,7 @@
"source": [
"### Convert the band pixel data in to the format for the model\n",
"\n",
"We will take the information in the stack of imagery and convert it into the formate that the model requires. This includes converting the lat/lon and the date of the imagery into normalized values."
"We will take the information in the stack of imagery and convert it into the format that the model requires. This includes converting the lat/lon and the date of the imagery into normalized values."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion finetune/classify/classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# %%
def cli_main():
"""
Command-line inteface to run Clasifier model with EuroSATDataModule.
Command-line interface to run Clasifier model with EuroSATDataModule.
"""
cli = LightningCLI(EuroSATClassifier, EuroSATDataModule)
return cli
Expand Down
2 changes: 1 addition & 1 deletion finetune/regression/regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# %%
def cli_main():
"""
Command-line inteface to run Regression with BioMastersDataModule.
Command-line interface to run Regression with BioMastersDataModule.
"""
cli = LightningCLI(
BioMastersClassifier,
Expand Down
2 changes: 1 addition & 1 deletion finetune/segment/segment.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# %%
def cli_main():
"""
Command-line inteface to run Segmentation Model with ChesapeakeDataModule.
Command-line interface to run Segmentation Model with ChesapeakeDataModule.
"""
cli = LightningCLI(ChesapeakeSegmentor, ChesapeakeDataModule)
return cli
Expand Down
11 changes: 11 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[tool.codespell]
ignore-words-list = [
"linz",
"socio-economic",
"therefrom",
]
skip = [
"docs/clay-v0/tutorial_digital_earth_pacific_patch_level.ipynb",
"docs/clay-v0/partial-inputs.ipynb",
"docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb",
]
Comment on lines +7 to +11
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codespell catches some false positive misspellings in Jupyter Notebook binary outputs (see also codespell-project/codespell#2138), so skipping these files after the true positive misspellings have been fixed:

docs/clay-v0/tutorial_digital_earth_pacific_patch_level.ipynb:245: iNH ==> in
docs/clay-v0/tutorial_digital_earth_pacific_patch_level.ipynb:687: te ==> the, be, we, to
docs/clay-v0/tutorial_digital_earth_pacific_patch_level.ipynb:916: WEe ==> we
docs/clay-v0/tutorial_digital_earth_pacific_patch_level.ipynb:985: Nd ==> And, 2nd
docs/clay-v0/tutorial_digital_earth_pacific_patch_level.ipynb:985: FO ==> OF, FOR, TO, DO, GO
docs/clay-v0/tutorial_digital_earth_pacific_patch_level.ipynb:985: OT ==> TO, OF, OR, NOT, IT
docs/clay-v0/tutorial_digital_earth_pacific_patch_level.ipynb:985: bu ==> by, be, but, bug, bun, bud, buy, bum
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: oly ==> only
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: teH ==> the
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: FO ==> OF, FOR, TO, DO, GO
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: tRU ==> through, true
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: AAs ==> ass, as
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: ALo ==> also
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: ALo ==> also
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: ALo ==> also
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: ALo ==> also
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: ALo ==> also
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: ALo ==> also
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: ALo ==> also
docs/tutorials/v1-inference-simsearch-naip-stacchip.ipynb:708: ALo ==> also
docs/clay-v0/partial-inputs.ipynb:394: oT ==> to, of, or, not, it
docs/clay-v0/partial-inputs.ipynb:394: fPr ==> for, far, fps

2 changes: 1 addition & 1 deletion trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# %%
def cli_main():
"""
Command-line inteface to run ClayMAE with ClayDataModule.
Command-line interface to run ClayMAE with ClayDataModule.
"""
cli = LightningCLI(save_config_kwargs={"overwrite": True})
return cli
Expand Down