Deployed 5b273d4 with MkDocs version: 1.5.3

centre-for-humanities-computing · Jul 22, 2024 · 078e232 · 078e232
commit 078e232
Show file tree

Hide file tree

Showing 113 changed files with 52,791 additions and 0 deletions.
diff --git a/.nojekyll b/.nojekyll
diff --git a/404.html b/404.html
diff --git a/Adding_a_new_dataset b/Adding_a_new_dataset
@@ -0,0 +1,42 @@
+# Adding a new dataset
+
+This guide is intended for internal collaborators. If you want to add a new dataset to the DFM, but are not a collaborator, please open an issue in this repository or contact us at using the contact form on the [website](https://www.foundationmodels.dk/#join-us).
+
+1) Add a datasheet to the `docs/datasheets` folder, e.g. [`nordjylland_news.md`](https://github.com/centre-for-humanities-computing/danish-foundation-models/blob/main/docs/datasheets/nordjylland_news.md). The datasheet should be named `<dataset_name>.md`. The datasheet should be written in markdown format and should have a front matter following the [Huggingface dataset card template](https://huggingface.co/docs/datasets/en/upload_dataset#create-a-dataset-card). It should have the following attributes:
+   1) a license (if the license is not a standard once allowed by Huggingface, please use "other" and specify the license in the datasheet)
+   2) Languages (e.g. "da" for Danish)
+2) Add a dataset of the same name to the `/danish-foundation-models (193701)/dfm-data/pre-training/` on UCloud. Using the following folder structure:
+
+```
+pre-training
+│
+└── dataset_name
+    │
+    ├── documents
+    │   ├──part1.jsonl.gz
+    │   ├──part2.jsonl.gz
+    │   └── ...
+    │
+    └── attributes   # OPTIONAL: folder containing annotations from dataset cleaning
+```
+
+1) Validate the dataset using the `data-processing/scripts/dataset_validator.py` script. The script will check if the datasets is in the correct format and if the metadata in the datasheet matches the dataset. See the docstring in the script for more information on how to use it.
+
+## JSONL Schema
+
+An entry in the dataset should adhere to the Document schema (defined below).
+
+```
+{
+    "id": "...",                      # MANDATORY: source-specific identifier
+    "text": "foo",                    # MANDATORY: textual content of the document
+    "source": "...",                  # MANDATORY: source of the data, such as peS2o, common-crawl, etc.
+    "added": "...",                   # MANDATORY: timestamp we acquired this data (time file was created), specified as
+                                        # YYYY-MM-DD e.g 2021-04-13
+    "created": "..."                  # MANDATORY: timestamp when orig document was created (best-guess if not available),
+                                         # should be specified as a range;
+                                         # "YYYY-MM-DD, YYYY-MM-DD"
+    "metadata": {                     # OPTIONAL: source-specific metadata
+         ...
+     }
+}
diff --git a/CNAME b/CNAME
@@ -0,0 +1 @@
+www.foundationmodels.dk
diff --git a/_static/collab.png b/_static/collab.png
diff --git a/_static/dev_container.png b/_static/dev_container.png
diff --git a/_static/icon.png b/_static/icon.png
diff --git a/_static/logo.png b/_static/logo.png
diff --git a/_static/munin-data-pipeline-da-simplified.drawio.png b/_static/munin-data-pipeline-da-simplified.drawio.png
diff --git a/_static/munin-data-pipeline-da.drawio.png b/_static/munin-data-pipeline-da.drawio.png
diff --git a/_static/structure.png b/_static/structure.png
diff --git a/assets/images/favicon.png b/assets/images/favicon.png
diff --git a/assets/javascripts/bundle.d7c377c4.min.js b/assets/javascripts/bundle.d7c377c4.min.js
diff --git a/assets/javascripts/bundle.d7c377c4.min.js.map b/assets/javascripts/bundle.d7c377c4.min.js.map
diff --git a/assets/javascripts/lunr/min/lunr.ar.min.js b/assets/javascripts/lunr/min/lunr.ar.min.js
diff --git a/assets/javascripts/lunr/min/lunr.da.min.js b/assets/javascripts/lunr/min/lunr.da.min.js
diff --git a/assets/javascripts/lunr/min/lunr.de.min.js b/assets/javascripts/lunr/min/lunr.de.min.js
diff --git a/assets/javascripts/lunr/min/lunr.du.min.js b/assets/javascripts/lunr/min/lunr.du.min.js