Skip to content

Commit

Permalink
Deployed 5b273d4 with MkDocs version: 1.5.3
Browse files Browse the repository at this point in the history
  • Loading branch information
Unknown committed Jul 22, 2024
0 parents commit 078e232
Show file tree
Hide file tree
Showing 113 changed files with 52,791 additions and 0 deletions.
Empty file added .nojekyll
Empty file.
651 changes: 651 additions & 0 deletions 404.html

Large diffs are not rendered by default.

42 changes: 42 additions & 0 deletions Adding_a_new_dataset
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Adding a new dataset

This guide is intended for internal collaborators. If you want to add a new dataset to the DFM, but are not a collaborator, please open an issue in this repository or contact us at using the contact form on the [website](https://www.foundationmodels.dk/#join-us).

1) Add a datasheet to the `docs/datasheets` folder, e.g. [`nordjylland_news.md`](https://github.com/centre-for-humanities-computing/danish-foundation-models/blob/main/docs/datasheets/nordjylland_news.md). The datasheet should be named `<dataset_name>.md`. The datasheet should be written in markdown format and should have a front matter following the [Huggingface dataset card template](https://huggingface.co/docs/datasets/en/upload_dataset#create-a-dataset-card). It should have the following attributes:
1) a license (if the license is not a standard once allowed by Huggingface, please use "other" and specify the license in the datasheet)
2) Languages (e.g. "da" for Danish)
2) Add a dataset of the same name to the `/danish-foundation-models (193701)/dfm-data/pre-training/` on UCloud. Using the following folder structure:

```
pre-training
└── dataset_name
├── documents
│ ├──part1.jsonl.gz
│ ├──part2.jsonl.gz
│ └── ...
└── attributes # OPTIONAL: folder containing annotations from dataset cleaning
```

1) Validate the dataset using the `data-processing/scripts/dataset_validator.py` script. The script will check if the datasets is in the correct format and if the metadata in the datasheet matches the dataset. See the docstring in the script for more information on how to use it.

## JSONL Schema

An entry in the dataset should adhere to the Document schema (defined below).

```
{
"id": "...", # MANDATORY: source-specific identifier
"text": "foo", # MANDATORY: textual content of the document
"source": "...", # MANDATORY: source of the data, such as peS2o, common-crawl, etc.
"added": "...", # MANDATORY: timestamp we acquired this data (time file was created), specified as
# YYYY-MM-DD e.g 2021-04-13
"created": "..." # MANDATORY: timestamp when orig document was created (best-guess if not available),
# should be specified as a range;
# "YYYY-MM-DD, YYYY-MM-DD"
"metadata": { # OPTIONAL: source-specific metadata
...
}
}
1 change: 1 addition & 0 deletions CNAME
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
www.foundationmodels.dk
Binary file added _static/collab.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/dev_container.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/icon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/munin-data-pipeline-da.drawio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/structure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/favicon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 29 additions & 0 deletions assets/javascripts/bundle.d7c377c4.min.js

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions assets/javascripts/bundle.d7c377c4.min.js.map

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions assets/javascripts/lunr/min/lunr.ar.min.js

Large diffs are not rendered by default.

18 changes: 18 additions & 0 deletions assets/javascripts/lunr/min/lunr.da.min.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 18 additions & 0 deletions assets/javascripts/lunr/min/lunr.de.min.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 18 additions & 0 deletions assets/javascripts/lunr/min/lunr.du.min.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 078e232

Please sign in to comment.