Sample data should be pushed to the Data branch of the ParlaMint repository directly into the parliament folder (Data/ParlaMint-XX
) in a flat structure of files.
- Create a GitHub account if you don't have one.
- Fork ParlaMint repository into your organization or private account.
- Start the terminal on your computer and navigate to the folder where you want the ParlaMint local clone of the repository to be placed:
# replace <USER-ORG> with your GitHub user or organization name
git clone [email protected]:<USER-ORG>/ParlaMint.git
- Set the data branch in your repository to be synchronized with the data branch in the ParlaMint repository:
cd ParlaMint
git remote add upstream https://github.com/clarin-eric/ParlaMint.git
git fetch upstream
git checkout -b data upstream/data
git push -u origin data
- check you are in the data branch
git status
# switch do data branch:
git checkout data
- Update your local git repository with your remote repository
git pull
- Add new data to your local git repository:
# replace XX with your country code
git add Data/ParlaMint-XX/*.xml
git commit -m 'XX' Data/ParlaMint-XX/ParlaMint-XX*.xml
-
Add common content (tagUsages, word extents, version):
- edit files and save in
Data/ParlaMint-XX/add-common-content/ParlaMint-XX/
folder:make add-common-content-XX
- check if modified files are ok
- replace
Data/ParlaMint-XX/*.xml
files withData/ParlaMint-XX/add-common-content/ParlaMint-XX/
content - commit changes
git commit -m 'XX add common content' Data/ParlaMint-XX/ParlaMint-XX*.xml
- edit files and save in
-
Push data to your Fork:
git push
- update your repository with new content in ParlaMint repository:
- create a pull request: https://github.com/USER-ORG/ParlaMint/compare/data...clarin-eric:data
- check changes
- merge pull request
- update ParlaMint repository with data in your repository:
- create a pull request: https://github.com/clarin-eric/ParlaMint/compare/data...USER-ORG:data
You can check if all prerequisites are installed with the command make check-prereq
if all success the output is:
Saxon: OK
Jing: OK
UD tools: OK
INFO: Maximum java heap size (saxon needs 5-times more than the size of processed xml file)
1.80469 GB
Saxon is expected to be at this location in your system: /usr/share/java/saxon.jar
You need superuser privileges to do this.
# download saxon file into /opt folder
sudo wget https://search.maven.org/remotecontent?filepath=net/sf/saxon/Saxon-HE/10.6/Saxon-HE-10.6.jar -O /opt/saxon.jar
# create a symbolic link to the correct location
sudo ln -s /opt/saxon.jar /usr/share/java/saxon.jar
Important note: jing archive below also contains Saxon. But that version of Saxon does not support all features that are needed.
Jing is expected to be at this location in your system: /usr/share/java/jing.jar
You need superuser privileges to do this.
# download jing into tmp folder
wget https://github.com/relaxng/jing-trang/releases/download/V20181222/jing-20181222.zip -O /tmp/jing-20181222.zip
# extract jinfg into /opt
sudo unzip /tmp/jing-20181222.zip jing-20181222/bin/* -d /opt
# create a symbolic link to the correct location
sudo ln -s /opt/jing-20181222/bin/jing.jar /usr/share/java/jing.jar
rm /tmp/jing-20181222.zip
- Change directory to
Scripts
folder:cd Scripts
- Clone UD tools repository:
git clone https://github.com/UniversalDependencies/tools.git
- Install Python regex library:
pip3 install --user regex
Running make help
in the repository root folder provides a make targets list with a description.
Once the set-up has been done, the corpus for country XX can be validated with the
validate-parlamint-XX
command. For the linguistically annotated version, make conllu-XX
should
also be run.
Once samples have been validated and incorporated into the ParlaMint GitHub repository the complete corpus can be processed and submitted.
First, pls. note that the samples in GitHub use a flat directory structure, while the complete corpus is structure differently. First, the linguistically non-annotated corpus should be stored in the directory named ParlaMint-XX.TEI/, while the linguistically annotated corpus should be stored separately, in the directory named ParlaMint-XX.TEI.ana/. Second, the component files should be stored in subdirectories, one for each year. Note that this is explained in the Section on Filenames and directory structure of the Guidelines.
Once the corpus is stored in the recommended way, it can be validated localy, and then the complete TEI and TEI.ana versions of the corpus should be compressed (either .zip or .tgz) into two files and put somewhere where the ParlaMint editors can access it. Preferably this is a web (http) server or any other location, where the files can be dowloaded via the command line. If this is not possible then the corpus can also be made available on the cloud, WeTransfer or similar. Then the editors (@TomazErjavec and @matyaskopp) should be sent an email with instructions on how to download the corpus, and they will send feedback on whether the corpus passed validation and let you have the validation and conversion log file.