Skip to content

Latest commit

 

History

History
149 lines (110 loc) · 5.56 KB

CONTRIBUTING.md

File metadata and controls

149 lines (110 loc) · 5.56 KB

CONTRIBUTING in ParlaMint

Git and GitHub

Sample data should be pushed to the Data branch of the ParlaMint repository directly into the parliament folder (Data/ParlaMint-XX) in a flat structure of files.

Setup

  • Create a GitHub account if you don't have one.
  • Fork ParlaMint repository into your organization or private account.
  • Start the terminal on your computer and navigate to the folder where you want the ParlaMint local clone of the repository to be placed:
# replace <USER-ORG> with your GitHub user or organization name
 git clone [email protected]:<USER-ORG>/ParlaMint.git
  • Set the data branch in your repository to be synchronized with the data branch in the ParlaMint repository:
cd ParlaMint
git remote add upstream https://github.com/clarin-eric/ParlaMint.git
git fetch upstream
git checkout -b data upstream/data
git push -u origin data

Adding new data into your remote repository (Fork)

  • check you are in the data branch
git status
# switch do data branch:
git checkout data
  • Update your local git repository with your remote repository
git pull
  • Add new data to your local git repository:
# replace XX with your country code
git add Data/ParlaMint-XX/*.xml
git commit -m 'XX' Data/ParlaMint-XX/ParlaMint-XX*.xml
  • Add common content (tagUsages, word extents, version):

    • edit files and save in Data/ParlaMint-XX/add-common-content/ParlaMint-XX/ folder: make add-common-content-XX
    • check if modified files are ok
    • replace Data/ParlaMint-XX/*.xml files with Data/ParlaMint-XX/add-common-content/ParlaMint-XX/ content
    • commit changes git commit -m 'XX add common content' Data/ParlaMint-XX/ParlaMint-XX*.xml
  • Push data to your Fork:

git push

Synchronize your remote repository with the ParlaMint repository

Install prerequisites

You can check if all prerequisites are installed with the command make check-prereq if all success the output is:

Saxon: OK
Jing: OK
UD tools: OK
INFO: Maximum java heap size (saxon needs 5-times more than the size of processed xml file)
  1.80469 GB

Saxon

Saxon is expected to be at this location in your system: /usr/share/java/saxon.jar You need superuser privileges to do this.

# download saxon file into /opt folder
sudo wget https://search.maven.org/remotecontent?filepath=net/sf/saxon/Saxon-HE/10.6/Saxon-HE-10.6.jar -O /opt/saxon.jar
# create a symbolic link to the correct location
sudo ln -s /opt/saxon.jar /usr/share/java/saxon.jar

Important note: jing archive below also contains Saxon. But that version of Saxon does not support all features that are needed.

Jing

Jing is expected to be at this location in your system: /usr/share/java/jing.jar You need superuser privileges to do this.

# download jing into tmp folder
wget https://github.com/relaxng/jing-trang/releases/download/V20181222/jing-20181222.zip -O /tmp/jing-20181222.zip
# extract jinfg into /opt
sudo unzip /tmp/jing-20181222.zip jing-20181222/bin/* -d /opt
# create a symbolic link to the correct location
sudo ln -s /opt/jing-20181222/bin/jing.jar /usr/share/java/jing.jar
rm /tmp/jing-20181222.zip

UD tools

  • Change directory to Scripts folder: cd Scripts
  • Clone UD tools repository: git clone https://github.com/UniversalDependencies/tools.git
  • Install Python regex library: pip3 install --user regex

Local validation

Running make help in the repository root folder provides a make targets list with a description. Once the set-up has been done, the corpus for country XX can be validated with the validate-parlamint-XX command. For the linguistically annotated version, make conllu-XX should also be run.

Submitting the completed corpora

Once samples have been validated and incorporated into the ParlaMint GitHub repository the complete corpus can be processed and submitted.

First, pls. note that the samples in GitHub use a flat directory structure, while the complete corpus is structure differently. First, the linguistically non-annotated corpus should be stored in the directory named ParlaMint-XX.TEI/, while the linguistically annotated corpus should be stored separately, in the directory named ParlaMint-XX.TEI.ana/. Second, the component files should be stored in subdirectories, one for each year. Note that this is explained in the Section on Filenames and directory structure of the Guidelines.

Once the corpus is stored in the recommended way, it can be validated localy, and then the complete TEI and TEI.ana versions of the corpus should be compressed (either .zip or .tgz) into two files and put somewhere where the ParlaMint editors can access it. Preferably this is a web (http) server or any other location, where the files can be dowloaded via the command line. If this is not possible then the corpus can also be made available on the cloud, WeTransfer or similar. Then the editors (@TomazErjavec and @matyaskopp) should be sent an email with instructions on how to download the corpus, and they will send feedback on whether the corpus passed validation and let you have the validation and conversion log file.