From c6ec4fe698d7c277b5d8f968821ee0d15a8c5f4e Mon Sep 17 00:00:00 2001 From: nreimers Date: Fri, 30 Oct 2015 12:40:22 +0100 Subject: [PATCH] Documentation for v0.4.0 --- code/pom.xml | 2 +- doc/user-guide.adoc | 55 +++++++++++++++++++++++++++++------ doc/user-guide.html | 70 ++++++++++++++++++++++++++++++++++----------- 3 files changed, 101 insertions(+), 26 deletions(-) diff --git a/code/pom.xml b/code/pom.xml index e5828a9..9702ba1 100644 --- a/code/pom.xml +++ b/code/pom.xml @@ -208,7 +208,7 @@ maven-shade-plugin - ${project.build.directory}/${artifactId}-${version}-standalone.jar + ${project.build.directory}/wrapper-${version}.jar diff --git a/doc/user-guide.adoc b/doc/user-guide.adoc index 7d327d7..8ff6c23 100644 --- a/doc/user-guide.adoc +++ b/doc/user-guide.adoc @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -:version: 0.3.6 +:version: 0.4.0 = DARIAH-DKPro-Wrapper v{version} :Author: DARIAH2 - Cluster 5, Use Case 1 Team @@ -37,13 +37,13 @@ The pipeline requires required *Java 1.8* or higher. You can download Java from After downloading and unzipping the files, execute in your command line the following code: **** -+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -input file.txt -output folder+ ++java -Xmx4g -jar wrapper-{version}.jar -input file.txt -output folder+ **** You can change the language by specifying the language parameter for the pipeline. Support for the following languages are include in the current version of the DARIAH-DKPro-Wrapper: German (de), English (en), Spanish (es), and French (fr). To run the pipeline for English, execute the following command: **** -+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -language en -input file.txt -output folder+ ++java -Xmx4g -jar wrapper-{version}.jar -language en -input file.txt -output folder+ **** == Run the full pipeline @@ -52,6 +52,45 @@ By default, the pipeline runs in a light mode, the memory and time intensive com If you like to use them, feel free to enable them in the `default.properties` or create a new `.properties`-File and pass the path to this file via the `config`-parameter. +== File Reader + +You can process either single files or also all files inside a directory. Patterns can be used to select specific files that should be processed. + +=== XML Reader + +The DARIAH-DKPro-Wrapper implements two base readers, one text reader and one XML-file reader. You can specify the reader that should be used with the `-reader` parameter. By default, the text reader is used. To use the XML reader, run the pipeline in the following way: + +**** ++java -Xmx4g -jar wrapper-{version}.jar -language en -reader xml -input file.xml -output folder+ +**** + +The XML reader skips XML tags and processes only text which is inside the XML tags. The xpath to each tag is conserved and stored in the column *SectionId* in the ouput format. + +=== Reading Directories + +You can also specify for the *-input* argument a directory instead of a file. If you run the pipeline in the following way: +**** ++java -Xmx4g -jar wrapper-{version}.jar -language en -input folder/With/Files/ -output folder+ +**** + +the pipeline will process all files with a _.txt_ extension for the Text-reader. For the XML-reader, it will process all files with a _.xml_ extension. + +You can speficy also patterns to read in only certain files or files with certain extension. For example to read in only _.xmi_ with the XML reader, you must start the pipeline in the following way: +**** ++java -Xmx4g -jar wrapper-{version}.jar -language en -reader xml -input "folder/With/Files/*.xmi" -output folder+ +**** + +*Note:* If you use patterns (i.e. paths containing an *), you must set it into quotes to prevent shell globbing. + +To read all files in all subfolders, you can use a pattern like this: +**** ++java -Xmx4g -jar wrapper-{version}.jar -language en -input "folder/With/Subfolders/**/*.txt" -output folder+ +**** + +This will read in all _.txt_ files in all subfolders. Note that the subfolder path will not be maintained in the output folder. + + + == Write your own config files The pipeline can be configurated via properties-files that are stored in the `configs` folder. In this folder you find a `default.properties`, the most basic configuration file. For the different supported languages, you can find further properties-files, for example `default_de.properties` for German, `default_es.properties` for English and so on. @@ -59,17 +98,17 @@ The pipeline can be configurated via properties-files that are stored in the `co If you like to write your own config file, just create your own `.properties` file. You can run the pipeline with your `.properties`-file by setting the command argument. **** -+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -config /path/to/my/config/myconfigfile.properties -language en -input file.txt -output folder+ ++java -Xmx4g -jar wrapper-{version}.jar -config /path/to/my/config/myconfigfile.properties -language en -input file.txt -output folder+ **** In case you store your `myconfigfile.properties` in the `configs` folder, you can run the pipeline via: **** -+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -config myconfigfile.properties -language en -input file.txt -output folder+ ++java -Xmx4g -jar wrapper-{version}.jar -config myconfigfile.properties -language en -input file.txt -output folder+ **** You can split your config file into different parts and pass them all to the pipeline by seperating the paths using comma or semicolons. The pipeline examines all passed config files and derives the final configuration from all files. The config-file passed as last arguments has the highest priority, i.e. it can overwrite the values for all previous config files: **** -+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -config myfile1.properties,myconfig2.properties,myfile3.properties -language en -input file.txt -output folder+ ++java -Xmx4g -jar wrapper-{version}.jar -config myfile1.properties,myconfig2.properties,myfile3.properties -language en -input file.txt -output folder+ **** *Note:* The system always uses the default.properties and default_[langcode].properties as basic configuration files. All further config files are added on top of these files. @@ -77,7 +116,7 @@ You can split your config file into different parts and pass them all to the pip In case you like to use the _full_-version and also want to change the POS-tagger, you can run the pipeline in the following way: **** -+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -config myFullVersion.properties,myPOSTagger.properties -language en -input file.txt -output folder+ ++java -Xmx4g -jar wrapper-{version}.jar -config myFullVersion.properties,myPOSTagger.properties -language en -input file.txt -output folder+ **** In `myPOSTagger.properties` you just add the configuration for the different POS-tagger. @@ -135,7 +174,7 @@ useLemmatizer = false Change the paths for the parameter _executablePath_ and _modelLocation_ to the correct paths on your machine. You can then use Treetagger in your pipeline using the `-config` argument: **** -+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -config treetagger-example.properties -language de -input file.txt -output folder+ ++java -Xmx4g -jar wrapper-{version}.jar -config treetagger-example.properties -language de -input file.txt -output folder+ **** Check the output of the pipeline that Treetagger is used. The output of your pipeline should look something like this: diff --git a/doc/user-guide.html b/doc/user-guide.html index 2e68072..ae6b4a9 100644 --- a/doc/user-guide.html +++ b/doc/user-guide.html @@ -4,7 +4,7 @@ -DARIAH-DKPro-Wrapper v0.3.6 +DARIAH-DKPro-Wrapper v0.4.0