-
Notifications
You must be signed in to change notification settings - Fork 142
ConsoleRegexTurtle (Console Application)
ConsoleRegexTurtle is a Java command line application that provides an interface for the RegexGenerator GP engine; ConsoleRegexTurtle is able to load annotated datasets and save the evolution results. The corresponding files are represented in JSON according to formats described in an appendix to this document. For a text extraction task, please note that the provided application works with fully annotated datasets, where fully annotated means that every example character belongs to an unmatch or match snippet; by default, unannotated and consecutive characters are merged into an unmatch snippet. For a text classification task (flagging), the provided application requires a dataset where the positive example contains only a single match that embraces all the example string; the negative example contains only a single unmatch that embraces all the example string.
ConsoleRegexTurtle can be executed in two ways:
java -jar ConsoleRegexTurtle.jar [parameters]
consoleregexturtle.sh [parameters]
The script works only on Unix/Linux systems and automatically sets the Java virtual machine maximum memory size in order to take advantage of all the system memory.
Executing ConsoleRegexTurtle without parameters, or with -h
, will visualize the following concise usage guide:
Usage:
java -jar ConsoleRegexTurtle -t 4 -p 500 -g 1000 -e 20.0 -c "interesting evolution" -x true -d dataset.json
-o ./outputfolder/
On linux you can invoke this tool using the alternative script:
consoleregexturtle.sh -t 4 -p 500 -g 1000 -e 20.0 -c "interesting evolution"-d dataset.json
-o ./outputfolder/
Parameters:
-t number of threads, default is 2
-p population size, default is 500
-g maximum number of generations, per Job, default is 1000
-e percentage of number generations, defines a threshold for the completion criteria, when best doesn't
change for the provided % of generations the Job evolution ends. Default is 100%, in other words
completion criteria is disabled.
-d path of the dataset json file containing the examples, this parameter is mandatory.
-o name of the output folder, results.json is saved into this folder; default is '.'
-x boolean, populates an extra field in results file, when 'true' adds all dataset examples in the results
file 'examples' field, default is 'false'
-s boolean, when 'true' enables dataset striping, striping is an experimental feature, default is disabled:
'false'
-c adds an optional comment string to the results file
-h visualizes this help message
-f enables the flagging mode: solves a flagging problem with a separate-and-conquer strategy
For example:
consoleregexturtle.sh -t 3 -p 200 -g 500 -e 20.0 -c "first test" -d /stuff/dataset.json
-o ./outputfolder/first/
This command line executes the search on the dataset named /stuff/dataset.json
, saves the evolution outcome in a file named results-YYMMDDhhmmss.json
, where YYMMDDhhmmss
is the curremt formatted date, into the folder /ouputfolder/first/
; the population size is 200, the maximum number of generations is 500, the threshold for the termination criteria is 20%, the comment stored in results-YYMMDDhhmmss.json
is "first test" and the execution is going to use three system threads.
ConsoleRegexTurtle runs on Windows systems but runtime output is less readable due to some glitches; final results are saved correctly also on Windows systems, though.