-
Notifications
You must be signed in to change notification settings - Fork 0
gary_tools
Documentation for tool found in directory libpython/gary
The basic usage for these tools is to create an entry
object and filters that operate on fields of the entry.
An Entry can be created with language fields given as parameters. Field names should be valid Python identifiers. For example:
Entry('eng', 'fin')
or Entry('eng_000', 'fin_000')
or Entry('English', 'Finnish')
An entry can have multiple fields within a language column for adding POS or other metadata.
entry = Entry('eng:pos:mcs')
This creates a single column called eng
with two subfields called pos
and mcs
. The fields can be accessed like this:
entry.text = 'cat'
entry.pos = 'noun'
entry.mcs = 'type:animal'
An entry can be filtered with different kinds of filters (SimpleFilter, SynonymFilter, etc.). A filter is a runnable object that takes an entry as it's first parameter and runs a text filter on specified columns. To create a filter object, specify the function to use and any columns to operate on. If no subfield is specified, the text field is assumed.
def text_filter(some_text):
# modify text here
return some_text
filter_object = SimpleFilter(text_filter, 'eng', 'eng.pos')
This example filters the eng.text
field and eng.pos
field. If a column is used without a subfield, the text field is assumed. The SimpleFilter
object makes no assumptions about the structure of the text.
#!/usr/bin/env python3
from collections import namedtuple
import re
from gary import Entry, increment_fileid, run_filters, ignore_parens, SYNDELIM
from gary.text_filter import pre_process
from gary.entry_filter import SimpleFilter, SynonymFilter, ExtractorFilter
Record = namedtuple('Record', ['eng', 'pos', 'fin', 'jpn', 'deu', 'spa'])
def create_entry(record:Record):
entry = Entry('eng:pos', 'fin', 'jpn', 'deu', 'spa')
entry.eng.text = record.eng
entry.eng.pos = record.pos
entry.fin.text = record.fin
entry.jpn.text = record.jpn
entry.deu.text = record.deu
entry.spa.text = record.spa
return entry
@ignore_parens
def delimit_synomyms(text):
return re.sub('\s*,\s*', SYNDELIM, text)
def remove_period(text):
return text.rstrip('.')
def extract_pos(text, pos):
match = re.search('^(.*)\s+\((.*)\)$', text)
if match:
text = match.group(1)
pos = match.group(2)
return text,pos
def do_nothing(text):
return text
preprocess_filter = SimpleFilter(pre_process, 'eng', 'eng.pos', 'spa', 'fin', 'deu', 'jpn')
synonym_filter = SimpleFilter(delimit_synomyms, 'eng', 'spa')
period_filter = SynonymFilter(remove_period, 'eng')
pos_filter = ExtractorFilter(extract_pos, 'fin.text', 'fin.pos')
dummy_filter = PanlexSynonymFilter(do_nothing, 'fin')
all_filters = [preprocess_filter, synonym_filter, period_filter, pos_filter, dummy_filter]
record_list = [Record('cat.,feline.', 'noun', 'kissa (s)', '猫', 'Katze', 'gato (por exemplo, animal)')]
if __name__ == '__main__':
for record in record_list:
entry = create_entry(record)
run_filters(all_filters,entry)
print(entry)
- SimpleFilter - used when the text has no structure, useful for removing illegal chars, or modifying text. Not recommended for operations on the beginning or end of the text if the text is divided into synonyms.
- SynonymFilter - used when the text is divided into synonyms with the synonym delimiter (‣) or into separate senses with the sense delimiter (⁋)
- ExtractorFilter - (takes two field name parameters) used when extracting information from the first field and move it to the second. (e.g. extracting POS tag from the text)
- PanlexSynonymFilter - used when data is already formatted in the serialized PanLex format (⫷ex⫸ and ⫷df⫸) or when tags need to be added. If there are no ex tags, they will be added.
-
PanlexExtractorFilter - same as
ExtractorFilter
, but uses PanLex-style tags.
Note: when the PanlexExtractorFilter
filter is used, it will not replace values in the extracted field. Instead, it will append them as synonyms. To remove data from the extracted field, use the ExtractorFilter
.
- ignore_parens decorator - add to function to keep it from processing text inside parentheses (e.g. converting comas to synonym delimiter.
-
increment_fileid(filename ) - takes filename in format
XXX-0.txt
and increments number (does not need to be.txt
file - pretag_ex(text [,uid ]) - tags the text with the ⫷ex⫸
- pretag_df(text ) - tags the text with the ⫷df⫸
- default_str(text ) - converts None to empty string, otherwise returns same value
- get_plx_fields(text ) - extracts triplet of ex-tag, text, property or classification data
- append_synonym(text , new_element ) - append new item to list of items delimited by (‣)