Skip to content
Gary edited this page May 5, 2017 · 5 revisions

Gary's Toolset

Documentation for tool found in directory libpython/gary

Getting Started

Entry objects

The basic usage for these tools is to create an entry object and filters that operate on fields of the entry. An Entry can be created with language fields given as parameters. Field names should be valid Python identifiers. For example:

Entry('eng', 'fin') or Entry('eng_000', 'fin_000') or Entry('English', 'Finnish')

An entry can have multiple fields within a language column for adding POS or other metadata.

entry = Entry('eng:pos:mcs')

This creates a single column called eng with two subfields called pos and mcs. The fields can be accessed like this:

entry.text = 'cat'
entry.pos = 'noun'
entry.mcs = 'type:animal'

Filter objects

An entry can be filtered with different kinds of filters (SimpleFilter, SynonymFilter, etc.). A filter is a runnable object that takes an entry as it's first parameter and runs a text filter on specified columns. To create a filter object, specify the function to use and any columns to operate on. If no subfield is specified, the text field is assumed.

def text_filter(some_text):
    # modify text here
    return some_text

filter_object = SimpleFilter(text_filter, 'eng', 'eng.pos')

This example filters the eng.text field and eng.pos field. If a column is used without a subfield, the text field is assumed. The SimpleFilter object makes no assumptions about the structure of the text.

Tutorial

#!/usr/bin/env python3
from collections import namedtuple
import re
from gary import Entry, increment_fileid, run_filters, ignore_parens, SYNDELIM
from gary.text_filter import pre_process
from gary.entry_filter import SimpleFilter, SynonymFilter, ExtractorFilter

Record = namedtuple('Record', ['eng', 'pos', 'fin', 'jpn', 'deu', 'spa'])

def create_entry(record:Record):
    entry = Entry('eng:pos', 'fin', 'jpn', 'deu', 'spa')
    entry.eng.text = record.eng
    entry.eng.pos = record.pos
    entry.fin.text = record.fin
    entry.jpn.text = record.jpn
    entry.deu.text = record.deu
    entry.spa.text = record.spa
    return entry

@ignore_parens
def delimit_synomyms(text):
    return re.sub('\s*,\s*', SYNDELIM, text)

def remove_period(text):
    return text.rstrip('.')

def extract_pos(text, pos):
    match = re.search('^(.*)\s+\((.*)\)$', text)
    if match:
        text = match.group(1)
        pos = match.group(2)
    return text,pos

def do_nothing(text):
    return text

preprocess_filter = SimpleFilter(pre_process, 'eng', 'eng.pos', 'spa', 'fin', 'deu', 'jpn')
synonym_filter = SimpleFilter(delimit_synomyms, 'eng', 'spa')
period_filter = SynonymFilter(remove_period, 'eng')
pos_filter = ExtractorFilter(extract_pos, 'fin.text', 'fin.pos')
dummy_filter = PanlexSynonymFilter(do_nothing, 'fin')

all_filters = [preprocess_filter, synonym_filter, period_filter, pos_filter, dummy_filter]
record_list = [Record('cat.,feline.', 'noun', 'kissa (s)', '猫', 'Katze', 'gato (por exemplo, animal)')]

if __name__ == '__main__':
    for record in record_list:
        entry = create_entry(record)
        run_filters(all_filters,entry)
        
        print(entry)

Documentation

Filter types

  • SimpleFilter - used when the text has no structure, useful for removing illegal chars, or modifying text. Not recommended for operations on the beginning or end of the text if the text is divided into synonyms.
  • SynonymFilter - used when the text is divided into synonyms with the synonym delimiter (‣) or into separate senses with the sense delimiter (⁋)
  • ExtractorFilter - (takes two field name parameters) used when extracting information from the first field and move it to the second. (e.g. extracting POS tag from the text)
  • PanlexSynonymFilter - used when data is already formatted in the serialized PanLex format (⫷ex⫸ and ⫷df⫸) or when tags need to be added. If there are no ex tags, they will be added.
  • PanlexExtractorFilter - same as ExtractorFilter, but uses PanLex-style tags.

Note: when the PanlexExtractorFilter filter is used, it will not replace values in the extracted field. Instead, it will append them as synonyms. To remove data from the extracted field, use the ExtractorFilter.

miscellaneous functions

  • ignore_parens decorator - add to function to keep it from processing text inside parentheses (e.g. converting comas to synonym delimiter.
  • increment_fileid(filename ) - takes filename in format XXX-0.txt and increments number (does not need to be .txt file
  • pretag_ex(text [,uid ]) - tags the text with the ⫷ex⫸
  • pretag_df(text ) - tags the text with the ⫷df⫸
  • default_str(text ) - converts None to empty string, otherwise returns same value
  • get_plx_fields(text ) - extracts triplet of ex-tag, text, property or classification data
  • append_synonym(text , new_element ) - append new item to list of items delimited by (‣)
Clone this wiki locally