-
Notifications
You must be signed in to change notification settings - Fork 75
SystemT Report
Authors(s): Zuozhi Wang (zuozhiw AT uci DOT edu)
Reviewer(s): Chen Li (chenli AT gmail DOT com) and Team 6
Conducted by Zuozhi Wang
Advised by Professor Chen Li and PhD Jamshid Esmaelnezhad
January 2016 - March 2016
University of California, Irvine
SystemT is a text analytics product from IBM for information extraction. Different from traditional grammar based or machine-learning based methods, SystemT takes a new algebraic approach. It provides Annotation Query Language (AQL), a SQL-like declarative query language to extract structured data from raw text. It uses relational operators, as well as span extraction and aggregation operators to build complex models.
During the winter quarter 2016 at UC Irvine, we learned to use SystemT and evaluated its performance. Furthermore, we did experiments on using Lucene and Russ Cox’s Regex algorithm to pre-process the query.
SystemT provides three ways to access its features:
- Web interface: install IBM BigInsights on Virtual Box. It requires 16 GB RAM;
- Java API: use SystemT API in a Java program .
Here are some basic AQL elements: Regular expressions, dictionaries, and patterns. For AQL tutorials, please see References, Tutorials, and Papers sections.
Extract regular expressions.
The Document view is a special view that represents the current document.
create view DateFormat as
extract regex /(\d|0\d|1[0-2])\/(\d|[0-2]\d|3[0-1])\/(19\d{2}|2\d{3}|d{2})/
on D.text as Date from Document D;
Extract dictionaries.
Here a dictionary is created from a local file. It can also be an inline dictionary.
create dictionary symptom_dict
from file '../../../resources/dictionaries/WebMD_symptoms.txt';
create view Symptoms as
extract dictionary 'symptom_dict'
on D.text as SymptomName from Document D;
Extract Patterns.
This query extracts patterns where disease names and symptom names are close to each other (i.e., 0 to 20 tokens apart).
create view DiseaseRelateSymptom as
extract pattern <D.DiseaseNames> <Token>{0, 20} <S.SymptomNames>
as match from Diseases D, Symptoms S;
Access the SystemT web page for the latest instructions on how to get a copy of SystemT.
All the code we wrote related to SystemT are available in this github repo.
To use SystemT, check the code in MedExtraction/src/extractor folder. Extractor.java is a wrapper program that does all the API calls and simplifies the process. MyExtraction.java is a sample program using the Extractor. Please see comments in the code for details.
The dataset we used is mostly iPubMed data. The dictionary files are downloaded by the Python crawlers we wrote. The code is in the med_crawlers folder in the github repo. Please come to us to get the data and dictionary files.
In the MedExtraction/textAnalytics/src folder, there are some AQL scripts written by me and another student (Fan Mo).
The code in MedExtraction/src/preprocessor and MedExtraction/src/test folder are not directly related to SystemT. If you are interested, see the Experiments in preprocessing dictionary and regex queries section.
Tests are run on our laptop (2013 Macbook Air) with a single thread.
SystemT's running time on extracting a small dictionary of 400 entries:
# of docs | SystemT execution time (secs) |
---|---|
10K | 9.54 |
50K | 35.84 |
100K | 97.16 |
250K | 255.29 |
400K | 397.13 |
SystemT's running time on extracting datetime regexes:
# of docs | SystemT execution time (secs) |
---|---|
10K | 0.97 |
20K | 1.17 |
50K | 2.70 |
100K | 5.30 |
200K | 10.41 |
300K | 15.90 |
400K | 21.32 |
500K | 26.24 |
Please note that this section describes our own experiments on preprocessing dictionary and regex queries. It's not relevant to SystemT itself.
In this big data era, many information extraction tasks can use a huge amount of data. For example, our iPubMed dataset has over 26 million records and it's a really valuable resource for information extraction. SystemT is capable of processing each document efficiently. But if we feed all 26 million documents to SystemT, it would take more than 6 hours to perform a simple extraction tasks. This leads us to think about the following idea: if we can pre-process a query, then we only need to feed SystemT those documents relevant to the query.
A SystemT query can be based on dictionaries and regexes. Complex extraction models are built upon these two fundamental elements. So we did some experiments to filter the whole dataset based on dictionaries and regexes. Its code is available in the preprocessor folder.
The first step is to find dictionaries and regular expressions in the AQL file. The scanner we wrote is relatively simple. It only extracts dictionaries and regexes. It doesn't parse the whole AQL file, so if the query is not regex or not in dictionary, the preprocessing would be completely wrong. Writing a complete SystemT AQL parser needs a lot of efforts. For this experiment, the simple scanner is enough.
I used Lucene to build an index on the whole dataset, find documents that contain entries in the dictionary, and feed the filtered documents to SystemT. In the following figures, the yellow line is the original time that all the documents are fed into SystemT. The blue line is the running time of building the index, searching, and feeding the filtered documents to SystemT. The red line is the time excluding the time of building index, because the time of building index is a one-time effort.
Here's the performance result of a small dictionary of 460 entries. Using Lucene for preprocessing is constantly a lot faster than the original SystemT.
Here's the performance result of a large dictionary of over 40,000 entries. Lucene starts to slow down because the dictionary doesn't have an index, and it has to scan the index for each entry. But it's still faster than SystemT.
Using Lucene, we built an index on the input data, but not on the dictionary. It would be potentially faster if we build indexes on both the data and the dictionary and use an efficient algorithm to match both indexes.
Google Code Search is one of online search tools that support regular expressions. Its technique to perform efficient regular expression matching has been long unknown, until Google Code Search service was closed in 2011, and Russ Cox wrote an article explaning the algorithm in 2012. Google Code Search is now also open source.
We used the Google Code Search to build an index and perform regular expression matching. Since it's written in the GO language, we had to make system calls in the Java program to run it.
Here's the performance result of putting all data to Google Code Search. It turns out that Google Code Search uses much more time than the original SystemT itself. The execution time SystemT takes to search the filtered documents is usually 4 times less, but SystemT is fast enough on matching regular expressions.
However, the Google Code Search program builds an index based on the file system. So we had to split the one file containing 500K records into 500K different small files. We suspect feeding too many files to the program at once may affect the performance. So we split the dataset into 50K chunks and feed Google Code Search 50K files at a time. Here are the performance results. It turns out to be much faster, although still slower than SystemT.
Our initial experiment on SystemT has helped us gain insights in this field. Our TextDB project just started and I'm proudly responsible for the regex matching part. We are doing more research on the Google Code Search program and more generally, regular expression matching with indexes. We believe we can find ways to make it much faster. For the latest update, please visit the CS290 2016S Task: Regex Matcher wiki page.
We want to thank IBM for providing their software package for free for education purpose and the SystemT team for their great help during the process of this project.