-
Notifications
You must be signed in to change notification settings - Fork 75
Regex Matcher
zuozhiw edited this page Aug 29, 2017
·
2 revisions
Author(s): Zuozhi Wang, Shuying Lai
Reviewer(s): Chen Li REVIEWED
Implement an operator to use an index-based method based on gram inverted index to support efficient regular expressions on large datasets. Our implementation is based on Russ Cox's algorithm and the corresponding open source implementation.
As of 6/14/2016: FINISHED
edu.uci.ics.texera.dataflow.regexmatch
com.google.re2j
https://github.com/Texera/texera/issues/30 and https://github.com/Texera/texera/issues/99
Our approach has the following steps:
- Build a gram index for documents using Lucene.
- Translate a regular expression to a boolean expression using Russ Cox's algorithm.
- Run the Boolean query on Lucene to compute candidate documents.
- Postprocess these candidate documents to remove false positives.
[Presentation] (https://docs.google.com/presentation/d/1F3Xboeb_azHSjWbJ2Cl36kGHpIeo_6-lI24XwXjq_hA/edit?usp=sharing) (Team 3)
Machine setting: Macbook Air (mid-2013), Intel Core i7 (4650U), SSD hard drive, 8GB memory.
- Data set: 1 million Medline records, about 1.53 GB
- Regex:
\bmedic(ine|al|ation|are|aid)?\b
("medicine" or "medical" or "medication" or "medicare" or "medicaid") - Performance results (average time reported in seconds):
Index Based | Scan Based | |
---|---|---|
Regex Matcher | 7.43 | 67.65 |
RE2J | 11.88 | 99.05 |
- Data set: 5 million Medline records, about 7.8 GB.
- Regex:
\bmedic(ine|al|ation|are|aid)?\b
("medicine" or "medical" or "medication" or "medicare" or "medicaid") - Performance results (average time reported in seconds):
Index Based | Scan Based | |
---|---|---|
Regex Matcher | 28.11 | 330.754 |
RE2J | 44.72 | 506.34 |
- Design and implement algorithm to compare two boolean expressions. The basic idea is to format boolean expressions into DNF, and check the equivalence of two DNFs. There is a PR for this task: https://github.com/Texera/texera/pull/127 FINISHED
- Use token position information to generate a more selective query to reduce the number of candidate documents.