-
Notifications
You must be signed in to change notification settings - Fork 75
Contributors
Xuxi Pan edited this page Jul 18, 2016
·
17 revisions
Spring 2016, Department of Computer Science, UC Irvine
- Instructor: Prof. Chen Li
- Lecture time: Mondays 4-5:30 pm, DBH 4011
- Office Hours: Mondays 3-4 pm, DBH 2092 (Email confirmation needed)
Goal:
- Gain hands-on experiences to build a system to manage large amounts of text information
- Study research challenges related to text and data management
- Form teams to do a group project; learn tools and skills to manage a software project.
Schedule
No. | Date | Topics | Todos |
---|---|---|---|
01 | 03/28/2016 | Introduction, SystemT Overview (by Instructor and Zuozhi) | Bid on tasks, form teams, github warmup |
02 | 04/04/2016 | Task assignments, [Lucene Overview] (https://docs.google.com/presentation/d/1P9HUFFW72ogqdEZf07r5Y7_gM9JK6Wu8UVgH0bGNkF0/edit?usp=sharing) (by team 1) | Lucene sample program, design phase |
03 | 04/11/2016 | ScanOperator (team 1), Data Store (team 1), Development environment (team 2), progress report (all teams) | Design phase, operator interface, test cases |
04 | 04/18/2016 | Token-based fuzzy operator (Team 5), progress report (all teams) | Operator interface, test cases |
05 | 04/25/2016 | [Stanford NLP] (https://docs.google.com/presentation/d/1ek18Zr0OqQ0RONj8D7W2aSGs9sz1etnf9bEnWTEA2ag/edit?usp=sharing) (Team 7), progress report (all teams) | Test cases, Implementation |
06 | 05/02/2016 | [Regex Matching] (https://docs.google.com/presentation/d/1F3Xboeb_azHSjWbJ2Cl36kGHpIeo_6-lI24XwXjq_hA/edit#slide=id.g12e478a39d_0_10) (Team 3), progress report (all teams) | Implementation |
07 | 05/09/2016 | [Fuzzy Tokenizer] (Foobar) (Team 2), progress report (all teams) | Implementation, Documentation |
08 | 05/16/2016 | Progress report (all teams) | Finishing Implementation, Starting Documentation |
Course schedule:
- Meet weekly with talks and project discussions;
- Form teams to work together;
- Evaluate existing software packages;
- Design and implement a text-centric data-management system.
Prerequisites:
- Hands-on system-building experiences;
- Familiar with Java and C/C++;
- Desire to learn, read existing software, and build systems;
- Eager to solve open problems;
- (Optional but a big plus) Have taken CS222 or CS221.
Commitment: 10 hours per week, 2 units
Software Tools:
- Java
- Maven
- Git
- Wiki
- Issue tracking
- Jenkins
Tasks (Welcome to propose your own):
- Support dictionary-based search on documents (using Lucene)
- Build gram-based inverted index (using Lucene)
- Support fuzzy search with gram index (using Lucene)
- Support regex search with gram index (using Lucene)
- Develop a query processor
- Write a parser and translator from a SystemT query to a TextDB query
- (Optional) Design a declarative query language TextSQL and write a parser
- (Optional) Include an embedded DB (Derby) and store query results
Related Projects:
- Lucene on keyword search (Java)
- Flamingo (UCI) on fuzzy search (C++)
- RE2 on index-based regex (C++)
- SystemT (IBM) on information extraction (Java)
- Stanford NLP on natural language processing (Java)
Project Management:
- Form teams to do tasks. Each team has 1 or 2 members;
- Write test cases first;
- If possible, use a simplest solution (even if it's scan-based), then develop a more advanced solution;
- Be prepared to make adjustments during the course of the project.
Project Protocol:
- Do not add large files to git. Check github guidance for details.
- Write high-quality code.
- Do high-quality peer reviews.
- Write good documentations using github wiki. Each wiki page has authors and reviewers with email address.
- Drawing diagrams: Use Google Drawings. Add diagram source files to Google Drive and change the ownership to "textdbproject AT gmail.com". Add authors to each diagram, and include the source file link on the wiki. Here is an example.
- Use the "sandbox/" folder on git for your only experiments. Use the format of "[firstname]-[lastname]" (all lower case) for the name of your folder under "sandbox/".
- Use Github Issues to manage tasks and bugs.
Sandeep Reddy Madugala | Rajesh Yarlagadda | Sudeep Meduri |
Kishore Narendran | Shiladitya Sen |
Zuozhi Wang | Shuying |
Akshay Jain | Prakul Agarwal |
Varun Bharill | Parag Sarogi |
Jinggang Diao | Flavio Bayer | Qing Tang |
Feng Hong | Yang Jiao |