Skip to content

Latest commit

 

History

History
55 lines (43 loc) · 2.02 KB

README.md

File metadata and controls

55 lines (43 loc) · 2.02 KB

xTraktor

Structured data extractor for the modern world wide web.

What is this?

Implementation of 3 approaches to structured data extraction:

Usage demonstrated on sample pages from 3 websites: overstock.com, rtvslo.si and avto.net. We have gathered two pages from each website.

This is the second assignment in the Web information extraction and retrieval course.

Setup

[Optional] Create a virtualenv and activate it.

$ virtualenv --python=python3 --system-site-packages wiervenv
$ source wiervenv/bin/activate

Install required dependencies.

$ pip3 install -r requirements.txt

Install in dev mode.

$ python3 setup.py develop

Running the parser

implementation/ contains the implementations of regular-expressions-based (regex.py) and XPath-based (xpath.py) approaches. RoadRunner-like approach is not implemented. Running those files will produce the JSON outputs for files in the input/ folder.

Assuming you are inside the implementation/ directory:

$ python3 regex.py
$ python3 xpath.py

Project structure

.
├── input/               # websites, that are used to test the approaches
├── output/              # JSON outputs generated by the methods
├── implementation/      # source code of our implemented approaches
└──report.pdf            # Final report PDF

2019, Jaka Stavanja, Matej Klemen & Andraž Povše