Skip to content

🚜 dobi podatke iz sajtov na nebroj načinov 🚜

Notifications You must be signed in to change notification settings

jstavanja/xTraktor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

xTraktor

Structured data extractor for the modern world wide web.

What is this?

Implementation of 3 approaches to structured data extraction:

Usage demonstrated on sample pages from 3 websites: overstock.com, rtvslo.si and avto.net. We have gathered two pages from each website.

This is the second assignment in the Web information extraction and retrieval course.

Setup

[Optional] Create a virtualenv and activate it.

$ virtualenv --python=python3 --system-site-packages wiervenv
$ source wiervenv/bin/activate

Install required dependencies.

$ pip3 install -r requirements.txt

Install in dev mode.

$ python3 setup.py develop

Running the parser

implementation/ contains the implementations of regular-expressions-based (regex.py) and XPath-based (xpath.py) approaches. RoadRunner-like approach is not implemented. Running those files will produce the JSON outputs for files in the input/ folder.

Assuming you are inside the implementation/ directory:

$ python3 regex.py
$ python3 xpath.py

Project structure

.
├── input/               # websites, that are used to test the approaches
├── output/              # JSON outputs generated by the methods
├── implementation/      # source code of our implemented approaches
└──report.pdf            # Final report PDF

2019, Jaka Stavanja, Matej Klemen & Andraž Povše

About

🚜 dobi podatke iz sajtov na nebroj načinov 🚜

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •