Skip to content
/ WDP Public

Repository for the first assignment for Web Data Processing.

Notifications You must be signed in to change notification settings

tomcorten/WDP

Repository files navigation

Assignment 1a

Alex Antonides - 2693298 - [email protected]

Eoan O'Dea - 2732791 - [email protected]

Tom Corten - 2618068 - [email protected]

Max Wassenberg - 2579797 - [email protected]

Design Choices and Rationale

We started the assignment by focusing on the problems described in the starter code file. We found two solutions to clean the html, the first and simple one was to remove all the HTML tags with a regular expression, however, this left us with CSS/JS code. The second solution, the one we uses now, was to use the library BeautifulSoup to parse the HTML and to read only the <p> and <h1> tags. We could have read anchor links as well, but most of the times theses anchor links were wrapped in paragraph tags, causing duplicate results.

We found many different solutions for the second problem of the starter code, to recognize the entities within the text. We used the following packages: NLTK, Spacy, Stanza, BERT, and Flair. However, we didn't like the processing time of Flair, furthermore, BERT only returned unigrams, and Stanza returned faulty results.

The solution to the final problem, we used ElasticSearch and Trident to refine the results. To improve the processing, we decided to add multithreading to the program. The optimal candidate is chosen as the entity with the most objects. An improvement was tried by creating formats for person/location and organisation wikidata pages. By choosing 20 random instances of human, human settlement and enterprise and looking at their overlap we tried to get an image of a generic person/organisation/location. So for every entity that falls in these categories, the results from elasticsearch are compared to these formats.

Getting Started

These instructions will get you a copy up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

1. Install the following software:

Installing

A step by step series of examples that tell you how to get a development environment running.

1. Update submodules

After cloning the repository, navigate to the project folder and run the following command [Mac and Linux only]:

sh setup.sh

or run the following command:

pip3 install --upgrade pip && pip3 install --user -U nltk && pip3 install beautifulsoup4 && pip3 install spacy && python3 -m spacy download en && pip3 install stanza
2. Start the ElasticSearch server

Run the following command:

sh start_elasticsearch_server.sh

Deployment

Once the packages have been installed, the project will be ready for deployment.

Local

1. Execute the shell script
sh run_example.sh

About

Repository for the first assignment for Web Data Processing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •