Web structure similarity calculation using three methods:

Movie Pirates of the Caribbean: Exploring Illegal Streaming Cyberlockers
The 12th International AAAI Conference on Web and Social Media (ICWSM)
pdf

This repository contains the algorithms for pages' HTML similarity that we used in the technical paper and additionally, freebies are the correlation matrices/heatmaps.

Web structure similarity calculation using three methods:

To understand the specific meaning of each method, see the paper from [Gottron], Clustering Template Based Web Documents at ECIR conference, 2008.

TV algorithm

A label vector method for counting how many times each possible tag appears, which converts the document D in a vector v(D) of fixed dimension N as the number of possible tags is limited.

LCTS

The Longest Common Tag sub-Sequence method uses the distance of two documents expressed based on their longest common tag subsequence. Note that we use difflib's SequenceMatcher to find any contiguous matching blocks from the two sequences and then calculate the longest sequence of tags that appear in the same order in both original sequences.

Some of the code is extracted from https://github.com/TeamHG-Memex/page-compare and adapted to our input format requirements. Crawler (crawler.py) and output to json/csv by @algarecu, P-R (score-compare-tags.py) corrected from original.

CTSS

The public tag sequence shingle method. Useful to overcome the computational costs of the previous distance techniques, as this method uses shingles.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
datasets		datasets
plots		plots
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web structure similarity calculation using three methods:

TV algorithm

LCTS

CTSS

About

Releases

Packages

Contributors 2

Languages

License

algarecu/tbwd

Folders and files

Latest commit

History

Repository files navigation

Web structure similarity calculation using three methods:

TV algorithm

LCTS

CTSS

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages