Movie Pirates of the Caribbean: Exploring Illegal Streaming Cyberlockers
The 12th International AAAI Conference on Web and Social Media (ICWSM)
pdf
This repository contains the algorithms for pages' HTML similarity that we used in the technical paper and additionally, freebies are the correlation matrices/heatmaps.
To understand the specific meaning of each method, see the paper from [Gottron], Clustering Template Based Web Documents at ECIR conference, 2008.
- A label vector method for counting how many times each possible tag appears, which converts the document D in a vector v(D) of fixed dimension N as the number of possible tags is limited.
- The Longest Common Tag sub-Sequence method uses the distance of two documents expressed based on their longest common tag subsequence. Note that we use difflib's SequenceMatcher to find any contiguous matching blocks from the two sequences and then calculate the longest sequence of tags that appear in the same order in both original sequences.
Some of the code is extracted from https://github.com/TeamHG-Memex/page-compare and adapted to our input format requirements. Crawler (crawler.py) and output to json/csv by @algarecu, P-R (score-compare-tags.py) corrected from original.
- The public tag sequence shingle method. Useful to overcome the computational costs of the previous distance techniques, as this method uses shingles.