Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelized predict_proba() #2

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
bd39e6f
Fixed #9: save HTML to files for faster re-testing.
ctwardy Mar 6, 2017
c3b7d42
Fixes #53: Switch to lxml for speed.
ctwardy Mar 8, 2017
f630875
Closes #54: Incorporate JPL's webpageclassifier [into App]
ctwardy Mar 10, 2017
f90c457
Add JSON utils to munge URL lists for scraping by HG's page-compare.
ctwardy Mar 17, 2017
f5d83d3
Merge pull request #10 from Sotera/max_scores
ctwardy Mar 17, 2017
59c1832
rm utils/: moved to page-class project.
Mar 24, 2017
7f3220f
Added eval. notebook & .gitignore.
Mar 24, 2017
dbcfd8c
Merge pull request #11 from Sotera/max_scores
ctwardy Mar 24, 2017
2eb8bc8
Refactored fns into wpc_utils. Cleaned up a loop. Started porting in…
Mar 27, 2017
f3d1468
Wrap JPL for scikit-learn.
ctwardy Apr 3, 2017
c31050f
Remove ERROR pages from metrics. Most of these reflect crawling failu…
ctwardy Apr 4, 2017
b69ceb2
Remove 50urls.csv - had MEMEX HT examples not really suitable for
ctwardy Apr 4, 2017
dfb4556
page-class:
ctwardy Apr 17, 2017
af1869d
Modify JPL classifier to use the 'error' category. Working, but not y…
May 16, 2017
dce409e
Merge pull request #13 from Sotera/Wrap_for_scikit-learn
ctwardy May 16, 2017
30a08a3
Fixed #14 Handle ERROR class (issue was really #15).
ctwardy May 19, 2017
733dffe
parallel jpl: Successfully parallelized predict_proba().
ctwardy May 22, 2017
620c712
Merge pull request #19 from Sotera/parallel_jpl
ctwardy May 22, 2017
e16f007
* Tweak parallelize: specifying n_jobs; use numpy vs pandas.
ctwardy May 23, 2017
4382781
Merge remote-tracking branch 'origin/master'
May 23, 2017
2581929
* JPLClassifier is now part of the pagetype system
ctwardy May 30, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
.idea/
scrapers/
Output/
Temp/
Figs/

~*
Scratch.ipynb
scratch.ipynb
metastore_db/
.DS_Store/

###
# From github/gitignore/Python.gitignore
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# dotenv
.env

# virtualenv
.venv
venv/
ENV/

# Spyder project settings
.spyderproject

# Rope project settings
.ropeproject

791 changes: 791 additions & 0 deletions JPL webpageclassifier Test.ipynb

Large diffs are not rendered by default.

7 changes: 4 additions & 3 deletions blog.txt → Keywords/blog.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
blog
.blog.com
.blogspot.com
.wordpress.com
.livejournal.com
.tumblr.com
.xanga.com
.typepad.com
.wordpress.com
.xanga.com
/medium.com/
.blog.com
41 changes: 21 additions & 20 deletions classified.txt → Keywords/classified.txt
Original file line number Diff line number Diff line change
@@ -1,27 +1,28 @@
absolutely free
best value
brand new
sell
sell
selling
buy
cheap
stuff
second hand
less price
price
good condition
full service
need
is here!
absolutely free
heavy discount
discount
cash
certified
certified
shipping
cash
services
services
classified
cheap
discount
exclusively
best value
full service
good condition
good condition
heavy discount
is here!
less price
need
not used
price
second hand
sell
sell
selling
services
services
shipping
stuff
58 changes: 58 additions & 0 deletions Keywords/error.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
404
404
404
404
404
404
404
404
503
503
denied
denied
domain
domain
encountered
encountered
encountered
error
error
errordocument
errordocument
exist
exist
exist
failed
forbidden
forbidden
hosting
http
http
index
index
moved
moved
moved
not found
not found
page
page
page
page
permission
permission
permission
permission
permission
request
request
request
requested
requested
requested
requested
requested
unavailable
using
найдено
ничего
File renamed without changes.
16 changes: 8 additions & 8 deletions news.txt → Keywords/news.txt
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
politics
art
religion
life
local
music
weather
news
politics
religion
science
shop
sport
travel
science
world
video
sport
local
news
weather
world
10 changes: 5 additions & 5 deletions shopping.txt → Keywords/shopping.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
shop
buy
sell
price
shipping
discount
certified
discount
price
sell
services
shipping
shop
shopping
1 change: 1 addition & 0 deletions __init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# -*- coding: utf-8 -*-
Loading