Parallelized predict_proba() #2

ctwardy · 2017-05-22T22:02:28Z

Just parallelized predict_proba() using scikit joblib. That's a handy speedup for testing.
Builds on wrapping for scikit-learn.
Currently I'm working on improving recognition of "error" pages.

No obligation to pull, just letting you know where this fork is.

Working #6: Try max(scores) instead of sequential rules. * Slight accuracy improvement: 33/59 = 56% * Slight refactoring * Slight tweaks to urls.csv after viewing pages in browser, e.g.: - erratasec.com really should be undecided, not blog - blog.erratasec.com should be a blog * Alphabetized gold words files. * Tried THRESH=.35 and .45. So far .40 is better. Tradeoffs.

* Passes tests using simple_test.html. * Deleted flask stuff * Deleted generateAnalytics.py (inspiration for featurize.py)

* Wrapped as separate JPL7 model. * Grabs URL from features, and rereads page. Not ideal, but current architecture assumes we extract features first. * Cleans up output format in fancy page. TODO: * Pass URL and HTML separate from features. * Or include as features, but... eh. * Try JPL as first stage in other approach. Or such.

(Python3.5) Scripts can be run on 50urls.json (moved to URLs) or on HG's full_urls.json (5,140 entries).

Merge in max_scores branch

Max scores

…o scikit-learn classifier.

* Make JPL webpageclassifier a SciKit classifier * Refactor more webpageclassifier -> wpc_utils * Move & update test_simple.html * Add URL/HTML data utils: json_merge.py and * Add train.json - has HTML. Still need to use for error categs. * Add Apache license * Create Notebook to test JPL webclass * MANY changes in webclassifier.py, including: - Refactor much to wpc_utils.py - Streamline cosine_sim() - Add UNDEF and ERROR vs 'undecided' - Use logging module - Changed various return types for Scikit - Improved logic for error pages: tries bleaching URL, better logging & tallying - Uses scikit metrics * In wpc_utils: - Add bleach(), with doctests - Better logging - Tweaks precision recall f1-score support ERROR 0.00 0.00 0.00 0 UNDEFINED 0.00 0.00 0.00 0 blog 0.98 0.63 0.77 63 classified 0.26 0.06 0.09 89 forum 0.97 0.57 0.72 54 news 0.84 0.61 0.70 92 search_engine 0.00 0.00 0.00 81 shopping 0.35 0.66 0.46 56 wiki 0.96 0.72 0.82 65 avg / total 0.59 0.43 0.48 500 Confusion Matrix: ERROR: [0 0 0 0 0 0 0 0 0] UNDEFINED: [0 0 0 0 0 0 0 0 0] blog: [ 1 11 40 1 0 3 0 7 0] classified: [ 7 48 0 5 0 0 0 28 1] forum: [ 1 14 0 1 31 2 0 5 0] news: [ 4 12 1 2 1 56 0 16 0] search_engine: [15 45 0 5 0 6 0 9 1] shopping: [ 3 12 0 4 0 0 0 37 0] wiki: [ 1 12 0 1 0 0 0 4 47] µ Info: 0.38 Accuracy: 0.43 Total #: 500 #Bleached: 37 #Errors: 32

…res. Not much change to performance: f1-score = 51%, accuracy=46%. Types 'classified' and 'search_engine' get the most UNDEFINED. precision recall f1-score support UNDEFINED 0.00 0.00 0.00 0 blog 0.98 0.65 0.78 62 classified 0.26 0.06 0.10 82 forum 0.97 0.58 0.73 53 news 0.84 0.64 0.72 88 search_engine 0.00 0.00 0.00 66 shopping 0.35 0.70 0.47 53 wiki 0.96 0.73 0.83 64 avg / total 0.61 0.46 0.51 468 Confusion Matrix: UNDEFINED: [0 0 0 0 0 0 0 0] blog: [11 40 1 0 3 0 7 0] classified: [48 0 5 0 0 0 28 1] forum: [14 0 1 31 2 0 5 0] news: [12 1 2 1 56 0 16 0] search_engine: [45 0 5 0 6 0 9 1] shopping: [12 0 4 0 0 0 37 0] wiki: [12 0 1 0 0 0 4 47] µ Info: 0.40 Total #: 500 #Errors: 32 ( 37 Bleached) #Predictd: 468 Accuracy: 0.46

general distribution. Also, testing now uses the 5K URL file from thh-classifiers. I'll add it here or to a separate project after cleaning: about 10% of the pages have expired and are for sale, etc. On 500 of those cases, accuracy and f1 are still around 50%. It's clear that the bottlenecks are "classified" (needs work) and "search_engines" (not even considered by this classifier, yet). Also, that dataset doesn't have "shopping"?

* Update Crawling notebook for current SiteHound / thh-classifier. * Rename thh -> sh where appropriate.

…et included in the f1 score. precision recall f1-score support UNCERTAIN 0.00 0.00 0.00 0 blog 0.82 0.54 0.65 69 classified 0.44 0.28 0.34 75 error 0.00 0.00 0.00 240 forum 0.77 0.80 0.78 337 news 0.86 0.44 0.59 151 shopping 0.52 0.70 0.60 155 wiki 0.84 0.85 0.84 79 avg / total 0.56 0.52 0.53 1106 Confusion Matrix: UNCERTAIN: 0, 0, 0, 0, 0, 0, 0, 0 blog: 15, 37, 4, 0, 3, 1, 9, 0 classified: 23, 0, 21, 0, 0, 0, 31, 0 error: 133, 7, 1, 0, 68, 8, 16, 7 forum: 48, 1, 4, 0, 271, 1, 12, 0 news: 28, 0, 10, 0, 10, 67, 30, 6 shopping: 37, 0, 8, 0, 0, 1, 109, 0 wiki: 7, 0, 0, 0, 2, 0, 3, 67 µ Info: 0.39 Total #: 1106 #Errors: 0 ( 0 Bleached) #Predicted: 1106 Accuracy: 0.52

Merge scikit learn branch back to master.

Fixed #15 Confusion Matrix labels mismatch. Fixed #16 Simplify JPL_Classifier. Fixed #17 Cyrillic goldwords fail. Cleaned up code, names, etc. ``` precision recall f1-score support blog 0.98 0.63 0.77 63 wiki 0.96 0.72 0.82 65 news 0.83 0.58 0.68 92 forum 0.85 0.63 0.72 54 classified 0.28 0.06 0.09 89 shopping 0.36 0.62 0.46 56 UNCERTAIN 0.00 0.00 0.00 0 error 0.00 0.00 0.00 0 avg / total 0.69 0.51 0.57 419 ```

About 3.5x faster on 7 cores, N=500.

Merge "Parallel jpl" for 3.5x speedup.

* Modify error.txt to improve performance. Showing 3x speedup on 8 cores -- pandas might be faster? Improved 'error' classification, but still much worse than thh's SVM. - Emphasized words found often in error pages - Reduced if found in forum & shopping, which were getting confused - Great precision, poor recall (21 of 240 found) INFO:root:Creating JPL classifier INFO:root:Classifier 'training' completed. INFO:root:TIMING: n_jobs = 8, t = 14:26:22, dt = **49.267s** N = 1106 precision recall f1-score support blog 0.82 0.54 0.65 69 wiki 0.84 0.85 0.84 79 news 0.86 0.44 0.59 151 forum 0.77 0.80 0.78 337 classified 0.44 0.28 0.34 75 shopping 0.52 0.70 0.60 155 UNCERTAIN 0.00 0.00 0.00 0 error 0.95 0.09 0.16 240 avg / total 0.77 0.54 0.56 1106 Confusion Matrix: blog: 37, 0, 1, 3, 4, 9, 15, 0 wiki: 0, 67, 0, 2, 0, 3, 7, 0 news: 0, 6, 67, 10, 10, 30, 28, 0 forum: 1, 0, 1, 271, 4, 12, 47, 1 classified: 0, 0, 0, 0, 21, 31, 23, 0 shopping: 0, 0, 1, 0, 8, 109, 37, 0 UNCERTAIN: 0, 0, 0, 0, 0, 0, 0, 0 error: 7, 7, 8, 68, 1, 16, 112, 21 µ Info: 0.40 Total #: 1106 #Predicted: 1106 Accuracy: 0.54

- Inherits from SciKitClassifier * Featurizer: fixed #59 "Throws ValueError" when reading unicode-encoded HTML * Update DD Crawls notebook: tries thh, JPL, and featurize classifiers * Add Tables of Contents to some Notebooks *

ctwardy and others added 21 commits March 6, 2017 12:40

Fixes #53: Switch to lxml for speed.

c3b7d42

* Passes tests using simple_test.html. * Deleted flask stuff * Deleted generateAnalytics.py (inspiration for featurize.py)

Add JSON utils to munge URL lists for scraping by HG's page-compare.

f90c457

(Python3.5) Scripts can be run on 50urls.json (moved to URLs) or on HG's full_urls.json (5,140 entries).

Merge pull request #10 from Sotera/max_scores

f5d83d3

Merge in max_scores branch

rm utils/: moved to page-class project.

59c1832

Added eval. notebook & .gitignore.

7f3220f

Merge pull request #11 from Sotera/max_scores

dbcfd8c

Max scores

Refactored fns into wpc_utils. Cleaned up a loop. Started porting int…

2eb8bc8

…o scikit-learn classifier.

page-class:

dfb4556

* Update Crawling notebook for current SiteHound / thh-classifier. * Rename thh -> sh where appropriate.

Merge pull request #13 from Sotera/Wrap_for_scikit-learn

dce409e

Merge scikit learn branch back to master.

parallel jpl: Successfully parallelized predict_proba().

733dffe

About 3.5x faster on 7 cores, N=500.

Merge pull request #19 from Sotera/parallel_jpl

620c712

Merge "Parallel jpl" for 3.5x speedup.

Merge remote-tracking branch 'origin/master'

4382781

* JPLClassifier is now part of the pagetype system

2581929

- Inherits from SciKitClassifier * Featurizer: fixed #59 "Throws ValueError" when reading unicode-encoded HTML * Update DD Crawls notebook: tries thh, JPL, and featurize classifiers * Add Tables of Contents to some Notebooks *

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelized predict_proba() #2

Parallelized predict_proba() #2

ctwardy commented May 22, 2017 •

edited

Loading

Parallelized predict_proba() #2

Are you sure you want to change the base?

Parallelized predict_proba() #2

Conversation

ctwardy commented May 22, 2017 • edited Loading

ctwardy commented May 22, 2017 •

edited

Loading