Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelized predict_proba() #2

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

Conversation

ctwardy
Copy link
Contributor

@ctwardy ctwardy commented May 22, 2017

Just parallelized predict_proba() using scikit joblib. That's a handy speedup for testing.
Builds on wrapping for scikit-learn.
Currently I'm working on improving recognition of "error" pages.

No obligation to pull, just letting you know where this fork is.

ctwardy and others added 21 commits March 6, 2017 12:40
Working #6: Try max(scores) instead of sequential rules.
* Slight accuracy improvement: 33/59 = 56%
* Slight refactoring
* Slight tweaks to urls.csv after viewing pages in browser, e.g.:
    - erratasec.com really should be undecided, not blog
    - blog.erratasec.com should be a blog
* Alphabetized gold words files.
* Tried THRESH=.35 and .45.  So far .40 is better. Tradeoffs.
* Passes tests using simple_test.html.
* Deleted flask stuff
* Deleted generateAnalytics.py (inspiration for featurize.py)
* Wrapped as separate JPL7 model.
* Grabs URL from features, and rereads page. Not ideal, but
  current architecture assumes we extract features first.
* Cleans up output format in fancy page.

TODO:
* Pass URL and HTML separate from features.
* Or include as features, but... eh.
* Try JPL as first stage in other approach. Or such.
(Python3.5) Scripts can be run on 50urls.json (moved to URLs) or on HG's
full_urls.json (5,140 entries).
Merge in max_scores branch
* Make JPL webpageclassifier a SciKit classifier
* Refactor more webpageclassifier -> wpc_utils
* Move & update test_simple.html
* Add URL/HTML data utils: json_merge.py and
* Add train.json - has HTML. Still need to use for error categs.
* Add Apache license
* Create Notebook to test JPL webclass
* MANY changes in webclassifier.py, including:
  - Refactor much to wpc_utils.py
  - Streamline cosine_sim()
  - Add UNDEF and ERROR vs 'undecided'
  - Use logging module
  - Changed various return types for Scikit
  - Improved logic for error pages: tries bleaching URL, better logging & tallying
  - Uses scikit metrics
* In wpc_utils:
  - Add bleach(), with doctests
  - Better logging
  - Tweaks

               precision    recall  f1-score   support

        ERROR       0.00      0.00      0.00         0
    UNDEFINED       0.00      0.00      0.00         0
         blog       0.98      0.63      0.77        63
   classified       0.26      0.06      0.09        89
        forum       0.97      0.57      0.72        54
         news       0.84      0.61      0.70        92
search_engine       0.00      0.00      0.00        81
     shopping       0.35      0.66      0.46        56
         wiki       0.96      0.72      0.82        65

  avg / total       0.59      0.43      0.48       500

Confusion Matrix:
               ERROR: [0 0 0 0 0 0 0 0 0]
           UNDEFINED: [0 0 0 0 0 0 0 0 0]
                blog: [ 1 11 40  1  0  3  0  7  0]
          classified: [ 7 48  0  5  0  0  0 28  1]
               forum: [ 1 14  0  1 31  2  0  5  0]
                news: [ 4 12  1  2  1 56  0 16  0]
       search_engine: [15 45  0  5  0  6  0  9  1]
            shopping: [ 3 12  0  4  0  0  0 37  0]
                wiki: [ 1 12  0  1  0  0  0  4 47]
   µ Info: 0.38
 Accuracy: 0.43
  Total #: 500
#Bleached: 37
  #Errors: 32
…res.

Not much change to performance: f1-score = 51%, accuracy=46%.
Types 'classified' and 'search_engine' get the most UNDEFINED.

               precision    recall  f1-score   support

    UNDEFINED       0.00      0.00      0.00         0
         blog       0.98      0.65      0.78        62
   classified       0.26      0.06      0.10        82
        forum       0.97      0.58      0.73        53
         news       0.84      0.64      0.72        88
search_engine       0.00      0.00      0.00        66
     shopping       0.35      0.70      0.47        53
         wiki       0.96      0.73      0.83        64

  avg / total       0.61      0.46      0.51       468

Confusion Matrix:
           UNDEFINED: [0 0 0 0 0 0 0 0]
                blog: [11 40  1  0  3  0  7  0]
          classified: [48  0  5  0  0  0 28  1]
               forum: [14  0  1 31  2  0  5  0]
                news: [12  1  2  1 56  0 16  0]
       search_engine: [45  0  5  0  6  0  9  1]
            shopping: [12  0  4  0  0  0 37  0]
                wiki: [12  0  1  0  0  0  4 47]

   µ Info: 0.40
  Total #:  500
  #Errors:   32 	(  37 Bleached)
#Predictd:  468
 Accuracy: 0.46
general distribution.

Also, testing now uses the 5K URL file from thh-classifiers.
I'll add it here or to a separate project after cleaning:
about 10% of the pages have expired and are for sale, etc.

On 500 of those cases, accuracy and f1 are still around 50%.
It's clear that the bottlenecks are "classified" (needs work) and
"search_engines" (not even considered by this classifier, yet).
Also, that dataset doesn't have "shopping"?
* Update Crawling notebook for current SiteHound / thh-classifier.
* Rename thh -> sh where appropriate.
…et included in the f1 score.

             precision    recall  f1-score   support

  UNCERTAIN       0.00      0.00      0.00         0
       blog       0.82      0.54      0.65        69
 classified       0.44      0.28      0.34        75
      error       0.00      0.00      0.00       240
      forum       0.77      0.80      0.78       337
       news       0.86      0.44      0.59       151
   shopping       0.52      0.70      0.60       155
       wiki       0.84      0.85      0.84        79

avg / total       0.56      0.52      0.53      1106

Confusion Matrix:
           UNCERTAIN:    0,   0,   0,   0,   0,   0,   0,   0
                blog:   15,  37,   4,   0,   3,   1,   9,   0
          classified:   23,   0,  21,   0,   0,   0,  31,   0
               error:  133,   7,   1,   0,  68,   8,  16,   7
               forum:   48,   1,   4,   0, 271,   1,  12,   0
                news:   28,   0,  10,   0,  10,  67,  30,   6
            shopping:   37,   0,   8,   0,   0,   1, 109,   0
                wiki:    7,   0,   0,   0,   2,   0,   3,  67

   µ Info: 0.39
   Total #: 1106
   #Errors:    0 (   0 Bleached)
#Predicted: 1106
  Accuracy: 0.52
Merge scikit learn branch back to master.
Fixed #15 Confusion Matrix labels mismatch.
Fixed #16 Simplify JPL_Classifier.
Fixed #17 Cyrillic goldwords fail.
Cleaned up code, names, etc.

```
             precision    recall  f1-score   support
       blog       0.98      0.63      0.77        63
       wiki       0.96      0.72      0.82        65
       news       0.83      0.58      0.68        92
      forum       0.85      0.63      0.72        54
 classified       0.28      0.06      0.09        89
   shopping       0.36      0.62      0.46        56
  UNCERTAIN       0.00      0.00      0.00         0
      error       0.00      0.00      0.00         0

avg / total       0.69      0.51      0.57       419
```
About 3.5x faster on 7 cores, N=500.
Merge "Parallel jpl" for 3.5x speedup.
* Modify error.txt to improve performance.

Showing 3x speedup on 8 cores -- pandas might be faster?
Improved 'error' classification, but still much worse than thh's SVM.
  - Emphasized words found often in error pages
  - Reduced if found in forum & shopping, which were getting confused
  - Great precision, poor recall (21 of 240 found)

INFO:root:Creating JPL classifier
INFO:root:Classifier 'training' completed.
INFO:root:TIMING: n_jobs = 8, t = 14:26:22, dt = **49.267s**
N = 1106
             precision    recall  f1-score   support

       blog       0.82      0.54      0.65        69
       wiki       0.84      0.85      0.84        79
       news       0.86      0.44      0.59       151
      forum       0.77      0.80      0.78       337
 classified       0.44      0.28      0.34        75
   shopping       0.52      0.70      0.60       155
  UNCERTAIN       0.00      0.00      0.00         0
      error       0.95      0.09      0.16       240

avg / total       0.77      0.54      0.56      1106

Confusion Matrix:
                blog:   37,   0,   1,   3,   4,   9,  15,   0
                wiki:    0,  67,   0,   2,   0,   3,   7,   0
                news:    0,   6,  67,  10,  10,  30,  28,   0
               forum:    1,   0,   1, 271,   4,  12,  47,   1
          classified:    0,   0,   0,   0,  21,  31,  23,   0
            shopping:    0,   0,   1,   0,   8, 109,  37,   0
           UNCERTAIN:    0,   0,   0,   0,   0,   0,   0,   0
               error:    7,   7,   8,  68,   1,  16, 112,  21

   µ Info: 0.40
   Total #: 1106
#Predicted: 1106
  Accuracy: 0.54
  - Inherits from SciKitClassifier
* Featurizer: fixed #59 "Throws ValueError" when reading unicode-encoded HTML
* Update DD Crawls notebook: tries thh, JPL, and featurize classifiers
* Add Tables of Contents to some Notebooks
*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant