-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelized predict_proba() #2
Open
ctwardy
wants to merge
21
commits into
asitang:master
Choose a base branch
from
Sotera:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Working #6: Try max(scores) instead of sequential rules. * Slight accuracy improvement: 33/59 = 56% * Slight refactoring * Slight tweaks to urls.csv after viewing pages in browser, e.g.: - erratasec.com really should be undecided, not blog - blog.erratasec.com should be a blog * Alphabetized gold words files. * Tried THRESH=.35 and .45. So far .40 is better. Tradeoffs.
* Passes tests using simple_test.html. * Deleted flask stuff * Deleted generateAnalytics.py (inspiration for featurize.py)
* Wrapped as separate JPL7 model. * Grabs URL from features, and rereads page. Not ideal, but current architecture assumes we extract features first. * Cleans up output format in fancy page. TODO: * Pass URL and HTML separate from features. * Or include as features, but... eh. * Try JPL as first stage in other approach. Or such.
(Python3.5) Scripts can be run on 50urls.json (moved to URLs) or on HG's full_urls.json (5,140 entries).
Merge in max_scores branch
Max scores
…o scikit-learn classifier.
* Make JPL webpageclassifier a SciKit classifier * Refactor more webpageclassifier -> wpc_utils * Move & update test_simple.html * Add URL/HTML data utils: json_merge.py and * Add train.json - has HTML. Still need to use for error categs. * Add Apache license * Create Notebook to test JPL webclass * MANY changes in webclassifier.py, including: - Refactor much to wpc_utils.py - Streamline cosine_sim() - Add UNDEF and ERROR vs 'undecided' - Use logging module - Changed various return types for Scikit - Improved logic for error pages: tries bleaching URL, better logging & tallying - Uses scikit metrics * In wpc_utils: - Add bleach(), with doctests - Better logging - Tweaks precision recall f1-score support ERROR 0.00 0.00 0.00 0 UNDEFINED 0.00 0.00 0.00 0 blog 0.98 0.63 0.77 63 classified 0.26 0.06 0.09 89 forum 0.97 0.57 0.72 54 news 0.84 0.61 0.70 92 search_engine 0.00 0.00 0.00 81 shopping 0.35 0.66 0.46 56 wiki 0.96 0.72 0.82 65 avg / total 0.59 0.43 0.48 500 Confusion Matrix: ERROR: [0 0 0 0 0 0 0 0 0] UNDEFINED: [0 0 0 0 0 0 0 0 0] blog: [ 1 11 40 1 0 3 0 7 0] classified: [ 7 48 0 5 0 0 0 28 1] forum: [ 1 14 0 1 31 2 0 5 0] news: [ 4 12 1 2 1 56 0 16 0] search_engine: [15 45 0 5 0 6 0 9 1] shopping: [ 3 12 0 4 0 0 0 37 0] wiki: [ 1 12 0 1 0 0 0 4 47] µ Info: 0.38 Accuracy: 0.43 Total #: 500 #Bleached: 37 #Errors: 32
…res. Not much change to performance: f1-score = 51%, accuracy=46%. Types 'classified' and 'search_engine' get the most UNDEFINED. precision recall f1-score support UNDEFINED 0.00 0.00 0.00 0 blog 0.98 0.65 0.78 62 classified 0.26 0.06 0.10 82 forum 0.97 0.58 0.73 53 news 0.84 0.64 0.72 88 search_engine 0.00 0.00 0.00 66 shopping 0.35 0.70 0.47 53 wiki 0.96 0.73 0.83 64 avg / total 0.61 0.46 0.51 468 Confusion Matrix: UNDEFINED: [0 0 0 0 0 0 0 0] blog: [11 40 1 0 3 0 7 0] classified: [48 0 5 0 0 0 28 1] forum: [14 0 1 31 2 0 5 0] news: [12 1 2 1 56 0 16 0] search_engine: [45 0 5 0 6 0 9 1] shopping: [12 0 4 0 0 0 37 0] wiki: [12 0 1 0 0 0 4 47] µ Info: 0.40 Total #: 500 #Errors: 32 ( 37 Bleached) #Predictd: 468 Accuracy: 0.46
general distribution. Also, testing now uses the 5K URL file from thh-classifiers. I'll add it here or to a separate project after cleaning: about 10% of the pages have expired and are for sale, etc. On 500 of those cases, accuracy and f1 are still around 50%. It's clear that the bottlenecks are "classified" (needs work) and "search_engines" (not even considered by this classifier, yet). Also, that dataset doesn't have "shopping"?
* Update Crawling notebook for current SiteHound / thh-classifier. * Rename thh -> sh where appropriate.
…et included in the f1 score. precision recall f1-score support UNCERTAIN 0.00 0.00 0.00 0 blog 0.82 0.54 0.65 69 classified 0.44 0.28 0.34 75 error 0.00 0.00 0.00 240 forum 0.77 0.80 0.78 337 news 0.86 0.44 0.59 151 shopping 0.52 0.70 0.60 155 wiki 0.84 0.85 0.84 79 avg / total 0.56 0.52 0.53 1106 Confusion Matrix: UNCERTAIN: 0, 0, 0, 0, 0, 0, 0, 0 blog: 15, 37, 4, 0, 3, 1, 9, 0 classified: 23, 0, 21, 0, 0, 0, 31, 0 error: 133, 7, 1, 0, 68, 8, 16, 7 forum: 48, 1, 4, 0, 271, 1, 12, 0 news: 28, 0, 10, 0, 10, 67, 30, 6 shopping: 37, 0, 8, 0, 0, 1, 109, 0 wiki: 7, 0, 0, 0, 2, 0, 3, 67 µ Info: 0.39 Total #: 1106 #Errors: 0 ( 0 Bleached) #Predicted: 1106 Accuracy: 0.52
Merge scikit learn branch back to master.
Fixed #15 Confusion Matrix labels mismatch. Fixed #16 Simplify JPL_Classifier. Fixed #17 Cyrillic goldwords fail. Cleaned up code, names, etc. ``` precision recall f1-score support blog 0.98 0.63 0.77 63 wiki 0.96 0.72 0.82 65 news 0.83 0.58 0.68 92 forum 0.85 0.63 0.72 54 classified 0.28 0.06 0.09 89 shopping 0.36 0.62 0.46 56 UNCERTAIN 0.00 0.00 0.00 0 error 0.00 0.00 0.00 0 avg / total 0.69 0.51 0.57 419 ```
About 3.5x faster on 7 cores, N=500.
Merge "Parallel jpl" for 3.5x speedup.
* Modify error.txt to improve performance. Showing 3x speedup on 8 cores -- pandas might be faster? Improved 'error' classification, but still much worse than thh's SVM. - Emphasized words found often in error pages - Reduced if found in forum & shopping, which were getting confused - Great precision, poor recall (21 of 240 found) INFO:root:Creating JPL classifier INFO:root:Classifier 'training' completed. INFO:root:TIMING: n_jobs = 8, t = 14:26:22, dt = **49.267s** N = 1106 precision recall f1-score support blog 0.82 0.54 0.65 69 wiki 0.84 0.85 0.84 79 news 0.86 0.44 0.59 151 forum 0.77 0.80 0.78 337 classified 0.44 0.28 0.34 75 shopping 0.52 0.70 0.60 155 UNCERTAIN 0.00 0.00 0.00 0 error 0.95 0.09 0.16 240 avg / total 0.77 0.54 0.56 1106 Confusion Matrix: blog: 37, 0, 1, 3, 4, 9, 15, 0 wiki: 0, 67, 0, 2, 0, 3, 7, 0 news: 0, 6, 67, 10, 10, 30, 28, 0 forum: 1, 0, 1, 271, 4, 12, 47, 1 classified: 0, 0, 0, 0, 21, 31, 23, 0 shopping: 0, 0, 1, 0, 8, 109, 37, 0 UNCERTAIN: 0, 0, 0, 0, 0, 0, 0, 0 error: 7, 7, 8, 68, 1, 16, 112, 21 µ Info: 0.40 Total #: 1106 #Predicted: 1106 Accuracy: 0.54
- Inherits from SciKitClassifier * Featurizer: fixed #59 "Throws ValueError" when reading unicode-encoded HTML * Update DD Crawls notebook: tries thh, JPL, and featurize classifiers * Add Tables of Contents to some Notebooks *
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Just parallelized predict_proba() using scikit joblib. That's a handy speedup for testing.
Builds on wrapping for scikit-learn.
Currently I'm working on improving recognition of "error" pages.
No obligation to pull, just letting you know where this fork is.