-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnotes.txt
136 lines (108 loc) · 5.64 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
SQL from Ed's email dated 3/16/15 11:57 am - see other emails from this day
for more notes on what they're currently doing too. SALIX uses basic
frequency in a lookup table to guess.
See sql below. Lichen and bryophyte databases definitely has OCR data.
SELECT IFNULL(i.originalurl, i.url) as url, r.rawstr, o.*
FROM images i inner join specprocessorrawlabels r ON i.imgid = r.imgid
INNER JOIN omoccurrences o ON i.occid = o.occid;
Unprocessed specimens:
SELECT IFNULL(i.originalurl, i.url) as url, r.rawstr, o.*
FROM images i inner join specprocessorrawlabels r ON i.imgid = r.imgid
INNER JOIN omoccurrences o ON i.occid = o.occid
WHERE o.processingstatus = "unprocessed";
Processed specimens:
SELECT IFNULL(i.originalurl, i.url) as url, r.rawstr, o.*
FROM images i inner join specprocessorrawlabels r ON i.imgid = r.imgid
INNER JOIN omoccurrences o ON i.occid = o.occid
WHERE o.processingstatus IS NULL OR o.processingstatus != "unprocessed";
Dependancies:
apt-get install python-mysqldb
apt-get install tesseract-ocr
apt-get install python-sklearn python-numpy python-pandas
python-six python-lxml python-chardet python-beautifulsoup python-pil python-bs4
# THis setup fails
git clone https://github.com/concordusapps/python-hocr.git
cd python-hocr
python setup.py install
# Just ended up writing own parser based on bs4
alter database character set = 'utf8'
Correcting hocr_results after finding word_y0 and word_x1 swapped:
alter table hocr_results add column swap INT;
update hocr_results set swap=word_y0;
update hocr_results set word_y0=word_x1;
update hocr_results set word_x1=swap;
alter table hocr_results drop column swap;
SELECT min(word_x0) as min_word_x0, min(word_y0) as min_word_y0 FROM hocr_results
WHERE occid=904652 AND word_x0>10 AND word_y0>10 AND LENGTH(TRIM(text))>2 GROUP BY occid;
AND r.area_x0 > e.min_area_x0
AND r.area_y0 > e.min_area_y0
AND r.line_x0 > e.min_line_x0
AND r.line_y0 > e.min_line_y0
AND r.word_x0 > e.min_word_x0
AND r.word_y0 > e.min_word_y0
Output from run of 25% of no loc on c10node12:
root@c10node12:~/svm# time ./svm.py
Selected 25275 training samples and 75850 testing samples.
Fitting 4 folds for each of 360 candidates, totalling 1440 fits
[Parallel(n_jobs=-1)]: Done 50 jobs | elapsed: 1.6min
[Parallel(n_jobs=-1)]: Done 200 jobs | elapsed: 3.9min
[Parallel(n_jobs=-1)]: Done 450 jobs | elapsed: 16.4min
[Parallel(n_jobs=-1)]: Done 800 jobs | elapsed: 21.2min
[Parallel(n_jobs=-1)]: Done 1 out of 1440 | elapsed: 67.3min remaining: 96903.7min
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed: 67.3min finished
Fitting 4 folds for each of 360 candidates, totalling 1440 fits
[Parallel(n_jobs=-1)]: Done 50 jobs | elapsed: 1.7min
[Parallel(n_jobs=-1)]: Done 200 jobs | elapsed: 4.1min
[Parallel(n_jobs=-1)]: Done 450 jobs | elapsed: 18.5min
[Parallel(n_jobs=-1)]: Done 800 jobs | elapsed: 25.8min
[Parallel(n_jobs=-1)]: Done 1 out of 1440 | elapsed: 63.6min remaining: 91583.1min
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed: 63.6min finished
Fitting 4 folds for each of 360 candidates, totalling 1440 fits
[Parallel(n_jobs=-1)]: Done 50 jobs | elapsed: 1.7min
[Parallel(n_jobs=-1)]: Done 200 jobs | elapsed: 4.7min
[Parallel(n_jobs=-1)]: Done 450 jobs | elapsed: 16.4min
[Parallel(n_jobs=-1)]: Done 800 jobs | elapsed: 20.5min
[Parallel(n_jobs=-1)]: Done 1 out of 1440 | elapsed: 64.0min remaining: 92075.9min
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed: 64.0min finished
Best features: area
Best parameters: {'C': 4096.0, 'gamma': 4.0}
Best score: 0.70433750824
real 325m55.911s
user 11636m2.718s
sys 139m41.175s
1% of all on Elk
root@elk:~/mlproject/svm# time ./svm.py
Selected 2658 training samples and 263224 testing samples.
Fitting 4 folds for each of 360 candidates, totalling 1440 fits
[Parallel(n_jobs=-1)]: Done 50 jobs | elapsed: 1.0min
[Parallel(n_jobs=-1)]: Done 1 jobs | elapsed: 1.3min
[Parallel(n_jobs=-1)]: Done 200 jobs | elapsed: 2.1min
[Parallel(n_jobs=-1)]: Done 450 jobs | elapsed: 2.7min
[Parallel(n_jobs=-1)]: Done 800 jobs | elapsed: 3.0min
[Parallel(n_jobs=-1)]: Done 1250 jobs | elapsed: 3.2min
[Parallel(n_jobs=-1)]: Done 1410 out of 1440 | elapsed: 3.2min remaining: 4.1s
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed: 3.3min finished
Fitting 4 folds for each of 360 candidates, totalling 1440 fits
[Parallel(n_jobs=-1)]: Done 50 jobs | elapsed: 1.3min
[Parallel(n_jobs=-1)]: Done 1 jobs | elapsed: 2.0min
[Parallel(n_jobs=-1)]: Done 200 jobs | elapsed: 3.0min
[Parallel(n_jobs=-1)]: Done 450 jobs | elapsed: 4.1min
[Parallel(n_jobs=-1)]: Done 800 jobs | elapsed: 4.4min
[Parallel(n_jobs=-1)]: Done 1250 jobs | elapsed: 4.6min
[Parallel(n_jobs=-1)]: Done 1410 out of 1440 | elapsed: 4.6min remaining: 5.9s
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed: 4.6min finished
Fitting 4 folds for each of 360 candidates, totalling 1440 fits
[Parallel(n_jobs=-1)]: Done 1 jobs | elapsed: 38.4s
[Parallel(n_jobs=-1)]: Done 50 jobs | elapsed: 43.0s
[Parallel(n_jobs=-1)]: Done 200 jobs | elapsed: 1.4min
[Parallel(n_jobs=-1)]: Done 450 jobs | elapsed: 1.9min
[Parallel(n_jobs=-1)]: Done 800 jobs | elapsed: 2.1min
[Parallel(n_jobs=-1)]: Done 1250 jobs | elapsed: 2.3min
[Parallel(n_jobs=-1)]: Done 1410 out of 1440 | elapsed: 2.4min remaining: 3.1s
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed: 2.4min finished
Best features: area
Best parameters: {'C': 32.0, 'gamma': 4.0}
Best score: 0.72596723703
real 15m9.373s
user 169m28.445s
sys 0m6.562s