-
Notifications
You must be signed in to change notification settings - Fork 6
/
TODO
220 lines (148 loc) · 9.06 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
ENHANCEMENT:
============
A big work would be: migrate the ADT's to GADTs.
I think it would be worth the effort,
because it will decrease code-size (LOC)
as well as making the source better readable and cleaner.
Should be seen as hightes priority.
All other changes later would be easier then.
Needed feature: gzip-deflate for incomonbg data!
BUGs:
=====
BETTER STRUCTURE, CLEANER CODE:
===============================
- maybe remove print_string(), because print() does the same now.
- dump is rather a deparse; dump_data is dump-like or deparse-like too;
possibly better names should be used here
- Replacing the ..._array-and ..._list-types by one of these, or by a collection-type,
which supports both.
But be aware, not to replace single-entity types by a collection with size 1,
because the distinction between single and multiple values is intentionally.
(For example Url is different to Url_list / Url_array with one element!
And it is meant to be different, because in some situations the special case is
used to distinguish use cases.)
The same holds for the other types that have ..._array types.
FEATUREs:
=========
a bit of priority
=================
- Parser-Exceptions/-Errors in normal (non-auto-try) mode: should catch exceptions
for each url, and allow to work on the next url, after the parser has been disrupted by an exception !
# Get_error: Client_error
# Parser abandoned, because of Get_error ( URL: http://www.altermannblog.de/2016/ )
- grepv: + filters columns by row-match?
+ filters rows by column-match?
+ check sources in evaluate.ml - filter-naming seems not to fit functionality?!
+ Documentation / README: add words about it
- adding XPATH-notation / XPATH-conversion for compatibility with other tools seems to make sense. (really?)
- add the possibility to interpret scraperJSON for compatibility,
possibly convenient way to define scrapers/parsers: https://github.com/ContentMine/scraperjSON
- tagselect: unparse a table should be possible...
- append_to_matchresult - command that allows collecting of data.
(e.g. extracting data from pages, where the data is spread over many webpages;
the data extracted from each page could then be collected into ONE matchresult
and the whole thing saved then).
- Internal Data structure, which represents the data better, and to which
will be converted from the parsetreetypes.
From Parsetreetypes to an internal Type.
This was in mind for along time, and it was the reason why the
parsetreetypes were named PARSEtreetype.
Would need preprocessing of the AST, and give a lot of possibilities
for optimisation of the data processing flow.
- Special variable for the tmpvar? maybe $TMPVAR ? || First exploration: there seems to be a problem with internally called Subst (?)
Must Clean Code there!
- converting HTML to named variables (like it also was planned for XML)
This would make it easier to select certain data from a docoment / doclist.
(DOM-to-string as varnames)
[ -> not sure if this is necessary anymore, because of tagselect-command ]
- subst: with replacement-*pattern* (instead of just a string-template)
- exceptions?
* exceptions as part of the language?
* try/catch?
* raise/try/catch?
* Or just a function to raise an exception, which than is not handled by the language but the invoking loop?
- Array2.colmap / Array2.rowmap map-functions for Array-data.
(Should behave non-destructive / functional of course.)
- Refactoring code, using colmap/rowmap functions
- tagselect: argpairs: different deparsing scheme also would make sense.
maybe giving back matrix instead of array/list?
- tagselect: regexp's for selection-pattern (extraction-pattern also?)
[ -> Patterns instead of strings also can bring problems. Using '^' and '$' all too often, to avoid wrong matches,
can become annyoing too.
So, maybe using a cli-switch to activate a regexp-patterns? But this might break some parsers, depending on CLI-switches,
which is a bad idea.
So, adding a new command, that also allows patterns?
But this would be bad for tagselect, there should better not be another tagselect-like command.
So, maybe adding another argument to tagselect, that can switch to regexp-patterns?
]
- tagselect / csv_save: things like tagselect( ... | argpairs)
would make sense to be converted to a table of the form
argname_1 argname_2 argname_3 ...
argval_1_1 argval_2_1 argval_3_1
argval_1_2 argval_2_2 argval_3_2
argval_1_3 argval_2_3 argval_3_3
argval_1_4 argval_2_4 argval_3_4
... ... ...
and then saved with csv_save...
So, this scheme would make sense.
But the scheme in use now also make sense to get a good and fast overview.
So, maybe, and option for selecting the deparsing scheme would make sense.
[ -> why not let it as it is, and in script then just add a to_matchres? (Would that way fit in, what I have in mind here?) ]
- slurping the URL-list from the command line, getting it into TMPVAR. (as Url_list)
Must be clear, that the URL-list then is shortened, or that an exception
for ending any-dl after the parser has finished is called.
(exc. caught and normal exit)
- tagselect: allow Doclist argument as tmpvar to work on
[ -> looks not to be really necessary at the moment ]
- DropCol / DropRow also for other data-structures, like Doc_array, Url_list and so on.
[ -> what do I mean by that? Adding other Parsetree-Types to DropCol/DropRow-command? ]
- possibly an option/pragma, that allows parsers to be *not* invoked
via "-a" CLI switch, because some parsers download anything, not
a certain video-file (because they are more general, like general
parsers that download all pics or so.)
[ -> something like "this parser is intended to be invoked, if a certain command is explicitly looked for.
So, using it with "-a" would not make sense, beacuse intention of "-a" switch is just to try out
all available parsers, in case it is not known, which one is able to download the needed data (videos).
-> But if one wants to download videos and try, which parser might work, using a link-extract-parser or a
parser that explicitly downloads pdf-files would not make sense here.
But "-a" measns: try ALL parsers. So, why should a pragma destroy the command from cli, saying "try ALL"?
-> Maybe another switch would be necessary, to allow to differentiate here between
"all" and "all but those which don't want to be used"?
-> Maybe "-a" and "-aa" (like "-v" and "-vv") ? Or something like that?
]
lesser Priority
===============
- javascript engine
- Logfile for sucessful and failed downloads
- more sophisticated selection menue? (enhancing, what iselectmatch offers).
(Should be text-only interface, and even ncurses will be overhead. KISS)
[ -> hmhh, I think it's not that bad at the moment... remind KISS-principle... ]
- !!!! XML-deparse -> get values from XML via name :-)
[ -> not that bad, maybe making it higher priority?
->
- matches and assignments -> each match-group one varname... like (url,vidname,foobar) = match("^(.*)xyz(lala).*?(foobar)")
- maybe add a more sophisticated program invocation (fork/exec or popen) ?
- possibly a true stack ( push, pop, dup, exch, ...) might make sense,
enhancing the one-val-stack concept.
- would it make sense to allow get also to read the streams?
Or should there be some other special commadn for it?
like "get_stream"?
(but how to achieve this? by using fork/exec to the tools used via system() at the moment?
Or using stream-reading libraries for this? Are there any such libs and/or OCaml bindings?)
- possibly switch to ulex instead of ocamllex:
http://alain.frisch.fr/soft.html#ulex
http://caml.inria.fr/pub/ml-archives/caml-list/2006/04/6d31ef03a5a1f9a182a9ed2422d266a4.en.html
... but ocamlnet uses ulex internally? I could just use ocamlnet for reading/parsing files maybe?
- Testsuite would help a lot (example where helpful: linkextract problem in commit c7f3c1ec98d9c3e88d47c93637a50879c0711d05 )
- tesing sites as well as testing parsers would be needed.
[ -> maybe using OUnit for uinit-testing? ]
QUESTIONS:
==========
- how detailed/verbose or how non-verbose should the commands be?
e.g. should print; print onl ythe url or url and referrer?
Should there be more specific commands?
Or should print get parameters?
Or should there be something like
print with referrer;
- the interactive loop for menue-item selection catches failing int_of_string: default-pattern is used.
It could also ask again for correct selection.