TODO

ENHANCEMENT:
============

A big work would be: migrate the ADT's to GADTs.
I think it would be worth the effort,
because it will decrease code-size (LOC)
as well as making the source better readable and cleaner.
Should be seen as hightes priority.
All other changes later would be easier then.


Needed feature: gzip-deflate for incomonbg data!


BUGs:
=====


BETTER STRUCTURE, CLEANER CODE:
===============================


 - maybe remove print_string(), because print() does the same now.
 - dump is rather a deparse; dump_data is dump-like or deparse-like too;
   possibly better names should be used here

 - Replacing the ..._array-and ..._list-types by one of these, or by a collection-type,
   which supports both.

   But be aware, not to replace single-entity types by a collection with size 1,
   because the distinction between single and multiple values is intentionally.
   (For example Url is different to Url_list / Url_array with one element!
    And it is meant to be different, because in some situations the special case is
    used to distinguish use cases.)
    The same holds for the other types that have ..._array types.


FEATUREs:
=========


 a bit of priority
 =================

 - Parser-Exceptions/-Errors in normal (non-auto-try) mode: should catch exceptions
   for each url, and allow to work on the next url, after the parser has been disrupted by an exception !

     # Get_error: Client_error
     # Parser abandoned, because of Get_error ( URL: http://www.altermannblog.de/2016/ )


 - grepv: + filters columns by row-match?
          + filters rows by column-match?
          + check sources in evaluate.ml - filter-naming seems not to fit functionality?!
          + Documentation / README: add words about  it

 - adding XPATH-notation / XPATH-conversion for compatibility with other tools seems to make sense. (really?)

 - add the possibility to interpret scraperJSON for compatibility,
   possibly convenient way to define scrapers/parsers: https://github.com/ContentMine/scraperjSON


 - tagselect: unparse a table should be possible...

 - append_to_matchresult - command that allows collecting of data.
   (e.g. extracting data from pages, where the data is spread over many webpages;
   the data extracted from each page could then be collected into ONE matchresult
   and the whole thing saved then).

 - Internal Data structure, which represents the data better, and to which
   will be converted from the parsetreetypes.
   From Parsetreetypes to an internal Type.
   This was in mind for along time, and it was the reason why the
   parsetreetypes were named PARSEtreetype.
   Would need preprocessing of the AST, and give a lot of possibilities
   for optimisation of the data processing flow.


 - Special variable for the  tmpvar? maybe $TMPVAR ? || First exploration: there seems to be a problem with internally called Subst (?)
                                                        Must Clean Code there!


 - converting HTML to named variables (like it also was planned for XML)
   This would make it easier to select certain data from a docoment / doclist.
   (DOM-to-string as varnames)
    [ -> not sure if this is necessary anymore, because of tagselect-command ]


 - subst: with replacement-*pattern* (instead of just a string-template)

 - exceptions?
     * exceptions as part of the language?
     * try/catch?
     * raise/try/catch?
     * Or just a function to raise an exception, which than is not handled by the language but the invoking loop?


 - Array2.colmap / Array2.rowmap  map-functions for Array-data.
   (Should behave non-destructive / functional of course.)

 - Refactoring code, using colmap/rowmap functions

 - tagselect: argpairs: different deparsing scheme also would make sense.
              maybe giving back matrix instead of array/list?

 - tagselect: regexp's for selection-pattern  (extraction-pattern also?)
     [ -> Patterns instead of strings also can bring problems. Using '^' and '$' all too often, to avoid wrong matches,
          can become annyoing too.
          So, maybe using a cli-switch to activate a regexp-patterns? But this might break some parsers, depending on CLI-switches,
          which is a bad idea.
          So, adding a new command, that also allows patterns?
          But this would be bad for tagselect, there should better not be another tagselect-like command.
          So, maybe adding another argument to tagselect, that can switch to regexp-patterns? 
     ]

 - tagselect / csv_save: things like tagselect( ... | argpairs)
                         would make sense to be converted to a table of the form

                           argname_1     argname_2  argname_3 ...
                             argval_1_1  argval_2_1   argval_3_1
                             argval_1_2  argval_2_2   argval_3_2
                             argval_1_3  argval_2_3   argval_3_3
                             argval_1_4  argval_2_4   argval_3_4
                              ...         ...          ...

                            and then saved with csv_save...
                       So, this scheme would make sense.
                       But the scheme in use now also make sense to get a good and fast overview.
                       So, maybe, and option for selecting the deparsing scheme would make sense.
   [ -> why not let it as it is, and in script then just add a to_matchres? (Would that way fit in, what I have in mind here?) ]


 - slurping the URL-list from the command line, getting it into TMPVAR. (as Url_list)
   Must be clear, that the URL-list then is shortened, or that an exception
   for ending any-dl after the parser has finished is called.
   (exc. caught and normal exit)

 - tagselect: allow Doclist argument as tmpvar to work on
     [ -> looks not to be really necessary at the moment ]

 - DropCol / DropRow also for other data-structures, like Doc_array, Url_list and so on.
      [ -> what do I mean by that? Adding other Parsetree-Types to DropCol/DropRow-command? ]

 - possibly an option/pragma, that allows parsers to be *not* invoked
   via "-a" CLI switch, because some parsers download anything, not
   a certain video-file (because they are more general, like general
   parsers that download all pics or so.)
     [ -> something like "this parser is intended to be invoked, if a certain command is explicitly looked for.
          So, using it with "-a" would not make sense, beacuse intention of "-a" switch is just to try out
          all available parsers, in case it is not known, which one is able to download the needed data (videos).
       -> But if one wants to download videos and try, which parser might work, using a link-extract-parser or a
          parser that explicitly downloads pdf-files would not make sense here.
          But "-a" measns: try ALL parsers. So, why should a pragma destroy the command from cli, saying "try ALL"?
       -> Maybe another switch would be necessary, to allow to differentiate here between
          "all" and "all but those which don't want to be used"?
       -> Maybe "-a" and "-aa" (like "-v" and "-vv") ? Or something like that?
     ]


 lesser Priority
 ===============

 - javascript engine

 - Logfile for sucessful and failed downloads

 - more sophisticated selection menue? (enhancing, what iselectmatch offers).
   (Should be text-only interface, and even ncurses will be overhead. KISS)
   [ -> hmhh, I think it's not that bad at the moment... remind KISS-principle... ]

 - !!!! XML-deparse -> get values from XML via name :-)
   [ -> not that bad, maybe making it higher priority?
     ->

 - matches and assignments -> each match-group one varname... like   (url,vidname,foobar) = match("^(.*)xyz(lala).*?(foobar)")

 - maybe add a more sophisticated program invocation (fork/exec or popen) ?

 - possibly a true stack ( push, pop, dup, exch, ...) might make sense,
   enhancing the one-val-stack concept.

 - would it make sense to allow get also to read the streams?
   Or should there be some other special commadn for it?
   like "get_stream"?
   (but how to achieve this? by using fork/exec to the tools used via system() at the moment?
   Or using stream-reading libraries for this? Are there any such libs and/or OCaml bindings?)

 - possibly switch to ulex instead of ocamllex:
     http://alain.frisch.fr/soft.html#ulex
     http://caml.inria.fr/pub/ml-archives/caml-list/2006/04/6d31ef03a5a1f9a182a9ed2422d266a4.en.html
    ... but ocamlnet uses ulex internally? I could just use ocamlnet for reading/parsing files maybe?

 - Testsuite would help a lot (example where helpful: linkextract problem in  commit c7f3c1ec98d9c3e88d47c93637a50879c0711d05 )
     - tesing sites as well as testing parsers would be needed.
     [ -> maybe using OUnit for uinit-testing? ]


QUESTIONS:
==========

 - how detailed/verbose or how non-verbose should the commands be?
   e.g.  should print; print onl ythe url or url and referrer?
   Should there be more specific commands?
   Or should print get parameters?
   Or should there be something like
      print with referrer;

 - the interactive loop for menue-item selection catches failing int_of_string: default-pattern is used.
   It could also ask again for correct selection.