Skip to content

any-dl: generic mediathek-downloader ("generic" means, you also can call it "scrapertool")

License

Notifications You must be signed in to change notification settings

klartext/any-dl

Repository files navigation

any-dl, a tool for downloading Mediathek video-files
====================================================


Overview
========

  The tool any-dl is inspired by, and has it's name derived from tools like
  youtube-dl, arte-dl, dctp-dl, zdf-dl, ...

  These tools are specialized downloading tools for videos
  of youtube, as well as tv-broadcasting companies.

  All these tools do download video files, and for accomplishing this task,
  they need to download and analyze webpages, via which thos cvideos are
  presented to the viewer.

  All these small tools are only programmed to work with certain video archives,
  and a lot of work is going into these kind of tools.

  any-dl is intended to be generic enough to allow downloads of videos from all
  these platforms, and for this case, probviding a Domain Specific Langage
  (DSL) which defines how the videos of a certain server can be downloaded.

  The DSL is designed to allow defining parsers, which say, how to scrape
  the archives.

  This language will explained below.

  But as normal user, you normally will not need to know this parser definition
  language.
  You just need to know how t use any-dl, and this is pretty simple.
  So, before the parser definition language will be explained,
  the usage of any-dl will be explained.

  any-dl provides a program with a certain language,
  that allows doing the parsing stuff of websites with
  focus on video download.

  If you miss a parser for a certain site or if you have written
  one by your own, please let me know.

  As any-dl does delegate the stream-downloads to certain tools,
  it will make sense to have the following tools also installed:

  - rtmpdump


Compilation / Installation / Setup
==================================


  When you read this README file from within the directory
  you downloaded (via git or elsewhere), then you
  already have unpacked it.

  You need to compile and install / setup the tool.

  You will need to have OCaml installed, as well as
  some libraries.

  The library ocamlnet currently (May 2024) is not available for ocaml 5.x.
  So you need to use ocaml 4. The currently used version in development is
  ocaml 4.14.2.

  When using OPAM, you need the following packages:

  - pcre
  - ocamlnet
  - conf-gnutls (gnutls must be installed on your system too)
  - xmlm
  - yojson
  - csv

  If you don't use OPAM but instead the package manager from your Linux installation,
  the packages might have different names.
  On Arch some packages might only be available via AUR.

  At the moment any-dl itself has no OPAM-package.
  It might be added later.


  To compile any-dl, just type "make" at the shell then.

    $  make

  The the file "any-dl" should be in the current directory.
  You can then copy it to $HOME/bin if your PATH-variable
  points to it.
  Or possible you may copy it to /usr/lovcal/bin or /usr/bin
  depending on the Linux-/Unix-system you are using, and
  the filesystem-standard it is using.


  The file "rc-file.adl" does contain needed parser-definitions
  for any-dl to work as expected.

  This file must be available in one of three places:

    - /etc/any-dl.rc

    - $XDG_CONFIG_HOME/any-dl.rc  ( default: $HOME/.config/any-dl.rc )

    - $HOME/.any-dl.rc


  So, please copy from the local dir, where you built any-dl
  the file "rc-file.adl" to one of the three places, mentioned above.

  This command would do it for the default of the XDG_CONFIG_HOME environment variable:

   $ cp rc-file.adl  $HOME/.config/any-dl.rc     # copy the config file to the XDG-default-dir

  But if the XDG_CONFIG_HOME-environment variable is set,
  it's better to use it:

   $ cp rc-file.adl  $XDG_CONFIG_HOME/any-dl.rc      # copy the config file to the XDG-dir

        ( For command-line-newbies:
           The "$" symbol in the mentioned command lines, which you have to
           type do represent the prompt of the shell; don't type it. )


  If you want to add your own parser-definitions, it would make sense
  to save the file "rc-file.adl"  in the  /etc/-directory, as mentioned above
  and then add your own parser-definitions in one of those places,
  where a any-dl config-file can be placed inside your HOME directory.

  This would have the advantage, that the config-file coming with any-dl will
  always be placed in the /etc/ directory, and your local parser-definitions
  will be saved in your local configs in $HOME.

  PLEASE, BE AWARE, THAT ANY-DL READS ALL CONFIG-FILES AS IF THEY WERE
  CONCATENATED INTO ONE BIG FILE.

  So, if you want to add your ADDITIONAL parsers to those that are already
  existing in the file inside "rc-file.adl" ( e.g. copied to /etc/any-dl.rc ),
  this can be done by just editing the config files in your $HOME-dir
  and write your own definitions into these files.
  It's not necessary (and also not recommended) to have there a copy of the
  "rc-file.adl", in which you added your parsers.
  Just write ONLY your own parsers into your local files.

  So, in other words: if you want to add your own parser-definitions place them
  solely in one of the default places fo config files in $HOME.
  Don't copy the stuff from "rc-file.adl" stored in /etc/any-dl.rc.

  ( If you place the config file "rc-file.adl" in $XDG_CONFIG_HOME/any-dl.rc
    you can add your parsers in $HOME/.any-dl.rc )
  

  Of course you can also add your additional parsers in the place,
  where all parsers from "rc-file.adl" are stored... but when you update
  to a newer version of any-dl you maybe by accident overwrite your own
  stuff with the new "rc-file.adl" coming with a newer version of any-dl.)


  IF YOU WISH TO USE OTHER CONFIG-FILES, you can specify them
  with the -f option of any-dl.
  If you use the -f option, you can give a filename-path, which is used
  as config file then.

  BE AWARE: All DEFAULT PLACES of config files WILL THEN BE IGNORED!
  If you wish to add more than one config file, you can do it by just
  using the -f option more than once.



Usage
=====

  You need to provide the url from the video archive,
  and give it to any-dl as a command line argument.
  Very often you have to quote the url inside of " and "
  so that certain symbols are not interpreted by the shell,
  from which you start any-dl.

  For example on ARTE mediathek, there is
  a telecast "Frankreichs mythische Orte", and the URL
  of it is:
    http://videos.arte.tv/de/videos/frankreichs-mythische-orte--7167432.html

  If you want  to download the video of it, at the shell it will look like this:

    $ any-dl "http://videos.arte.tv/de/videos/frankreichs-mythische-orte--7167432.html"

  Then any-dl would download the video. :-)
  That's all :-)

  The same principle holds true for any other archives, for which
  a parser definition already is provided.
  If there is no such parser defined, any-dl will tell you with an
  exception-message.

  You then may ask, if there already is a parser for it available,
  written by the author of any-dl, or by any other persons.
  Or you could learn the parser definition language and program your
  own parser for that archive.
  If you send the parser you wrote to the author of any-dl,
  then in a newer release of any-dl, other people could use it also.

  By the way: there are already also some parser definitions, that are not
  focussed on certain video archives.
  There is the parser "linkextract" as well as "linkextract_xml".
  You can use them to pick out html-hyperreferences (typically called "links"
  or "references") or links in xml-files.

  To pick a certain parser can be done with the command-line switch "p":

    $ any-dl -p linkextract "http://videos.arte.tv/de/videos/frankreichs-mythische-orte--7167432.html"

  will print out all href's of the document (and they should all appear as absolute URLs).

  The names of all defined/available parsers can be displayed with the "l"-switch:

    $ any-dl -l

  If a parser as URLs, on which it will be invoked as default (when not using -p)
  it is also displayed with -l as switch.

  If you want to write your own parser-definitions, you need the list of
  commands. You can get it with the -c switch:

    $ any-dl -c 

  will print a list of all keywords that the lexer/scanner does accept.


  That's enough for an introduction.

  And here now follows a brief introduction into the parser definition language.


Parser-Definition Language: Intro
=================================

  Here is a simple parser definition, that allows to pick out all
  html-hyper-references from a webpage and print them.


  parsername "linkextract": ( "" )
  start
    linkextract;
    print;
  end

  As you can see, the definition allows to give a parsername
  to the definition of the parser, an inbetween of "start" and "end"
  the commands that define the parser, are listed.

  A get-command that downloads the url
  (which is given via the command line) is done
  implicitly.

  Then the commands "linkextract" and "print"
  are executed.
  So, all links from the document, referred to by the URL
  are printed.



  The part with the parantehses and quoting-symbols allows to bind certain
  URL's to this parser, so that a parser can be selected automatically
  via the URL. So, a parser, dedicated to a certain URL will be invoked
  to work on the document, that has a certain URL.

  Via command line arguments, it is possible, to select a different parser,
  to do it differently than using the defaults.


  As an example see at the parser, that does look-up for the
  video-files of the NDR-TV-broadcaster in germany:

    # Example-URL: http://www.ndr.de/fernsehen/sendungen/mein_nachmittag/videos/wochenserie361.html
    #
    parsername "ndr_mediathek_get": ( "http://www.ndr.de" )
    start
      match( "http://.*?mp4" );
      rowselect(0);
      store("url");

      # download the video
      # ------------------
      paste("wget ",  $url );
      system;
    end

  There you can see, that the parser-name is set to
  "ndr_mediathek_get", and the URL, to which this parser is bound by
  default is "http://www.ndr.de".
  This does mean, that any URLs, that start with "http://www.ndr.de"
  will be parsed with the "ndr_mediathek_get" parser.

  If you give an URL like the one in the example (shown above the parser)
  as command line argument to any-dl, then the parser "ndr_mediathek_get"
  is invoked to look for the video file.

  Again, an implicit get is invoked.
  Because the first doeument must be downloaded in any case,
  the first get is done implicitly.
  It's obvious that the first document must be downloaded,
  and it makes writing the parsers easier.


Stack and named variables
-------------------------

  This language is somehow special, that uses a mix of
  a stack-based language and one that allows named variables.
  The stack has a size of one value.
  Most functions use the stack. They can get their argument from there,
  as well as puttin gtheir results to the stack.
  A one-value-stack, which is used to read arguments from and save
  results to, does behave like a pipe in unix-environment.
  Something is written to a pipe by someone, and the same thing is read from a pipe by someone.
  So, the stack emulates something like a pipe.

    (Another analogy would be Perl's built in variable $_
     but a Pipe analogy does fit the picture better. I think.)

  Because this behaviour sometimes is not providing enough complexity,
  any-dl also allows to store data/results in named variables.

The NDR-example explained
-------------------------

  The first command does a MATCH with regular expressions
  on the contents of the first document.
  It does the match on the document, which was downloaded by the implicit
  GET-command. This document was put onto the 1-valued-stack.
  The match command reads the argument (the document) from the stack,
  tries to match for the certain regular expression, and puts the result
  onto the stack.

  Then from the result (a match is a 2D-matrix, meaning an array of an array",
  the first row (index == 0) is selected with ROWSELECT.
  The resulting selection holds an array.
  This selection-result is put to the 1-valued-stack.
  The stack-value (selection-result) is stored in the named variable "url" for later use
  via the STORE-command.

    To come back to the pipe-analogy, it's like a pipe that would look like this
    (pseudocode):
      GET(<start-url>) | MATCH(<regular_expression>) | ROWSELECT( <index> ) | STORE( <varname> ) | ....


  The paste-command pastes the literal string and the contents of the named variable "url"
  together, and places the  result on the 1-valued stack.
  The system command tries to use the system() command (which you may know
  from other programming languages, the shell or the system-API)
  and as argument uses the value from the stack.

  So, if the variable "url" contanins the video-url,
  the system()-call would look like this one:

      system("wget <video-url>");



  That is the parser language explained by example.



  I hope, this example shows you, what there is all about the input language
  (parser definition language).

  It's comparingly easy (IMHO), and in this way it will be possible to have easy access
  to a lot of different video archives, all with the same tool.
  So, it is not necessary to look for tool-updates, when some URLs and how they
  are connected together, on a video-archive-page, do change.

  If something changes in the way a video url is presented on one of these
  video/archives / Mediatheken, then only the according parser-definition
  needs to be updated. The tool any-dl itself does not needed to be changed.

  Also, all the different tools that provide video-download-functionality,
  with all
  their seperated effort of the programmer (many programmers), done to make only
  certain archive be accessed, can be freed to make just the basic analyzing of
  the webpages that provide the videos, and save effort to program a tool.
  So, one tool and many archives, instead of many tools for some archives.

  So, I think the advantage may be obvious to you.

  Now, details about the language will follow.



Language Features:
==================

  Parser-Definitions:

  parsername "<parser-name>": ( <list-of-urls> )
    start
      <command_1>
       ...
      <command_n>
    end

  Example: see above.

  <list-of-urls> is a comma-seperated list of strings.


  Commands all end with a semicolon ( ';' ).
  Commands / functions, that do not have parameters, will be used without
  parenatheses ( '(' and ')' ).
  Only when a command / function will need arguments,
  these will be passed inside parenatheses ( '(' and ')' )
  which follow the name of the command/function.

  Some commands are available with and without parantheses.
  An example is the print-command/function.



  Stringquoting at the moment has three dfferent styles:

    String-Quoting:  "    "
    String-Quoting:  >>>  <<<
    String-Quoting:  _*_  _*_



  The language offers a stack of size 1.
  That means, that results from one command / function
  can be passed as input for the next command/function
  and this is default behaviour.
  Not all commands / functions do need the stack
  for input, and not all do leave something there
  as result (and input for following functions/commands).

  But if there is the need for transfering a result,
  normally no additional variables are needed.
  Most often, the data can be transferred from one function/command
  to the next one via the 1-valued-stack.

  But in certain cases, this is not enough.
  For these cases there are named variables also.

  To store the current data from the default-stack
  under a certain name, the command
    store("<variablename>");
  will be used.

  To copy (restore/recall) the value of the named variable back to the default stack,
  the command 
    recall("<variablename>");
  can be used.

  In the paste()-command/function, it is possible, to access
  named variables via the $-notation, that you might know from
  other programming languages, like Perl for example.

  In the NDR-parser, it looks like this:
      paste("wget ",  $url );

  This does paste together the literal string "wget "
  and the contents of the named variable "url".
  The result of paste is stored at the one-valued stack.
  And the system-command uses this value as it's argument
  (and therefore downloads a file with the wget-tool).


Startup-sequence:
-----------------

The document(-url) given via command line is loaded automatically.
The loaded document is automatically saved as a named variable (name: "BASEDOC").



Command Line Options:
---------------------

    -l list parser-definitions and related URLs
    -p <parsername>  selects a certain parser, to be used for all urls.
                     The names that can be selected can be listed with
                     the -l option, or one can look into the rc-file.
    -f filename for rc-file
    -v  verbose output
    -vv very verbose output
    -c show commands of parserdef-language
    -v verbose     
    -s safe: no download via system invoked
    -i interactive: interactive features enabled

    -a     auto-try: try all parsers
    -as    auto-try-stop: try all parsers; stop after first success

    -u     set the user-agent-string manually
    -ir    set the initial referrer from '-' to custom value 
    -ms    set a sleep-time in a (bulk-) get-command in milli-seconds
            => sleeps only for bulk-get-commands (get that would call a list of documents,
               not for single get-commands)
    -sep   set seperator-string, which is printed between parser-calls

    -help  Display this list of options
    --help  Display this list of options



Examples:
---------

  1.: Print html-links of a webpage:

    If you want to print the href-links of html,
    use any-dl with the predefined parser for link-extraction:

      $ any-dl -p linkextract   <url_list>



List of commands/keywords and a short-description of them:
=========================================================


  appendto
    Appends tmpvar to a named variable.
    If that variable does not exist already, it's internally created as empty match-result
    (which means the appendto-command then creates the varable itself with the new data.)

    "appendto" only works on match-results.

    Two match results will be concatenated this way, and be saved in the
    named variable automatically. (No store-command is needed after append.)

    The itmes will be concatenated as Rows, so adding two matchres'
    will add the second matchres as appending it's rows to the matchres in the
    named variable.


  basename
    creates the basename of an url or filename;
    the leading filename or URL-path is removed


  call
    call a macro.
    The macro is working like textual insertion of the commands of the
    macro at the place where the "call"-command is used.


  csv_read
    csv_read reads in a file as csv-file.
    The result is placed in tmpvar as Match_result.


  csv_save_as
    csv_save_as does save a *match_result* to a csv-file.
    All data is transformed to have equal number of columns in each row.
    Arguments of csv_save_as() are appended into a resulting filename.


  csv_save
    csv_save does save a *match_result* to a csv-file.
    All data is transformed to have equal number of columns in each row.
    The filename is derived from the used STARTURL.
    The charcater set is shrinked down to a subset of ASCII.
    ".csv" is appended automatically.


  colselect
    selects columns from a match-result
        # Example:
        # --------
        colselect(2);


  delete
    deletes / removes a variable.
    It is not accessible anymore then.
    This means: accessing it can result in an error,
    because it's like accessing a variable that was not
    defined at all.


  download
    downloads an entity and storing to a file.
        # Examples:
        # ---------
        download;
        download( $filename );


  dropcol
    drops a column from a match-result


  droprow
    drops a row from a match-result


  dummy
    just a dummy command (something like a NOP of processors)


  dump
    dump a html-page: deparses the tags, prints tags and data
    annotated; data is indented and an underline prepended.
    The underline is a multitude (defaults to 2) of the deepness
    of the nesting in the parse-tree.
    Means: the deeper something is wrapped in tags, the higher the indentation.


  dump_data
    dump a the data-part of a html-page: deparses the tags, prints data part,
    and NOT the tags.
    Works like un-tag html, or like a html-2-text.


  emptydummy
    just a dummy command (something like a NOP of processors),
    but gives back Empty as tmpvar


  end
    end-keyword for the parser-definition


  grep
    extract matching elements from data


  grepv
    extract non-matching elements from data
    (grepv: grep -v)


  exitparse
    exit's a parse of one parser.
    This means, that the URL that is currently tried to be parsed and
    worked on, will not be further investigated.
    But if there are more than one URl given via command-line,
    then the next url will be investigated.
    This means: even if by accident your parser for one url is
    exited (e.g. you are developing the parser for that URL),
    the next one will be worked on.


  get
    gets a document like html or xml page.
    Could also be a file, but not a stream so far.


  htmldecode
    Decodes the HTML-Quotings like &#034; and such stuff
    back into "normal" characters.


  iselectmatch
    this is an interactive selectmatch. ("i" for interactive).
    Without the "-i" switch on the command line, it behaves
    like selectmatch().
    But when the "-i" switch is set via command line,
    then an interactive menue will be displayed, so that
    the user can select an option; this option will allow
    to select the row by the selected column-index interactively.
    The user selects a number (beginning from 0).
    The corresponding column of the selected number
    will be used for selection of the row.
    If the input is not valid, a default value will be used.
    The default value is the value, that is the second arg of
    iselectmatch(). It would be the same as a hard coded selection
    of a selectmatch().
    So, in most cases it would make sense to use iselectmatch()
    instead of selectmatch().
        # Example:
        # --------
        iselectmatch( <col_idx>, <matchpat>, <default_pattern>);


  linkextract
    extracts href-links from html-pages;
    relative links will tried to be converted into absolute links.


  linkextract_xml
    extracting href-items of an xml-document


  list_variables
    displays all named variables.
    Prints variable-name only.
    (show_variables does also print the contents of the variables)


  makeurl
    tries to make an url from a string


  match
    tries to match to the used pattern.
    PCRE-matches are used.
    The result is a matrix, containing of
    rows-of-"column"-elements.

    Please note:
      For real matches: Col 0 is the whole match, all others are the groups of a match.
      For match_results, thatare just "arrays of arrays" (not coming from a match,
      this obviously does not hold.
      If you do a match, and want only the selected groups to appear in your
      result, use
          dropcol(0);
      to kick out the whole-match.

        # examples:
        match("Regexp-String");
        match(>>>another "Regex"-String<<<);


  mselect
    a multiple-select, like select, but the result will be an array
    of items (Strings or URls) not a single element.
        # Example:
        # --------
        mselect(1,2);


  parsername
    this keywords starts the definition of a parser.


  paste
    the paste()-command creates a string from strings and variable-names (-notation).
    paste() accepts a list of items, seperated by commas (",").
        # Example:
        # --------
        paste( "literal string", $varname, "foo", $bar );


  post
    post does make a post-request (instead of get-request) to a webserver.

    The post-data is stored in named variables; the names of the variables
    will be given to post as arguments, e.g.:
      post( "name_1", "name_2" );
    and the values will be looked up internally.
    For that purpose, the post-data has to be stored in named variables,
    before the post-command is called, so that the value for a variable
    can be looked up by the post-command.

    The URL for the post-command is taken from tmpvar.

        # Example:
        # --------
        post("valname_1", "valname_2", "valname_3"); # the values must be set as named variables before.


  print
    print invoked without parantheses prints the value on the one-val-stack.
    print() with parantheses prints strings and variables (denoted by $-notatation),
    which means it accepts the same parameters as paste() but does not change the
    one-val-stack.
    print() used on an empty string does end the line automatically.
    This means, a new line will be used for further commands.
    If you wish to print only a certain string, without line-endlings added,
    you need to use print_string()


  print_string
    accepts only one string-argument and prints it.
    It prints the plain string, and does not add line-ending automatically.


  quote
    wraps the one-val-stack value with '"' and '"'.
    needed for arguments that are given to other tools,
    which will be invoked bia system() (which is invoking
    a shell).


  readline
    reads one line from stdin / console.
    Without arguments, the input is stored in the TMPVAR,
    With argument, the argument is used as variable-name,
    and the input is stored in this named variable.

        # Examples:
        # ---------
        readline;
        readline("VarnameForInputLine");


  recall
    get a named value and store it on the one-val-stack.
        # Example:
        # --------
        recall("varname");


  rowselect
    selects a certain row from a match-result.
        # Example:
        # --------
        rowselect(0);


  save
    saves a document to a file.
    The filename is derived from the url of the document.
    The charcater set is shrinked down to a subset of ASCII.


  save_as
    saves a document to a file with filename as argument.


  select
    selects ONE part of a tmpvar.

    Examples: select(0);
              select(3);

    For rows and columns:

    document:
    ---------
      0 selects the document,
      1 selects the url of the document

      any other value  selects the document too

    document-array:
    ---------------
    selects document with index (starting at 0)
    rows/columns:
    -------------
      selects ONE ELEMENT from a row or a column.
      The row/column must already have been selected with rowselect() or colselect().
      select() does NOT allow matches on match-results (which are a matrix internally).
        # Example:
        # --------
        select(2);


  selectmatch
    allows to select a row from a match-result, by specifiying
    a column-index and a string-matching-pattern for this certain
    element.
    So, this is a more advanced rowselect() with additional matching capabilities.


  show_match
    shows a match-result in a certain way; this command is
    intended to display matchese in a way, wher they can be read easily.
    Most often will be used in parser-development.
    But can of course also be used for informing the user
    on the steps that any-dl has done (e.,g. just be verbose and
    display the matches). But normally, rather developers
    will be interested in these details.


  show_type
    just shows the "type" of the value in the one-val-stack.


  show_variables
    displays all named variables.
    Prints variable-name and contents of the variable.
    (list_variables does only print the names of the variables)


  start
    this keyword indicates the start of the keywords section
    of a parser definition.


  store
    store the value from the one-val-stack as named variable.
    (use recall() for getting it back to the one-val-stack, or
     $-notation in some of the commnds that accept this notatiom).
       # Example:
       store("varname");


  storematch
    Stores the tmpvar (must be match-result) to a named variable,
    with Row- and Column-Indexes as part of the name:
      storematch("MyName"); # stores matchresult as  MyName.(col).(row)
      (for all col's and row's as indexes of the match-result)


  subst
    string-substiturion.
    Uses Pcre.replace internally.
        # Example:
        # --------
        subst("pattern", "subst-string");


  system
    calls the system() command with the string that is hold in
    the one-val-stack.


  table_to_matchres  (expermental feature so far)
    converts a html-table to a match-result.
    This conversion works for single tables.
    So, a selection of a table should be as specific as possible, so that
    only one table will be seleted with tagselect.
    Then the conversion works.
    If more than one table has been extracted by tagselect, then
    they all will becoerced into ONE mathc-result.
    If that's, what is wanted, anything is fine. Otherwise, seperate table-selection will be necessary.

    Use tagselect with "htmlstring"-extractor, like this:

       # Example:
       # --------
         tagselect("table"."id"="foobar" | htmlstring );
         table_to_matchres;
         csv_save;


  tagselect
    selects tags and "subtags" from a document tree and gives back
    data accordingly.
    selection can be a *list* of tags, and optionally the argument
    "args" or the argument "arg" with a key-parameter (of a key-value pair)
    that selects the certain argument.
    See above in the command-examples for syntax details.

    Selection list does do a selection on the firt selector-specification.
    Then the resulting stuff is again selected, and so on.

    Example:
    --------
        tagselect("table", "a", "img"."align"="top"| dump);

      The document is first scanned for table's.
      The outermost match is selected. So if a table is inside a table,
      the outer tag will be selected, and the whole outer table be selected.
      The inner table would just be content of the first one.
      No in-depth selection is done.

      All found table's then are scanned for <a ...> tags, which
      should be the <a href="..."> stuff.
      From the found <a ...>-tags any img-tags inside these a-tags
      will be selected, if they also are top-aligned.

      The result then is dumped to screen/console.

    tagselect selects elements from the document tree,
    so that a selection picks that certain tag and all it's descenmdants.
    That means for example, that a data-slurp-extraction will show all data from the descendants.
    But all other extractors ONLY LOOK UP THE TOPMOST element.
    (And not the desendants)
    The reason is: that the selected element normaly is what needs to be analyzed,
    not necessarily the descendants.

    With the "anytag" selector in tagselect (e.g. 'tagselect( anytags, argpairs );' )
    ANY tag is selected, so ALL tags are TOPMOST tags, because any descendant also
    is edetected as a new tag.
    This is a depth-first selection, with each element being a top-element.
    This way you can access all descendants and analyse them, fr example
    extract all argpairs from all the tags of the whole document.

        # Examples, showing the allowed syntax:
        # -------------------------------------
        tagselect( "a"| dump );    # dumps all <a ...> tags
        tagselect( "br"| dump );   # dumps all <br>-tags
        tagselect( "table", "a"| dump ); # <a ...> inside tables will be dumped
        tagselect( "img"."src"| dump );  # <img src="...">   wil be dumped

        tagselect("table", "a", "img"."align"="top"| dump); # all img-tags with "align"="top" will be selected,
                                                            # if they appear inside a table; the stuff is dumped to screen

        tagselect( "table", "a" | argpairs );      # extract argpairs from the stuff that was selected
        tagselect( "table", "a" | arg("href") );   # extract value for the arg with key/name "href" from the stuff that was selected

        # the pair-extratcors ( "argpairs", "argkeys", "argvals" ) can be used as single-extractor-arguments
        # the other selectros select one item only (not pairs) and can be given as list, like this:

        tagselect("img"."src" | arg("src"), arg("alt") );

        # tagselect used with "anytags"-selector 
        # --------------------------------------
        # the "anytags"-selector selects ANY tags,
        # which means that ALL tags from the document are
        # picked up in depth-first manner.
        # without anytags, a match does pick a tag with all descendants.
        # But these descendants will not be extracted with a extractor-pattern!
        # --------------------------------------
        tagselect( anytags | argpairs ); # shows argpairs of ANY / ALL tags found (depth-first)


  titleextract
    extracts the contents from the <title>-tag of a webpage
    and puts the resutlt to the one-val-stack.


  to_string
    converts the value of the one-val-stack to a string-representation.


  to_matchres
    converts the value of the one-val-stack to a value of the same type,
    that a match-operation gives as result.
    This is an a row-column "array" and show_match-command could show
    details about it.
    Useful for later selecting/rowselecting values from this matchres-typed
    value.


  transpose
    transposition of a Match_result (which is an array of arrays, or a "matrix").
    This exchanges rows and columns.


  sort
    Sorts entries. (only match-result's so far.)


  uniq
    From the tmpvar it removes entries with same contents.
    Works like "uniq" from unix-toolbox, but not limited to neighbouring lines;
    or like "sort -u" without a sort (not changing the order of the entries).

    Please note: on match-results, uniq works on ROWs. This means, that multiple
    rows will be discarded.
    But multiple Columns in a row will not be touched.
    If you want to remove multiple columns of a row, you need to first transpose,
    then use uniq, and then transpose again.

    So, for uniq-ing columns (remove multiple equal columns), you need to do it this way:
      transpose;
      uniq;
      transpose;



New feature (as of december 2014): assignments
=================================

  It's now also possible to use assignments.

      varname = COMMAND(...);

  assigns the tmpvar that was created by the command
  (internally via Store-command) to a named variable.

  Using assignments is not different to calling a command
  and then do store(<varname>);

  It's just syntactic sugar,
  maybe helpful in situations, where many store/recall comands would be
  needed.

  The tmpvar will be placed on the tmpvar-stack,
  so a previous stack-value will be replaced by the new one.



Control Structures:
===================

  if( ... )
  then ...
  else ...
  endif

    Here the "..." stands for statements that can be used there.
    (statement-list / command-list; each command needs to be followed by a semicolon)

    Instead of "if" also "ifnotempty" and "ifne" can be used.
    All three behave the same way, but some people may prefer more
    specific keywords.
    So, if the tmpvar that the statements-list inside the "if"-command (in the paranteheses)
    lefts behind, is not empty, the condition is true.
    (Thats like in otherlanguages like C or Perl).


  while( ... )
  do
    ...
  done

    Here the "..." stands for statements that can be used there.
    (statement-list / command-list; each command needs to be followed by a semicolon)

    Instead of "while" also "whilenotempty" and "whilene" can be used.
    All three behave the same way, but some people may prefer more
    specific keywords.
    So, if the tmpvar that the statements-list inside the "if"-command (in the paranteheses)
    lefts behind, is not empty, the condition is true and the statementlist
    between "do" and "done" will be evaluated.
    This is repeated, as long as the statement-list between the while-parantheses
    evaluates to the empty value.


Prefedined Variables:
=====================

  STARTURL
    The URL that is given via cli and investigated by the corresponding
    parser can access the url via this named variable.
    (recall("STARTURL") or $STARTURL)

  BASEDOC
    The document, that is retrieved via $STARTURL is avaiable as
    names variable BASEDOC.

  COOKIES.RECEIVED
    Cookies received from a webserver will be stored here.

  COOKIES.SEND
    Cookies which should be send to the webserver are stored here.

  NOW
    Unix-Timestamp (seconds 00h00m00s GMT 01.01.1970) as string.


Using Cookies:
==============
  If a server sends cookies, they will be stroed in the named variable
  "COOKIES.RECEIVED".
  If you want to send them back to the server, you manually need to copy
  the contents of "COOKIES.RECEIVED" to "COOKIES.SEND".
  Just add these two commands between the commands which get cookies and which
  should send cookies:

    recall("COOKIES.RECEIVED"); # recall the cookies from named variable and put them on the one-var-stack
    store("COOKIES.SEND");      # store the cookies from the one-var-stack in the variable "COOKIES.SEND"

  This handling seems a bit unconvenient; cookies could be received and sent automagically.
  But this manual handling gives you more control over the process.
  Be aware, that the contents of "COOKIES.SEND" is not changed automatically.
  So, wzhat is stored in that variable will be used again and again in next calls to the webserver.
  So, you need to keep track of the right cookies by using the above mentioned way of
  setting "COOKIES.SEND", or delete that variable.



Macros
======
  A new feature was added in end of march 2015: macros.

  It's now posible to define macros, so that repeating sequences don't need to
  be coded again and again. Instead a macro can be defined that allows
  factoring-out common command sequences into macros, and then call these
  macros with the "call"-command. (See call-command above, in the
  parserdef-language description.)

  Macros can be defined anywhere in a rc-file.
  They don't need to be defined before they are used.


  The syntax of macro definitons can be seen in this example:


  defmacro "FooBar":
  start
    tagselect( "a"."href" | arg("href") );
  end


  This macro would be called with the command
    call("FooBar");



This macro would be called with the command
  call("FooBar");



__END__

About

any-dl: generic mediathek-downloader ("generic" means, you also can call it "scrapertool")

Resources

License

Stars

Watchers

Forks

Packages

No packages published