-
Notifications
You must be signed in to change notification settings - Fork 6
any-dl: generic mediathek-downloader ("generic" means, you also can call it "scrapertool")
License
klartext/any-dl
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
any-dl, a tool for downloading Mediathek video-files ==================================================== Overview ======== The tool any-dl is inspired by, and has it's name derived from tools like youtube-dl, arte-dl, dctp-dl, zdf-dl, ... These tools are specialized downloading tools for videos of youtube, as well as tv-broadcasting companies. All these tools do download video files, and for accomplishing this task, they need to download and analyze webpages, via which thos cvideos are presented to the viewer. All these small tools are only programmed to work with certain video archives, and a lot of work is going into these kind of tools. any-dl is intended to be generic enough to allow downloads of videos from all these platforms, and for this case, probviding a Domain Specific Langage (DSL) which defines how the videos of a certain server can be downloaded. The DSL is designed to allow defining parsers, which say, how to scrape the archives. This language will explained below. But as normal user, you normally will not need to know this parser definition language. You just need to know how t use any-dl, and this is pretty simple. So, before the parser definition language will be explained, the usage of any-dl will be explained. any-dl provides a program with a certain language, that allows doing the parsing stuff of websites with focus on video download. If you miss a parser for a certain site or if you have written one by your own, please let me know. As any-dl does delegate the stream-downloads to certain tools, it will make sense to have the following tools also installed: - rtmpdump Compilation / Installation / Setup ================================== When you read this README file from within the directory you downloaded (via git or elsewhere), then you already have unpacked it. You need to compile and install / setup the tool. You will need to have OCaml installed, as well as some libraries. The library ocamlnet currently (May 2024) is not available for ocaml 5.x. So you need to use ocaml 4. The currently used version in development is ocaml 4.14.2. When using OPAM, you need the following packages: - pcre - ocamlnet - conf-gnutls (gnutls must be installed on your system too) - xmlm - yojson - csv If you don't use OPAM but instead the package manager from your Linux installation, the packages might have different names. On Arch some packages might only be available via AUR. At the moment any-dl itself has no OPAM-package. It might be added later. To compile any-dl, just type "make" at the shell then. $ make The the file "any-dl" should be in the current directory. You can then copy it to $HOME/bin if your PATH-variable points to it. Or possible you may copy it to /usr/lovcal/bin or /usr/bin depending on the Linux-/Unix-system you are using, and the filesystem-standard it is using. The file "rc-file.adl" does contain needed parser-definitions for any-dl to work as expected. This file must be available in one of three places: - /etc/any-dl.rc - $XDG_CONFIG_HOME/any-dl.rc ( default: $HOME/.config/any-dl.rc ) - $HOME/.any-dl.rc So, please copy from the local dir, where you built any-dl the file "rc-file.adl" to one of the three places, mentioned above. This command would do it for the default of the XDG_CONFIG_HOME environment variable: $ cp rc-file.adl $HOME/.config/any-dl.rc # copy the config file to the XDG-default-dir But if the XDG_CONFIG_HOME-environment variable is set, it's better to use it: $ cp rc-file.adl $XDG_CONFIG_HOME/any-dl.rc # copy the config file to the XDG-dir ( For command-line-newbies: The "$" symbol in the mentioned command lines, which you have to type do represent the prompt of the shell; don't type it. ) If you want to add your own parser-definitions, it would make sense to save the file "rc-file.adl" in the /etc/-directory, as mentioned above and then add your own parser-definitions in one of those places, where a any-dl config-file can be placed inside your HOME directory. This would have the advantage, that the config-file coming with any-dl will always be placed in the /etc/ directory, and your local parser-definitions will be saved in your local configs in $HOME. PLEASE, BE AWARE, THAT ANY-DL READS ALL CONFIG-FILES AS IF THEY WERE CONCATENATED INTO ONE BIG FILE. So, if you want to add your ADDITIONAL parsers to those that are already existing in the file inside "rc-file.adl" ( e.g. copied to /etc/any-dl.rc ), this can be done by just editing the config files in your $HOME-dir and write your own definitions into these files. It's not necessary (and also not recommended) to have there a copy of the "rc-file.adl", in which you added your parsers. Just write ONLY your own parsers into your local files. So, in other words: if you want to add your own parser-definitions place them solely in one of the default places fo config files in $HOME. Don't copy the stuff from "rc-file.adl" stored in /etc/any-dl.rc. ( If you place the config file "rc-file.adl" in $XDG_CONFIG_HOME/any-dl.rc you can add your parsers in $HOME/.any-dl.rc ) Of course you can also add your additional parsers in the place, where all parsers from "rc-file.adl" are stored... but when you update to a newer version of any-dl you maybe by accident overwrite your own stuff with the new "rc-file.adl" coming with a newer version of any-dl.) IF YOU WISH TO USE OTHER CONFIG-FILES, you can specify them with the -f option of any-dl. If you use the -f option, you can give a filename-path, which is used as config file then. BE AWARE: All DEFAULT PLACES of config files WILL THEN BE IGNORED! If you wish to add more than one config file, you can do it by just using the -f option more than once. Usage ===== You need to provide the url from the video archive, and give it to any-dl as a command line argument. Very often you have to quote the url inside of " and " so that certain symbols are not interpreted by the shell, from which you start any-dl. For example on ARTE mediathek, there is a telecast "Frankreichs mythische Orte", and the URL of it is: http://videos.arte.tv/de/videos/frankreichs-mythische-orte--7167432.html If you want to download the video of it, at the shell it will look like this: $ any-dl "http://videos.arte.tv/de/videos/frankreichs-mythische-orte--7167432.html" Then any-dl would download the video. :-) That's all :-) The same principle holds true for any other archives, for which a parser definition already is provided. If there is no such parser defined, any-dl will tell you with an exception-message. You then may ask, if there already is a parser for it available, written by the author of any-dl, or by any other persons. Or you could learn the parser definition language and program your own parser for that archive. If you send the parser you wrote to the author of any-dl, then in a newer release of any-dl, other people could use it also. By the way: there are already also some parser definitions, that are not focussed on certain video archives. There is the parser "linkextract" as well as "linkextract_xml". You can use them to pick out html-hyperreferences (typically called "links" or "references") or links in xml-files. To pick a certain parser can be done with the command-line switch "p": $ any-dl -p linkextract "http://videos.arte.tv/de/videos/frankreichs-mythische-orte--7167432.html" will print out all href's of the document (and they should all appear as absolute URLs). The names of all defined/available parsers can be displayed with the "l"-switch: $ any-dl -l If a parser as URLs, on which it will be invoked as default (when not using -p) it is also displayed with -l as switch. If you want to write your own parser-definitions, you need the list of commands. You can get it with the -c switch: $ any-dl -c will print a list of all keywords that the lexer/scanner does accept. That's enough for an introduction. And here now follows a brief introduction into the parser definition language. Parser-Definition Language: Intro ================================= Here is a simple parser definition, that allows to pick out all html-hyper-references from a webpage and print them. parsername "linkextract": ( "" ) start linkextract; print; end As you can see, the definition allows to give a parsername to the definition of the parser, an inbetween of "start" and "end" the commands that define the parser, are listed. A get-command that downloads the url (which is given via the command line) is done implicitly. Then the commands "linkextract" and "print" are executed. So, all links from the document, referred to by the URL are printed. The part with the parantehses and quoting-symbols allows to bind certain URL's to this parser, so that a parser can be selected automatically via the URL. So, a parser, dedicated to a certain URL will be invoked to work on the document, that has a certain URL. Via command line arguments, it is possible, to select a different parser, to do it differently than using the defaults. As an example see at the parser, that does look-up for the video-files of the NDR-TV-broadcaster in germany: # Example-URL: http://www.ndr.de/fernsehen/sendungen/mein_nachmittag/videos/wochenserie361.html # parsername "ndr_mediathek_get": ( "http://www.ndr.de" ) start match( "http://.*?mp4" ); rowselect(0); store("url"); # download the video # ------------------ paste("wget ", $url ); system; end There you can see, that the parser-name is set to "ndr_mediathek_get", and the URL, to which this parser is bound by default is "http://www.ndr.de". This does mean, that any URLs, that start with "http://www.ndr.de" will be parsed with the "ndr_mediathek_get" parser. If you give an URL like the one in the example (shown above the parser) as command line argument to any-dl, then the parser "ndr_mediathek_get" is invoked to look for the video file. Again, an implicit get is invoked. Because the first doeument must be downloaded in any case, the first get is done implicitly. It's obvious that the first document must be downloaded, and it makes writing the parsers easier. Stack and named variables ------------------------- This language is somehow special, that uses a mix of a stack-based language and one that allows named variables. The stack has a size of one value. Most functions use the stack. They can get their argument from there, as well as puttin gtheir results to the stack. A one-value-stack, which is used to read arguments from and save results to, does behave like a pipe in unix-environment. Something is written to a pipe by someone, and the same thing is read from a pipe by someone. So, the stack emulates something like a pipe. (Another analogy would be Perl's built in variable $_ but a Pipe analogy does fit the picture better. I think.) Because this behaviour sometimes is not providing enough complexity, any-dl also allows to store data/results in named variables. The NDR-example explained ------------------------- The first command does a MATCH with regular expressions on the contents of the first document. It does the match on the document, which was downloaded by the implicit GET-command. This document was put onto the 1-valued-stack. The match command reads the argument (the document) from the stack, tries to match for the certain regular expression, and puts the result onto the stack. Then from the result (a match is a 2D-matrix, meaning an array of an array", the first row (index == 0) is selected with ROWSELECT. The resulting selection holds an array. This selection-result is put to the 1-valued-stack. The stack-value (selection-result) is stored in the named variable "url" for later use via the STORE-command. To come back to the pipe-analogy, it's like a pipe that would look like this (pseudocode): GET(<start-url>) | MATCH(<regular_expression>) | ROWSELECT( <index> ) | STORE( <varname> ) | .... The paste-command pastes the literal string and the contents of the named variable "url" together, and places the result on the 1-valued stack. The system command tries to use the system() command (which you may know from other programming languages, the shell or the system-API) and as argument uses the value from the stack. So, if the variable "url" contanins the video-url, the system()-call would look like this one: system("wget <video-url>"); That is the parser language explained by example. I hope, this example shows you, what there is all about the input language (parser definition language). It's comparingly easy (IMHO), and in this way it will be possible to have easy access to a lot of different video archives, all with the same tool. So, it is not necessary to look for tool-updates, when some URLs and how they are connected together, on a video-archive-page, do change. If something changes in the way a video url is presented on one of these video/archives / Mediatheken, then only the according parser-definition needs to be updated. The tool any-dl itself does not needed to be changed. Also, all the different tools that provide video-download-functionality, with all their seperated effort of the programmer (many programmers), done to make only certain archive be accessed, can be freed to make just the basic analyzing of the webpages that provide the videos, and save effort to program a tool. So, one tool and many archives, instead of many tools for some archives. So, I think the advantage may be obvious to you. Now, details about the language will follow. Language Features: ================== Parser-Definitions: parsername "<parser-name>": ( <list-of-urls> ) start <command_1> ... <command_n> end Example: see above. <list-of-urls> is a comma-seperated list of strings. Commands all end with a semicolon ( ';' ). Commands / functions, that do not have parameters, will be used without parenatheses ( '(' and ')' ). Only when a command / function will need arguments, these will be passed inside parenatheses ( '(' and ')' ) which follow the name of the command/function. Some commands are available with and without parantheses. An example is the print-command/function. Stringquoting at the moment has three dfferent styles: String-Quoting: " " String-Quoting: >>> <<< String-Quoting: _*_ _*_ The language offers a stack of size 1. That means, that results from one command / function can be passed as input for the next command/function and this is default behaviour. Not all commands / functions do need the stack for input, and not all do leave something there as result (and input for following functions/commands). But if there is the need for transfering a result, normally no additional variables are needed. Most often, the data can be transferred from one function/command to the next one via the 1-valued-stack. But in certain cases, this is not enough. For these cases there are named variables also. To store the current data from the default-stack under a certain name, the command store("<variablename>"); will be used. To copy (restore/recall) the value of the named variable back to the default stack, the command recall("<variablename>"); can be used. In the paste()-command/function, it is possible, to access named variables via the $-notation, that you might know from other programming languages, like Perl for example. In the NDR-parser, it looks like this: paste("wget ", $url ); This does paste together the literal string "wget " and the contents of the named variable "url". The result of paste is stored at the one-valued stack. And the system-command uses this value as it's argument (and therefore downloads a file with the wget-tool). Startup-sequence: ----------------- The document(-url) given via command line is loaded automatically. The loaded document is automatically saved as a named variable (name: "BASEDOC"). Command Line Options: --------------------- -l list parser-definitions and related URLs -p <parsername> selects a certain parser, to be used for all urls. The names that can be selected can be listed with the -l option, or one can look into the rc-file. -f filename for rc-file -v verbose output -vv very verbose output -c show commands of parserdef-language -v verbose -s safe: no download via system invoked -i interactive: interactive features enabled -a auto-try: try all parsers -as auto-try-stop: try all parsers; stop after first success -u set the user-agent-string manually -ir set the initial referrer from '-' to custom value -ms set a sleep-time in a (bulk-) get-command in milli-seconds => sleeps only for bulk-get-commands (get that would call a list of documents, not for single get-commands) -sep set seperator-string, which is printed between parser-calls -help Display this list of options --help Display this list of options Examples: --------- 1.: Print html-links of a webpage: If you want to print the href-links of html, use any-dl with the predefined parser for link-extraction: $ any-dl -p linkextract <url_list> List of commands/keywords and a short-description of them: ========================================================= appendto Appends tmpvar to a named variable. If that variable does not exist already, it's internally created as empty match-result (which means the appendto-command then creates the varable itself with the new data.) "appendto" only works on match-results. Two match results will be concatenated this way, and be saved in the named variable automatically. (No store-command is needed after append.) The itmes will be concatenated as Rows, so adding two matchres' will add the second matchres as appending it's rows to the matchres in the named variable. basename creates the basename of an url or filename; the leading filename or URL-path is removed call call a macro. The macro is working like textual insertion of the commands of the macro at the place where the "call"-command is used. csv_read csv_read reads in a file as csv-file. The result is placed in tmpvar as Match_result. csv_save_as csv_save_as does save a *match_result* to a csv-file. All data is transformed to have equal number of columns in each row. Arguments of csv_save_as() are appended into a resulting filename. csv_save csv_save does save a *match_result* to a csv-file. All data is transformed to have equal number of columns in each row. The filename is derived from the used STARTURL. The charcater set is shrinked down to a subset of ASCII. ".csv" is appended automatically. colselect selects columns from a match-result # Example: # -------- colselect(2); delete deletes / removes a variable. It is not accessible anymore then. This means: accessing it can result in an error, because it's like accessing a variable that was not defined at all. download downloads an entity and storing to a file. # Examples: # --------- download; download( $filename ); dropcol drops a column from a match-result droprow drops a row from a match-result dummy just a dummy command (something like a NOP of processors) dump dump a html-page: deparses the tags, prints tags and data annotated; data is indented and an underline prepended. The underline is a multitude (defaults to 2) of the deepness of the nesting in the parse-tree. Means: the deeper something is wrapped in tags, the higher the indentation. dump_data dump a the data-part of a html-page: deparses the tags, prints data part, and NOT the tags. Works like un-tag html, or like a html-2-text. emptydummy just a dummy command (something like a NOP of processors), but gives back Empty as tmpvar end end-keyword for the parser-definition grep extract matching elements from data grepv extract non-matching elements from data (grepv: grep -v) exitparse exit's a parse of one parser. This means, that the URL that is currently tried to be parsed and worked on, will not be further investigated. But if there are more than one URl given via command-line, then the next url will be investigated. This means: even if by accident your parser for one url is exited (e.g. you are developing the parser for that URL), the next one will be worked on. get gets a document like html or xml page. Could also be a file, but not a stream so far. htmldecode Decodes the HTML-Quotings like " and such stuff back into "normal" characters. iselectmatch this is an interactive selectmatch. ("i" for interactive). Without the "-i" switch on the command line, it behaves like selectmatch(). But when the "-i" switch is set via command line, then an interactive menue will be displayed, so that the user can select an option; this option will allow to select the row by the selected column-index interactively. The user selects a number (beginning from 0). The corresponding column of the selected number will be used for selection of the row. If the input is not valid, a default value will be used. The default value is the value, that is the second arg of iselectmatch(). It would be the same as a hard coded selection of a selectmatch(). So, in most cases it would make sense to use iselectmatch() instead of selectmatch(). # Example: # -------- iselectmatch( <col_idx>, <matchpat>, <default_pattern>); linkextract extracts href-links from html-pages; relative links will tried to be converted into absolute links. linkextract_xml extracting href-items of an xml-document list_variables displays all named variables. Prints variable-name only. (show_variables does also print the contents of the variables) makeurl tries to make an url from a string match tries to match to the used pattern. PCRE-matches are used. The result is a matrix, containing of rows-of-"column"-elements. Please note: For real matches: Col 0 is the whole match, all others are the groups of a match. For match_results, thatare just "arrays of arrays" (not coming from a match, this obviously does not hold. If you do a match, and want only the selected groups to appear in your result, use dropcol(0); to kick out the whole-match. # examples: match("Regexp-String"); match(>>>another "Regex"-String<<<); mselect a multiple-select, like select, but the result will be an array of items (Strings or URls) not a single element. # Example: # -------- mselect(1,2); parsername this keywords starts the definition of a parser. paste the paste()-command creates a string from strings and variable-names (-notation). paste() accepts a list of items, seperated by commas (","). # Example: # -------- paste( "literal string", $varname, "foo", $bar ); post post does make a post-request (instead of get-request) to a webserver. The post-data is stored in named variables; the names of the variables will be given to post as arguments, e.g.: post( "name_1", "name_2" ); and the values will be looked up internally. For that purpose, the post-data has to be stored in named variables, before the post-command is called, so that the value for a variable can be looked up by the post-command. The URL for the post-command is taken from tmpvar. # Example: # -------- post("valname_1", "valname_2", "valname_3"); # the values must be set as named variables before. print print invoked without parantheses prints the value on the one-val-stack. print() with parantheses prints strings and variables (denoted by $-notatation), which means it accepts the same parameters as paste() but does not change the one-val-stack. print() used on an empty string does end the line automatically. This means, a new line will be used for further commands. If you wish to print only a certain string, without line-endlings added, you need to use print_string() print_string accepts only one string-argument and prints it. It prints the plain string, and does not add line-ending automatically. quote wraps the one-val-stack value with '"' and '"'. needed for arguments that are given to other tools, which will be invoked bia system() (which is invoking a shell). readline reads one line from stdin / console. Without arguments, the input is stored in the TMPVAR, With argument, the argument is used as variable-name, and the input is stored in this named variable. # Examples: # --------- readline; readline("VarnameForInputLine"); recall get a named value and store it on the one-val-stack. # Example: # -------- recall("varname"); rowselect selects a certain row from a match-result. # Example: # -------- rowselect(0); save saves a document to a file. The filename is derived from the url of the document. The charcater set is shrinked down to a subset of ASCII. save_as saves a document to a file with filename as argument. select selects ONE part of a tmpvar. Examples: select(0); select(3); For rows and columns: document: --------- 0 selects the document, 1 selects the url of the document any other value selects the document too document-array: --------------- selects document with index (starting at 0) rows/columns: ------------- selects ONE ELEMENT from a row or a column. The row/column must already have been selected with rowselect() or colselect(). select() does NOT allow matches on match-results (which are a matrix internally). # Example: # -------- select(2); selectmatch allows to select a row from a match-result, by specifiying a column-index and a string-matching-pattern for this certain element. So, this is a more advanced rowselect() with additional matching capabilities. show_match shows a match-result in a certain way; this command is intended to display matchese in a way, wher they can be read easily. Most often will be used in parser-development. But can of course also be used for informing the user on the steps that any-dl has done (e.,g. just be verbose and display the matches). But normally, rather developers will be interested in these details. show_type just shows the "type" of the value in the one-val-stack. show_variables displays all named variables. Prints variable-name and contents of the variable. (list_variables does only print the names of the variables) start this keyword indicates the start of the keywords section of a parser definition. store store the value from the one-val-stack as named variable. (use recall() for getting it back to the one-val-stack, or $-notation in some of the commnds that accept this notatiom). # Example: store("varname"); storematch Stores the tmpvar (must be match-result) to a named variable, with Row- and Column-Indexes as part of the name: storematch("MyName"); # stores matchresult as MyName.(col).(row) (for all col's and row's as indexes of the match-result) subst string-substiturion. Uses Pcre.replace internally. # Example: # -------- subst("pattern", "subst-string"); system calls the system() command with the string that is hold in the one-val-stack. table_to_matchres (expermental feature so far) converts a html-table to a match-result. This conversion works for single tables. So, a selection of a table should be as specific as possible, so that only one table will be seleted with tagselect. Then the conversion works. If more than one table has been extracted by tagselect, then they all will becoerced into ONE mathc-result. If that's, what is wanted, anything is fine. Otherwise, seperate table-selection will be necessary. Use tagselect with "htmlstring"-extractor, like this: # Example: # -------- tagselect("table"."id"="foobar" | htmlstring ); table_to_matchres; csv_save; tagselect selects tags and "subtags" from a document tree and gives back data accordingly. selection can be a *list* of tags, and optionally the argument "args" or the argument "arg" with a key-parameter (of a key-value pair) that selects the certain argument. See above in the command-examples for syntax details. Selection list does do a selection on the firt selector-specification. Then the resulting stuff is again selected, and so on. Example: -------- tagselect("table", "a", "img"."align"="top"| dump); The document is first scanned for table's. The outermost match is selected. So if a table is inside a table, the outer tag will be selected, and the whole outer table be selected. The inner table would just be content of the first one. No in-depth selection is done. All found table's then are scanned for <a ...> tags, which should be the <a href="..."> stuff. From the found <a ...>-tags any img-tags inside these a-tags will be selected, if they also are top-aligned. The result then is dumped to screen/console. tagselect selects elements from the document tree, so that a selection picks that certain tag and all it's descenmdants. That means for example, that a data-slurp-extraction will show all data from the descendants. But all other extractors ONLY LOOK UP THE TOPMOST element. (And not the desendants) The reason is: that the selected element normaly is what needs to be analyzed, not necessarily the descendants. With the "anytag" selector in tagselect (e.g. 'tagselect( anytags, argpairs );' ) ANY tag is selected, so ALL tags are TOPMOST tags, because any descendant also is edetected as a new tag. This is a depth-first selection, with each element being a top-element. This way you can access all descendants and analyse them, fr example extract all argpairs from all the tags of the whole document. # Examples, showing the allowed syntax: # ------------------------------------- tagselect( "a"| dump ); # dumps all <a ...> tags tagselect( "br"| dump ); # dumps all <br>-tags tagselect( "table", "a"| dump ); # <a ...> inside tables will be dumped tagselect( "img"."src"| dump ); # <img src="..."> wil be dumped tagselect("table", "a", "img"."align"="top"| dump); # all img-tags with "align"="top" will be selected, # if they appear inside a table; the stuff is dumped to screen tagselect( "table", "a" | argpairs ); # extract argpairs from the stuff that was selected tagselect( "table", "a" | arg("href") ); # extract value for the arg with key/name "href" from the stuff that was selected # the pair-extratcors ( "argpairs", "argkeys", "argvals" ) can be used as single-extractor-arguments # the other selectros select one item only (not pairs) and can be given as list, like this: tagselect("img"."src" | arg("src"), arg("alt") ); # tagselect used with "anytags"-selector # -------------------------------------- # the "anytags"-selector selects ANY tags, # which means that ALL tags from the document are # picked up in depth-first manner. # without anytags, a match does pick a tag with all descendants. # But these descendants will not be extracted with a extractor-pattern! # -------------------------------------- tagselect( anytags | argpairs ); # shows argpairs of ANY / ALL tags found (depth-first) titleextract extracts the contents from the <title>-tag of a webpage and puts the resutlt to the one-val-stack. to_string converts the value of the one-val-stack to a string-representation. to_matchres converts the value of the one-val-stack to a value of the same type, that a match-operation gives as result. This is an a row-column "array" and show_match-command could show details about it. Useful for later selecting/rowselecting values from this matchres-typed value. transpose transposition of a Match_result (which is an array of arrays, or a "matrix"). This exchanges rows and columns. sort Sorts entries. (only match-result's so far.) uniq From the tmpvar it removes entries with same contents. Works like "uniq" from unix-toolbox, but not limited to neighbouring lines; or like "sort -u" without a sort (not changing the order of the entries). Please note: on match-results, uniq works on ROWs. This means, that multiple rows will be discarded. But multiple Columns in a row will not be touched. If you want to remove multiple columns of a row, you need to first transpose, then use uniq, and then transpose again. So, for uniq-ing columns (remove multiple equal columns), you need to do it this way: transpose; uniq; transpose; New feature (as of december 2014): assignments ================================= It's now also possible to use assignments. varname = COMMAND(...); assigns the tmpvar that was created by the command (internally via Store-command) to a named variable. Using assignments is not different to calling a command and then do store(<varname>); It's just syntactic sugar, maybe helpful in situations, where many store/recall comands would be needed. The tmpvar will be placed on the tmpvar-stack, so a previous stack-value will be replaced by the new one. Control Structures: =================== if( ... ) then ... else ... endif Here the "..." stands for statements that can be used there. (statement-list / command-list; each command needs to be followed by a semicolon) Instead of "if" also "ifnotempty" and "ifne" can be used. All three behave the same way, but some people may prefer more specific keywords. So, if the tmpvar that the statements-list inside the "if"-command (in the paranteheses) lefts behind, is not empty, the condition is true. (Thats like in otherlanguages like C or Perl). while( ... ) do ... done Here the "..." stands for statements that can be used there. (statement-list / command-list; each command needs to be followed by a semicolon) Instead of "while" also "whilenotempty" and "whilene" can be used. All three behave the same way, but some people may prefer more specific keywords. So, if the tmpvar that the statements-list inside the "if"-command (in the paranteheses) lefts behind, is not empty, the condition is true and the statementlist between "do" and "done" will be evaluated. This is repeated, as long as the statement-list between the while-parantheses evaluates to the empty value. Prefedined Variables: ===================== STARTURL The URL that is given via cli and investigated by the corresponding parser can access the url via this named variable. (recall("STARTURL") or $STARTURL) BASEDOC The document, that is retrieved via $STARTURL is avaiable as names variable BASEDOC. COOKIES.RECEIVED Cookies received from a webserver will be stored here. COOKIES.SEND Cookies which should be send to the webserver are stored here. NOW Unix-Timestamp (seconds 00h00m00s GMT 01.01.1970) as string. Using Cookies: ============== If a server sends cookies, they will be stroed in the named variable "COOKIES.RECEIVED". If you want to send them back to the server, you manually need to copy the contents of "COOKIES.RECEIVED" to "COOKIES.SEND". Just add these two commands between the commands which get cookies and which should send cookies: recall("COOKIES.RECEIVED"); # recall the cookies from named variable and put them on the one-var-stack store("COOKIES.SEND"); # store the cookies from the one-var-stack in the variable "COOKIES.SEND" This handling seems a bit unconvenient; cookies could be received and sent automagically. But this manual handling gives you more control over the process. Be aware, that the contents of "COOKIES.SEND" is not changed automatically. So, wzhat is stored in that variable will be used again and again in next calls to the webserver. So, you need to keep track of the right cookies by using the above mentioned way of setting "COOKIES.SEND", or delete that variable. Macros ====== A new feature was added in end of march 2015: macros. It's now posible to define macros, so that repeating sequences don't need to be coded again and again. Instead a macro can be defined that allows factoring-out common command sequences into macros, and then call these macros with the "call"-command. (See call-command above, in the parserdef-language description.) Macros can be defined anywhere in a rc-file. They don't need to be defined before they are used. The syntax of macro definitons can be seen in this example: defmacro "FooBar": start tagselect( "a"."href" | arg("href") ); end This macro would be called with the command call("FooBar"); This macro would be called with the command call("FooBar"); __END__
About
any-dl: generic mediathek-downloader ("generic" means, you also can call it "scrapertool")
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published