Skip to content

preaction/ETL-Yertl

Repository files navigation

NAME

ETL::Yertl - ETL with a Shell

VERSION

version 0.044

STATUS

Coverage Status

SYNOPSIS

### On a shell...
# Convert file to Yertl's format
$ yfrom csv file.csv >work.yml
$ yfrom json file.json >work.yml

# Convert file to output format
$ yto csv work.yml
$ yto json work.yml

# Parse HTTP logs into documents
$ ygrok '%{LOG.HTTP_COMMON}' httpd.log

# Read data from a database
$ ysql db_name 'SELECT * FROM employee'

# Write data to a database
$ ysql db_name 'INSERT INTO employee ( id, name ) VALUES ( $.id, $.name )'

### In Perl...
use ETL::Yertl;

# Give everyone a 5% raise
my $xform = file( '<', 'employees.yaml' )
          | transform( sub { $_->{salary} *= 1.05 } )
          >> stdout;
$xform->run;

DESCRIPTION

ETL::Yertl is an ETL (Extract, Transform, Load) for shells. It is designed to accept data from multiple formats (CSV, JSON), manipulate them using simple tools, and then convert them to an output format.

Yertl will have tools for:

  • Extracting data from databases (MySQL, Postgres, MongoDB)
  • Loading data into databases
  • Extracting data from web services
  • Writing data to web services
  • Distributing data through messaging APIs (ZeroMQ)

SUBROUTINES

stdin

my $stdin = stdin( %args );

Get a ETL::Yertl::FormatStream object for standard input. %args is a list of key/value pairs passed to "new" in ETL::Yertl::FormatStream. Useful keys are:

stdout

my $stdout = stdout( %args );

Get a ETL::Yertl::FormatStream object for standard output. %args is a list of key/value pairs passed to "new" in ETL::Yertl::FormatStream. Useful keys are:

  • format

    Specify the format that standard input is. Defaults to yaml or the value of YERTL_FORMAT (see "get_default" in ETL::Yertl::Format.

  • autoflush

    Immediately write documents to standard output to improve responsiveness, instead of queuing them for later writes for efficiency. This defaults to 1 (on). Set it to 0 to turn autoflush off.

transform

my $xform = transform( sub { ... } );
my $xform = transform( 'Local::Transform::Class' => @args );

Create a new ETL::Yertl::Transform object, passing in either a subref to transform documents, or a class to instantiate and arguments to pass to its constructor.

The subref will receive two arguments: The ETL::Yertl::Transform object and the document to transform. $_ will also be set to the document to transform. The subref should return the transformed document (either a new document, or the existing document after being modified).

If given a transform class, it should inherit from ETL::Yertl::Transform. The class will be loaded and an object instantiated using the @args.

file

my $stream = file( $mode, $path, %args );

Create a ETL::Yertl::FormatStream object for the given $path. $mode should be one of < for reading or > for writing. %args are additional arguments to pass to the ETL::Yertl::FormatStream constructor. Useful keys are:

yq

my $xform = yq( $filter );

Create a ETL::Yertl::Transform::Yq object with the given filter. See "SYNTAX" in yq for full filter syntax.

loop

my $loop = loop();

Get the IO::Async::Loop singleton. Use this to add other IO::Async objects to a larger program. This is not needed for simple Yertl streams, and is mostly used internally.

This is not exported by default. You can import it using use ETL::Yertl 'loop'.

SEE ALSO

Yertl Tools

  • yfrom

    Convert incoming data (CSV, JSON) to Yertl documents.

  • yto

    Convert Yertl documents into another format (CSV, JSON).

  • ygrok

    Parse lines of text into Yertl documents.

  • ysql

    Read/write documents from SQL databases.

  • yq

    A powerful mini-language for munging and filtering.

Other Tools

Here are some other tools that can be used with Yertl

  • recs (App::RecordStream)

    A set of tools for manipulating JSON (constrast with Yertl's YAML). For interoperability, set the YERTL_FORMAT environment variable to "json".

  • Catmandu

    A generic data processing toolkit. Convert data between multiple formats, import/export into multiple databases, and manipulate data with a mini-language.

    This project is very much like Yertl, and more mature besides.

  • jq

    A filter for JSON documents. The inspiration for yq. For interoperability, set the YERTL_FORMAT environment variable to "json".

  • jt

    JSON Transformer. Allows multiple ways of manipulating JSON, including JSONPath. For interoperability, set the YERTL_FORMAT environment variable to "json".

  • pv (Pipe Viewer)

    This tool helps examine how fast data is flowing through a shell pipeline. If the size of the data is known, it can even provide a progress bar and an ETA.

  • netcat (nc)

    Netcat allows simple streaming over a network. Using Netcat you can start a Yertl pipeline on one machine and finish it on another machine. For example, you could generate metrics on each client machine, and then write them to a central machine to insert into a database on that machine.

    Netcat does not come with any security, so be careful (use firewalls).

  • socat

    Socat is a multi-purpose relay. It is similar to Netcat but with many more features such as SSL and client verification. Socat has security, so you can use this like Netcat in cases where you must accept data from the Internet.

  • parallel (GNU Parallel)

    GNU Parallel is a shell tool for executing jobs in parallel on one or more computers. Parallel is very similar to xargs, except it will execute the commands on other computers.

  • distribution

    This tool creates charts. Pipe into it from yq to create simple charts from your data.

AUTHOR

Doug Bell [email protected]

CONTRIBUTORS

COPYRIGHT AND LICENSE

This software is copyright (c) 2018 by Doug Bell.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.