Extract/Transform/Load tool for NYC Space/Time Directory data: it loads separate data modules which perform ETL tasks, such as downloading and transforming data to the NYC Space/Time Directory data model.
For more information about the NYC Space/Time Directory project, as well as datasets produced by spacetime-etl, see http://spacetime.nypl.org.
Space/Time's ETL modules are separate Node.js modules which need to be installed individually. Each ETL module represents a NYC Space/Time Directory dataset or data transformation, and defines a set of steps; spacetime-etl loads these modules, and executes the steps they define.
Some examples:
ETL Module | Description |
---|---|
etl-mapwarper |
Outlines of maps from Map Warper, NYPL's tool for georectifying historical maps |
etl-group-maps |
Map Warper maps, grouped by decade — used by Maps by Decade |
etl-spacetime-graph |
Graph of all NYC Space/Time Directory datasets |
etl-oldnyc |
Locations of 40,000 geotagged photos from OldNYC |
For more ETL modules, see GitHub.
The configuration of the data tool is done in the NYC Space/Time Directory configuration file, under the etl
key.
The following configuration options must be specified:
Parameter | Description |
---|---|
moduleDir |
Path (absolute, or relative to data tool) where spacetime-etl looks for data modules |
modulePrefix |
Directory prefix used to identify data modules (e.g. etl-mapwarper ) — default is etl- |
outputDir |
Directory to which ETL modules write their data |
Example:
etl:
modulePrefix: "etl-"
moduleDir: /Users/bertspaan/code/etl-modules
outputDir: /Users/bertspaan/data/spacetime/etl
The configuration of the separate ETL modules can also be done in configuration file. Please see the README of the respective ETL modules for more information. Example:
etl:
modules:
geonames:
types:
PPL: 'st:Place'
PPLX: 'st:Neighborhood'
To use spacetime-etl to run ETL modules, you first need to install them. Go to the directory specified by the moduleDir
configuration option, and clone the ETL modules you need, for example:
git clone https://github.com/nypl-spacetime/etl-nyc-wards.git
git clone https://github.com/nypl-spacetime/etl-mapwarper.git
git clone https://github.com/nypl-spacetime/etl-oldnyc.git
Then, install the dependencies of each module:
cd etl-nyc-wards
npm install
cd ..
cd etl-mapwarper
npm install
cd ..
cd etl-oldnyc
npm install
You can now use spacetime-etl to run the three ETL modules you have just installed: nyc-wards
, mapwarper
and oldnyc
.
Installation:
npm install -g nypl-spacetime/spacetime-etl
Run the data tool without command-line arguments to get a list of the available data modules:
spacetime-etl
To run one or more ETL modules, provide their IDs as command-line arguments:
spacetime-etl mapwarper oldnyc ...
Alternatively, you can select the processing steps you want to run:
spacetime-etl mapwarper.download
By default, all steps are run consecutively.
Installation:
npm install nypl-spacetime/spacetime-etl
Usage (to run this example, first install etl-mapwarper
, see Installing ETL Modules):
const etl = require('spacetime-etl')
// Fetch all installed ETL modules:
const modules = etl.modules()
// Execute all steps:
etl.execute('mapwarper', (err) => {
if (err) {
console.error('Error:')
console.error(err)
} else {
console.log('Done!')
}
})
// Execute a single step:
etl.execute('nyc-streets.download', (err) => {
if (err) {
console.error('Error:')
console.error(err)
} else {
console.log('Done!')
}
})
The produced data files are written in a subdirectory of the configured output directory: <outputDir>/<step>/mapwarper
.
It's easy! Let's say we want to write a scraper which, very illegally, reads photos and their metadata from the NYC Municipal Archives Online Gallery.
First, create a directory in spacetime-etl's moduleDir
with the following name:
mkdir etl-nyc-municipal-archives
In this directory, create two files:
First, nyc-municipal-archives.dataset.json
, holding the metadata of the ETL module and the resulting dataset:
{
"id": "nyc-municipal-archives",
"title": "NYC Municipal Archives Online Gallery",
"license": "CC0",
"description": "The NYC Municipal Archives Online Gallery provides research access to over 900,000 items digitized from the Municipal Archives' vast holdings, including photographs, maps, motion-pictures and audio recordings",
"author": "Bert Spaan",
"website": "http://nycma.lunaimaging.com/luna/servlet"
}
The actual code goes in nyc-municipal-archives.js
:
function download (config, dirs, tools, callback) {
// Download data, write data to output directory;
// dirs.current contains the path of the
// output directory of the current step
// config object contains configuration from
// this module's section (if available)
callback()
}
function transform (config, dirs, tools, callback) {
// Read downloaded data from output directory;
// dirs.download contains the path of the
// output directory of the download step
// Do data transformations, and write the
// resulting Space/Time objects to disk
// using tools.writer
const object = {
type: 'object',
obj: {
id: 1,
type: 'st:Photo'
data: {
title: '',
collection: ''
},
geometry: {
type: "Point",
coordinates: [
-74.014592,
40.702211
]
}
}
}
tools.writer.writeObject(object, callback)
}
module.exports.steps = [
download,
transform
]
You can now run this ETL module with the following command:
spacetime-etl nyc-municipal-archives
Copyright (C) 2015 Waag Society, 2017 The New York Public Library