A filter plugin for Embulk to change timestamp format
- columns: columns to retain (array of hash)
- name: name of column (required)
- type: type to cast, choose one of
string
,timestamp
,long
(unixtimestamp),double
(unixtimestamp) (string, default isstring
) - from_format: specify the format of the input string (array of strings, default is default_from_timestamp_format)
- from_timezone: specify the timezone of the input string (string, default is default_from_timezone)
- to_format: specify the format of the output string (string, default is default_to_timestamp_format)
- to_timezone: specify the timezone of the output string (string, default is default_to_timezone)
- from_unit: specify the time unit of the input unixtimestamp (string, default is default_from_timestamp_unit)
- to_unit: specify the time unit of the output unixtimestamp (string, default is default_to_timestamp_unit)
- default_from_timestamp_format: default timestamp format for the input string (array of strings, default is
["%Y-%m-%d %H:%M:%S.%N %z"]
) - default_from_timezone: default timezone for the input string (string, default is
UTC
) - default_to_timestamp_format: default timestamp format for the output string (string, default is
%Y-%m-%d %H:%M:%S.%N %z
) - default_to_timezone: default timezone for the output string (string, default is
UTC
) - default_from_timestamp_unit: default time unit such as
sec
(for second),ms
(for milli second),us
(for micro second),ns
(for nano second) for the input unixtimestamp (string, default issecond
) - default_to_timestamp_unit: default time unit such as
sec
(for second),ms
(for milli second),us
(for micro second),ns
(for nano second) for the output unixtimestamp (string, default issecond
) - stop_on_invalid_record: stop bulk load transaction if a invalid record is found (boolean, default is
false
) - timestamp_parser (experimental): set
auto_java
to try to convert ruby format to java format to use faster java timestamp parser (string, default isauto
)
Say example.jsonl is as follows (this is a typical format on exporting a BigQuery table):
{"timestamp":"2015-07-12 15:00:00 UTC","nested":{"timestamp":"2015-07-12 15:00:00 UTC"}}
{"timestamp":"2015-07-12 15:00:00.1 UTC","nested":{"timestamp":"2015-07-12 15:00:00.1 UTC"}}
in:
type: file
path_prefix: example/example.jsonl
parser:
type: jsonl # not json parser
columns:
- {name: timestamp, type: string}
- {name: nested, type: json}
filters:
- type: timestamp_format
default_from_timestamp_format: ["%Y-%m-%d %H:%M:%S.%N %z", "%Y-%m-%d %H:%M:%S %z"]
default_to_timezone: "Asia/Tokyo"
default_to_timestamp_format: "%Y-%m-%d %H:%M:%S.%N"
columns:
- {name: timestamp, type: long, to_unit: ms}
- {name: $.nested.timestamp}
out:
type: stdout
Output will be as:
{"timestamp":1436713200000,"nested":{"timestamp":"2015-07-13 00:00:00.0}}
{"timestamp":1436713200100,"nested":{"timestamp":"2015-07-13 00:00:00.1}}
See ./example for more examples.
For type: json
column, you can specify JSONPath for column's name as:
name: $.payload.key1
name: "$.payload.array[0]"
name: "$.payload.array[*]"
name: $['payload']['key1.key2']
Following operators of JSONPath are not supported:
- Multiple properties such as
['name','name']
- Multiple array indexes such as
[1,2]
- Array slice such as
[1:2]
- Filter expression such as
[?(<expression>)]
NEWS: (2017/07/10) embulk 0.8.27 is released with a fast Timestamp jruby parser. This issue should be resolved, so Java Timestamp parser support will be dropped in future releases.
Embulk's timestamp parser originally uses jruby implementation, but it is slow. To improve performance, this plugin also supports Java's Joda-Time DateTimeFormat format as:
in:
type: file
path_prefix: example/example.jsonl
parser:
type: jsonl
columns:
- {name: timestamp, type: string}
- {name: nested, type: json}
filters:
- type: timestamp_format
default_from_timestamp_format: ["yyyy-MM-dd HH:mm:ss.SSS z", "yyyy-MM-dd HH:mm:ss z", "yyyy-MM-dd HH:mm:ss"]
default_to_timezone: "Asia/Taipei"
default_to_timestamp_format: "yyyy-MM-dd HH:mm:ss.SSS Z"
columns:
- {name: timestamp, type: long, to_unit: ms}
- {name: $.nested.timestamp}
out:
type: stdout
If format strings contain %
, jruby parser/formatter is used. Otherwirse, java parser/formatter is used
Automatic Conversion of Ruby Timestamp Format to Java Timestamp Format (experimental)
If you configure timestamp_parser: auto_java
, this plugin tries to convert ruby format into java format automatically to use faster java timestamp parser.
COMPARISON:
Benchmark test sets are available at ./bench. In my environment (Mac Book Pro), for 1000000 timestamps:
- java parser + java formatter: 1.3s
- java parser + jruby formatter: 1.4s
- jruby parser + java formatter: 64.52s
- jruby parser + jruby formatter: 65.06s
JRuby parser is slow, but JRuby formatter is not so slow.
JRuby parser has micro second resolution. Java (Joda-Time) parser has milli second resolution.
Nano second resolution is partially supported by this plugin itself. Use parser format nnnnnnnnn
for Java parser as
yyyy-MM-dd HH:mm:ss.nnnnnnnnn z
This plugin finds places of nano second from texts with regular expression \.(\d+)
.
For formatter, you can use nnnnnnnnn
for nano and nnnnnn
for micro as
yyyy-MM-dd HH:mm:ss.nnnnnnnnn z
yyyy-MM-dd HH:mm:ss.nnnnnn z
FYI: Java8's DateTimeFormatter supports nano second resolution, but we can not use it because embulk supports Java7.
- Write test
Run example:
$ ./gradlew classpath
$ embulk preview -I lib example/example.yml
Run test:
$ ./gradlew test
Run checkstyle:
$ ./gradlew check
Release gem:
$ ./gradlew gemPush