- EMR support now works very well
- A couple of bugfixes. Sorry about that.
- Documentation fixes
Use —run=emr to launch a job onto the Amazon Elastic MapReduce cloud.
- copies the script to s3, as foo-mapper.rb and foo-reducer.rb (removing the need for the —map flag)
- copies the wukong libs up as a .tar/bz2, and extracts it into the cwd
- combines settings from commandline and yaml config, etc to configure and launch job
It’s still way shaky and I don’t think anything but the sample app will run. That sample app runs, tho.
Incompatible changes to option handling and script launching:
- Script doesn’t use extra_options any more. You should relocate them to the initializer or to configliere.
- there is no more default_mapper or default_reducer
- Improvements to the pig conversion methods
hdp-rm
respects the -skipTrash method
- added the
max_(maps|reduces)_per_(node|cluster)
jobconfs. - added jobconfs for io_job_mb and friends.
- added a loadable module to convert output data to pig bags and tuples
- pulled in several methods from active_support, incl. Enumerable#sum
- Scripts to find percentile rank of elements in a dataset
- We are starting to move wukong to a model where streaming is from a generic
source into a generic sink. Several stores have been landed in the code, but
many are in a half- or un-baked state. Please ignore this for the moment.
- made scripts inject a helpful job name using mapred.job.name
- Hash.compact_blank! and HashLike.compact_blank! — eliminate all key-values whoes value is blank?
- Bug in passing commandline args down to map and reduce child processes
Lots more examples:
- examples/stats/avg_value_frequency.rb does an Average Value Frequency histogram
- examples/server_logs has a quite useful apache log file parser
- Made the base streamer use each_record, opening the door for alternative record injection (eg Datamapper!)
- wukong/streamer/counting_reducer.rb is an um reducer and it counts things.
- A HELLA AWESOME working example from retail web analytics by @lenbust
- In
--run=local
mode, you can use ‘-’ alone as a filename to indicate STDIN / STDOUT as input/output respectively. - Minor tweaks to contrib/jeans
- Moved settings management & command line handling over to Configliere (
- Added example script and notes from Fredrik Möllerstrand (@lenbust)