Skip to content

Commit

Permalink
working through editorial feedback
Browse files Browse the repository at this point in the history
  • Loading branch information
Philip (flip) Kromer committed Aug 14, 2014
1 parent e308ee1 commit a577037
Show file tree
Hide file tree
Showing 14 changed files with 462 additions and 144 deletions.
21 changes: 21 additions & 0 deletions 01-intro.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@


why Hadoop is a breakthrough tool and examples of how you can use it to transform, simplify, contextualize, and organize data.

* distributes the data
* context (group)
* matching (cogroup / join)
*
* coordinates to grid cells
* group on location
* count articles
* wordbag
* join wordbags to coordinates
* sum counts
50 changes: 50 additions & 0 deletions 02-feedback_and_response.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
==== Introduction Structure



==== Tell readers what the point of this is before you dive into the example. What are you showing them? Why? What will they get out of it? "I'm going to walk you through an example of ___, which will show you _____ so that you'll begin to understand how _____" for example.

[NOTE]
.Initial version
======
Igpay Atinlay translator, actual version is our first Hadoop job, a program that translates plain text files into Igpay Atinlay. It’s written in Wukong, ...
======

Igpay Atinlay translator is our first Hadoop job, a program that translates plain text files into Igpay Atinlay. This is a Hadoop job stripped to its barest minimum, one that does just enough to each record that you believe it happened but with no distractions. That makes it convenient to learn how to launch a job; how to follow its progress; and where Hadoop reports performance metrics such as run time and amount of data moved. What's more, the very fact that it's trivial makes it one of the most important examples to run. For comparable input and output size, no regular Hadoop job can out-perform this one in practice, so it's a key reference point to carry in mind.

==== Whenever you say "It's best" be sure to include a statement of why it's best.

[NOTE]
.Initial version
======
It’s best to begin developing jobs locally on a subset of data. Run your Wukong script directly from your terminal’s commandline: ...
======


It's best to begin developing jobs locally on a subset of data: they are faster and cheaper to run. To run the Wukong script locally, enter this into your terminal's commandline:

(... a couple paragraphs later ...)

NOTE: There are even more reasons why it's best to begin developing jobs locally on a subset of data than just faster and cheaper. What's more, though, extracting a meaningful subset of tables also forces you to get to know your data and its relationships. And since all the data is local, you're forced into the good practice of first addressing "what would I like to do with this data" and only then considering "how shall I do so efficiently". Beginners often want to believe the opposite, but experience has taught us that it's nearly always worth the upfront investment to prepare a subset, and not to think about efficiency from the beginning.

==== Tell them what to expect before they run the job.

[NOTE]
.Initial version
======
First, let’s test on the same tiny little file we used at the commandline.
------
wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi
------
While the script outputs a bunch of happy robot-ese to your screen...
======

First, let's test on the same tiny little file we used at the commandline. This command does not process any data but instead instructs _Hadoop_ to process the data, and so its output will contain information on how the job is progressing.

------
wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi.txt
------

While the script outputs a bunch of happy robot-ese to your screen ...
93 changes: 75 additions & 18 deletions 02-hadoop_basics.asciidoc

Large diffs are not rendered by default.

29 changes: 29 additions & 0 deletions 06-analytic_patterns-structural_operations-ordering.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -292,6 +292,33 @@ NOTE: We've cheated on the theme of this chapter (pipeline-only operations) -- s
// * (how do `null`s sort?)
// * ASC / DESC: fewest strikeouts per plate appearance

=== Numbering Records in Rank Order

If you supply only the name of the table, RANK acts as a pipeline operation, introducing no extra map/reduce stage. Each split is numbered as a unit: the third line of chunk `part-00000` gets rank 2, the third line of chunk `part-00001` gets rank 2, and so on.

When you give rank a field to act on, it

It's important to know that in current versions of Pig, the RANK operator sets parallelism one,
forcing all data to a single reducer. If your data is unacceptably large for this, you can use the
method used in (REF) "Assigning a unique identifier to each line" to get a unique compound index
that matches the total ordering, which might meet your needs. Otherwise, we can offer you no good
workaround -- frankly your best option may be to pay someone to fix this

------
gift_id gift RANK RANK gift_id RANK gift DENSE
1 partridge 1 1 1
4a calling birds 2 4 7
4b calling birds 3 4 7
2a turtle dove 4 2 2
4d calling birds 5 4 7
5 golden rings 6 5 11
2b turtle dove 7 2 2
3a french hen 8 3 4
3b french hen 9 3 4
3c french hen 10 3 4
4c calling birds 11 4 7
------

// ==== Rank records in a group using Stitch/Over
//
//
Expand Down Expand Up @@ -420,3 +447,5 @@ STORE_TABLE('vals_shuffled', vals_shuffled);
-----

This follows the general plot of 'Assign a Unique ID': enable a hash function UDF; load the files so that each input split has a stable handle; and number each line within the split. The important difference here is that the hash function we generated accepts a seed that we can mix in to each record. If you supply a constant to the constructor (see the documentation) then the records will be put into an effectively random order, but the same random order each time. By supplying the string `'rand'` as the argument, the UDF will use a different seed on each run. What's nice about this approach is that although the ordering is different from run to run, it does not exhibit the anti-pattern of changing from task attempt to task attempt. The seed is generated once and then used everywhere. Rather than creating a new random number for each row, you use the hash to define an effectively random ordering, and the seed to choose which random ordering to apply.


28 changes: 0 additions & 28 deletions 10-advanced_patterns.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -333,34 +333,6 @@ You'll see a more elaborate version of this
// -- STORE_TABLE(normed_seasons, 'normed_seasons');


=== Numbering Records in Rank Order


If you supply only the name of the table, RANK acts as a pipeline operation, introducing no extra map/reduce stage. Each split is numbered as a unit: the third line of chunk `part-00000` gets rank 2, the third line of chunk `part-00001` gets rank 2, and so on.

When you give rank a field to act on, it

It's important to know that in current versions of Pig, the RANK operator sets parallelism one,
forcing all data to a single reducer. If your data is unacceptably large for this, you can use the
method used in (REF) "Assigning a unique identifier to each line" to get a unique compound index
that matches the total ordering, which might meet your needs. Otherwise, we can offer you no good
workaround -- frankly your best option may be to pay someone to fix this

------
gift RANK RANK gift RANK gift DENSE
partridge 1 1 1
turtle dove 2 2 2
turtle dove 3 2 2
french hen 4 3 4
french hen 5 3 4
french hen 6 3 4
calling birds 7 4 7
calling birds 8 4 7
calling birds 9 4 7
calling birds 10 4 7
K golden rings 11 5 11
------


// -- ***************************************************************************
// --
Expand Down
1 change: 0 additions & 1 deletion 10-event_streams.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,6 @@ Unless of course you are trying to test a service for resilience against an adve
flow(:mapper){ input > parse_loglines > elephant_stampede }
----


You must use Wukong's eventmachine bindings to make more than one simultaneous request per mapper.

=== Refs ===
Expand Down
23 changes: 17 additions & 6 deletions 11b-spatial_aggregation-points.asciidoc
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@

==== Smoothing Pointwise Data Locally (Spatial Aggregation of Points)

Let's start by extending the group-and-aggregate pattern -- introduced in Chapter Six (REF) and ubiqitous since --

We will start, as we always do, by applying patterns that turn Big Data into Much a Less Data. In particular,
A great tool for visualizing a large spatial data set


a great way to summarize a large data set, and one of the first things you’ll do to Know Thy Data.
This type of aggregation is a frontline tool of spatial analysis
It draws on methods you’ve already learned, giving us a chance to introduce some terminology and necessary details.

// * You want to "wash out" everything but the spatial variation -- even though the data was gathered for each
// * Point measurement of effect with local extent -- for example, the temperature measured at a weather station is understood to be representative of the weather for several surrounding miles.
// *
Expand Down Expand Up @@ -58,6 +61,13 @@ Then geonames places -- show lakes and streams (or something nature-y) vs someth
Do that again, but for a variable: airport flight volume -- researching
epidemiology


This would also be
n epidemiologist or transportation analyst interested in knowing the large-scale flux of people could throughout the global transportation network
Combining this with the weather data



// FAA flight data http://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/media/cy07_primary_np_comm.pdf

We can plot the number of air flights handled by every airport
Expand All @@ -77,8 +87,6 @@ grid_cts = FOREACH (GROUP gridded BY (bin_x, bin_y))
SUM(n_flights) AS tot_flights;
------

An epidemiologist or transportation analyst interested in knowing the large-scale flux of people could throughout the global transportation network

===== Pattern Recap: Spatial Aggregation of Points

* _Generic Example_ -- group on tile cell, then apply the appropriate aggregation function
Expand All @@ -89,8 +97,11 @@ An epidemiologist or transportation analyst interested in knowing the large-scal

=== Matching Points within a Given Distance (Pointwise Spatial Join)

Now that you've learned the spatial equivalent of a `GROUP BY`, you'll probably be interested to
learn the spatial equivalent of `COGROUP` and `JOIN`.
Now that you've learned the spatial equivalent of a `GROUP BY` aggregation -- combining many records within a grid cell into a single summary record -- you'll probably be interested to
learn the spatial equivalent of `COGROUP` and `JOIN` --
collecting all records


In particular, let's demonstrate how to match all points in one table with every point in another table that are less than a given fixed distance apart.

Our reindeer friends would like us to help determin what UFO pilots do while visiting Earth.
Expand Down
35 changes: 24 additions & 11 deletions 11c-geospatial_mechanics.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -60,17 +60,31 @@ We'll start with the operations that transform a shape on its own to produce a n

==== Constructing and Converting Geometry Objects

Somewhat related are operations that bring shapes in and out of Pig's control.
Somewhat related are operations that change the data types used to represent a shape.

* `FromWKText(chararray)`, `FromGeoJson(chararray)` -- converts the serialized description of a shape into the corresponding geometry object. We'll cover these data formats a bit later in the chapter. Similarly, `ToWKText(geom)` and `ToGeoJson(geom)` serialize a geometry into a string
Going from shape to coordinates-as-numbers lets you apply general-purpose manipulations

As a concrete example (but without going into the details), to identify patterns of periodic spacing in a set of coordinates footnote:[The methodical rows of trees in an apple orchard will appear as isolated frequency peaks oriented to the orchard plan; an old-growth forest would show little regularity and no directionality]
you'd quite likely want to extract the coordinates of your shapes as a bag of tuples, apply
a generic UDF implementing the 2-D FFT (Fast Fourier Transform) algorithm


.
The files in GeoJSON, WKT, or the other geographic formats described later in this Chapter (REF) produce records directly as geometry objects,

There are functions to construct Point, Multipoint, LineString, ... objects from coordinates you supply, and counterparts that extract a shape's coordinates as plain-old-Pig-objects.


* `Point` / `MultiPoint` / `LineString` / `MultiLineString` / `Polygon` / `MultiPolygon` -- construct given geometry.
* `GeoPoint(x_coord, y_coord)` -- constructs a `Point` from the given coordinates
* `GeoEnvelope( (x_min, y_min), (x_max, y_max) )` -- constructs an `Envelope` object from the numerically lowest and numerically highest coordinates. Note that it takes two tuples as inputs, not naked coordinates.
* `GeoMultiToBag(geom)` -- splits a (multi)geometry into a bag of simple geometries. A `MultiPoint` becomes a bag of `Points`; a `Point` becomes a bag with a single `Point`, and so forth.
* `GeoBagToMulti(geom)` -- combines a bag of geometries into a single multi geometry. For instance, a bag with any mixture of `Point` and `MultiPoint` geometries becomes a single `MultiPoint` object, and similarly for (multi)lines and (multi)polygons. All the elements must have the same dimension -- no mixing (multi)points with (multi)lines, etc.
* `FromWKText(chararray)`, `FromGeoJson(chararray)` -- converts the serialized description of a shape into the corresponding geometry object. We'll cover these data formats a bit later in the chapter. Similarly, `ToWKText(geom)` and `ToGeoJson(geom)` serialize a geometry into a string


// * (?name) GetPoints -- extract the collection of points from a geometry. Always returns a MultiPoint no matter what the input geometry.
// * (?name) GetLines -- extract the collection of lines or rings from a geometry. Returns `NULL` for a `Point`/`MultiPoint` input, and otherwise returns a MultiPoint no matter what the input geometry.
// * Point / MultiPoint / LineString / MultiLineString / Polygon / MultiPolygon -- construct given geometry
// - ClosedLineString -- bag of points to linestring, appending the initial point if it isn't identical to the final point
// * ForceMultiness
// * AsBinary, AsText
Expand All @@ -86,22 +100,21 @@ Somewhat related are operations that bring shapes in and out of Pig's control.
* `GeoX(point)`, `GeoY(point)` -- X or Y coordinates of a point
* `GeoLength(geom)`
* `GeoLength2dSpheroid(geom)` — Calculates the 2D length of a linestring/multilinestring on an ellipsoid. This is useful if the coordinates of the geometry are in longitude/latitude and a length is desired without reprojection.
* `GeoPerimeter(geom)` -- length measurement of a geometry's boundary
* `GeoDistanceSphere(geom)` — Returns minimum distance in meters between two lon/lat geometries. Uses a spherical earth and radius of 6370986 meters. Faster than GeoDistanceSpheroid, but less accurate
* `GeoDistance(geom)` -- the 2-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units.
* `GeoMinDistance(geom)`
* `GeoMaxDistance(geom)` -- the 2-dimensional largest distance between two geometries in projected units
* `GeoDistanceSphere(geom)` — Returns minimum distance in meters between two lon/lat geometries. Uses a spherical earth and radius of 6370986 meters.
// * `GeoMaxDistance(geom)` -- the 2-dimensional largest distance between two geometries in projected units
// * IsNearby -- if some part of the geometries lie within the given distance apart
// * IsNearbyFully(geom_a, geom_b, distance) -- if all parts of each geometry lies within the given distance of each other.
// * `GeoPerimeter(geom)` -- length measurement of a geometry's boundary

There are also a set of meta-operations that report on the geometry objects representing a shape:

* `Dimension(geom)` -- This operation returns zero for Point and MultiPoint; 1 for LineString and MultiLineString; and 2 for Polygon and MultiPolygon, regardless of whether those shapes exist in a 2-D or 3-D space
* `CoordDim(geom)` -- the number of axes in the coordinate system being used: 2 for X-Y geometries, 3 for X-Y-Z geometries, and so on. Points, lines and polygons within a common coordinate system will all have the same value for `CoordDim`
* `GeometryType(geom)` -- string representing the geometry type: `'Point'`, `'LineString'`, ..., `'MultiPolygon'`.
* `IsGeomEmpty(geom)` -- 1 if the geometry contains no actual points.
* `IsLineClosed(line)` -- 1 if the given `LineString`'s end point meets its start point.
* `IsSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. A multipoint is 'simple' if none of its points coincide.
* `IsGeoEmpty(geom)` -- 1 if the geometry contains no actual points.
* `IsGeoClosed(line)` -- 1 if the given `LineString`'s end point meets its start point.
* `IsGeoSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. A multipoint is 'simple' if none of its points coincide.
* `IsLineRing` -- 1 if the given `LineString` is a ring -- that is, closed and simple.

* `NumGeometries(geom_collection)`
Expand Down Expand Up @@ -155,7 +168,7 @@ The geospatial toolbox has a set of precisely specified spatial relationships. T
* `Contains(geom_a, geom_b)` -- 1 if `geom_a` completely contains `geom_b`: that is, the shapes' interiors intersect, and no part of `geom_b` lies in the exterior of `geom_a`. If two shapes are equal, then it is true that each contains the other. `Contains(A, B)` is exactly equivalent to `Within(B, A)`.
// - `ContainsProperly(geom_a, geom_b)` -- 1 if : that is, the shapes' interiors intersect, and no part of `geom_b` intersects the exterior _or boundary_ of `geom_a`. The result of `Contains(A, A)` is always 1 and the result of `ContainsProperly(A,A) is always 0.
* `Within(geom_a, geom_b)` -- 1 if `geom_a` is completely contained by `geom_b`: that is, the shapes' interiors intersect, and no part of `geom_a` lies in the exterior of `geom_b`. If two shapes are equal, then it is true that each is within the other.
* `Covers(geom_a, geom_b)` -- 1 if no point in `geom_b` is outside `geom_a`. `CoveredBy(geom_a, geom_b)` is sugar for `Covers(geom_b, geom_a)`.
* `Covers(geom_a, geom_b)` -- 1 if no point in `geom_b` is outside `geom_a`. `CoveredBy(geom_a, geom_b)` is sugar for `Covers(geom_b, geom_a)`. (TODO: verify: A polygon covers its boundary but does not contain its boundary.)
* `Crosses(geom_a, geom_b)` -- 1 if the shapes cross: their geometries have some, but not all, interior points in common; and the dimension of the intersection is one less than the higher-dimension of the two shapes. That's a mouthful, so let's just look at the cases in turn:
- A MultiPoint crosses a (multi)line or (multi)polygon as long as at least one of its points lies in the other shape's interior, and at least one of its points lies in the other shape's exterior. Points along the border of the polygon(s) or the endpoints of the line(s) don't matter.
- A Line/MultiLine crosses a Polygon/MultiPolygon only when part of some line lies within the polygon(s)' interior and part of some line lies within the polygon(s)' exterior. Points along the border of a polygon or the endpoints of a line don't matter.
Expand Down
Loading

0 comments on commit a577037

Please sign in to comment.