From a5770370548ad17ff9fffd01361aed7201e5e994 Mon Sep 17 00:00:00 2001 From: "Philip (flip) Kromer" Date: Thu, 14 Aug 2014 16:03:58 -0500 Subject: [PATCH] working through editorial feedback --- 01-intro.asciidoc | 21 ++ 02-feedback_and_response.asciidoc | 50 ++++ 02-hadoop_basics.asciidoc | 93 +++++-- ...ns-structural_operations-ordering.asciidoc | 29 ++ 10-advanced_patterns.asciidoc | 28 -- 10-event_streams.asciidoc | 1 - 11b-spatial_aggregation-points.asciidoc | 23 +- 11c-geospatial_mechanics.asciidoc | 35 ++- 11c-spatial_aggregations_on_regions.asciidoc | 2 +- 11e-weather_near_you.asciidoc | 15 + 12-text_analysis.asciidoc | 263 +++++++++++++++--- 25c-references.asciidoc | 1 + 90-style_guide.asciidoc | 7 + book.asciidoc | 38 +-- 14 files changed, 462 insertions(+), 144 deletions(-) create mode 100644 01-intro.asciidoc create mode 100644 02-feedback_and_response.asciidoc create mode 100644 90-style_guide.asciidoc diff --git a/01-intro.asciidoc b/01-intro.asciidoc new file mode 100644 index 0000000..8e17ba8 --- /dev/null +++ b/01-intro.asciidoc @@ -0,0 +1,21 @@ + + +why Hadoop is a breakthrough tool and examples of how you can use it to transform, simplify, contextualize, and organize data. + +* distributes the data +* context (group) +* matching (cogroup / join) +* + + +* coordinates to grid cells +* group on location +* count articles +* wordbag +* join wordbags to coordinates +* sum counts + + + + + diff --git a/02-feedback_and_response.asciidoc b/02-feedback_and_response.asciidoc new file mode 100644 index 0000000..480eecb --- /dev/null +++ b/02-feedback_and_response.asciidoc @@ -0,0 +1,50 @@ +==== Introduction Structure + + + +==== Tell readers what the point of this is before you dive into the example. What are you showing them? Why? What will they get out of it? "I'm going to walk you through an example of ___, which will show you _____ so that you'll begin to understand how _____" for example. + +[NOTE] +.Initial version +====== +Igpay Atinlay translator, actual version is our first Hadoop job, a program that translates plain text files into Igpay Atinlay. It’s written in Wukong, ... +====== + +Igpay Atinlay translator is our first Hadoop job, a program that translates plain text files into Igpay Atinlay. This is a Hadoop job stripped to its barest minimum, one that does just enough to each record that you believe it happened but with no distractions. That makes it convenient to learn how to launch a job; how to follow its progress; and where Hadoop reports performance metrics such as run time and amount of data moved. What's more, the very fact that it's trivial makes it one of the most important examples to run. For comparable input and output size, no regular Hadoop job can out-perform this one in practice, so it's a key reference point to carry in mind. + +==== Whenever you say "It's best" be sure to include a statement of why it's best. + +[NOTE] +.Initial version +====== +It’s best to begin developing jobs locally on a subset of data. Run your Wukong script directly from your terminal’s commandline: ... +====== + + +It's best to begin developing jobs locally on a subset of data: they are faster and cheaper to run. To run the Wukong script locally, enter this into your terminal's commandline: + +(... a couple paragraphs later ...) + +NOTE: There are even more reasons why it's best to begin developing jobs locally on a subset of data than just faster and cheaper. What's more, though, extracting a meaningful subset of tables also forces you to get to know your data and its relationships. And since all the data is local, you're forced into the good practice of first addressing "what would I like to do with this data" and only then considering "how shall I do so efficiently". Beginners often want to believe the opposite, but experience has taught us that it's nearly always worth the upfront investment to prepare a subset, and not to think about efficiency from the beginning. + +==== Tell them what to expect before they run the job. + +[NOTE] +.Initial version +====== +First, let’s test on the same tiny little file we used at the commandline. + +------ +wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi +------ + +While the script outputs a bunch of happy robot-ese to your screen... +====== + +First, let's test on the same tiny little file we used at the commandline. This command does not process any data but instead instructs _Hadoop_ to process the data, and so its output will contain information on how the job is progressing. + +------ +wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi.txt +------ + +While the script outputs a bunch of happy robot-ese to your screen ... diff --git a/02-hadoop_basics.asciidoc b/02-hadoop_basics.asciidoc index 7fe5cd5..a73441e 100644 --- a/02-hadoop_basics.asciidoc +++ b/02-hadoop_basics.asciidoc @@ -1,9 +1,40 @@ [[hadoop_basics]] == Hadoop Basics +=== Introduction + +In this chapter, we will equip you with two things: the necessary mechanics of working with Hadoop, and a physical intuition for how data and computation move around the cluster during a job. + +Hadoop is a large and complex beast. It can be bewildering to even begin to use the system, and so in this chapter we're going to purposefully charge through the least you need to know to launch jobs and manage data. If you hit trouble, anything past that is well-covered in Hadoop's excellent and detailed documentation or online. But don't go looking for trouble! For every one of its many modes options and configurations that is essential, there are many more that are distracting or even dangerous. The most important optimizations you can make come from designing efficient workflows, and even moreso from knowing when to spend highly valuable programmer time to reduce compute time. + +The key to doing so is an intuitive, physical understanding of how data moves around a Hadoop cluster. + +how data moves around a hadoop cluster +how much that costs + +The focus of this chapter is on building your intuition on +how much data should be processed and how much that should cost +how much data was processed and how much it did cost. + +how and why Hadoop distributes data across the machines in a cluster +how much it costs to +overhead + +basis for comparing human costs to cluster costs + + +How much data was moved + +This chapter will only look at "embarras + +Shipping data from one machine to another -- even from one location on disk to another -- is outrageously costly +How + +// (If you're already familiar with the basics of using Hadoop and are too anxious to get to the specifics of working with data, skip ahead to Chapter 4) + .Chimpanzee and Elephant Start a Business ****** -A few years back, two friends -- JT, a gruff silverback chimpanzee, and Nanette, a meticulous matriarch elephant -- decided to start a business. As you know, Chimpanzees love nothing more than sitting at keyboards processing and generating text. Elephants have a prodigious ability to store and recall information, and will carry huge amounts of cargo with great determination. This combination of skills impressed a local publishing company enough to earn their first contract, so Chimpanzee and Elephant Corporation (C&E Corp for short) was born. +A few years back, two friends -- JT, a gruff silverback chimpanzee, and Nanette, a meticulous matriarch elephant -- decided to start a business. As you know, Chimpanzees love nothing more than sitting at keyboards processing and generating text. Elephants have a prodigious ability to store and recall information, and will carry huge amounts of cargo with great determination. This combination of skills impressed a local publishing company enough to earn their first contract, so Chimpanzee and Elephant, Incorporated (C&E for short) was born. The publishing firm’s project was to translate the works of Shakespeare into every language known to man, so JT and Nanette devised the following scheme. Their crew set up a large number of cubicles, each with one elephant-sized desk and one or more chimp-sized desks, and a command center where JT and Nanette can coordinate the action. @@ -16,24 +47,21 @@ If you were to walk by a cubicle mid-workday, you would see a highly-efficient i The fact that each chimpanzee's work is independent of any other's -- no interoffice memos, no meetings, no requests for documents from other departments -- made this the perfect first contract for the Chimpanzee & Elephant, Inc. crew. JT and Nanette, however, were cooking up a new way to put their million-chimp army to work, one that could radically streamline the processes of any modern paperful office footnote:[Some chimpanzee philosophers have put forth the fanciful conceit of a "paper-less" office, requiring impossibilities like a sea of electrons that do the work of a chimpanzee, and disks of magnetized iron that would serve as scrolls. These ideas are, of course, pure lunacy!]. JT and Nanette would soon have the chance of a lifetime to try it out for a customer in the far north with a big, big problem. ****** -=== Introduction - -Hadoop is a large and complex beast. There's a lot to learn before one can even begin to use the system, much less become meaningfully adept at doing so. While Hadoop has stellar documentation, not everyone is comfortable diving right in to the menagerie of Mappers, Reducers, Shuffles, and so on. For that crowd, we've taken a different route, in the form of a story. - === Map-only Jobs: Process Records Individually === -Having read that short allegory, you've just learned a lot about how Hadoop operates under the hood. We can now use it to walk you through some examples. This first example uses only the _Map_ phase of MapReduce, to take advantage of what some people call an "embarrassingly parallel" problem. +As you'd guess, the way Chimpanzee and Elephant organize their files and workflow corresponds directly with how Hadoop handles data and computation under the hood. We can now use it to walk you through some examples. -We may not be as clever as JT's multilingual chimpanzees, but even we can translate text into a language we'll call _Igpay Atinlay_ footnote:[Sharp-eyed readers will note that this language is really called _Pig Latin._ That term has another name in the Hadoop universe, though, so we've chosen to call it Igpay Atinlay -- Pig Latin for "Pig Latin".]. For the unfamiliar, here's how to http://en.wikipedia.org/wiki/Pig_latin#Rules[translate standard English into Igpay Atinlay]: +We may not be as clever as JT's multilingual chimpanzees, but even we can translate text into a language we'll call _Igpay Atinlay_. footnote:[Sharp-eyed readers will note that this language is really called _Pig Latin._ That term has another name in the Hadoop universe, though, so we've chosen to call it Igpay Atinlay -- Pig Latin for "Pig Latin".]. For the unfamiliar, here's how to http://en.wikipedia.org/wiki/Pig_latin#Rules[translate standard English into Igpay Atinlay]: * If the word begins with a consonant-sounding letter or letters, move them to the end of the word adding "ay": "happy" becomes "appy-hay", "chimp" becomes "imp-chay" and "yes" becomes "es-yay". * In words that begin with a vowel, just append the syllable "way": "another" becomes "another-way", "elephant" becomes "elephant-way". -<> is our first Hadoop job, a program that translates plain text files into Igpay Atinlay. It's written in Wukong, a simple library to rapidly develop big data analyses. Like the chimpanzees, it is single-concern: there's nothing in there about loading files, parallelism, network sockets or anything else. Yet you can run it over a text file from the commandline -- or run it over petabytes on a cluster (should you for whatever reason have a petabyte of text crying out for pig-latinizing). +<> is our first Hadoop job, a program that translates plain text files into Igpay Atinlay. This is a Hadoop job stripped to its barest minimum, one that does just enough to each record that you believe it happened but with no distractions. That makes it convenient to learn how to launch a job; how to follow its progress; and where Hadoop reports performance metrics such as run time and amount of data moved. What's more, the very fact that it's trivial makes it one of the most important examples to run. For comparable input and output size, no regular Hadoop job can out-perform this one in practice, so it's a key reference point to carry in mind. +We've written this example in Wukong, a simple library to rapidly develop big data analyses. Like the chimpanzees, it is single-concern: there's nothing in there about loading files, parallelism, network sockets or anything else. Yet you can run it over a text file from the commandline -- or run it over petabytes on a cluster (should you for whatever reason have a petabyte of text crying out for pig-latinizing). [[pig_latin_translator]] -.Igpay Atinlay translator, actual version +.Igpay Atinlay translator ---- CONSONANTS = "bcdfghjklmnpqrstvwxz" UPPERCASE_RE = /[A-Z]/ @@ -77,7 +105,7 @@ end * `yield(latinized)` hands off the `latinized` string for wukong to output **** -It's best to begin developing jobs locally on a subset of data. Run your Wukong script directly from your terminal's commandline: +It's best to begin developing jobs locally on a subset of data, because they are faster and cheaper to run. To run the Wukong script locally, enter this into your terminal's commandline footnote:[If you're not familiar with working from the commandline, Appendix (REF) will give you the bare essentials.]: ------ wu-local examples/text/pig_latin.rb data/magi.txt - @@ -89,15 +117,25 @@ The `-` at the end tells wukong to send its results to standard out (STDOUT) rat Everywhere-way ey-thay are-way isest-way. Ey-thay are-way e-thay agi-may. ------ -That's what it looks like when a `cat` is feeding the program data; let's see how it works when an elephant sets the pace. +That's what it looks like when a `cat` is feeding the program data. Let's run it on a real Hadoop cluster to see how it works when an elephant is in charge. + +NOTE: There are even more reasons why it's best to begin developing jobs locally on a subset of data than just faster and cheaper. What's more, though, extracting a meaningful subset of tables also forces you to get to know your data and its relationships. And since all the data is local, you're forced into the good practice of first addressing "what would I like to do with this data" and only then considering "how shall I do so efficiently". Beginners often want to believe the opposite, but experience has taught us that it's nearly always worth the upfront investment to prepare a subset, and not to think about efficiency from the beginning. + + +=== Transferring data to the cluster [NOTE] .Are you running on a cluster? ====== -If you've skimmed Hadoop's documentation already, you've probaby seen the terms _fully-distributed,_ _pseudo-distributed,_ and _local,_ bandied about. Those describe different ways to setup your Hadoop cluster, and they're relevant to how you'll run the examples in this chapter. +If you've skimmed Hadoop's documentation already, you've probaby seen the terms _fully-distributed,_ _pseudo-distributed,_ and _local_ bandied about. Those describe different ways to setup your Hadoop cluster, and they're relevant to how you'll run the examples in this chapter. +If you follow our advice and only use Hadoop through its medium and high-level abstractions, +you can develop and test a job using your laptop +during a long-haul flight with no network access, then run it on a cluster -In short: if you're running the examples on your laptop, during a long-haul flight, you're likely running in local mode. That means all of the computation work takes place on your machine, and all of your data sits on your local filesystem. +If you want to running the examples on the single machine you're sitting in front of, , you're likely running . That means all of the computation work takes place on your machine, and all of your data sits on your local filesystem. + +// we recommend you On the other hand, if you have access to a cluster, your jobs run in fully-distributed mode. All the work is farmed out to the cluster machines. In this case, your data will sit in the cluster's filesystem called HDFS. @@ -110,16 +148,20 @@ hadoop fs -put wukong_example_data/text ./data/ These commands understand `./data/text` to be a path on the HDFS, not your local disk; the dot `.` is treated as your HDFS home directory (use it as you would `~` in Unix.). The `wu-put` command, which takes a list of local paths and copies them to the HDFS, treats its final argument as an HDFS path by default, and all the preceding paths as being local. +// If you want to test you +// You might also see references to 'pseudo-distributed mode'. Our advice is to avoid it. + (Note: if you don't have access to a Hadoop cluster, Appendix 1 (REF) lists resources for acquiring one.) ====== ==== Run the Job ==== -First, let's test on the same tiny little file we used at the commandline. +First, let's test on the same tiny little file we used at the commandline. This command does not process any data but instead instructs _Hadoop_ to process the data, and so its output will contain information on how the job is progressing. + // Make sure to notice how much _longer_ it takes this elephant to squash a flea than it took to run without Hadoop. ------ -wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi +wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi.txt ------ // TODO-CODE: something about what the reader can expect to see on screen @@ -186,7 +228,12 @@ Visit the jobtracker console (see sidebar REF). The first thing you'll notice is The most important numbers to note are the number of running tasks (there should be some unless your job is finished or the cluster is congested) and the number of failed tasks (for a healthy job on a healthy cluster, there should never be any). Don't worry about killed tasks; for reasons we'll explain later on, it's OK if a few appear late in a job. We will describe what to do when there are failing attempts later in the section on debugging Hadoop jobs (REF), but in this case, there shouldn't be any. Clicking on the number of running Map tasks will take you to a window that lists all running attempts (and similarly for the other categories). On the completed tasks listing, note how long each attempt took; for the Amazon M3.xlarge machines we used, each attempt took about x seconds (CODE: correct time and machine size). There is a lot of information here, so we will pick this back up in chapter (REF), but the most important indicator is that your attempts complete in a uniform and reasonable length of time. There could be good justification for why task 00001 is still running after five minutes while all other attempts finished in ten seconds, but if that's not what you thought would happen you should dig deeper footnote:[A good justification is that task 00001's input file was compressed in a non-splittable format and is 40 times larger than the rest of the files. It's reasonable to understand how this unfortunate situation unfairly burdened a single mapper task. A bad justification is that task 00001 is trying to read from a failing-but-not-failed datanode, or has a corrupted record that is sending the XML parser into recursive hell. The good reasons you can always predict from the data itself; otherwise assume it's a bad reason]. -You should get in the habit of sanity-checking the number of tasks and the input and output sizes at each job phase for the jobs you write. In this case, the job should ultimately require x Map tasks, no Reduce tasks and on our x machine cluster, it completed in x minutes. For this input, there should be one Map task per HDFS block, x GB of input with the typical one-eighth GB block size, means there should be 8x Map tasks. Sanity checking the figure will help you flag cases where you ran on all the data rather than the one little slice you intended or vice versa; to cases where the data is organized inefficiently; or to deeper reasons that will require you to flip ahead to chapter (REF). +You should get in the habit of sanity-checking the number of tasks and the input and output sizes at each job phase for the jobs you write. + + +Remember, what we're learning to do is 'coach' Hadoop into how to process our data. + +In this case, the job should ultimately require x Map tasks, no Reduce tasks and on our x machine cluster, it completed in x minutes. For this input, there should be one Map task per HDFS block, x GB of input with the typical one-eighth GB block size, means there should be 8x Map tasks. Sanity checking the figure will help you flag cases where you ran on all the data rather than the one little slice you intended or vice versa; to cases where the data is organized inefficiently; or to deeper reasons that will require you to flip ahead to chapter (REF). Annoyingly, the Job view does not directly display the Mapper input data, only the cumulative quantity of data per source, which is not always an exact match. Still, the figure for HDFS bytes read should closely match the size given by ‘Hadoop fs -du’ (CODE: add paths to command). @@ -194,6 +241,11 @@ You can also estimate how large the output should be, using the "Gift of the Mag We cannot stress enough how important it is to validate that your scripts are doing what you think they are. The whole problem of Big Data is that it is impossible to see your data in its totality. You can spot-check your data, and you should, but without independent validations like these you're vulnerable to a whole class of common defects. This habit -- of validating your prediction of the job’s execution -- is not a crutch offered to the beginner, unsure of what will occur; it is a best practice, observed most diligently by the expert, and one every practitioner should adopt. +=== Performance Baseline + +This first example is what some people call an "embarrassingly parallel" problem: every record can be processed on its own. As you'll see, that means it uses only the _Map_ phase of Map/Reduce. + + .How a job is born, the thumbnail version ********* Apart from one important detail, the mechanics of how a job is born should never become interesting to a Hadoop user. But since some people's brains won't really believe that the thing actually works unless we dispel some of the mystery, here's a brief synopsis. @@ -205,6 +257,11 @@ As you have gathered, each Hadoop worker runs a tasktracker daemon to coordinate The one important detail to learn in all this is that _task trackers do not run your job, they only launch it_. Your job executes in a completely independent child process with its own Java settings and library dependencies. In fact, if you are using Hadoop streaming programs like Wukong, your job runs in even yet its own process, spawned by the Java child process. We've seen people increase the tasktracker memory sizes thinking it will improve cluster performance -- the only impact of doing so is to increase the likelihood of out-of-memory errors. ******** -=== Outro +// === The Cost of a Job + + + +// === Outro +// +// In the next chapter, you'll learn about map/reduce jobs -- the full power of Hadoop's processing paradigm.. Let's start by joining JT and Nannette with their next client. -In the next chapter, you'll learn about map/reduce jobs -- the full power of Hadoop's processing paradigm.. Let's start by joining JT and Nannette with their next client. diff --git a/06-analytic_patterns-structural_operations-ordering.asciidoc b/06-analytic_patterns-structural_operations-ordering.asciidoc index 417b666..16d4531 100644 --- a/06-analytic_patterns-structural_operations-ordering.asciidoc +++ b/06-analytic_patterns-structural_operations-ordering.asciidoc @@ -292,6 +292,33 @@ NOTE: We've cheated on the theme of this chapter (pipeline-only operations) -- s // * (how do `null`s sort?) // * ASC / DESC: fewest strikeouts per plate appearance +=== Numbering Records in Rank Order + +If you supply only the name of the table, RANK acts as a pipeline operation, introducing no extra map/reduce stage. Each split is numbered as a unit: the third line of chunk `part-00000` gets rank 2, the third line of chunk `part-00001` gets rank 2, and so on. + +When you give rank a field to act on, it + +It's important to know that in current versions of Pig, the RANK operator sets parallelism one, +forcing all data to a single reducer. If your data is unacceptably large for this, you can use the +method used in (REF) "Assigning a unique identifier to each line" to get a unique compound index +that matches the total ordering, which might meet your needs. Otherwise, we can offer you no good +workaround -- frankly your best option may be to pay someone to fix this + +------ +gift_id gift RANK RANK gift_id RANK gift DENSE +1 partridge 1 1 1 +4a calling birds 2 4 7 +4b calling birds 3 4 7 +2a turtle dove 4 2 2 +4d calling birds 5 4 7 +5 golden rings 6 5 11 +2b turtle dove 7 2 2 +3a french hen 8 3 4 +3b french hen 9 3 4 +3c french hen 10 3 4 +4c calling birds 11 4 7 +------ + // ==== Rank records in a group using Stitch/Over // // @@ -420,3 +447,5 @@ STORE_TABLE('vals_shuffled', vals_shuffled); ----- This follows the general plot of 'Assign a Unique ID': enable a hash function UDF; load the files so that each input split has a stable handle; and number each line within the split. The important difference here is that the hash function we generated accepts a seed that we can mix in to each record. If you supply a constant to the constructor (see the documentation) then the records will be put into an effectively random order, but the same random order each time. By supplying the string `'rand'` as the argument, the UDF will use a different seed on each run. What's nice about this approach is that although the ordering is different from run to run, it does not exhibit the anti-pattern of changing from task attempt to task attempt. The seed is generated once and then used everywhere. Rather than creating a new random number for each row, you use the hash to define an effectively random ordering, and the seed to choose which random ordering to apply. + + diff --git a/10-advanced_patterns.asciidoc b/10-advanced_patterns.asciidoc index 5cb21fd..398db23 100644 --- a/10-advanced_patterns.asciidoc +++ b/10-advanced_patterns.asciidoc @@ -333,34 +333,6 @@ You'll see a more elaborate version of this // -- STORE_TABLE(normed_seasons, 'normed_seasons'); -=== Numbering Records in Rank Order - - -If you supply only the name of the table, RANK acts as a pipeline operation, introducing no extra map/reduce stage. Each split is numbered as a unit: the third line of chunk `part-00000` gets rank 2, the third line of chunk `part-00001` gets rank 2, and so on. - -When you give rank a field to act on, it - -It's important to know that in current versions of Pig, the RANK operator sets parallelism one, -forcing all data to a single reducer. If your data is unacceptably large for this, you can use the -method used in (REF) "Assigning a unique identifier to each line" to get a unique compound index -that matches the total ordering, which might meet your needs. Otherwise, we can offer you no good -workaround -- frankly your best option may be to pay someone to fix this - ------- -gift RANK RANK gift RANK gift DENSE -partridge 1 1 1 -turtle dove 2 2 2 -turtle dove 3 2 2 -french hen 4 3 4 -french hen 5 3 4 -french hen 6 3 4 -calling birds 7 4 7 -calling birds 8 4 7 -calling birds 9 4 7 -calling birds 10 4 7 -K golden rings 11 5 11 ------- - // -- *************************************************************************** // -- diff --git a/10-event_streams.asciidoc b/10-event_streams.asciidoc index 15b375b..119d69b 100644 --- a/10-event_streams.asciidoc +++ b/10-event_streams.asciidoc @@ -151,7 +151,6 @@ Unless of course you are trying to test a service for resilience against an adve flow(:mapper){ input > parse_loglines > elephant_stampede } ---- - You must use Wukong's eventmachine bindings to make more than one simultaneous request per mapper. === Refs === diff --git a/11b-spatial_aggregation-points.asciidoc b/11b-spatial_aggregation-points.asciidoc index 5163211..44ab821 100644 --- a/11b-spatial_aggregation-points.asciidoc +++ b/11b-spatial_aggregation-points.asciidoc @@ -1,11 +1,14 @@ ==== Smoothing Pointwise Data Locally (Spatial Aggregation of Points) +Let's start by extending the group-and-aggregate pattern -- introduced in Chapter Six (REF) and ubiqitous since -- -We will start, as we always do, by applying patterns that turn Big Data into Much a Less Data. In particular, -A great tool for visualizing a large spatial data set + a great way to summarize a large data set, and one of the first things you’ll do to Know Thy Data. +This type of aggregation is a frontline tool of spatial analysis +It draws on methods you’ve already learned, giving us a chance to introduce some terminology and necessary details. + // * You want to "wash out" everything but the spatial variation -- even though the data was gathered for each // * Point measurement of effect with local extent -- for example, the temperature measured at a weather station is understood to be representative of the weather for several surrounding miles. // * @@ -58,6 +61,13 @@ Then geonames places -- show lakes and streams (or something nature-y) vs someth Do that again, but for a variable: airport flight volume -- researching epidemiology + +This would also be +n epidemiologist or transportation analyst interested in knowing the large-scale flux of people could throughout the global transportation network +Combining this with the weather data + + + // FAA flight data http://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/media/cy07_primary_np_comm.pdf We can plot the number of air flights handled by every airport @@ -77,8 +87,6 @@ grid_cts = FOREACH (GROUP gridded BY (bin_x, bin_y)) SUM(n_flights) AS tot_flights; ------ -An epidemiologist or transportation analyst interested in knowing the large-scale flux of people could throughout the global transportation network - ===== Pattern Recap: Spatial Aggregation of Points * _Generic Example_ -- group on tile cell, then apply the appropriate aggregation function @@ -89,8 +97,11 @@ An epidemiologist or transportation analyst interested in knowing the large-scal === Matching Points within a Given Distance (Pointwise Spatial Join) -Now that you've learned the spatial equivalent of a `GROUP BY`, you'll probably be interested to -learn the spatial equivalent of `COGROUP` and `JOIN`. +Now that you've learned the spatial equivalent of a `GROUP BY` aggregation -- combining many records within a grid cell into a single summary record -- you'll probably be interested to +learn the spatial equivalent of `COGROUP` and `JOIN` -- +collecting all records + + In particular, let's demonstrate how to match all points in one table with every point in another table that are less than a given fixed distance apart. Our reindeer friends would like us to help determin what UFO pilots do while visiting Earth. diff --git a/11c-geospatial_mechanics.asciidoc b/11c-geospatial_mechanics.asciidoc index 57e77db..3dac6a6 100644 --- a/11c-geospatial_mechanics.asciidoc +++ b/11c-geospatial_mechanics.asciidoc @@ -60,17 +60,31 @@ We'll start with the operations that transform a shape on its own to produce a n ==== Constructing and Converting Geometry Objects -Somewhat related are operations that bring shapes in and out of Pig's control. +Somewhat related are operations that change the data types used to represent a shape. -* `FromWKText(chararray)`, `FromGeoJson(chararray)` -- converts the serialized description of a shape into the corresponding geometry object. We'll cover these data formats a bit later in the chapter. Similarly, `ToWKText(geom)` and `ToGeoJson(geom)` serialize a geometry into a string +Going from shape to coordinates-as-numbers lets you apply general-purpose manipulations + +As a concrete example (but without going into the details), to identify patterns of periodic spacing in a set of coordinates footnote:[The methodical rows of trees in an apple orchard will appear as isolated frequency peaks oriented to the orchard plan; an old-growth forest would show little regularity and no directionality] +you'd quite likely want to extract the coordinates of your shapes as a bag of tuples, apply +a generic UDF implementing the 2-D FFT (Fast Fourier Transform) algorithm + + +. +The files in GeoJSON, WKT, or the other geographic formats described later in this Chapter (REF) produce records directly as geometry objects, + +There are functions to construct Point, Multipoint, LineString, ... objects from coordinates you supply, and counterparts that extract a shape's coordinates as plain-old-Pig-objects. + + +* `Point` / `MultiPoint` / `LineString` / `MultiLineString` / `Polygon` / `MultiPolygon` -- construct given geometry. * `GeoPoint(x_coord, y_coord)` -- constructs a `Point` from the given coordinates * `GeoEnvelope( (x_min, y_min), (x_max, y_max) )` -- constructs an `Envelope` object from the numerically lowest and numerically highest coordinates. Note that it takes two tuples as inputs, not naked coordinates. * `GeoMultiToBag(geom)` -- splits a (multi)geometry into a bag of simple geometries. A `MultiPoint` becomes a bag of `Points`; a `Point` becomes a bag with a single `Point`, and so forth. * `GeoBagToMulti(geom)` -- combines a bag of geometries into a single multi geometry. For instance, a bag with any mixture of `Point` and `MultiPoint` geometries becomes a single `MultiPoint` object, and similarly for (multi)lines and (multi)polygons. All the elements must have the same dimension -- no mixing (multi)points with (multi)lines, etc. +* `FromWKText(chararray)`, `FromGeoJson(chararray)` -- converts the serialized description of a shape into the corresponding geometry object. We'll cover these data formats a bit later in the chapter. Similarly, `ToWKText(geom)` and `ToGeoJson(geom)` serialize a geometry into a string + // * (?name) GetPoints -- extract the collection of points from a geometry. Always returns a MultiPoint no matter what the input geometry. // * (?name) GetLines -- extract the collection of lines or rings from a geometry. Returns `NULL` for a `Point`/`MultiPoint` input, and otherwise returns a MultiPoint no matter what the input geometry. -// * Point / MultiPoint / LineString / MultiLineString / Polygon / MultiPolygon -- construct given geometry // - ClosedLineString -- bag of points to linestring, appending the initial point if it isn't identical to the final point // * ForceMultiness // * AsBinary, AsText @@ -86,22 +100,21 @@ Somewhat related are operations that bring shapes in and out of Pig's control. * `GeoX(point)`, `GeoY(point)` -- X or Y coordinates of a point * `GeoLength(geom)` * `GeoLength2dSpheroid(geom)` — Calculates the 2D length of a linestring/multilinestring on an ellipsoid. This is useful if the coordinates of the geometry are in longitude/latitude and a length is desired without reprojection. -* `GeoPerimeter(geom)` -- length measurement of a geometry's boundary -* `GeoDistanceSphere(geom)` — Returns minimum distance in meters between two lon/lat geometries. Uses a spherical earth and radius of 6370986 meters. Faster than GeoDistanceSpheroid, but less accurate * `GeoDistance(geom)` -- the 2-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units. -* `GeoMinDistance(geom)` -* `GeoMaxDistance(geom)` -- the 2-dimensional largest distance between two geometries in projected units +* `GeoDistanceSphere(geom)` — Returns minimum distance in meters between two lon/lat geometries. Uses a spherical earth and radius of 6370986 meters. +// * `GeoMaxDistance(geom)` -- the 2-dimensional largest distance between two geometries in projected units // * IsNearby -- if some part of the geometries lie within the given distance apart // * IsNearbyFully(geom_a, geom_b, distance) -- if all parts of each geometry lies within the given distance of each other. +// * `GeoPerimeter(geom)` -- length measurement of a geometry's boundary There are also a set of meta-operations that report on the geometry objects representing a shape: * `Dimension(geom)` -- This operation returns zero for Point and MultiPoint; 1 for LineString and MultiLineString; and 2 for Polygon and MultiPolygon, regardless of whether those shapes exist in a 2-D or 3-D space * `CoordDim(geom)` -- the number of axes in the coordinate system being used: 2 for X-Y geometries, 3 for X-Y-Z geometries, and so on. Points, lines and polygons within a common coordinate system will all have the same value for `CoordDim` * `GeometryType(geom)` -- string representing the geometry type: `'Point'`, `'LineString'`, ..., `'MultiPolygon'`. -* `IsGeomEmpty(geom)` -- 1 if the geometry contains no actual points. -* `IsLineClosed(line)` -- 1 if the given `LineString`'s end point meets its start point. -* `IsSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. A multipoint is 'simple' if none of its points coincide. +* `IsGeoEmpty(geom)` -- 1 if the geometry contains no actual points. +* `IsGeoClosed(line)` -- 1 if the given `LineString`'s end point meets its start point. +* `IsGeoSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. A multipoint is 'simple' if none of its points coincide. * `IsLineRing` -- 1 if the given `LineString` is a ring -- that is, closed and simple. * `NumGeometries(geom_collection)` @@ -155,7 +168,7 @@ The geospatial toolbox has a set of precisely specified spatial relationships. T * `Contains(geom_a, geom_b)` -- 1 if `geom_a` completely contains `geom_b`: that is, the shapes' interiors intersect, and no part of `geom_b` lies in the exterior of `geom_a`. If two shapes are equal, then it is true that each contains the other. `Contains(A, B)` is exactly equivalent to `Within(B, A)`. // - `ContainsProperly(geom_a, geom_b)` -- 1 if : that is, the shapes' interiors intersect, and no part of `geom_b` intersects the exterior _or boundary_ of `geom_a`. The result of `Contains(A, A)` is always 1 and the result of `ContainsProperly(A,A) is always 0. * `Within(geom_a, geom_b)` -- 1 if `geom_a` is completely contained by `geom_b`: that is, the shapes' interiors intersect, and no part of `geom_a` lies in the exterior of `geom_b`. If two shapes are equal, then it is true that each is within the other. -* `Covers(geom_a, geom_b)` -- 1 if no point in `geom_b` is outside `geom_a`. `CoveredBy(geom_a, geom_b)` is sugar for `Covers(geom_b, geom_a)`. +* `Covers(geom_a, geom_b)` -- 1 if no point in `geom_b` is outside `geom_a`. `CoveredBy(geom_a, geom_b)` is sugar for `Covers(geom_b, geom_a)`. (TODO: verify: A polygon covers its boundary but does not contain its boundary.) * `Crosses(geom_a, geom_b)` -- 1 if the shapes cross: their geometries have some, but not all, interior points in common; and the dimension of the intersection is one less than the higher-dimension of the two shapes. That's a mouthful, so let's just look at the cases in turn: - A MultiPoint crosses a (multi)line or (multi)polygon as long as at least one of its points lies in the other shape's interior, and at least one of its points lies in the other shape's exterior. Points along the border of the polygon(s) or the endpoints of the line(s) don't matter. - A Line/MultiLine crosses a Polygon/MultiPolygon only when part of some line lies within the polygon(s)' interior and part of some line lies within the polygon(s)' exterior. Points along the border of a polygon or the endpoints of a line don't matter. diff --git a/11c-spatial_aggregations_on_regions.asciidoc b/11c-spatial_aggregations_on_regions.asciidoc index 8d316de..49102e8 100644 --- a/11c-spatial_aggregations_on_regions.asciidoc +++ b/11c-spatial_aggregations_on_regions.asciidoc @@ -2,7 +2,7 @@ While spatially aggregating pointwise data required nothing terribly sophisticated, doing the same for objects with spatial extent will motivate a couple new tricks. As an example, let's take agricultural production for each country footnote:[downloaded from the website of FAOSTAT, The Statistical Division of the Food and Agriculture Organization of the United Nations] and smooth it onto a uniform grid. (We'll find some interesting connections to this data in the final chapter). -Fair warning: though the first version of this script we demonstrate for you is correct as it stands, it will be exceptionally inefficient. But we want to demonstate the problem and then discuss its elegant solution. +Fair warning: though the first version of this script we demonstrate for you is correct as it stands, it will be sub-optimal. But we want to demonstate the problem and then discuss its elegant solution. // X points Y gridcells occupied ~ 6000 grid cells (50 x 120) diff --git a/11e-weather_near_you.asciidoc b/11e-weather_near_you.asciidoc index 8f7539c..5094fbc 100644 --- a/11e-weather_near_you.asciidoc +++ b/11e-weather_near_you.asciidoc @@ -144,3 +144,18 @@ The Voronoi diagram covers the plane with polygons, one per point -- I'll call t // // The details: Connect each point with a line to its neighbors, dividing the plane into triangles; there's an efficient alorithm (http://en.wikipedia.org/wiki/Delaunay_triangulation[Delaunay Triangulation]) to do so optimally. If I stand at the midpoint of the edge connecting two locations, and walk perpendicular to the edge in either direction, I will remain equidistant from each point. Extending these lines defines the Voronoi diagram -- a set of polygons, one per point, enclosing the area closer to that point than any other. + + +=== Spatial Smoothing -- Hot-Spot Analysis + +Also known as local auto-correlation + +Gi-star (Gettis-Ord) + +* http://www.utdallas.edu/~briggs/henan/11SAlocal.ppt +* http://www.nefmc.org/tech/cte_mtg_docs/130516/CATT%20Report/Getis_Ords_G-star%20and%20Spatial%20Autocorrelation%20implementation%20in%20ArcView.pdf +* http://www.ucl.ac.uk/jdi/events/int-CIA-conf/ICIAC11_Slides/ICIAC11_3D_SChainey +* http://resources.esri.com/help/9.3/ArcGISEngine/java/Gp_ToolRef/Spatial_Statistics_tools/how_hot_spot_analysis_colon_getis_ord_gi_star_spatial_statistics_works.htm + + + diff --git a/12-text_analysis.asciidoc b/12-text_analysis.asciidoc index 506b94b..05bc680 100644 --- a/12-text_analysis.asciidoc +++ b/12-text_analysis.asciidoc @@ -1,3 +1,161 @@ +* Tokenize Articles +* Wordbag articles +* Corpus term stats + - document+term: (all rate figures are ppb) + - doc_id, bag:{(term, doc-term-n-usages, doc-term-rate, doc-term-rate-squared, range=1)}, doc-n-terms, doc-n-usages + - doc-max-term-usages: usages-ct of term with highest usages-ct + - doc-term-tf_1 = 0.5*(1 + (doc-term-n-usages/doc-max-term-usages)) + - doc-term-tf = 1 + log(doc-term-n-usages) (or 0 if d-t-n-usages == 0) + - doc-term-tf_3 = log(doc-term-n-usages)/(1+log(doc-avg-term-usages)) + - doc-n-singletons: number of terms in document used exactly once + + - removing stopwords + + - all: + - tot-usages: sum(doc-term-n-usages) for all docs&terms + - n-terms = count(distinct terms): count of distinct terms + - n-docs = count(docs): count of docs + - avg-doc-terms: avg(doc-n-terms) -- average number of distinct terms in a doc + - avg-doc-usages = avg(doc-n-usages) = (tot-usages/n-docs): average number of usages in a doc + - sum of doc-term-rate on all docs and terms should be same as docs ct -- see if we get an underflow situation? Do Knuth trick? + + - term: + - term-n-usages: sum of doc-term-n-usages + - term-rate: 1e9*sum(doc-term-n-usages)/tot-usages + - term-range: count of docs with word ie. sum of range column + - idf: ln(n-docs/term-range) + - term-avg-doc-rate: avg(doc-term-rate) + - term-sdv-doc-rate: sqrt((n-docs * sum(doc-term-rate-squared) - sum(doc-term-rate)^2)/n-docs + - rank-term-rate + - ?rank of dispersion*term-rate + + - term-doc: + - term + - doc-term-n-usages + - doc-term-rate + - term-rate + - doc-term-tf = 0.5*(1 + (doc-term-n-usages/doc-max-term-usages)) + - doc-term-tf-idf = tf*idf: significance of document for this term in global corpus + - doc-term-rate-sm = (1 - doc-unseen-rate)*(doc-term-n-usages / doc-n-usages) + - doc-n-terms + - doc-n-usages + - doc-unseen-rate = clamp(doc-n-singletons/doc-n-usages, 1/doc-n-usages, 0.5) + - doc-unseen-term-rate = ( doc-unseen-rate * (term-rate/(1 - (doc-n-usages/tot-usages))) ) -- rate of a specific unseen word + - all + - n-grids + - avg-docs-per-grid + - grid + - term + - grid-term-n-usages -- with grid-n-usages, the term rate if longer docs count more. In effect, this treats the whole grid as one document. + - grid-avg-doc-term-rate -- term rate if every document weights the same + - term-rate -- global term rate, for normalization + - grid-term-tf: 0.5*(1 + (doc-term-n-usages/doc-max-term-usages)) + - grid-term-tf-idf = tf*idf: significance of document for this term in global corpus + - avg-doc-term-tf-idf + - grid-n-usages + - grid-n-terms + - grid-n-docs + - n-grids + - n-docs + +The way to think about it is as... You won't believe it... As if there were a chimpanzee at a typewriter sitting at each grid cell littering the floor with documents. Whatever the coherence of their output, they do generally use words at the same rate as normal prose -- but each one has words they've grown particularly fond of. + +=== How to weight tiles + + +We can think of a few ways to look at it: + +10/1000 2/100 1/100 0/1000 0/300 + + + + +- **grid as doc**: use grid PMI, grid tf-idf. +- **average of doc stats**: use average of doc PMI, doc tf-idf. A term used 2 times out of 200 in a smallish document counts the same as a term used 10 times out of 1000 in a larger document. The number of documents mentioning a term doesn't matter if the same average rate is hit: you get the same outcome for a term that appears 1% of the time in both A and B as if it appeared 2% of the time in A and never in B. +- **RMS averaged docs**: use sum-squared average (`sqrt(sum(x^2))/ct`). A term appearing 1% of the time in both A and B contributes about 30% less than if it appeared 2% of the time in A and never in B. +- **thresholded doc stats**: a term is prominent if many docs feature it at some level. Three docs featuring a term at _any_ above-cutoff rate contributes much more than one doc which shouts about it. A grid with three docs which use the term just under threshold doesn't is the same as one that doesn't mention it. +- an ad-hoc mixture of the above. + +=== Count-min-sketch + +* Count-min-sketch filter for bigrams + - hash + - choosing a hash function; watch out for deanonymization + - choosing number, length of hash; using parts of a hash as hash. (2^21 = 2 million, getting three from a 64-bit hash; 2^25 = 33 million, getting five from two 64-bit hashes; 2^26 = 66 million, getting six from five 32-bit hashes / 2 64-bit and a 32-bit; 2^27 = 134 million, getting seven from three 64-bit hashes.) + - http://web.stanford.edu/class/cs166/lectures/12/Small12.pdf + - w is the number of possible hash outcomes. Eg for 21-bit hash get 2 million slots; for 25-bit hash get 33 million slots + - dh is the number of hash functions we use + - a_i is the usages count for term a + - U = |a|_1 (the l1 norm of a) is sum(a_i): the total usage count + - |a|_2 (l2 norm) is sqrt(sum(a_i^2)); it is strictly less than |a|_1. + - the naive background error with one hash function will be U/w: scattering the U observations evenly across w outcomes. If I have 4 million articles averaging 100 words each, and 25 bits giving 33 million slots, the expected slop is 12 counts. For 21 bits, it's 400m/2m= 191 counts. + - with many hash functions, if we take + - eps = e * U / w = 2.718 * U / 2^hashbits + - delta = 1/exp(dh) -- with 4 hashes it's 1/55, 5 hashes it's ~1/150, with seven it's ~1/1100 + - then with probability delta, the error in count will be less than eps. + - summary + - w = 2^hbits + - data size = dh*w + - background bin count = U/w + - eps = e * U/w + - delta = 1/exp(dh) + - for 400m usages, 21 bits, + + +* BNC stats: #1 "the": 62k ppm; #10, "was": 9.2k ppm; 36 ppm: "at once", empire, sugar; 17 ppm: "as opposed to", horrible, measurements; 10 ppm "relative to", dirt, sexually + - http://ucrel.lancs.ac.uk/bncfreq/flists.html + - ninety-four: 153/450m, dispersion 0.89; tyrannosaur 134/450m, disp 0.53; combinatorial 123/450m, disp 0.55; fluidly 135/450m, disp 0.90; compassionately 122/450m, disp 0.9 + + +===== notes on figures + +The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. + +Range (Ra) is a simple count of how many segments include the tag in question. Dispersion (Disp) is a statistical coefficient (Juilland’s D) of how evenly distributed a tag is across successive segments of the corpus. This is useful, because many segments and texts are made up of a number of smaller, relatively independent units – for example, sectors and stories in newspapers. It may be that, even within a text, certain word classes are overused in a given part – e.g. the football-reporting sector of a newspaper. Juilland’s D is more sensitive to this degree of variation. It was calculated as follows: V where x is the mean sub-frequency of the tag in the subcorpus (i.e. its frequency in each segment averaged) and s is the standard deviation of these sub-frequencies. We selected Juilland’s D as it has been shown by Lyne (1985) to be the most reliable of the various dispersion coefficients that are available. It varies between 0 and 1, where values closer to 0 show that occurrences are focussed in a small number of segments, and values closer to 1 show a more even distribution across all segments. + D =1- V/sqrt(n-1) + where n is the number of segments in the subcorpus. The variation coefficient V is given by: + V = s / x + + +http://www.comp.lancs.ac.uk/~paul/publications/rwl_lc36_2002.pdf +http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf +http://faculty.cs.byu.edu/~ringger/CS479/papers/Gale-SimpleGoodTuring.pdf + + +http://www.linguistics.ucsb.edu/faculty/stgries/research/2010_STG_UsefulStats4CorpLing_MosaicCorpLing.pdf + + +------ + +http://www.linguistics.ucsb.edu/faculty/stgries/research/2008_STG_Dispersion_IJCL.pdf + +Let us assume this is our corpus of length l = 50, where letters represent words and the pipes the division of the corpus into different, here, n=5 equally-sized parts.3 bamnibeupk|basatbewqn|bcagabesta|baghabeaat| ba h a a b e a x a The percentages that each of the parts makes up of the whole corpus — in this case all 0.2 — are denoted as s1 to s5. Let’s assume we are interested in the word a in the corpus. The frequencies with which a occurs in each part are denoted as v1, v2, etc.; as you can see, a occurs once in the first part, twice in the second part, and so on, such that v1 =1, v2 =2, v3 =3, v4 =4, v5 =5. The vector with all observed frequencies (1, 2, 3, 4, 5) is referred to as v and the sum of all observed frequencies, i.e., the number of occurrences of a is referred to as f (f=Σv=15). Note also some other words’ distributions: x occurs only once in the whole corpus; e occurs once in each segment. + +The first are general stats and not specifically geared to the dispersion of linguistic items in texts but more often used as general indices of variation/dispersion: + +* range: number of parts containing term x times) = 5 (x is usually, but need not be, 1) +* max-min difference: max(v) - min(v) = 4 f +* standard deviation sd = sqrt( N*sum(x^2) - sum(x)^2 )/N +* Chi-squared χ : i=1 +* variation coefficient vcn: The coefficient of variation (CV) is defined as the ratio of the standard deviation sig to the mean mu http://en.m.wikipedia.org/wiki/Coefficient_of_variation * chi-squared χ : ∑ n − 1 ≈ 3.33 where expected + Next, a few well-known “classics” from the early 1970s: 1 + * Juilland et al.’s (1971) D: 1 −⎝ i =1⎝ in=1 ⎠ ≈ ⎠0.736 ⎝ ⎛⎝1⎛n−1⎠⎞⎞⎠1 +* Rosengren’s (1971) S: ⎜ ⎜ vcv ⎟ ⎟ ⋅ +* Carroll’s (1970) D : lo⎝g +* idf +* Juillands D adjusted for differing corpus sizes + * = D: : ≈ 7.906 ≈ 7.906 where v= + +To determine the degree of dispersion DP of word a in a corpus with n parts, one needs to take three simple steps. i. Determine the sizes s1−n of each of the n corpus parts, which are normalized against the overall corpus size and correspond to expected percentages which take differently-sized corpus parts into consideration ii. Determine the frequencies v1−n with which a occurs in the n corpus parts, which are normalized against the overall number of occurrences of a and cor- respond to an observed percentage. iii. Compute all n pairwise absolute differences of observed and expected per- centages, sum them up, and divide the result by two. The result is DP, which can theoretically range from approximately 0 to 1, where values close to 0 indicate that a is distributed across the n corpus parts as one would expect given the sizes of the n corpus parts. By contrast, values close to 1 indicate that a is distributed across the n corpus parts exactly the opposite way one would expect given the sizes of the n corpus parts +* Divide DP by 1 − (1/n) to yield DPnorm. + + +Minimal DP’s Word DP Broad Dispersion: a 0.08 to 0.103 and 0.106 with 0.155 but 0.158 in 0.159 not 0.165 this 0.166 the 0.168 have 0.178 be 0.207 are 0.223 that 0.227 there 0.243 of 0.249 + Lumpy Dispersion: macari 0.999 10 mamluks 0.998 10 lemar 0.996 10 sem 0.994 10 hathor 0.994 10 tatars 0.989 10 scallop 0.989 10 malins 0.988 10 ft 0.986 102 defender 0.98 10 scudamore 0.98 10 pre 0.945 10 diamond 0.941 102 carl 0.938 102 proclaimed + +Medium Dispersion: includes thousands plain formal anywhere properly excuse hardly er each lot house tell came +------ + [[text_data]] == Text Data @@ -44,44 +202,9 @@ At a high level, what we'll do is this: * Identify prominent (unusually frequent) words on each grid cell * Identify words that are prominent on a large number of grid cells -- these have strong geographic "flavor" - -=== Match Wikipedia Article Text with Article Geolocation - -Let's start by assembling the data we need. The wikipedia dataset has three different tables for each article: the metadata for each page (page id, title, size, last update time, and so); the full text of each article (a very large field); and the article geolocations. Below are snippets from the articles table and of the geolocations table: - -[[wp_lexington_article]] -._Wikipedia article record for "Lexington, Texas"_ ------- -Lexington,_Texas Lexington is a town in Lee County, Texas, United States. ... Snow's BBQ, which Texas Monthly called "the best barbecue in Texas" and The New Yorker named "the best Texas BBQ in the world" is located in Lexington. ------- - - -[[wp_coords]] -._Article coordinates_ ------- -Lexington,_Texas -97.01 30.41 023130130 ------- - -Since we want to place the words in each article in geographic context, our first step is to reunite each article with its geolocation. - ----- -article_text = LOAD ('...'); -article_geolocations = LOAD('...'); -articles = JOIN article_geolocations BY page_id, article_text BY page_id; -articles = FOREACH articles GENERATE article_text::page_id, QUADKEY(lng, lat) as quadcell, text; ----- - -The quadkey field you see is a label for the grid cell containing an article's location; you'll learn all about them in the <> chapter, but for the moment just trust us that it's a clever way to divide up the world map for big data computation. Here's the result: - -[[wp_lexington_wordbag_and_coords]] -._Wordbag with coordinates_ ------- -Lexington,_Texas 023130130 Lexington is a town in Lee County, Texas ... ------- - === Decompose Wikipedia Articles -Next, we will summarize each article by preparing its "word bag" -- a simple count of the terms on its wikipedia page. We've written a Pig UDF (User-Defined Function) to do so: +Next, we will summarize each article by preparing its "word bag" -- a simple count of the terms on its wikipedia page. This is an application of "Exploding One Records into Many" (REF), but in this case we've written a Pig UDF (User-Defined Function) that uses a proper Linguistics toolkit (OpenNLP (?)) to perform the splitting. ---- REGISTER path/to/udf ...; @@ -116,9 +239,34 @@ Lexington,_Texas 023130130 tot_usages num_terms 2 "bbq" Lexington,_Texas 023130130 tot_usages num_terms 1 "barbecue" ------ +=== Determining Frequency, Range and Dispersion of Terms within a Corpus + +* Doc-term-usages-ct +* doc-term-usages-rate, +* doc-term-frac +* Doc-all-terms-ct, doc-all-usages-ct +* Raw Rate in doc; adjusted rate in doc: +* Docs-ct -- count of documents +* Terms-ct -- count of terms +* Usages-ct -- count of usages +* (doc_id, term, usages-ct, doc-term-usages-rate, all-term-usages-freq) + +The term "however" crops up at a rate of XXX ppm across XXX articles +In contrast, the word "function" XXX ppm of terms, XXX %of articles -- relatively fewer articles employ the term, but they tend to do at a concentrated rate. +(TODO: histogram of usage counts) +We can get a sense of this "lumpiness" using the dispersion function footnote:[there are several competing formulas to capture the concept we have in mind; this version, "Juilland's D", is easy to calculate] + +=== map decorate + +Group all count star + + + === Term Statistics by Grid Cell +Now comes the fun part. + ---- taf_g = GROUP term_article_freqs BY quadcell, term; cell_freqs = FOREACH taf_g GENERATE @@ -225,6 +373,41 @@ In doing so, we've turned articles that have a geolocation into coarse-grained r Our next task -- the sprint home -- is to use a few more transforms and pivots to separate the signal from the background and, as far as possible, from the noise. +=== Match Wikipedia Article Text with Article Geolocation + +Let's start by assembling the data we need. The wikipedia dataset has three different tables for each article: the metadata for each page (page id, title, size, last update time, and so); the full text of each article (a very large field); and the article geolocations. Below are snippets from the articles table and of the geolocations table: + +[[wp_lexington_article]] +._Wikipedia article record for "Lexington, Texas"_ +------ +Lexington,_Texas Lexington is a town in Lee County, Texas, United States. ... Snow's BBQ, which Texas Monthly called "the best barbecue in Texas" and The New Yorker named "the best Texas BBQ in the world" is located in Lexington. +------ + + +[[wp_coords]] +._Article coordinates_ +------ +Lexington,_Texas -97.01 30.41 023130130 +------ + +Since we want to place the words in each article in geographic context, our first step is to reunite each article with its geolocation: a Direct Inner Join (pattern REF). While we've got the data streaming by, we also attach its quadtile sorting key ("Partition Points onto a Spatial Grid", REF) w + +---- +article_text = LOAD ('...'); +article_geolocations = LOAD('...'); +articles = JOIN article_geolocations BY page_id, article_text BY page_id; +articles = FOREACH articles GENERATE article_text::page_id, QUADKEY(lng, lat) as quadcell, text; +---- + +Here's the result: + +[[wp_lexington_wordbag_and_coords]] +._Wordbag with coordinates_ +------ +Lexington,_Texas 023130130 Lexington is a town in Lee County, Texas ... +------ + + ==== Pulling signal from noise To isolate the signal, we'll pull out a trick called <>. Though it may sound like an insurance holding company, in fact PMI is a simple approach to isolate the noise and background. It compares the following: @@ -288,8 +471,6 @@ Next, we'll choose some _exemplars_: familiar records to trace through "Barbeque * https://github.com/Ganglion/varaha/blob/master/src/main/java/varaha/text/TokenizeText.java - - ------ stream do |article| words = Wukong::TextUtils.tokenize(article.text, remove_stopwords: true) @@ -301,8 +482,6 @@ end Reading it as prose the script says "for each article: break it into a list of words; group all occurrences of each word and count them; then output the article id, word and count." - - .Snippet from the Wikipedia article on "Barbecue" [quote, wikipedia, http://en.wikipedia.org/wiki/Barbeque] ____ @@ -357,7 +536,7 @@ To be lazy, add a 'pseudocount' to each term: pretend you saw it an extra small Consult a mathematician: for something that is mathematically justifiable, yet still simple enough to be minimally invasive, she will recommend "Good-Turing" smoothing. -In this approach, we expand the dataset to include both the pool of counter for terms we saw, and an "absent" pool of fractional counts, to be shared by all the terms we _didn't_ see. Good-Turing says to count the terms that occurred once, and guess that an equal quantity of things _would_ have occurred once, but didn't. This is handwavy, but minimally invasive; we oughtn't say too much about the things we definitionally can't say much about. +In this approach, we expand the dataset to include both the pool of counts for terms we saw, and an "absent" pool of fractional counts, to be shared by all the terms we _didn't_ see. Good-Turing says to count the terms that occurred once, and guess that an equal quantity of things _would_ have occurred once, but didn't. This is hand-wavy, but minimally invasive; we oughtn't say too much about the things we definitionally can't say much about. We then make the following adjustments: @@ -400,5 +579,5 @@ We then make the following adjustments: // end // - +“If one advances confidently in the direction of his dreams, and endeavors to live the life which he has imagined, he will meet with a success unexpected in common hours.” — Henry David Thoreau, Walden: Or, Life in the Woods diff --git a/25c-references.asciidoc b/25c-references.asciidoc index 9246d50..ea42d49 100644 --- a/25c-references.asciidoc +++ b/25c-references.asciidoc @@ -107,6 +107,7 @@ SQL Cookbook * http://cogcomp.cs.illinois.edu/page/software_view/4 * http://opennlp.apache.org/ +* http://www.movable-type.co.uk/scripts/latlong.html Calculate distance, bearing and more between Latitude/Longitude points ==== Locality-Sensitive Hashing & Sketching Algorithms ==== diff --git a/90-style_guide.asciidoc b/90-style_guide.asciidoc new file mode 100644 index 0000000..d090fa6 --- /dev/null +++ b/90-style_guide.asciidoc @@ -0,0 +1,7 @@ + + +* variable naming: doc_term_n_usages, etc +* all inputs & outputs have an extension, even on a hadoop cluster: `wukong --run foo.rb ./data/bob.txt ./output/foo_bob.txt` (and so result will be in `./output/foo_bob.txt/part-00000`) +* make all refs use `./` directory (i.e. the cwd in local mode and the home directory in HDFS mode). + + diff --git a/book.asciidoc b/book.asciidoc index 64c4687..912e331 100644 --- a/book.asciidoc +++ b/book.asciidoc @@ -1,39 +1,3 @@ = Big Data for Chimps - -include::02-part_one.asciidoc[] -include::02-hadoop_basics.asciidoc[] - -include::03-map_reduce.asciidoc[] - -include::04-structural_operations.asciidoc[] - -include::05-part_two.asciidoc[] - -include::05-analytic_patterns-pipeline_operations.asciidoc[] - -include::06-analytic_patterns-structural_operations-grouping.asciidoc[] - -include::06-analytic_patterns-structural_operations-joining.asciidoc[] - -include::06-analytic_patterns-structural_operations-ordering.asciidoc[] - -include::06-analytic_patterns-structural_operations-uniquing.asciidoc[] - -include::07-part_three.asciidoc[] - -include::11-geographic.asciidoc[] - -include::11a-geodata-intro.asciidoc[] - -include::11b-spatial_aggregation-points.asciidoc[] - -include::11c-spatial_aggregations_on_regions.asciidoc[] - -include::11c-geospatial_mechanics.asciidoc[] - -include::11e-weather_near_you.asciidoc[] - -include::11d-quadtiles.asciidoc[] - -include::11f-data_formats.asciidoc[] +include::11a-geodata-intro.asciidoc[] \ No newline at end of file