You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently I am testing hadoopy in my python-and-hadoop project and got an issue. My project is using text format input and output, and I want to use hadoopy to wrap my streaming tasks.
Specification of mapper/reducer/combiner
Input Key/Value Types
For TypedBytes/SequenceFileInputFormat, the Key/Value are the decoded TypedBytes
For TextInputFormat, the Key is a byte offset (int) and the Value is a line without the newline (string)
I use value.split('\t', 2)[1] to extract MID for further analyze, because the document says the following line will be passed to map as the value argument:
However, when I put the script to execute in hadoop cluster (with launch_frozen), it did not output the MIDs, but a lot of '3's.
After some tracing and code-reading, I found the following code fragment in hadoopy/main.pyx file:
defread_in_map(self):
"""Provides the input iterator to use If is_io_typedbytes() is true, then use TypedBytes. If is_on_hadoop() is true, then use Text as key\\tvalue\\n. Else, then use Text with key as byte offset and value as line (no \\n) Returns: Iterator that can be called to get KeyValue pairs. """ifself.is_io_typedbytes():
returnKeyValueStream(self.tb.__next__)
ifself.is_on_hadoop():
returnKeyValueStream(self.read_key_value_text)
returnKeyValueStream(self.read_offset_value_text)
When the script is_on_hadoop(), \t split key-value pairs are passed to mapper, but when testing locally, the map function get file offset and entire line.
I wonder why people need to distinguish the two conditions. A consistent behavior in both testing and production environments will simplify the world a lot.
Currently I monkey patched hadoopy.run to wrap the mapper function, making sure the offset and entire line is sent to map. But I really hope hadoopy can support consistent map specifications out of box.
Thanks for your effort on the great project.
Regards,
ftofficer
The text was updated successfully, but these errors were encountered:
Recently I am testing hadoopy in my python-and-hadoop project and got an issue. My project is using text format input and output, and I want to use hadoopy to wrap my streaming tasks.
A typical input file is like this:
According to the document:
http://www.hadoopy.com/en/latest/api.html#task-functions-usable-inside-hadoopy-jobs
So I wrote the map as the following:
I use value.split('\t', 2)[1] to extract MID for further analyze, because the document says the following line will be passed to map as the value argument:
The script works well when I test locally, using the following command line:
However, when I put the script to execute in hadoop cluster (with launch_frozen), it did not output the MIDs, but a lot of '3's.
After some tracing and code-reading, I found the following code fragment in hadoopy/main.pyx file:
When the script is_on_hadoop(), \t split key-value pairs are passed to mapper, but when testing locally, the map function get file offset and entire line.
I wonder why people need to distinguish the two conditions. A consistent behavior in both testing and production environments will simplify the world a lot.
Currently I monkey patched hadoopy.run to wrap the mapper function, making sure the offset and entire line is sent to map. But I really hope hadoopy can support consistent map specifications out of box.
Thanks for your effort on the great project.
Regards,
ftofficer
The text was updated successfully, but these errors were encountered: