Different arguments are sent to mapper if the script is executed on hadoop cluster #72

ftofficer · 2012-12-02T15:41:05Z

Recently I am testing hadoopy in my python-and-hadoop project and got an issue. My project is using text format input and output, and I want to use hadoopy to wrap my streaming tasks.

A typical input file is like this:

2012-11-30-19:11:53 deedf4179a041a78ff0947478937a3f6    3   <something> 
2012-11-30-19:14:45 e6106cf8d93540ef453eb1fd73db2754    3   <something>
2012-11-30-19:36:59 98667ae8ab2b3d9db10d4091fde7a35a    3   <something>
2012-11-30-20:02:23 7452bfc1ca3fcd51dffce74f5d0181d0    3   <something>

According to the document:
http://www.hadoopy.com/en/latest/api.html#task-functions-usable-inside-hadoopy-jobs

Specification of mapper/reducer/combiner
Input Key/Value Types
For TypedBytes/SequenceFileInputFormat, the Key/Value are the decoded TypedBytes
For TextInputFormat, the Key is a byte offset (int) and the Value is a line without the newline (string)

So I wrote the map as the following:

 import hadoopy
 import logging

 def map(key, value):
     try:
         mid = value.split('\t', 2)[1]
         yield mid, 1
     except Exception, e:
         logging.error('exception: %s', e, exc_info=1)

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO,
                        stream=sys.stderr)
    hadoopy.run(map)

I use value.split('\t', 2)[1] to extract MID for further analyze, because the document says the following line will be passed to map as the value argument:

2012-11-30-19:11:53 deedf4179a041a78ff0947478937a3f6    3   <something>

The script works well when I test locally, using the following command line:

hadoop fs -text /path/to/input/files | python test.py map

However, when I put the script to execute in hadoop cluster (with launch_frozen), it did not output the MIDs, but a lot of '3's.

After some tracing and code-reading, I found the following code fragment in hadoopy/main.pyx file:

    def read_in_map(self):
        """Provides the input iterator to use                                                                                           

        If is_io_typedbytes() is true, then use TypedBytes.                                                                             
        If is_on_hadoop() is true, then use Text as key\\tvalue\\n.                                                                     
        Else, then use Text with key as byte offset and value as line (no \\n)                                                          

        Returns:                                                                                                                        
            Iterator that can be called to get KeyValue pairs.                                                                          
        """
        if self.is_io_typedbytes():
            return KeyValueStream(self.tb.__next__)
        if self.is_on_hadoop():
            return KeyValueStream(self.read_key_value_text)
        return KeyValueStream(self.read_offset_value_text)

When the script is_on_hadoop(), \t split key-value pairs are passed to mapper, but when testing locally, the map function get file offset and entire line.

I wonder why people need to distinguish the two conditions. A consistent behavior in both testing and production environments will simplify the world a lot.

Currently I monkey patched hadoopy.run to wrap the mapper function, making sure the offset and entire line is sent to map. But I really hope hadoopy can support consistent map specifications out of box.

Thanks for your effort on the great project.

Regards,
ftofficer

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different arguments are sent to mapper if the script is executed on hadoop cluster #72

Different arguments are sent to mapper if the script is executed on hadoop cluster #72

ftofficer commented Dec 2, 2012

Different arguments are sent to mapper if the script is executed on hadoop cluster #72

Different arguments are sent to mapper if the script is executed on hadoop cluster #72

Comments

ftofficer commented Dec 2, 2012