Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different arguments are sent to mapper if the script is executed on hadoop cluster #72

Open
ftofficer opened this issue Dec 2, 2012 · 0 comments

Comments

@ftofficer
Copy link

Recently I am testing hadoopy in my python-and-hadoop project and got an issue. My project is using text format input and output, and I want to use hadoopy to wrap my streaming tasks.

A typical input file is like this:

2012-11-30-19:11:53 deedf4179a041a78ff0947478937a3f6    3   <something> 
2012-11-30-19:14:45 e6106cf8d93540ef453eb1fd73db2754    3   <something>
2012-11-30-19:36:59 98667ae8ab2b3d9db10d4091fde7a35a    3   <something>
2012-11-30-20:02:23 7452bfc1ca3fcd51dffce74f5d0181d0    3   <something>

According to the document:
http://www.hadoopy.com/en/latest/api.html#task-functions-usable-inside-hadoopy-jobs

Specification of mapper/reducer/combiner
Input Key/Value Types
For TypedBytes/SequenceFileInputFormat, the Key/Value are the decoded TypedBytes
For TextInputFormat, the Key is a byte offset (int) and the Value is a line without the newline (string)

So I wrote the map as the following:

 import hadoopy
 import logging

 def map(key, value):
     try:
         mid = value.split('\t', 2)[1]
         yield mid, 1
     except Exception, e:
         logging.error('exception: %s', e, exc_info=1)

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO,
                        stream=sys.stderr)
    hadoopy.run(map)

I use value.split('\t', 2)[1] to extract MID for further analyze, because the document says the following line will be passed to map as the value argument:

2012-11-30-19:11:53 deedf4179a041a78ff0947478937a3f6    3   <something> 

The script works well when I test locally, using the following command line:

hadoop fs -text /path/to/input/files | python test.py map

However, when I put the script to execute in hadoop cluster (with launch_frozen), it did not output the MIDs, but a lot of '3's.

After some tracing and code-reading, I found the following code fragment in hadoopy/main.pyx file:

    def read_in_map(self):
        """Provides the input iterator to use                                                                                           

        If is_io_typedbytes() is true, then use TypedBytes.                                                                             
        If is_on_hadoop() is true, then use Text as key\\tvalue\\n.                                                                     
        Else, then use Text with key as byte offset and value as line (no \\n)                                                          

        Returns:                                                                                                                        
            Iterator that can be called to get KeyValue pairs.                                                                          
        """
        if self.is_io_typedbytes():
            return KeyValueStream(self.tb.__next__)
        if self.is_on_hadoop():
            return KeyValueStream(self.read_key_value_text)
        return KeyValueStream(self.read_offset_value_text)

When the script is_on_hadoop(), \t split key-value pairs are passed to mapper, but when testing locally, the map function get file offset and entire line.

I wonder why people need to distinguish the two conditions. A consistent behavior in both testing and production environments will simplify the world a lot.

Currently I monkey patched hadoopy.run to wrap the mapper function, making sure the offset and entire line is sent to map. But I really hope hadoopy can support consistent map specifications out of box.

Thanks for your effort on the great project.

Regards,
ftofficer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant