Skip to content

Distributed FileSystem Planning

Conor Griffin edited this page Mar 14, 2018 · 1 revision

Distributed File System Planning

The goal of the distributed file system is to support distributing the mapreduce input files across the cluster and to store the final output. The distributed file system only needs to support writing files once, as files do not need to be modified during a mapreduce operation. This greatly simplifies the implementation.

The majority of interaction with the distributed file system will be done through a DataAbstractionLayer implementation to avoid having to change large amounts of the worker code.

User Interaction

As implementing fuse would be a lot of work, user interaction will be done through the cli. For now, there will be two commands, one to upload input files to the system and one to download the result files. In the future we could consider more commands, such as adding the ability to list files in the output directory.

The upload command should block until the file has been distributed to a number of workers. The upload command should support uploading to any filepath, and all input files should not have to be uploaded at once.

File Distribution

Files will be split into chunks and distributed among a number of workers. The master would not store files in the distributed file system, only the workers. The master would keep track of all file locations. The master would be responsible for redistributing files but we will not implement that in the first version.

File Access

Workers will contact the master to lookup the location of a file they wanted to read. We could pass file locations as part of the input locations, but the file may need to be redistributed if workers go offline. Workers contacting the master does not require much additional overhead and allows for file redistribution.

Changes required

File splitting should be changed to no longer require reading all the files as this will require a lot of unnecessary network bandwidth. Direct file access will no longer be possible, so code that uses the open file/create file methods of the data abstraction layer will have to be rewritten.

User Flow

  1. User uploads files to the distributed file system.

  2. These files are replicated among a number of workers.

  3. User starts mapreduce job with the path of the files on the filesystem.

  4. Once the job completes, the user downloads the output files from the distributed file system, or runs a new job with these files as input.

Worker Flow

  1. Worker gets Map task.

  2. Worker checks to see if it has the file chunk already and if not it requests the location of the file chunk from the master.

  3. Worker requests the file chunk from one of the other workers that has the chunk.

The worker will also upload reduce output files to the distributed file system in a similar fashion to how the user uploads input files.

Clone this wiki locally