Skip to content

Latest commit

 

History

History
108 lines (63 loc) · 6.15 KB

spark-hadoop.adoc

File metadata and controls

108 lines (63 loc) · 6.15 KB

Spark and Hadoop

SparkHadoopUtil

Caution
FIXME
Tip

Enable DEBUG logging level for org.apache.spark.deploy.SparkHadoopUtil logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.deploy.SparkHadoopUtil=DEBUG

Refer to Logging.

runAsSparkUser

runAsSparkUser(func: () => Unit)

runAsSparkUser runs a func function block with Hadoop’s UserGroupInformation of the current user.

Internally, it reads the current username (as SPARK_USER environment variable or the short user name from Hadoop’s UserGroupInformation).

Caution
FIXME How to use SPARK_USER to change the current user name?

You should see the current username printed out in the following DEBUG message in the logs:

DEBUG YarnSparkHadoopUtil: running as user: [user]

It then creates a remote user for the current user (using UserGroupInformation.createRemoteUser), transfers credential tokens and runs the input func function as the privileged user.

transferCredentials

Caution
FIXME

newConfiguration

Caution
FIXME

Creating SparkHadoopUtil Instance (get method)

Caution
FIXME

Hadoop Storage Formats

The currently-supported Hadoop storage formats typically used with HDFS are:

Caution
FIXME What are the differences between the formats and how are they used in Spark.

Introduction to Hadoop

Note
This page is the place to keep information more general about Hadoop and not related to Spark on YARN or files Using Input and Output (I/O) (HDFS). I don’t really know what it could be, though. Perhaps nothing at all. Just saying.

From Apache Hadoop's web site:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

  • Originally, Hadoop is an umbrella term for the following (core) modules:

    • HDFS (Hadoop Distributed File System) is a distributed file system designed to run on commodity hardware. It is a data storage with files split across a cluster.

    • MapReduce - the compute engine for batch processing

    • YARN (Yet Another Resource Negotiator) - the resource manager

  • Currently, it’s more about the ecosystem of solutions that all use Hadoop infrastructure for their work.

People reported to do wonders with the software with Yahoo! saying:

Yahoo has progressively invested in building and scaling Apache Hadoop clusters with a current footprint of more than 40,000 servers and 600 petabytes of storage spread across 19 clusters.

Beside numbers Yahoo! reported that:

Deep learning can be defined as first-class steps in Apache Oozie workflows with Hadoop for data processing and Spark pipelines for machine learning.

You can find some preliminary information about Spark pipelines for machine learning in the chapter ML Pipelines.

HDFS provides fast analytics – scanning over large amounts of data very quickly, but it was not built to handle updates. If data changed, it would need to be appended in bulk after a certain volume or time interval, preventing real-time visibility into this data.

  • HBase complements HDFS’ capabilities by providing fast and random reads and writes and supporting updating data, i.e. serving small queries extremely quickly, and allowing data to be updated in place.

When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file. For instance, if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (but the split between partitions would be done on line split, not the exact block split), unless you have a compressed text file. In case of compressed file you would get a single partition for a single file (as compressed text files are not splittable).

If you have a 30GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (128MB) it would be stored in 235 blocks, which means that the RDD you read from this file would have 235 partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)

With HDFS you can store any data (regardless of format and size). It can easily handle unstructured data like video or other binary files as well as semi- or fully-structured data like CSV files or databases.

There is the concept of data lake that is a huge data repository to support analytics.

HDFS partition files into so called splits and distributes them across multiple nodes in a cluster to achieve fail-over and resiliency.

MapReduce happens in three phases: Map, Shuffle, and Reduce.