mastering-apache-spark-book/spark-data-sources.adoc at master · hal2069/mastering-apache-spark-book · GitHub

Data Sources in Spark

Spark can access data from many data sources, including Hadoop Distributed File System (HDFS), Cassandra, HBase, S3 and many more.

Spark offers different APIs to read data based upon the content and the storage.

There are two groups of data based upon the content:

binary
text

You can also group data by the storage:

files
databases, e.g. Cassandra