Joins

Caution

FIXME

Broadcast Join (aka Map-Side Join)

Caution

FIXME: Review BroadcastNestedLoop.

You can use broadcast function to mark a Dataset to be broadcast when used in a join operator.

Note	According to the article Map-Side Join in Spark, broadcast join is also called a replicated join (in the distributed system community) or a map-side join (in the Hadoop community).

Note	At long last! I have always been wondering what a map-side join is and it appears I am close to uncover the truth!

And later in the article Map-Side Join in Spark, you can find that with the broadcast join, you can very effectively join a large table (fact) with relatively small tables (dimensions), i.e. to perform a star-schema join you can avoid sending all data of the large table over the network.

CanBroadcast object matches a LogicalPlan with output small enough for broadcast join.

Note	Currently statistics are only supported for Hive Metastore tables where the command `ANALYZE TABLE [tableName] COMPUTE STATISTICS noscan` has been run.

It uses spark.sql.autoBroadcastJoinThreshold setting to control the size of a table that will be broadcast to all worker nodes when performing a join.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-sql-joins.adoc

spark-sql-joins.adoc

Joins

Broadcast Join (aka Map-Side Join)

Files

spark-sql-joins.adoc

Latest commit

History

spark-sql-joins.adoc

File metadata and controls

Joins

Broadcast Join (aka Map-Side Join)