Skip to content

Latest commit

 

History

History
21 lines (11 loc) · 1.48 KB

spark-sql-joins.adoc

File metadata and controls

21 lines (11 loc) · 1.48 KB

Joins

Caution
FIXME

Broadcast Join (aka Map-Side Join)

Caution
FIXME: Review BroadcastNestedLoop.

You can use broadcast function to mark a Dataset to be broadcast when used in a join operator.

Note
According to the article Map-Side Join in Spark, broadcast join is also called a replicated join (in the distributed system community) or a map-side join (in the Hadoop community).
Note
At long last! I have always been wondering what a map-side join is and it appears I am close to uncover the truth!

And later in the article Map-Side Join in Spark, you can find that with the broadcast join, you can very effectively join a large table (fact) with relatively small tables (dimensions), i.e. to perform a star-schema join you can avoid sending all data of the large table over the network.

CanBroadcast object matches a LogicalPlan with output small enough for broadcast join.

Note
Currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE [tableName] COMPUTE STATISTICS noscan has been run.

It uses spark.sql.autoBroadcastJoinThreshold setting to control the size of a table that will be broadcast to all worker nodes when performing a join.