Caution
|
FIXME |
Caution
|
FIXME: Review BroadcastNestedLoop .
|
Note
|
According to the article Map-Side Join in Spark, broadcast join is also called a replicated join (in the distributed system community) or a map-side join (in the Hadoop community). |
Note
|
At long last! I have always been wondering what a map-side join is and it appears I am close to uncover the truth! |
And later in the article Map-Side Join in Spark, you can find that with the broadcast join, you can very effectively join a large table (fact) with relatively small tables (dimensions), i.e. to perform a star-schema join you can avoid sending all data of the large table over the network.
CanBroadcast
object matches a LogicalPlan with output small enough for broadcast join.
Note
|
Currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE [tableName] COMPUTE STATISTICS noscan has been run.
|
It uses spark.sql.autoBroadcastJoinThreshold setting to control the size of a table that will be broadcast to all worker nodes when performing a join.