Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Word Count

The purpose of this folder is to present multiple solutions (using DataFrames and RDDs) for classic word count problem.


"... This book will be a great resource for
both readers looking to implement existing
algorithms in a scalable fashion and readers
who are developing new, custom algorithms
using Spark. ..."

Dr. Matei Zaharia
Original Creator of Apache Spark

FOREWORD by Dr. Matei Zaharia

Word Count using Spark DataFrames

word_count_by_dataframe.log
word_count_by_dataframe.py

word_count_by_dataframe_shorthand.log
word_count_by_dataframe_shorthand.py

Word Count using Spark RDDs

Solutions are provided by using reduceByKey() , groupByKey(), and combineByKey() reducers. In general, solution by using reduceByKey() and combineByKey() are scale-out solutions than using groupByKey().

wordcount_by_groupbykey.py
wordcount_by_groupbykey.sh

wordcount_by_groupbykey_shorthand.py
wordcount_by_groupbykey_shorthand.sh

wordcount_by_combinebykey.py
wordcount_by_combinebykey.sh

wordcount_by_reducebykey.py
wordcount_by_reducebykey.sh

wordcount_by_reducebykey_shorthand.py
wordcount_by_reducebykey_shorthand.sh

wordcount_by_reducebykey_with_filter.py
wordcount_by_reducebykey_with_filter.sh

best regards,
Mahmoud Parsian