Name		Name	Last commit message	Last commit date
parent directory ..
data		data
README.md		README.md
word_count_by_dataframe.log		word_count_by_dataframe.log
word_count_by_dataframe.py		word_count_by_dataframe.py
word_count_by_dataframe_shorthand.log		word_count_by_dataframe_shorthand.log
word_count_by_dataframe_shorthand.py		word_count_by_dataframe_shorthand.py
wordcount_by_combinebykey.log		wordcount_by_combinebykey.log
wordcount_by_combinebykey.py		wordcount_by_combinebykey.py
wordcount_by_combinebykey.sh		wordcount_by_combinebykey.sh
wordcount_by_groupbykey.py		wordcount_by_groupbykey.py
wordcount_by_groupbykey.sh		wordcount_by_groupbykey.sh
wordcount_by_groupbykey_shorthand.py		wordcount_by_groupbykey_shorthand.py
wordcount_by_groupbykey_shorthand.sh		wordcount_by_groupbykey_shorthand.sh
wordcount_by_reducebykey.py		wordcount_by_reducebykey.py
wordcount_by_reducebykey.sh		wordcount_by_reducebykey.sh
wordcount_by_reducebykey_shorthand.py		wordcount_by_reducebykey_shorthand.py
wordcount_by_reducebykey_shorthand.sh		wordcount_by_reducebykey_shorthand.sh
wordcount_by_reducebykey_with_filter.py		wordcount_by_reducebykey_with_filter.py
wordcount_by_reducebykey_with_filter.sh		wordcount_by_reducebykey_with_filter.sh

README.md

Word Count

The purpose of this folder is to present multiple solutions (using DataFrames and RDDs) for classic word count problem.

"... This book will be a great resource for
both readers looking to implement existing
algorithms in a scalable fashion and readers
who are developing new, custom algorithms
using Spark. ..."

Dr. Matei Zaharia
Original Creator of Apache Spark

FOREWORD by Dr. Matei Zaharia

Word Count using Spark DataFrames

word_count_by_dataframe.log
word_count_by_dataframe.py

word_count_by_dataframe_shorthand.log
word_count_by_dataframe_shorthand.py

Word Count using Spark RDDs

Solutions are provided by using reduceByKey() , groupByKey(), and combineByKey() reducers. In general, solution by using reduceByKey() and combineByKey() are scale-out solutions than using groupByKey().

wordcount_by_groupbykey.py
wordcount_by_groupbykey.sh

wordcount_by_groupbykey_shorthand.py
wordcount_by_groupbykey_shorthand.sh

wordcount_by_combinebykey.py
wordcount_by_combinebykey.sh

wordcount_by_reducebykey.py
wordcount_by_reducebykey.sh

wordcount_by_reducebykey_shorthand.py
wordcount_by_reducebykey_shorthand.sh

wordcount_by_reducebykey_with_filter.py
wordcount_by_reducebykey_with_filter.sh

best regards,
Mahmoud Parsian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python

python

README.md

Word Count

Word Count using Spark DataFrames

Word Count using Spark RDDs

Files

python

Directory actions

More options

Directory actions

More options

Latest commit

History

python

Folders and files

parent directory

README.md

Word Count

Word Count using Spark DataFrames

Word Count using Spark RDDs