Find-Something-In-Big-Documents

We are going to indentify the top-N most frequent @mention and #hashtag entities. The dataset contains 20 million Turkish Tweets and can be downloded from here.

Your project must be a valid maven project. mvn clean package must produce an executable jar file named trending.jar under the target directory. This can be done via maven plugins such as shade or assembly plugin.

Following command line options must be supported.

Option	Description
-n, --number	The number of entities to display. [defaults to 10]
-e, --entity	The name of the entity (e.g., hashtag or mention). [defaults to hashtag]
-r, --reverse	Reverse the comparison (e.g., display most infrequent entities).
-i, --ignore-case	Fold upper case to lower case characters (e.g., collate #AnadoluÜniversitesi and #anadoluÜniversitesi).

The result will be printed to the standard output in the format of two columns (entity \t frequency) separated by a tab.

For example, java -jar target/trending.jar -n 20 -e mention -i Tweets.txt will display top-20 mentions in decreasing order by their frequency.

Another example, java -jar target/trending.jar -r Tweets.txt will display 10 hashtags in increasing order by their frequency.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
src/main		src/main
target		target
.gitignore		.gitignore
Find-Something-In-Big-Documents.iml		Find-Something-In-Big-Documents.iml
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Find-Something-In-Big-Documents

About

Releases

Packages

Languages

License

aytarozgur/Find-Something-In-Big-Documents

Folders and files

Latest commit

History

Repository files navigation

Find-Something-In-Big-Documents

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages