-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME
40 lines (33 loc) · 1.95 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
This file describes DocumentCluster, a program for clustering text documents
based on similarity of word frequencies. Document words are first filtered
against a specified stop word list, then stemmed using the classic Porter
stemming algorithm. The resulting data is then converted to Term Frequency -
Inverse Document Frequency values, and normalized so each document is a vector
of length one. The document data is internally represented as a sparse matrix
with collapsed word columns. The vectors are then clustered using the classic
k-means algorithm, using cosine similarity as the distance measure.
To install:
1. Copy files to a directory
2. Create a subdirectory 'data'
3. Move the supplied stopwords.txt file to this directory, or create a custom
version. Words within the file must be specified on a single line seperated by
commas.
4. Compile source files
5. Copy the files to cluster to the data subdirectory
6. Run as DocumentCluster [number of clusters] [list of files]
The number of clusters will be the number specified or half the number of
files, whichever is less. Files that have no word overlap with other files
after stop-word removal will be excluded from the clustering; this happens
rarely in practice.
Copyright (C) 2013 Ezra Erb
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License version 3 as published by the Free
Software Foundation.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program. If not, see <http://www.gnu.org/licenses/>.
I'd appreciate a note if you find this program useful or make updates. Please
contact me through LinkedIn or github (my profile also has a link to the code
depository)