GK-LDA (General Knowledge based LDA)

GK-LDA is an open-source Java package implementing the algorithm proposed in the paper (Chen et al., CIKM 2013), created by Zhiyuan (Brett) Chen. For more details, please refer to this paper.

If you use this package, please cite the paper: Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh. Discovering Coherent Topics Using General Knowledge. In Proceedings of CIKM 2013, pages 209-218.

If you have any question or bug report, please send it to Zhiyuan (Brett) Chen ([email protected]).

-i: the path of input domains directory.
-know: the file path of input knowledge file.
-o: the path of output model directory.
-nthreads: the number of threads used in the program. The program runs in parallel supporting multithreading.
-nTopics: the number of topics used in Topic Model for each domain.

## Input and Output ### Input The input directory should contain domain files. For each domain, there should be 2 files (can be opened by text editors):

domain.docs: each line (representing a document) contains a list of word ids.
domain.vocab: mapping from word id (starting from 0) to word, separated by ":".

The input directory should also contain a knowledge file, in which each line represents a must-set (i.e., a set of words that should appear together under the same topic).

Output

The output directory contains topic model results for each learning iteration. LearningIteration 0 is always LDA, i.e., without any knowledge. LearningIteration 1 is GK-LDA with the input knowledge. LDA is run first in order to construct word correlation metric used in GK-LDA.

Under each learning iteration folder and sub-folder "DomainModels", there are a list of domain folders where each domain folder contains topic model results for each domain. Under each domain folder, there are 6 files (can be opened by text editors):

domain.docs: each line (representing a document) contains a list of word ids.
domain.param: parameter settings.
domain.tassign: topic assignment for each word in each document.
domain.twdist: topic-word distribution
domain.twords: top words under each topic. The columns are separated by '\t' where each column corresponds to each topic.
domain.vocab: mapping from word id (starting from 0) to word.

## Contact Information * Author: Zhiyuan (Brett) Chen * Affiliation: University of Illinois at Chicago * Research Area: Text Mining, Machine Learning, Statistical Natural Language Processing, and Data Mining * Email: [email protected] * Homepage: http://www.cs.uic.edu/~zchen/

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Data/Input		Data/Input
Src		Src
LICENSE		LICENSE
README.md		README.md
ReadMe_PlainText		ReadMe_PlainText

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GK-LDA (General Knowledge based LDA)

Table of Contents

Output

About

Releases

Packages

Languages

License

czyuan/GKLDA

Folders and files

Latest commit

History

Repository files navigation

GK-LDA (General Knowledge based LDA)

Table of Contents

Output

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages