Skip to content

gregd33/topicnaming

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TopicNaming

TopicNaming is a simple class for smoothing out the process of abstractive cluster description for vector based topic modeling techniques such as top2vec or BERTopic. This is the problem often described as topic representation.

topic modeling pipeline

Techniques for topic modeling such as top2vec or BERTopic work by using the a sequence of four steps:

  1. Embed documents (or other objects) into a semantic space using techniques such as a Sentence Transformer. This initial embedding gives a vector representation of the documents.
  2. Use dimension reduction to get a low dimensional space.
  3. Employ robust clustering techniques to find dense clusters of documents discussing a single concept. As part of this step, it is useful to leverage clustering techniques that are robust to noise (such as hdbscan) to identify these topical clusters.
  4. Represent clusters as topics. This final step is the focus of this topicnaming library.

This style of topic modeling works well for short to medium length homogeneous documents that are about a single topic, but requires extra work such as document segmentation to be effective on long or heterogeneous documents.

Note that using robust clustering techniques in Step 3 can allow for more filtering of background documents that don't have a sufficiently large number of similar documents within your corpus to be considered a topic.

The techniques used in this topicnaming library are broadly similar to the prompt engineering methods described in BERTopic 6B LLM & Generative AI.

The primary differences are:

  • the layered approach we use for clustering our documents into topics is tailored towards hierarchical topic modeling.
  • the cluster sampling strategies that we employ (see EVōC for more details)
  • the prompt engineering used for naming our topics
  • and a final step for dealing with duplicate topics within our hierarchy

As of now this is an early beta version of the library. Things can and will break right now. We welcome feedback, use cases and feature suggestions.

Basic Installation

For now install the latest version of TopicNaming from source you can do so by cloning the repository and running:

git clone https://github.com/TutteInstitute/topicnaming
cd topicnaming
pip install .

Dependency Installation

We will use the LLM inference framework llama.cpp for running our large language models that will name our topics. We are using the python bindings available llama-cpp-python, but have left the installation to the user so it can be installed appropriately for your setup.

Since this library is built on top of C++ it is best installed using conda via conda install -c conda-forge llama-cpp-python.

If you are using pip for installation, there are various command line parameters necessary to help optimize it for your system. Detailed instructions for installing this library via pip can be found here. Basic instructions are found below.

Leveraging a GPU can significantly speed up the process of topic naming and is highly recommended. If you don't have access to a GPU install llama.cpp as follows: If you have:

Linux and Mac no GPU

Linux and Mac with GPU

Model Installation

We will need a large language model downloaded for use with llama.cpp. In our experiments we find that the mistral-7B model gives solid results.

To download this model:

We will use sentence_transformers for embedding out documents (and eventually keywords) into a consistent space. Since sentence_transformers is a a dependency of topicnaming it will be installed by default. Note that sentence_transformers is capable of downloading its own models.

Basic Usage

We will need documents, document vectors and a low dimensional representation of these document vector to construct a represenation. This can be very expensive without a GPU so we recommend storing and reloading these vectors as needed. For faster encoding change device to: "cuda", "mps", "npu" or "cpu" depending on hardware availability. Once we generate document vectors we will need to construct a low dimensional representation. Here we do that via our UMAP library.

Once the low-dimensional representation is available (document_map in this case), we can do the topic naming. Note that you should adjust the parameters passed to Llama based on your hardward configuration as per the api

License

TopicNaming is MIT licensed. See the LICENSE file for details.

Contributing

Contributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged in.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%