Skip to content
Permalink

Comparing changes

This is a direct comparison between two commits made in this repository or its related repositories. View the default comparison for this range or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: centre-for-humanities-computing/tweetopic
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: fc55dfac5be8189c1e5dd0a02b1f2968f8794cde
Choose a base ref
..
head repository: centre-for-humanities-computing/tweetopic
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 7eb98f423a091c0f49645e53aed366d51798727d
Choose a head ref
162 changes: 162 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# python
__pycache__/
*.egg-info/
dist/

# Notebooks
*.ipynb
@@ -17,3 +18,164 @@ __pycache__/

# Other
/tweetopic/_old.py

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -14,7 +14,7 @@ ci:

repos:
- repo: https://github.com/pycqa/isort
rev: 5.11.4
rev: 5.12.0
hooks:
- id: isort
name: isort (python)
@@ -32,7 +32,7 @@ repos:
args: [--in-place]

- repo: https://github.com/psf/black
rev: 22.12.0
rev: 23.3.0
hooks:
- id: black

47 changes: 17 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -30,19 +30,14 @@ Install from PyPI:
pip install tweetopic
```

If you intend to use the visualization features of PyLDAvis, install the package with optional dependencies:

```bash
pip install tweetopic[viz]
```

## 👩‍💻 Usage ([documentation](https://centre-for-humanities-computing.github.io/tweetopic/))

For easy topic modelling, tweetopic provides you the TopicPipeline class:
Train your a topic model on a corpus of short texts:

```python
from tweetopic import TopicPipeline, DMM
from tweetopic import DMM
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# Creating a vectorizer for extracting document-term matrix from the
# text corpus.
@@ -52,7 +47,10 @@ vectorizer = CountVectorizer(min_df=15, max_df=0.1)
dmm = DMM(n_components=30, n_iterations=100, alpha=0.1, beta=0.1)

# Creating topic pipeline
pipeline = TopicPipeline(vectorizer, dmm)
pipeline = Pipeline([
("vectorizer", vectorizer),
("dmm", dmm),
])
```

You may fit the model with a stream of short texts:
@@ -61,34 +59,23 @@ You may fit the model with a stream of short texts:
pipeline.fit(texts)
```

To examine the structure of the topics you can either look at the most frequently occuring words:
To investigate internal structure of topics and their relations to words and indicidual documents we recommend using [topicwizard](https://github.com/x-tabdeveloping/topic-wizard).

```python
pipeline.top_words(top_n=3)
-----------------------------------------------------------------

[
{'vaccine': 1011.0, 'coronavirus': 428.0, 'vaccines': 396.0},
{'afghanistan': 586.0, 'taliban': 509.0, 'says': 464.0},
{'man': 362.0, 'prison': 310.0, 'year': 288.0},
{'police': 567.0, 'floyd': 444.0, 'trial': 393.0},
{'media': 331.0, 'twitter': 321.0, 'facebook': 306.0},
...
{'pandemic': 432.0, 'year': 427.0, 'new': 422.0},
{'election': 759.0, 'trump': 573.0, 'republican': 527.0},
{'women': 91.0, 'heard': 84.0, 'depp': 76.0}
]
Install it from PyPI:

```bash
pip install topic-wizard
```

Or use rich visualizations provided by [pyLDAvis](https://github.com/bmabey/pyLDAvis):
Then visualize your topic model:

```python
pipeline.visualize(texts)
```
import topicwizard

![PyLDAvis visualization](https://github.com/centre-for-humanities-computing/tweetopic/blob/main/docs/_static/pyldavis.png)
topicwizard.visualize(pipeline=pipeline, corpus=texts)
```

> Note: You must install optional dependencies if you intend to use pyLDAvis
![topicwizard visualization](docs/_static/topicwizard.png)

## 🎓 References

2 changes: 1 addition & 1 deletion citation.cff
Original file line number Diff line number Diff line change
@@ -6,6 +6,6 @@ authors:
given-names: "Márton"
orcid: "https://orcid.org/0000-0001-9652-4498"
title: "tweetopic: Blazing fast topic modelling for short texts."
version: 0.2.0
version: 0.2.2
date-released: 2022-09-21
url: "https://github.com/centre-for-humanities-computing/tweetopic"
Binary file added docs/_static/topicwizard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 5 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -9,22 +9,23 @@ This website contains the API reference for tweetopic as well as
a usage guide for getting started with tweetopic.

.. toctree::
:maxdepth: 2
:maxdepth: 1
:caption: Getting started

installation

.. toctree::
:maxdepth: 3
:maxdepth: 1
:caption: Usage

using_tweetopic.dmm
using_tweetopic.btm
using_tweetopic.pipeline
using_tweetopic.visualization
using_tweetopic.model_persistence

.. toctree::
:maxdepth: 3
:maxdepth: 1
:caption: API reference

tweetopic.dmm
@@ -34,4 +35,4 @@ a usage guide for getting started with tweetopic.

.. toctree::

GitHub Repository <https://github.com/centre-for-humanities-computing/tweetopic>
GitHub Repository <https://github.com/centre-for-humanities-computing/tweetopic>
8 changes: 0 additions & 8 deletions docs/installation.rst
Original file line number Diff line number Diff line change
@@ -5,11 +5,3 @@ tweetopic can be simply installed by installing the PyPI package.
.. code-block::
pip install tweetopic
If you would like to get the rich visualizations provided by PyLDAvis,
it is advisable to install the "viz" dependencies of the package.


.. code-block::
pip install tweetopic[viz]
Loading