Skip to content

Commit

Permalink
v0.1.1 (#2)
Browse files Browse the repository at this point in the history
* Fix RAM issues
* Update documentation
* Add `ftfy` dependency
* Fix `.visualize_concepts`
* Added `.search_concepts`
  • Loading branch information
MaartenGr authored Nov 1, 2021
1 parent 05031fd commit a390976
Show file tree
Hide file tree
Showing 12 changed files with 214 additions and 70 deletions.
57 changes: 43 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
[![PyPI - PyPi](https://img.shields.io/pypi/v/Concept)](https://pypi.org/project/concept/)
[![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/concept/)
[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/concept/blob/master/LICENSE)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1XHwQPT2itZXu1HayvGoj60-xAXxg9mqe?usp=sharing)

# Concept

Expand Down Expand Up @@ -30,12 +31,11 @@ example:

```python
import os
import glob
import zipfile
from tqdm import tqdm
from PIL import Image
from sentence_transformers import util


# 25k images from Unsplash
img_folder = 'photos/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
Expand All @@ -49,45 +49,74 @@ if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
with zipfile.ZipFile(photo_filename, 'r') as zf:
for member in tqdm(zf.infolist(), desc='Extracting'):
zf.extract(member, img_folder)
images = [Image.open("photos/"+filepath) for filepath in tqdm(img_names)]
img_names = list(glob.glob('photos/*.jpg'))
```

Next, we only need to pass images to **Concept**:

```python
from concept import ConceptModel
concept_model = ConceptModel()
concepts = concept_model.fit_transform(images)
concepts = concept_model.fit_transform(img_names)
```

The resulting concepts can be visualized through `concept_model.visualize_concepts()`:

<img src="images/concepts_without_topics.jpg" width="100%" height="100%" align="center" />

However, to get the full experience, we need to label the concept clusters with topics. To do this,
we need to create a vocabulary:
we need to create a vocabulary. We are going to feed our model with 50.000 nouns from the English
vocabulary:

```python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
vectorizer = TfidfVectorizer(ngram_range=(1, 2)).fit(docs)
words = vectorizer.get_feature_names()
words = [words[index] for index in np.argpartition(vectorizer.idf_, -50_000)[-50_000:]]
import random
import nltk
nltk.download("wordnet")
from nltk.corpus import wordnet as wn

all_nouns = [word for synset in wn.all_synsets('n') for word in synset.lemma_names() if "_" not in word]
selected_nouns = random.sample(all_nouns, 50_000)
```

Then, we can pass in the resulting `words` to **Concept**:
Then, we can pass in the resulting `selected_nouns` to **Concept**:

```python
from concept import ConceptModel

concept_model = ConceptModel()
concepts = concept_model.fit_transform(images, docs=words)
concepts = concept_model.fit_transform(img_names, docs=selected_nouns)
```

Again, the resulting concepts can be visualized. This time however, we can also see the generated topics
through `concept_model.visualize_concepts()`:

<img src="images/concepts.jpg" width="100%" height="100%" align="center" />

**NOTE**: Use `Concept(embedding_model="clip-ViT-B-32-multilingual-v1")` to select a model that supports 50+ languages.
**NOTE**: Use `Concept(embedding_model="clip-ViT-B-32-multilingual-v1")` to select a model that supports 50+ languages.

## Search Concepts
We can quickly search for specific concepts by embedding a search term and finding the cluster embeddings
that best represent them. As an example, let us search for the term `beach` and see what we can find.
To do this, we simply run the following:

```python
>>> concept_model.find_concepts("beach")
[(100, 0.277577825349102),
(53, 0.27431058773894657),
(95, 0.25973751319723837),
(77, 0.2560122597417548),
(97, 0.25361988261846297)]
```

Each tuple contains two values, the first is the concept cluster and the second the similarity to the
search term. The top 5 similar topics are returned.

Now, let us visualize those concepts to see how well the search function works:

```python
concept_model.visualize_concepts(concepts=[100, 53, 95, 77, 97])
```

<img src="images/search.jpg" width="100%" height="100%" align="center" />


2 changes: 1 addition & 1 deletion concept/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from concept._model import ConceptModel

__version__ = "0.1.0"
__version__ = "0.1.1"

__all__ = [
"ConceptModel",
Expand Down
Loading

0 comments on commit a390976

Please sign in to comment.