GitHub - kausmeows/clothsy: Transformer based search/rec engine to fetch Amazon URLs for similar clothing items given a text description

title	emoji	colorFrom	colorTo	sdk	sdk_version	app_file	pinned
Clothsy	👕	purple	purple	gradio	3.24.1	main.py	false

HF Space Demo

Working Demo

Data Collection

To scrape quality clothing data containing proper description and url for the product I used Apify's Amazon Product Scraper By creating an account and logging into the console we can input links of the amazon fashion category like- Men's Fashion -> Shirts

I downloaded all the scraped data for various clothing categories into a CSV file with columns url|title|description

Apify Console

The full data consists of 2900 different clothing products of men and women, it can be found at data/clothing_similarity_search.csv

Data Cleaning

I used pandas to clean the data and preprocess the text data by cleaning it (remove special characters, lowercasing, etc.), and possibly by applying some form of text normalization (like stemming or lemmatization).

Making Embeddings

sentence-transformers has been used to make embeddings for the cleaned data. I used all-MiniLM-L6-v2 model to make the embeddings. The model card can be found here

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)

The choice of this model selection was based on its small size and good accuracy which favors the API response speed

The embeddings generated for the whole dataset has been saved into a .npy at /data/embeddings.npy file which can be loaded and used for similarity search retrieval. This makes sure searching takes place via vector-similarity which is faster.

I used the cosine similarity to find the similarity between the embeddings of the query and the embeddings of the products.

API

Used FastAPI to create the API. The API has a single endpoint /predict which takes a query string and returns the top 5 most similar products as json

We hit the endpoint http://0.0.0.0:8080/predict with a JSON payload as 
{
    "query": "Men's winter jacket black and white"
}

This will return
{
  "similar_urls": [
    "https://www.amazon.in/dp/B082L3BGGM",
    "https://www.amazon.in/dp/B08KWFRY6W",
    "https://www.amazon.in/dp/B08Q3VBFPD",
    "https://www.amazon.com/dp/B07S1LMK58",
    "https://www.amazon.in/dp/B0B8YY38VF"
  ]
}

Deployment

I used Docker to containerize the API. Was trying to use Google Cloud Functions to deploy the endpoint but faced some issues since it was my first time using GCP:-

Wasn't able to load the embeddings.npy file from cloud storage into the cloud function. Some help on this would be appreciated.

Running Locally

Clone the repo
Make a virtual environment
Install the dependencies pip install -r requirements.txt
Run the server python main.py

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
assets		assets
data		data
model		model
notebooks		notebooks
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
Procfile		Procfile
README.md		README.md
api.py		api.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HF Space Demo

Working Demo

Data Collection

Data Cleaning

Making Embeddings

API

Deployment

Running Locally

About

Releases

Packages

Languages

kausmeows/clothsy

Folders and files

Latest commit

History

Repository files navigation

HF Space Demo

Working Demo

Data Collection

Data Cleaning

Making Embeddings

API

Deployment

Running Locally

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages