- CLIP is used for text-to-image comparison
- Unicom is used for image-to-image comparison
- ffmpeg is used to extract keyframes
- ChromaDB is used because I have not found how to store metadata (image filename etc) in FAISS
- For some reason text-to-image works bag after moving to ChromaDB :(
- Try different distances except default cosine distance
- Fix text-to-image