Website | Email | Docs (soon)
This implementation provides a comprehensive framework for integrating Stable Diffusion Diffusion Transformer Model with a retrieval-based attribution system using PyTorch, Hugging Face's Diffusers, CLIP, and InternViT. By encoding and indexing the training dataset's images, the system can both attribute generated images and verify external images against the training data.
This approach promotes transparency and accountability in generative models, addressing concerns related to copyright and artist attribution. It serves as a foundation that can be further refined and expanded based on specific requirements and datasets.
Building a sophisticated generative model architecture that integrates Diffusion Transformer Model with a retrieval-based attribution system involves several components. This system will not only generate images based on text prompts but also provide attributions to the artists or data sources that most closely align with the generated content. Additionally, it will offer verification capabilities for external images against the training dataset.
A training pipeline that allows a generative model like FLUX or AuraFlow to output the nearest artist reference text based on CLIP + ViT embeddings and autoencoder (VAE) embeddings involves the following key steps:
- Data Preparation: Organize dataset with images and associated artist labels.
- Embedding Extraction:
- CLIP Embeddings: Encode images using the CLIP model.
- InternViT Embeddings: Encode images using the ViT model.
- Autoencoder (VAE) Embeddings: Extract latent representations from the VAE part of the Diffusion Transformer model.
- Combining Embeddings: Merge CLIP and VAE embeddings to create a comprehensive representation.
- Label Encoding: Encode artist labels for training.
- Model Training: Train a classifier (e.g., a neural network) to predict artist labels based on combined embeddings.
- Integration with Generation Pipeline: Enhance the image generation process to output artist references alongside generated images.
- We finetune generative model with new image datasets
- We extract VAE embedding from the finetuned model regarding the image datasets
- We extract ViT and CLIP embedding from pretrained model (CLIP and InternViT) regarding the image datasets
- We collect embedding from both sources and stored in datasets
- We train the classifier from collected embedding
- The classiffier have multiple output heads, to easily scale with many artists label in the future.
- The generative model and classifier trained separately (They're two different models but work in the same inference pipeline)
Notes:
- ViT and CLIP parameter is freezed according to the image datasets, but the VAE checkpoint is updated.
- Every image is transformed into 1024*1024 for input uniformity into the classifier
- Total parameter count (if we are using FLUX model, 1024*1024 image size transformation, and 1,000,000 number of artists): 688595244~ (688M Parameters, exclude the pretrained and FLUX models)
- We published our early website and first article, kindly check out this article named Bridging Generative AI and Artistic Integrity: A New Era of Creative Collaboration
- We publish brief video explanation regarding the project [Link]
- Experimental implementation with WikiArt Public dataset
- Integration with Glaze image cloaking algorithm for further protection of artists' artwork.
Add AuraFlow variant, because FLUX-schnell is distilled model and FLUX-dev is not included with commercial license.(completed)FLUX.1-schnell implementation rather than stable diffusion(completed)Use bigger ViT model such as InternViT-6B(completed)
We are looking for research sponsor/investor. Please email [email protected] if you are interested in sponsoring this project.