- ❗ Requires OpenAI/CLIP
- Generates matching text embeddings / a 'CLIP opinion' about images
- Uses gradient ascent to optimize text embeds for cosine similarity with image embeds
- Saves 'CLIP opinion' as .txt files [best tokens]
- Saves text-embeds.pt with [batch_size] number of embeds
- Can be used to create an adversarial text-image aligned dataset
- For XAI, adversarial training, etc; see 'attack' folder for example images
- Usage: Single image:
python gradient-ascent.py --use_image attack/024_attack.png - Usage: Batch process:
python gradient-ascent.py --img_folder attack - 🆕 Load custom model:
python gradient-ascent-unproj_flux1.py --model_name "path/to/myCLIP.safetensors"
- Args
--model_namenow accepts name (defaultViT-L/14), OR a"/path/to/model.pt" - If it ends on
.safetensors, will assume 'ViT-L/14' (CLIP-L) and load state_dict. ✅ ⚠️ Must be nevertheless in original "OpenAI/CLIP" format. HuggingFace converted models will NOT work.- My HF: zer0int/CLIP-GmP-ViT-L-14
model.safetensorswill NOT work (it's for diffusers / HF). - Instead, download the full model .safetensors [text encoder AND vision encoder]; direkt link:
- My GmP-BEST-smooth and GmP-Text-detail and 🆕 SAE-GmP will work with this code.
- 🆕 Added
gradient-ascent-unproj_flux1.py. Usage is the same; however, in addition to projected embeddings: - Saves
pinvandinvversion of pre-projection embeddings. - 👉
Flux.1-devuses these embeddings (pinvseems best for Flux.1-dev). - Recommended samplers: HEUN, Euler.
Example "worst portrait ever" generated by Flux.1-dev with pure CLIP guidance (no T5!) as CLIP apparently tried to encode the facial expression of the cat 😂; plus, the usual CLIP text gibberish of something 'cat' and 'shoe' mashed-up:
Command-line arguments:
--batch_size, default=13, type=int, help="Reduce batch_size if you have OOM issues" --model_name, default='ViT-L/14', help="CLIP model to use" --tokens_to, default="texts", help="Save CLIP opinion texts path" --embeds_to, default="embeds", help="Save CLIP embeddings path" --use_best, default="True", help="If True, use best embeds (loss); if False, just saves last step (not recommended)" --img_folder, default=None, help="Path to folder with images, for batch embeddings generation" --use_image, default=None, help="Path to a single image" Further processing example code snippets:
text_embeddings = torch.load("path/to/embeds.pt").to(device) # loop over all batches of embeds and do a thing num_embeddings = text_embeddings.size(0) # e.g. batch_size 13 -> idx 0 to 12 for selected_embedding_idx in range(num_embeddings): print(f"Processing embedding index: {selected_embedding_idx}") # do your thing here! # select a random batch from embedding and do a thing selected_embedding_idx = torch.randint(0, text_embeddings.size(0), (1,)).item() selected_embedding = text_embeddings[selected_embedding_idx:selected_embedding_idx + 1] # or just manually select one selected_embedding_idx = 3 selected_embedding = text_embeddings[selected_embedding_idx:selected_embedding_idx + 1] 
