Posted on Nov 15 • Originally published at ville.dev

Brand Tagging with VLMs

#vlm #pytorch #ai #python

TL;DR

Build a two-stage logo pipeline:

Retrieval - generate image embeddings for small crops and match against a logo dictionary with FAISS cosine search. Use SigLIP-2 (NaFlex) so logos are not distorted and small marks still pop.
Verification - for top matches, ask LLaVA-OneVision-1.5 a strict JSON question ("Is this the X logo?") and accept only high-confidence "yes". It's a good model sir.

A bit longer post this time around.

Intro

Brand tagging in real-world video is hard: logos are tiny, partly occluded, moving, and often appear on textured backgrounds. A practical approach is a two-stage pipeline:

First retrieve likely logo crops with a fast contrastive image encoder.
Then verify each candidate with a vision-language model (VLM) that can read text and reason about shapes and context. Each candidate is a (frame, bbox, brand, retrieval_score) record that we pass to the VLM.

This tutorial combines a modern image-text encoder (SigLIP-2, NaFlex variant) for high-recall retrieval with a VLM from the LLaVA family for precise, structured yes/no verification.

You could train a YOLO-style detector for specific brands. This post, however, focuses on a more flexible "embedding + VLM" approach that adapts quickly to new logos without retraining. Some might find this fun.

Research

Contrastive encoders such as CLIP/SigLIP produce embeddings where similar visuals are close. FAISS makes nearest-neighbor search over many references instantaneous. You can treat logo search as nearest-neighbor lookup instead of training a custom detector.

Verification reduces false positives. A VLM can explicitly answer "Is this the Red Bull logo?" and justify the decision, improving precision on lookalikes, partial views, and blur.

Pointers to the underlying research for the initiated:

Setup

You’ll need to be comfortable with basic Python and PyTorch. No prior experience with FAISS or VLMs required.

Hardware/software requirements:

At least one NVIDIA GPU (at least 32 GB VRAM recommended)
CUDA 12.x, Python 3.10–3.12
Disk: 5–10 GB (models + caches)
Tools: ffmpeg for frames

This tutorial was built and tested on a single NVIDIA H200.

Prepare the Python environment as follows:

python3 -m venv .venv && source .venv/bin/activate pip install -U torch torchvision torchaudio \ "transformers>=4.45" accelerate pillow opencv-python faiss-cpu \ numpy pydantic polars

For simplicity we'll use ƒaiss-cpu as it's fine at this scale. It's a single logo after all. GPU for the models.

Example video (Creative Commons):

Title: Red Bull Racing Pit Stop Practice (2015), 44 s, 1920x1080.
License: CC BY-SA 4.0 - attribute ProtoplasmaKid / Wikimedia Commons.
It has tiny, moving logos, occlusion, uniforms, car bodywork, pit rig and acts as a great stress-test.

Architecture

frames (2–4 FPS) ──▶ crops (multi-scale grid) │ ▼ [Stage 1] Retrieval (SigLIP-2 image features + FAISS cosine) │ └─ top-K per frame/brand with heuristics ▼ [Stage 2] VLM verification (LLaVA-OneVision-1.5 JSON verdict) │ ▼ JSONL evidence

Models used in this post:

SigLIP-2 reports better zero-shot and retrieval performance than prior SigLIP/CLIP models on public benchmarks. NaFlex means the encoder resizes each crop to a grid of flexible patches instead of forcing a fixed square, so long thin logos don't get squashed.

OneVision-1.5 is an open VLM family. Its card reports strong benchmark leads vs other open models.

Both are Apache-2.0 licensed.

So let's do it!

Quickstart

1) Grab the video:

wget https://upload.wikimedia.org/wikipedia/commons/b/ba/Red_Bull_Racing_Pit_Stop_Practice.webm

Then extract frames. Two FPS, scaled down to 1280x720:

mkdir -p frames ffmpeg -i "Red_Bull_Racing_Pit_Stop_Practice.webm" -vf "fps=2,scale=1280:-1:flags=lanczos" -q:v 3 frames/f_%06d.jpg

This yields 88 JPG files.

2) Prepare a logo dictionary

In this example we're only interested in the Red Bull logo. Create a logos directory you want to use. Grab the logo:

mkdir -p logos wget -P logos https://upload.wikimedia.org/wikipedia/fi/a/a5/Red_Bull_logo.png

Disclaimer: This tutorial is for educational purposes only and is not affiliated with or endorsed by Red Bull. Red Bull is a registered trademark of Red Bull GmbH. It's a decent energy drink though.

3) Build the logo index (FAISS, SigLIP-2)

Save to build_logo_index.py:

import json, faiss, torch from pathlib import Path from PIL import Image import numpy as np from transformers import AutoModel, AutoProcessor MODEL_ID = "google/siglip2-base-patch16-naflex" # NaFlex = native aspect ratio OUT_DIR = Path("artifacts"); OUT_DIR.mkdir(exist_ok=True, parents=True) def embed_images(paths, model, proc, batch=16): imgs = [Image.open(p).convert("RGB") for p in paths] feats = [] for i in range(0, len(imgs), batch): chunk = imgs[i:i+batch] inputs = proc(images=chunk, return_tensors="pt") inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): f = model.get_image_features(**inputs) # (B, d)  f = torch.nn.functional.normalize(f, dim=-1) feats.append(f.cpu()) return torch.cat(feats, dim=0).numpy() def main(): model = AutoModel.from_pretrained(MODEL_ID, dtype=torch.float32, device_map="auto") proc = AutoProcessor.from_pretrained(MODEL_ID) logo_paths = sorted(list(Path("logos").glob("*.*"))) if not logo_paths: raise SystemExit("No logo files found in 'logos' directory.") brands = [p.stem for p in logo_paths] vecs = embed_images(logo_paths, model, proc) d = vecs.shape[1] index = faiss.IndexFlatIP(d); index.add(vecs) faiss.write_index(index, str(OUT_DIR / "logos.faiss")) (OUT_DIR / "logos_meta.json").write_text( json.dumps({"brands": brands, "files": [str(p) for p in logo_paths]}, indent=2) ) print(f"Indexed {len(brands)} brands into {OUT_DIR/'logos.faiss'}") if __name__ == "__main__": main()

SigLIP-2 provides get_image_features and NaFlex dynamic resizing to minimize distortion on non-square inputs — useful for narrow/wide logos.

Run the tool:

$ python build_logo_index.py ... Indexed 1 brands into artifacts/logos.faiss

4) Generate crops + Stage-1 retrieval

Save the following to retrieve.py:

import json, math, os, faiss, torch from pathlib import Path from PIL import Image, ImageDraw import numpy as np from transformers import AutoModel, AutoProcessor EMB_ID = "google/siglip2-base-patch16-naflex" ART = Path("artifacts"); ART.mkdir(exist_ok=True) SIZES = [192, 256, 320] # square windows STRIDE = 0.5 # 50% overlap TOPK = int(os.environ.get("TOPK", "3")) COSINE_TH = float(os.environ.get("COSINE_TH", "0.7")) # keep candidates above this  def _metric_type(index): try: return index.metric_type except Exception: return None def _l2_to_cosine(d: np.ndarray) -> np.ndarray: # If vectors are L2-normalized, cos = 1 - 0.5 * ||a - b||^2  return 1.0 - 0.5 * d def _as_float32(x: np.ndarray) -> np.ndarray: if x.dtype != np.float32: x = x.astype(np.float32, copy=False) return np.ascontiguousarray(x) def grid_crops(im: Image.Image): W, H = im.size for s in SIZES: step = max(1, int(s * STRIDE)) for y in range(0, max(1, H - s + 1), step): for x in range(0, max(1, W - s + 1), step): yield (x, y, s, s) def embed(model, proc, pil_list, bs=24): out = [] for i in range(0, len(pil_list), bs): chunk = pil_list[i:i+bs] inp = proc(images=chunk, return_tensors="pt") inp = {k: v.to(model.device) for k, v in inp.items()} with torch.no_grad(): f = model.get_image_features(**inp) f = torch.nn.functional.normalize(f, dim=-1) out.append(f.detach().to(torch.float32).cpu()) vecs = torch.cat(out, dim=0).numpy() return _as_float32(vecs) def main(): # load logo index  index = faiss.read_index(str(ART / "logos.faiss")) meta = json.loads((ART / "logos_meta.json").read_text()) brands = meta["brands"] try: print(f"[retrieve] index ntotal={index.ntotal}, brands={len(brands)}") except Exception: pass # load embedder  model = AutoModel.from_pretrained(EMB_ID, dtype=torch.float32, device_map="auto") proc = AutoProcessor.from_pretrained(EMB_ID) frames = sorted(Path("frames").glob("f_*.jpg")) print(f"[retrieve] frames found={len(frames)}") out = [] dump_all = bool(int(os.environ.get("DEBUG_DUMP_ALL", "0"))) dbg_all = [] if dump_all else None debug_draw = dump_all and bool(int(os.environ.get("DEBUG_DRAW", "0"))) debug_draw_dir = ART / "debug_vis" if debug_draw: debug_draw_dir.mkdir(parents=True, exist_ok=True) debug_draw_th = float(os.environ.get("DEBUG_DRAW_TH", "-1.0")) debug_draw_max = int(os.environ.get("DEBUG_DRAW_MAX", "0")) # 0 = unlimited  best = {"score": -1.0, "frame": None, "brand": None, "bbox": None} for fpath in frames: im = Image.open(fpath).convert("RGB") boxes, crops = [], [] for (x, y, w, h) in grid_crops(im): boxes.append((x, y, w, h)) crops.append(im.crop((x, y, x+w, y+h))) if not crops: continue vecs = embed(model, proc, crops) D, I = index.search(_as_float32(vecs), TOPK) mt = _metric_type(index) if mt == getattr(faiss, "METRIC_L2", 1): scores_mat = _l2_to_cosine(D) else: scores_mat = D # collect frame-local debug matches as well  frame_dbg = [] if dump_all else None for i, (scores, ids) in enumerate(zip(scores_mat, I)): if dump_all: for r, (score, idx) in enumerate(zip(scores.tolist(), ids.tolist())): rec = { "frame": fpath.name, "bbox": boxes[i], "rank": int(r), "score": float(score), "brand": brands[idx] } dbg_all.append(rec) if frame_dbg is not None: frame_dbg.append(rec) for score, idx in zip(scores, ids): if score > best["score"]: best.update({ "score": float(score), "frame": fpath.name, "brand": brands[idx], "bbox": boxes[i] }) if score < COSINE_TH: continue bx = boxes[i] out.append({ "frame": fpath.name, "bbox": bx, "score_retr": float(score), "brand": brands[idx] }) # draw annotations for this frame if requested  if debug_draw and frame_dbg: # sort by score desc  frame_dbg.sort(key=lambda r: r["score"], reverse=True) if debug_draw_max > 0: frame_dbg = frame_dbg[:debug_draw_max] canvas = im.copy() draw = ImageDraw.Draw(canvas) for rec in frame_dbg: if rec["score"] < debug_draw_th: continue x, y, w, h = rec["bbox"] x2, y2 = x + w, y + h color = (255, 0, 0) draw.rectangle([x, y, x2, y2], outline=color, width=2) label = f"{rec['brand']} {rec['score']:.3f}#{rec['rank']}" # simple text; if background needed, draw a small filled box then text  draw.text((x + 3, y + 3), label, fill=color) out_path = debug_draw_dir / f"{fpath.stem}_debug.jpg" canvas.save(out_path, quality=90) print(f"[retrieve] wrote debug visualization → {out_path}") Path("candidates.jsonl").write_text("\n".join(json.dumps(x) for x in out)) print(f"wrote {len(out)} retrieval candidates → candidates.jsonl") if not out and best["frame"] is not None: print(f"[retrieve] no candidates above threshold {COSINE_TH}. " f"Best observed: score={best['score']:.3f}, frame={best['frame']}, " f"brand={best['brand']}, bbox={best['bbox']} — consider lowering COSINE_TH.") if dump_all and dbg_all is not None: Path("debug_matches.jsonl").write_text("\n".join(json.dumps(x) for x in dbg_all)) print(f"[retrieve] wrote {len(dbg_all)} raw matches → debug_matches.jsonl (DEBUG_DUMP_ALL=1)") if __name__ == "__main__": main()

This is quite a few things so breaking it down:

Crops: multi-scale grid over each frame using SIZES and STRIDE, yielding square patches and their bbox tuples.
Embeddings: SigLIP‑2 get_image_features on each crop batch, then L2‑normalize features.
Search: FAISS over the logo index; uses inner‑product on normalized vectors (cosine). If the index is L2, we convert to cosine (1 − 0.5·L2²).
Types: FAISS expects contiguous float32; embeddings are cast/contiguous before index.search.
Thresholds: keep top‑TOPK per crop, then filter by COSINE_TH. You can override at runtime:
- TOPK=10 COSINE_TH=0.65 python retrieve.py
Debugging:
- DEBUG_DUMP_ALL=1 writes every raw match to debug_matches.jsonl (ranked with scores).
- DEBUG_DRAW=1 also saves artifacts/debug_vis/*_debug.jpg with [brand score#rank] boxes.
- Optional: DEBUG_DRAW_TH=0.3 (only draw ≥ threshold), DEBUG_DRAW_MAX=200 (cap boxes).
Outputs: filtered candidates land in candidates.jsonl and are fed to the verifier stage.
Knobs to tune recall/precision: crop sizes, stride, TOPK, COSINE_TH. For higher recall, increase sizes or TOPK; for precision, raise the threshold and later add a margin or temporal smoothing.

Output:

$ python retrieve.py [retrieve] index ntotal=1, brands=1 [retrieve] frames found=88 wrote 25 retrieval candidates → candidates.jsonl

Depending on the source material this probably needs tuning for recall/precision. The debugging knobs are quite nice. Here's an example frame from the video showing how it looks like on frame #7:

Frame from Red Bull Racing Pit Stop Practice (2015), ProtoplasmaKid / Wikimedia Commons / CC BY-SA 4.0.

5) VLM verification

Save the following to verify.py:

import json, torch, os from pathlib import Path from PIL import Image from transformers import AutoProcessor, AutoModelForCausalLM VLM_ID = "lmms-lab/LLaVA-OneVision-1.5-8B-Instruct" SYSTEM = ( "You are a logo verification API. " "Given an image crop and a target brand, answer in strict JSON with no extra text." ) def build_user_prompt(brand: str): return ( "Task: Verify whether the crop contains the specified brand's logo.\n" f"Brand: {brand}\n\n" "Return JSON only:\n" "{\n" ' "verdict": "yes" | "no",\n' ' "confidence": 0.0-1.0,\n' ' "visual_cues": "short, literal cues proving the verdict (colors/shapes/text)"\n' "}\n" "Rules: output only the JSON object; no prose before or after. " "Be literal; do not speculate; base confidence on how clearly the logo is visible." ) def _normalize_quotes(s: str) -> str: # Replace smart quotes with ASCII equivalents  return ( s.replace("\u201c", '"').replace("\u201d", '"') .replace("\u2018", "'").replace("\u2019", "'") ) def _extract_json_object(text: str): text = _normalize_quotes(text) # Find first balanced {...} block; handle braces within strings  in_str = False escape = False depth = 0 start = None for i, ch in enumerate(text): if ch == "\\" and not escape: escape = True continue if ch == '"' and not escape: in_str = not in_str escape = False if in_str: continue if ch == "{": if depth == 0: start = i depth += 1 elif ch == "}": if depth > 0: depth -= 1 if depth == 0 and start is not None: candidate = text[start:i+1] try: j = json.loads(candidate) if isinstance(j, dict) and "verdict" in j and "confidence" in j: return j except Exception: pass return None def run_once(proc, model, crop: Image.Image, brand: str, max_new=128): msgs = [ {"role":"system", "content": SYSTEM}, {"role":"user", "content":[{"type":"image"}, {"type":"text", "text": build_user_prompt(brand)}]} ] text = proc.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True) inputs = proc(text=[text], images=[crop], padding=True, return_tensors="pt") inputs = {k: (v.to(model.device) if hasattr(v,"to") else v) for k,v in inputs.items()} with torch.inference_mode(): ids = model.generate(**inputs, max_new_tokens=max_new, temperature=0.0, do_sample=False) out = proc.batch_decode(ids, skip_special_tokens=True)[0] j = _extract_json_object(out) if j is None: if os.environ.get("DEBUG_VLM", "0") == "1": Path("vlm_raw.txt").write_text(out) return {"verdict":"no", "confidence":0.0, "visual_cues":"parse_error"} # Coerce fields  verdict = str(j.get("verdict", "no")).strip().lower() if verdict not in ("yes", "no"): verdict = "no" try: conf = float(j.get("confidence", 0.0)) except Exception: conf = 0.0 cues = str(j.get("visual_cues", "")) return {"verdict": verdict, "confidence": conf, "visual_cues": cues} def main(): proc = AutoProcessor.from_pretrained(VLM_ID, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( VLM_ID, dtype=torch.float32, device_map="auto", trust_remote_code=True ) results = [] for line in Path("candidates.jsonl").read_text().splitlines(): c = json.loads(line) im = Image.open(Path("frames")/c["frame"]).convert("RGB") x,y,w,h = c["bbox"] crop = im.crop((x,y,x+w,y+h)) j = run_once(proc, model, crop, c["brand"]) results.append({ **c, "verdict": j.get("verdict","no"), "confidence_vlm": float(j.get("confidence",0.0)), "rationale": j.get("visual_cues","") }) Path("detections.jsonl").write_text("\n".join(json.dumps(r, ensure_ascii=False) for r in results)) print("wrote detections.jsonl") if __name__ == "__main__": main()

Again, breaking it down:

Inputs: candidates.jsonl entries with frame, bbox, score_retr, brand.
Crop: open the frame and im.crop(bbox) per candidate.
Prompting: one system + one user message; user contains the target brand and an explicit "JSON only" schema.
Generation: temperature=0.0, do_sample=False, max_new_tokens=128 for deterministic outputs.
Parsing: balanced-brace JSON extraction with smart-quote normalization; on failure returns parse_error. Set DEBUG_VLM=1 to dump raw text to vlm_raw.txt.
Output: writes detections.jsonl with verdict, confidence_vlm, rationale merged onto each candidate.
Cost control: keep retrieval strict (lower TOPK, higher COSINE_TH) to limit VLM calls as that's the main runtime driver.

Output:

$ python verify.py wrote detections.jsonl

And what does detections.jsonl look like? It has items like this (one per line):

{ "frame": "f_000007.jpg", "bbox": [ 384, 288, 192, 192 ], "score_retr": 0.7020304203033447, "brand": "Red_Bull_logo", "verdict": "yes", "confidence_vlm": 0.95, "rationale": "red bull charging bull silhouette, red and yellow colors" }

Here 0.702 is the cosine similarity between the crop and the logo embedding.

Conclusion

I hope this showcases how a simple two-stage recipe - SigLIP-2 retrieval + VLM verification - can turn a semi-noisy video into reviewable brand evidence. I do want to mention that a pipeline like this is meant as a powerful filter, not an oracle. Human in the loop needed.

On this specific 44-second clip, with the thresholds above, I get N true positives, M false positives, and 0 missed clear logos (subjective visual check).

That being said, this is probably not a production-ready detector. We trade some accuracy and runtime for simplicity and transparency. Some improvement ideas I had in mind:

float32 dtype could be replaced with lower precision, like bfloat16.
Add a margin filter (top1 − top2 ≥ 0.15) and simple temporal smoothing. Confirm across 2–3 adjacent frames.
Add a lightweight OCR gate for texty marks to backstop retrieval.
Calibrate thresholds per brand, as some logos need higher COSINE_TH or TH_VLM.
Expand the logo dictionary and try multi-resolution templates (flat vs curved surfaces).
Speed/scale: use FAISS IVF/PQ for larger dictionaries.
Quantize the VLM or batch crops.
Maybe consider a second encoder for consensus (SigLIP‑2 + CLIP) to reduce lookalikes.
For 4K frames, consider either down-scaling first or increasing STRIDE (e.g. 0.75) to avoid generating tens of thousands of crops per frame.

As always, credits: thanks to the SigLIP-2 authors & maintainers and the LLaVA-OneVision team for open releases. And attribution for the example video: ProtoplasmaKid / Wikimedia Commons / CC BY-SA 4.0.

DEV Community