Skip to content

Strike the Pose

May 2026·3 min read

Branded short-video generator. The headline win: a vision-LLM gate lifts the usable-output rate from 63% to 88%.

One free-form prompt + a brand logo → a 4-second clip of a model walking in branded sneakers. The shipping pipeline decouples logo placement from scene composition, and a vision-LLM likeness gate retries the compose step until the logo reads.

navy canvas sneakers, misty harbor at dawn, VHS grain
navy canvas, misty harbor at dawn, VHS grain
cream low-top in a minimalist studio, B&W film
cream low-top, minimalist studio, B&W
all-black runners on wet pavement neon
all-black runners, wet pavement neon
olive suede runner on a forest trail
olive suede, forest trail

Product shot


navy canvas shoe with arcade logo on lateral panel


Lands reliably. Logo likeness, shoe silhouette/colourway, canvas + leather materials. A VLM likeness gate after compose catches the off-likeness logos before they leave the stage.

Doesn't yet. Multi-logo composition, explicit texture control, explicit placement directives ("logo on heel"), suede + complex materials.

Decoupling logo placement from scene composition is the structural win. Earlier iterations tried to do both in one shot and the logo was always the first thing to break.

Scene seed


model wearing the navy canvas shoes on a misty harbor dock


Lands reliably. Scene background, setting, and lighting. The rewriter routes demographic and pose hints into the right prompt slot.

Doesn't yet. Chained branding — putting the logo on a billboard and on the shoe simultaneously breaks both. The gate only watches one surface today; multi-surface gating is the next architectural step.

Video motion


4-second walk video preview
4 s @ 16 fps, 1280×720


Lands reliably. Cross-frame identity, image sharpness, low motion blur.

Doesn't yet. Mid-clip pacing wobbles, hallucinated subtitle overlays (caught by VLM at the obvious end, not yet at the corner-text end), camera movement beyond a single subtle push, aesthetic filters at strength.

Wan is a motion model, not a style model. Aesthetic directives land best when injected at the still-image step and carried forward.

Experiments

Two write-ups from the lab notebook. Each isolates a single load-bearing question.

ReportHeadline finding
Likeness-retry gateVLM gate cuts the logo-failure rate from 50% to 12%. Three retries captures almost all of the win.
Setting an aestheticStyle directives are invisible to the motion model, visible at the still-image step, and destructive in a separate restyle pass. Stronger paths: style LoRA, video-to-video filter.

Next

  1. Multi-surface logo gating — likeness checks on every chained brand surface, not just the shoe.
  2. OCR pre-check on frames — the VLM misses corner text; OCR will not.
  3. Video-to-video stylization so aesthetic directives actually land at strength.

Stack: Python · FLUX-dev · Qwen-Image-Edit-2511 · Wan 2.2 14B I2V · gpt-4o-mini (rewriter + likeness gate) · ComfyUI · OWLv2

Repo · Architecture · Deployment

  • video-gen
  • diffusion
  • eval