DEV Community

code performance
code performance

Posted on

Veo3: a more complete and (hypothetical?) JSON prompt

While the exact, official, and full schema for Veo3's JSON input isn't publicly released in a single, definitive document (as it's a rapidly evolving internal Google model), we can infer a lot from its capabilities, what's been shown in demos, and what's common in other advanced generative AI models.

Here's a more complete and hypothetical JSON prompt, building on the "Ethiopian man in Kyoto" scenario, incorporating various aspects you might want to control with Veo3.

Important Note: This is an illustrative example based on current understanding of advanced video generation models. The actual parameter names and nested structures in Veo3's API might vary.

{ "model_id": "veo-3.0-generate-preview", "seed": 12345, # Optional: For reproducibility. Use a different seed for variations. "prompt": "A traditional Ethiopian man walking the vibrant streets of Kyoto, Japan, creating a beautiful cultural juxtaposition.", "negative_prompt": "blurry, low-resolution, cartoonish, unnatural movements, distorted faces, fast cuts, shaky camera, out of focus, modern Japanese attire on the Ethiopian man, non-traditional buildings, excessive tourists, rain, snow, night, neon signs, graffiti", "output_config": { "duration_seconds": 15, "aspect_ratio": "16:9", # Options: "1:1", "9:16", "16:9" "resolution": "1080p", # Options: "720p", "1080p", "4K" (if supported) "frame_rate": 24, # Common options: 24, 30, 60 "generate_audio": true, "audio_config": { "ambient_sound": "gentle rustling of kimonos, distant temple bells, soft murmur of Japanese chatter, occasional bicycle bell", "music": { "genre": "subtle, peaceful Japanese traditional music (koto, shakuhachi)", "intensity": "low", "fade_in_duration": 1, "fade_out_duration": 1 }, "dialogue": null # No dialogue for this scene } }, "scene_composition": { "time_of_day": "late morning", "lighting_mood": "soft, warm daylight, diffused sunlight filtering through trees and traditional architecture", "atmosphere": "serene, respectful, culturally rich, slightly nostalgic", "weather": "clear, pleasant, light breeze", "environment_details": [ "traditional wooden machiya houses with intricate details", "narrow cobblestone pathways", "small, manicured gardens visible behind walls", "occasional red or orange torii gates in the distance", "subtle steam rising from a nearby tea house (optional)", "cherry blossom petals gently falling (if season implied)" ], "elements_to_avoid": [ "power lines", "modern signage", "overtly Western elements", "large vehicles" ] }, "character_details": { "main_character": { "description": "An elderly Ethiopian man, dignified and wise, with a gentle smile and observant eyes. He wears a meticulously draped white gabi (traditional Ethiopian shawl) and a matching kofia (head covering). His posture is upright and calm.", "action": "walking at a leisurely pace, occasionally pausing to observe details (e.g., a garden, a shop window), taking a slow, deliberate sip from a small teacup (if a teahouse is implied), turning his head slowly to take in the surroundings.", "emotion": "curiosity, peaceful contemplation, quiet wonder", "clothing_details": "authentic, high-quality white gabi with traditional embroidery, clean white kofia, simple leather sandals" }, "background_characters": { "density": "sparse to moderate", # Options: "empty", "sparse", "moderate", "dense" "description": "local Japanese people in traditional and modern attire, going about their daily lives respectfully. No direct interaction with the main character, but a sense of shared space.", "actions": "walking, cycling, quietly observing, tending to shops" } }, "camera_and_motion": { "primary_shot_type": "medium tracking shot", "camera_movement": { "type": "smooth Steadicam", "direction": "following the man from slightly behind and to his side", "speed": "slow, contemplative", "pan": "subtle pans to reveal new architectural details", "tilt": "minimal, perhaps slight upward tilts to capture rooflines" }, "secondary_shots": [ { "shot_type": "close-up", "subject": "intricate embroidery on the gabi", "duration_seconds": 2 }, { "shot_type": "wide shot", "subject": "the man framed against a backdrop of traditional Kyoto street, emphasizing contrast", "duration_seconds": 3, "camera_movement": "slow dolly out" } ], "lens_type": "50mm prime lens (for a classic, slightly compressed look)", "film_grain": "subtle, organic film grain for a timeless feel", "depth_of_field": "shallow depth of field, with the man in sharp focus and background gently blurred" }, "post_production_style": { "color_grading": "warm, natural tones, slightly desaturated to evoke a classic film aesthetic", "visual_effects": "minimal, realistic dust motes in sunbeams (if applicable), no CGI elements", "transition_style": "seamless cuts, slow dissolves between key shots" } } 
Enter fullscreen mode Exit fullscreen mode

Breakdown of the Comprehensive JSON Structure:

  1. model_id: Specifies which version of Veo you want to use.
  2. seed: A numerical seed for reproducibility. The same seed with the same prompt should ideally produce very similar results.
  3. prompt: The core natural language description of the scene.
  4. negative_prompt: Crucial for guiding the AI on what not to include, or what visual artifacts to avoid.
  5. output_config: Controls the technical aspects of the video output.
    • duration_seconds: How long the video should be.
    • aspect_ratio: Video dimensions (e.g., cinematic, vertical for social).
    • resolution: Quality of the output.
    • frame_rate: Smoothness of motion.
    • generate_audio: Boolean to enable/disable audio.
    • audio_config: Detailed control over sound.
      • ambient_sound: Background noises.
      • music: Type, intensity, and fade.
      • dialogue: If characters speak, what they say.
  6. scene_composition: Defines the environment and overall mood.
    • time_of_day: Lighting conditions based on time.
    • lighting_mood: Specific qualities of light (soft, harsh, golden).
    • atmosphere: The general feeling or emotion of the scene.
    • weather: Specific weather conditions.
    • environment_details: Specific elements that should be present in the setting.
    • elements_to_avoid: Specific things not to include in the environment.
  7. character_details: In-depth description of subjects.
    • main_character:
      • description: Physical appearance, age, general demeanor.
      • action: What the character is doing.
      • emotion: The feeling the character conveys.
      • clothing_details: Specifics about their attire.
    • background_characters:
      • density: How many people in the background.
      • description: What they look like, if relevant.
      • actions: What they are doing.
  8. camera_and_motion: Dictates cinematography.
    • primary_shot_type: Main type of shot (e.g., medium, wide, close-up).
    • camera_movement:
      • type: How the camera moves (Steadicam, handheld, dolly, crane).
      • direction: The path of movement.
      • speed: Pace of movement.
      • pan, tilt, zoom: Specific camera controls.
    • secondary_shots: For more complex videos, you might define a sequence of shots.
    • lens_type: Simulates different camera lenses.
    • film_grain: Adds a specific visual texture.
    • depth_of_field: Controls focus.
  9. post_production_style: Affects the final visual treatment.
    • color_grading: Overall color aesthetic.
    • visual_effects: Any added effects (e.g., lens flares, mist).
    • transition_style: How cuts between scenes appear.

This comprehensive JSON structure allows for a very high degree of control, letting you act as a virtual director and cinematographer, specifying nearly every element of your desired video.

Another Example:

(Inspiration for this comes from post by @IamEmily2050 on X)

Based on publicly available information about Veo3's capabilities (native audio, improved prompt adherence, realism, character consistency, camera control, etc.) and common patterns in other generative AI APIs, here's another JSON prompt, tailored to what Veo3 is likely to respond well to.

{ "model_id": "veo-3.0-generate", # Or "veo-3.0-fast-generate" for quicker, potentially lower res results "seed": 42069, # A different seed for a new generation, for unique variations "api_key": "YOUR_VEO_API_KEY", # Placeholder for your actual API key "global_settings": { "output_resolution": "1080p", # Options: "720p", "1080p", "4K" (if supported) "aspect_ratio": "16:9", # Options: "1:1", "9:16", "16:9" "frame_rate": 24, # Common options: 24, 30, 60 "max_video_duration_seconds": 15, # Overall max duration for the complete video "enable_audio_generation": true }, "character_definitions": { "FoxyBrown": { "gender": "female", "age": 27, "height_cm": 173, "build": "lean, athletic, swimmer’s shoulders", "skin_tone": "deep bronze with a subtle sun-kissed glow", "hair_style": "jet-black, shoulder-length, slicked straight back and dripping wet", "eye_color": "almond-shaped hazel with faint gold flecks", "unique_features": [ "tiny star tattoo tucked behind her right ear", "gold stud in upper left helix" ], "personality_traits": [ "playfully self-assured", "confident", "sarcastic (subtly implied through smirk)" ], "default_attire": { "description": "metallic-coral bikini, mirrored sunglasses, gold hoop earrings", "color_palette": ["metallic coral", "gold", "dark mirror tint"] }, "facial_expression_preference": { "mouth_shape": "smirk", "eye_contact_directness": 0.7 # 0.0 to 1.0, 1.0 being direct eye contact } } }, "video_segments": [ { "segment_id": "S1_SplashCash_Intro", "description": "FoxyBrown at a rooftop infinity pool overlooking a neon-tropic city. She is leaning on the pool edge, exuding playful self-assurance. Water glistens on her skin and hair.", "duration_seconds": 8, # Duration for this specific segment "negative_prompt_segment": "unnatural water ripples, blurry reflections, stiff posture, dull colors", "scene_details": { "location_description": "Rooftop infinity pool, high above a vibrant, neon-lit city skyline (daytime view).", "time_of_day": "mid-day", "weather_conditions": "clear, sunny", "environment_elements": [ "sunlit pool water with dynamic reflections and shifting patterns", "floating dollar-sign inflatables (subtly visible)", "modern, futuristic city architecture in the background" ], "lighting_conditions": { "type": "high-key", "intensity": "bright", "qualities": ["specular highlights on wet skin", "natural sunlight"] }, "color_palette_override": ["hot-pink", "aqua", "tangerine", "electric blue", "bright yellow"], "atmosphere_mood": ["vibrant", "playful", "luxurious", "confident", "energetic"] }, "camera_details": { "type": "smooth gimbal", "lens_focal_length_mm": 35, "shot_composition": "medium close-up", "camera_motion": { "type": "dolly-in", "distance_cm": 60, "speed": "slow" }, "film_grain_intensity": 0.05, # 0.0 (no grain) to 1.0 (heavy grain) "depth_of_field": "shallow", # Options: "shallow", "medium", "deep" "cinematic_effects": ["subtle lens flare from the sun", "reflections on water"] }, "character_actions": [ { "character_ref": "FoxyBrown", "action_description": "leans confidently on the pool edge", "timing": "start of clip" }, { "character_ref": "FoxyBrown", "action_description": "on beat four of the audio, she cheekily fans her hand towards the camera, causing water droplets to sparkle", "timing": "sync to audio beat 4" } ], "audio_track": { "main_element": "music", "music_details": { "genre": "trap-pop rap", "tempo_bpm": 145, "rhythm_style": "swung hats, prominent sub-bass", "lyrics": "Splash-cash, bling-blap—pool water pshh! Charts skrrt! like my wave, hot tropics whoosh!", "vocal_delivery": { "emotion": "confident, playful, tongue-in-cheek", "flow_style": "double-time for first bar, brief half-time tag", "voice_gender": "female", "voice_tone": "sassy, clear" }, "instrumentation_hint": "synthesizers, 808s, crisp percussion" }, "ambient_sound_details": { "description": "subtle splashing sounds, distant city hum, faint tropical birds", "volume_level": "low" }, "sfx_details": [ {"sound": "water pshh sound effect", "timing": "synchronized with hand fan action"}, {"sound": "cash register 'cha-ching' (stylized)", "timing": "synchronized with 'bling-blap' lyric"} ] # "custom_audio_url": "https://example.com/custom_nyx_rap.wav", # Option for external audio # "audio_base64_encoded": "base64_string_of_audio_data" # Option for inline audio } } # You could add more clip objects here for a longer video with different scenes # { # "segment_id": "S2_DiveIn", # "description": "FoxyBrown dives into the pool...", # "duration_seconds": 5, # ... # } ] } 
Enter fullscreen mode Exit fullscreen mode

Key Learning Points :

  1. model_id: Explicitly states which Veo model version to use. This is crucial for API calls.
  2. global_settings: Consolidates parameters that apply to the entire video, ensuring consistency.
    • Added max_video_duration_seconds at this level, as the duration_sec within clips implies per-clip duration.
    • enable_audio_generation: A clear boolean for toggling audio.
  3. character_definitions: Made this a top-level object (character_definitions) that holds multiple named character profiles (e.g., FoxyBrown). This is very scalable.
    • Each character then has a character_ref in the character_actions section, allowing you to easily reuse defined characters across multiple clips without repeating their full profile.
    • Added gender and more specific personality_traits to aid AI in subtle nuances.
    • facial_expression_preference: Kept your insightful mouth_shape_intensity and eye_contact_ratio, renaming them slightly for clarity on their intent. These are advanced and great to include.
  4. video_segments (instead of clips): Renamed for clearer semantics of being distinct parts of a larger video.
    • Each segment has its own negative_prompt_segment because what you want to avoid might be specific to that particular shot.
  5. Enhanced scene_details:
    • location_description: More detailed narrative for location.
    • weather_conditions: Explicitly stating weather.
    • lighting_conditions: Broke down lighting into type, intensity, and qualities for finer control.
    • color_palette_override: Allows overriding the global color palette for a specific segment.
    • atmosphere_mood: A list of descriptive words to convey the overall feeling.
  6. Detailed camera_details:
    • lens_focal_length_mm: Explicitly setting a focal length helps the AI understand the field of view and compression.
    • cinematic_effects: For things like lens flares, smoke, etc.
  7. character_actions: Made this an array of objects to allow multiple characters and complex action sequences within a single segment, with timing hints.
  8. Comprehensive audio_track:
    • Separated music_details, ambient_sound_details, and sfx_details for granular control.
    • Added vocal_delivery within music_details for rap/dialogue nuances (emotion, flow, voice type).
    • Included placeholders for custom_audio_url and audio_base64_encoded as common API patterns for uploading specific audio.
  9. Removed Redundancy (where practical): Reduced repeated character descriptions within each clip by using a character reference.

This JSON is a more production-ready, API-focused prompt structure that takes full advantage of the type of control a model like Veo3 is designed to offer. It's a fantastic blueprint for building complex AI-generated video narratives!

Top comments (0)