Skip to main content

V4 Expressive Visual Avatars – Tech Specs

D-ID’s ultra-fast diffusion model powers real-time visual agents that combine human-like delivery with expressive emotional intelligence.

    • Conversational latency (end-to-end): < 500 ms
    • Model latency: < 120 ms
    • Rendering performance: 200+ FPS diffusion pipeline
    • Lip-sync accuracy:  5.7 LSE-D* (17% better than closest competitor and 44% better than D-ID’s V3 avatar model)

      * LSE-D (lip-sync error distance) is a metric computed using the SyncNet model that measures how well the lip movements in a video align with the corresponding speech audio. The lower the value, the better the audio-lip alignment. Values under 6 are considered as excellent / near-perfect sync under the evaluation conventions.
    • Support of multiple sentiments with EQ-based sentiment control
    • Dynamic, context-sensitive facial expressions
    • Natural speech delivery aligned with emotional tone

     

    • High-resolution output: up to 4K 
    • Sharper facial motion and improved lip synchronization
    • Consistent avatar identity across long sessions

     

    • Generative UI: agents can fetch and render media assets dynamically as interactive screen components
    • Eyesight: a vision-enabled LLM analyzes frames from the video stream, allowing the Agent to interpret facial expressions, gestures, objects, and scene context, and respond naturally in conversation.
    • MCP apps: enable direct interaction with the D-ID API ecosystem

     

    • Optimized GPU utilization with a small compute footprint (3.5GB GPU RAM @ 4 concurrent sessions)
    • Significant cost advantage compared to diffusion-based AI video generation

     

    • 15+ avatars available at launch
    • Custom avatars available for Enterprise users
    • Accessible in D-ID Studio and via the D-ID API