Skip to main content

V4 Expressive Visual Avatars – Tech Specs

D-ID’s ultra-fast diffusion model powers real-time visual agents that combine human-like delivery with expressive emotional intelligence.

Lip-sync quality

D-ID ranks #1 in lip-sync accuracy among the leading real-time avatar platforms tested.

Lip-sync quality was evaluated using SyncNet, the industry-standard academic benchmark for audio-visual synchronization. SyncNet is a model that was trained to detect even subtle misalignment between speech audio and mouth movements.
The evaluation was conducted on a large set of real-world conversational scenarios and applied consistently across real world network connections providing a fair and unbiased comparison.

Results

  • #1 on both SyncNet metrics (LSE-D and LSE-C)
  • Consistently outperformed competing real-time avatar platforms
  • Demonstrated a 10–15% performance advantage over competitors across lip-sync quality measurements

In practice, this translates to more accurate, natural, and reliable speech synchronization during live conversations.

We compared real-time avatar streaming from D-ID, HeyGen, Anam, and Tavus using a fully synthetic, reproducible benchmark. The test included 5 avatars per company, varied scripts, and real-world network scenarios, covering hundreds of videos and tens of thousands of frames.

Performance

Conversational latency: End-to-end response time stays below 500 ms, keeping avatar conversations fast, fluid, and natural.
Model latency: The core model runs below 120 ms, giving the system the speed needed for real-time interaction.
Rendering performance: A 200+ FPS diffusion pipeline generates expressive avatar frames faster than real-time, supporting smooth motion and consistent visual quality.

Expressiveness

Sentiment control: Expressive V4 supports multiple sentiments with EQ-based control, helping avatars respond with the right emotional tone for the situation.
Context-sensitive expressions: Facial expressions adapt dynamically to the conversation, creating more responsive and natural interactions.
Emotionally aligned speech: Voice delivery matches the intended tone, so the avatar’s speech, expression, and message feel consistent.

Visual quality

High-resolution output: Expressive V4 supports up to 4K output, preserving facial detail and visual clarity for polished, customer-facing experiences.
Sharper facial motion: More precise facial movement and improved lip synchronization make speech feel better timed and more natural.
Consistent avatar identity: Avatars maintain a stable appearance across longer sessions, reducing visual drift and keeping interactions reliable.

Interactive capabilites

Generative UI: Agents can fetch and display media assets as interactive screen components, such as images, documents, forms, or buttons, directly within the conversation.
Optional eyesight: A vision-enabled LLM analyzes frames from the video stream, helping the agent understand expressions, gestures, objects, and scene context, then respond more naturally.
MCP apps: Agents can interact directly with the D-ID API ecosystem, making it easier to connect conversations with tools, workflows, and API-based actions.

Efficiency

Efficient GPU usage: Expressive V4 is optimized for a small compute footprint, using around 3.5GB GPU RAM for 4 concurrent sessions, which supports scalable real-time avatar deployment without heavy infrastructure requirements.
Lower generation costs: Compared to diffusion-based AI video generation, the system offers a significant cost advantage by delivering expressive real-time avatars with more efficient compute usage.

Availability

15+ avatars at launch: Expressive V4 launches with more than 15 ready-to-use avatars, helping teams start quickly with a variety of visual options.
Custom Enterprise avatars: Enterprise users can create branded or use-case-specific avatars tailored to their audience and experience.
Studio and API access: Expressive V4 is available in D-ID Studio and through the D-ID API, supporting both no-code creation and developer-led integration.