V4 Expressive Visual Agents - Tech Specs

V4 Expressive Visual Avatars – Tech Specs

D-ID’s ultra-fast diffusion model powers real-time visual agents that combine human-like delivery with expressive emotional intelligence.

- Conversational latency (end-to-end): < 500 ms
- Model latency: < 120 ms
- Rendering performance: 200+ FPS diffusion pipeline
- Lip-sync accuracy: 5.7 LSE-D* (17% better than closest competitor and 44% better than D-ID’s V3 avatar model)
  
  * LSE-D (lip-sync error distance) is a metric computed using the SyncNet model that measures how well the lip movements in a video align with the corresponding speech audio. The lower the value, the better the audio-lip alignment. Values under 6 are considered as excellent / near-perfect sync under the evaluation conventions.
- Support of multiple sentiments with EQ-based sentiment control
- Dynamic, context-sensitive facial expressions
- Natural speech delivery aligned with emotional tone
- High-resolution output: up to 4K
- Sharper facial motion and improved lip synchronization
- Consistent avatar identity across long sessions
- Generative UI: agents can fetch and render media assets dynamically as interactive screen components
- Eyesight: a vision-enabled LLM analyzes frames from the video stream, allowing the Agent to interpret facial expressions, gestures, objects, and scene context, and respond naturally in conversation.
- MCP apps: enable direct interaction with the D-ID API ecosystem
- Optimized GPU utilization with a small compute footprint (3.5GB GPU RAM @ 4 concurrent sessions)
- Significant cost advantage compared to diffusion-based AI video generation
- 15+ avatars available at launch
- Custom avatars available for Enterprise users
- Accessible in D-ID Studio and via the D-ID API

Performance

Expressiveness

Visual Quality

Interactive Capabilities

Efficiency

Availability