Skip to main content
-
- Conversational latency (end-to-end): < 500 ms
- Model latency: < 120 ms
- Rendering performance: 200+ FPS diffusion pipeline
- Lip-sync accuracy: 5.7 LSE-D* (17% better than closest competitor and 44% better than D-ID’s V3 avatar model)
* LSE-D (lip-sync error distance) is a metric computed using the SyncNet model that measures how well the lip movements in a video align with the corresponding speech audio. The lower the value, the better the audio-lip alignment. Values under 6 are considered as excellent / near-perfect sync under the evaluation conventions.
-
- Support of multiple sentiments with EQ-based sentiment control
- Dynamic, context-sensitive facial expressions
- Natural speech delivery aligned with emotional tone
-
- High-resolution output: up to 4K
- Sharper facial motion and improved lip synchronization
- Consistent avatar identity across long sessions
-
- Generative UI: agents can fetch and render media assets dynamically as interactive screen components
- Eyesight: a vision-enabled LLM analyzes frames from the video stream, allowing the Agent to interpret facial expressions, gestures, objects, and scene context, and respond naturally in conversation.
- MCP apps: enable direct interaction with the D-ID API ecosystem
-
- Optimized GPU utilization with a small compute footprint (3.5GB GPU RAM @ 4 concurrent sessions)
- Significant cost advantage compared to diffusion-based AI video generation
-
- 15+ avatars available at launch
- Custom avatars available for Enterprise users
- Accessible in D-ID Studio and via the D-ID API