Why Everyone Is Doing AI Avatars Wrong

Jun 17, 2026

AI avatars are becoming the new interface for digital experiences. Every app, game, brand, and platform is going to ship one eventually: an agent, an assistant, a companion, or NPC. Most of them exist today as text or voice. That works, but it has a ceiling: you can't build a relationship with a chat box.

The next layer is visual. Users connect with personas they can see, that react to them, that have a face.

There are two ways to bring an AI avatar to life on screen: a real-time character rig system or a video-based generation system. They have very different cost structures, latency profiles, and failure modes. The right choice depends entirely on what you're building. We've spent years on this problem. So here's our take on the tradeoffs…and why we went took the harder approach (so you don't have to).

The two approaches

A real-time character rig system is how games and Hollywood have done 3D animated characters for decades, modernized for AI. The character is a rigged 3D asset that renders on the user's device. The server doesn't send pixels, it sends lightweight animation signals (blend shapes, bone transforms, motion curves - kilobytes, not megabytes), and the client brings the character to life locally.

A video-based system generates the character as video, frame by frame, server-side. The model outputs pixels, and those pixels stream to the user like any other video.

Both have their pros and cons. Many teams default to video. But the defaults are worth digging into.

The case for video…and where it breaks

Video-based systems are popular because they ship fast. There's no custom runtime to build, no SDK to maintain, no rigging pipeline, no specialized hires. Output streams to anything with a video player. And for photorealistic human characters, current generative video quality is starting to get better.

If you're pre-PMF and just looking to validate a visual layer, video is a reasonable way to test the hypothesis for an AI avatar. Generation usually takes several seconds or more per clip. If your interaction model is async, users tolerate that, and video can work in production.

The through-line: video wins when the interaction is one-directional where the user watches rather than converses in real time, and when photorealism of a human matters more than long-term economics. An example of this might be a spokesperson or talking head, whether using for a tik tok or corporate training video.

The problems show up when you try to build a real interactive product experience with it:

Latency. Video generation can't yet support real-time, back-and-forth conversation. If your product is synchronous, that's disqualifying. Most companion, agent, and support experiences are synchronous.
Costs scale on every axis. Every frame renders server-side, with no way to offload to the client. Cost climbs with every user, every minute of conversation, and every step up in quality. This is why the demos you see are short and low-res. Real quality at real length is expensive, which is why the economics get worse precisely when the product succeeds.
Identity drift. Generative video is inconsistent frame to frame and session to session. Your character looks slightly different every time. For products built on character attachment (especially for licensed IP, where the character is the asset) that's a serious problem.
The uncanny valley. Video's photorealism has been getting better for human characters but not as strong for stylized ones. For consumer products, hyper-real human avatars tend to set a threshold of perfection the tech can't consistently clear.
Device and bandwidth load. Streaming generated video drains battery and depends on connection strength.
Multiplayer breaks. Shared, synchronous experiences with generated video don't scale.

The case for a character rig based system…and what they cost you

A rig-based character system inverts almost every one of those tradeoffs.

Because rendering happens on-device, character state can update in under 100ms from live audio, emotion, or conversational context. The character responds to user input, UI state, and gameplay logic directly - it feels part of the product, not a video playing inside it. Identity consistency is structural: a rigged character maintains its form and model across every frame and every session, by design. Character customization and asset systems plug directly into the rig, which means the monetization layer comes built in - think of users customizing their outfits through collection mechanics. And synchronized state, rather than streamed pixels, makes multiplayer actually feasible - two or more personas can interact in real-time. This also enables gaming experiences, something video rendering can’t do.

So why doesn't everyone do it? Because the costs are front-loaded, and they're steep:

High upfront investment. Building a real-time character system that is expansive enough to be highly customizable for different users and use cases while also being easy to integrate can take years, not months.
Specialized talent. Riggers, technical artists, and runtime engineers are rare and expensive.
Cross-platform is a grind. Shipping an SDK across iOS, Android, web, desktop, and game engines is a significant, ongoing engineering lift.

A decision framework

If you're evaluating this for your own product, here's how we'd frame it.

Rig-based makes sense when: your user interaction model is synchronous and real-time responsiveness is core to the product; your user base is at or approaching scale, where per-session costs become untenable; your characters are stylized; your roadmap includes many characters, user-generated avatars, or creator tools; the character needs to exist consistently across multiple surfaces; or you have the technical depth and time horizon to support a real-time runtime.

Video makes sense when: you need to ship quickly; you're pre-PMF and validating; your character count is low and staying that way; your interaction model is async and users tolerate 5+ seconds of latency; photorealistic humans are core to the value proposition; or you have no developers on your team.

In summary: if you believe in real-time interaction between users and AI avatars at scale, a rig-based system is the only realistic approach. That's before you account for the customization and monetization upside that comes structurally with a rig based character system.

"But won't real-time generative video close the gap?"

This is the most common pushback, and it's a fair one. Latency on diffusion-based video is dropping fast. But faster generation doesn't fix the structural problems:

There's still no path to client-side rendering - cost per user, per session, forever, plus a hard dependency on connectivity.
Generative video still drifts. Characters and worlds look different frame to frame, which is disqualifying for products built on identity and attachment.
Style control remains unsolved - holding a specific aesthetic reliably across emotional states is an open generative problem.

Speed was never the only issue. The architecture is.

Why we built the hard one

At Genies, we believed in real-time interaction at scale from the start - which made the rig-based path the only viable one. So we spent the last five years of R&D taking on exactly the downsides that keep most teams away from it:

The upfront investment is behind us. We’ve been building this system.

The talent problem is solved. We built the specialized team of riggers, technical artists, and runtime engineers required to make this work…so you don’t have to.

Generative animation, the genuinely unsolved problem, is where most of our R&D lives. Our Behavior Model converts LLM outputs into expressive, character-consistent behavior in real time - a compelling experience today, and architected to improve as the research does. And our Looks Model lets anyone generate fully rigged, engine-ready characters from text or image prompts, so creating characters no longer requires an entire character pipeline.

Cross-platform complexity is handled by our native avatar framework, which runs across apps, web, mobile, game engines, and XR. Built so platform updates generate their own wrappers, rather than requiring a standing army of engineers per platform.

The result is what we set out to build: the upside of the rig-based approach without its barriers to entry. Partners get sub-100ms responsiveness, structural identity consistency, costs that don't scale linearly with users, and a built-in customization layer - delivered as engine-ready assets, not a multi-year internal project. That's why the most influential IP in entertainment and media has already partnered with us, and it's why the cost math works out to 350x+ efficiency versus video systems at scale.

The bottom line

If your AI avatar is a demo, video gets you there fastest. If your AI avatar is a product (something users come back to, form attachment to, and interact with in real time) the architecture underneath it matters more than the first impression.

We made our bet five years ago.

If you're making yours now, we'd love to talk, whether you're bringing characters to a story-driven platform, an education app, a game, or an enterprise agent. Get in touch with our team here.