Skip to content

AI VTuber technology stack

The AI VTuber technology stack enables the creation of virtual characters capable of real-time interaction, gaming, and streaming. Unlike traditional static avatars, an AI VTuber integrates a Large Language Model (LLM) as a "brain" to drive conversation and decision-making, coordinated by a backend runtime^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].

Modern implementations typically use a Monorepo architecture (e.g., using pnpm workspaces) to manage the complexity of supporting multiple deployment targets, including Web browsers, desktop apps (Electron), and mobile devices^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].

Core Components

Avatar Rendering

Visual representation is handled through specialized rendering engines, supporting both 2D and 3D models^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md]:

  • Live2D: Used for 2D models; libraries manage animations, automatic blinking, and gaze tracking^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].
  • VRM: Used for 3D models; typically rendered via [[Three.js]] to create spatially aware avatars^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].
  • Cross-Platform UI: The visual interface ("Stage UI") is often built with web frameworks like [[Vue.js]] and [[Vite]], allowing it to run natively on the web or be wrapped in desktop/mobile containers^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].

AI Brain & LLM Integration

The "brain" of the VTuber connects to various LLM providers to process text and generate responses^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md]. Key considerations include:

  • Provider Abstraction: Support for 30+ providers (e.g., OpenAI, Anthropic, DeepSeek, Qwen, local models via Ollama) through unified SDKs (similar to [[Vercel AI SDK]])^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].
  • Self-Hosting: Full self-hosting capabilities are standard, allowing users to keep API keys and data local^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].

Audio & Voice Pipeline

Real-time voice interaction requires a processing pipeline for speech-to-text and text-to-speech^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md]:

  • Input (ASR): Audio is captured via the [[WebAudio API]].
  • Output (TTS): Text responses are converted to speech using services like ElevenLabs^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].

Memory System

To maintain context over long sessions or multiple streams, a dedicated memory layer is required^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md]:

  • Client-Side: In-browser databases like DuckDB WASM or pglite allow for local data storage without a backend^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].
  • Server-Side: For services like Discord or Telegram bots, traditional databases like [[PostgreSQL]] combined with vector extensions (e.g., pgvector) are used for semantic search and long-term memory^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].

Gaming & Agent Capabilities

A defining feature of advanced AI VTubers (inspired by Neuro-sama) is the ability to play games^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md]. This requires bridging the LLM with game environments:

  • Minecraft: Implemented using frameworks like Mineflayer to connect the AI to MC servers^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].
  • Other Games: Support exists for simulation games like Factorio and Kerbal Space Program via custom plugins^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].
  • Logic Control: The AI interprets game state and executes commands, transforming the VTuber from a chatbot into an autonomous agent^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].

Ecosystem & Platform Integration

AI VTubers are often integrated into broader social platforms via specific services^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md]:

  • Chat Platforms: Bots for Discord and Telegram allow the AI character to interact directly in community channels^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].
  • Plugin Architecture: Extensibility is achieved through plugin systems (e.g., for Bilibili, HomeAssistant, or Claude Code), enabling the character to perform tasks outside of simple chatting^[001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md].
  • [[Neuro-sama]]: The pioneering AI VTuber that popularized this technology stack.
  • [[Three.js]]: The graphics library often used for rendering 3D VTuber models.
  • [[Live2D]]: The technology behind 2D interactive avatars.
  • [[Local LLM]]: Running language models locally for privacy and low latency.

Sources

  • 001-TODO__Project_AIRI_-_开源_AI_VTuber_赛博伴侣.md