AI Engine Layer

Engine Architecture Overview

KOLI adopts a modular engine architecture that decouples speech recognition, language processing, and speech synthesis into separate components. It comprises three main engine types:

ASR (Automatic Speech Recognition): Converts user audio input into text.
LLM (Large Language Model): Understands and processes textual input to generate intelligent responses.
TTS (Text-to-Speech): Converts generated responses into speech for vocal delivery.

This architecture allows KOLI to mix and match different providers for each capability, optimizing for performance, cost, or scalability as needed.

EnginePool & Engine Management

The core of the engine layer is the EnginePool, a centralized singleton class responsible for creating, registering, and managing all engine instances during system initialization. Each engine type (ASR, LLM, TTS) follows a unified pattern for configuration, registration, and invocation.

During system startup, the EnginePool reads from a configuration file (e.g., config.yaml), instantiates all listed engines, and makes them globally accessible via the getEngine method.

# Simplified EnginePool Initialization
def setup(self, config):
    # Create ASR engines
    for asrCfg in config.ASR.SUPPORT_LIST:
        self._pool[EngineType.ASR][asrCfg.NAME] = ASRFactory.create(asrCfg)
    # Create LLM engines
    for llmCfg in config.LLM.SUPPORT_LIST:
        self._pool[EngineType.LLM][llmCfg.NAME] = LLMFactory.create(llmCfg)
    # Create TTS engines
    for ttsCfg in config.TTS.SUPPORT_LIST:
        self._pool[EngineType.TTS][ttsCfg.NAME] = TTSFactory.create(ttsCfg)

This approach provides a clean and scalable way to manage diverse engine capabilities across multiple modalities.

ASR Engines (Speech-to-Text)

ASR engines convert user audio into plain text. All ASR engines in KOLI inherit from a common BaseEngine interface and implement standardized methods such as:

setup(): Initializes credentials, model instances, and any preloading steps.
run(input: AudioMessage): Accepts audio input and returns recognized text.

The system supports integration with multiple ASR providers and uses standard data structures to normalize recognition results, ensuring downstream compatibility.

LLM Engines (Text Understanding & Response)

LLM engines process the text recognized from ASR and generate semantically rich replies. KOLI supports both third-party APIs and custom fine-tuned models trained on blockchain and crypto corpora.

Each LLM engine implements:

setup(): Authenticates API tokens and configures model parameters.
run(input: TextMessage): Sends a structured prompt to the model and returns a formatted response.

Sample Supported LLM Engines

Engine Name

Description

Provider

OpenAI_API

GPT-based general-purpose LLM

OpenAI

Internal_LLM

KOLI’s custom crypto-native model

In-house

Grok_API

Grok’s LLM platform integration

GrokAI

This abstraction enables rapid switching between models with no code change in the core dialogue pipeline.

TTS Engines (Text-to-Speech)

TTS engines convert textual responses into high-fidelity speech audio. Like ASR and LLM engines, TTS modules adhere to the BaseEngine interface and follow consistent method definitions:

setup(): Initializes voice presets, API credentials, or local models.
run(input: TextMessage): Converts textual replies to audio and returns a streamable output.

Each AI twin can be mapped to a distinct voice signature using TTS parameters, allowing dynamic control over tone, pitch, and emotion based on the character profile.

Message Processing Flow

The pipeline from user input to response is orchestrated across the ASR, LLM, and TTS engines as follows:

User Speech Input: Audio is captured and formatted into an AudioMessage object.
ASR Engine: Audio is passed to an ASR engine and converted into TextMessage.
LLM Engine: Text is interpreted by an LLM engine to generate a response TextMessage.
TTS Engine: Text is synthesized into AudioMessage by the TTS engine.
Response Output: The audio output is streamed back to the user for playback.

Each message type is strongly typed and structured to ensure compatibility across stages. This pipeline supports chained or conditional execution logic and can be customized per agent behavior.

Engine API Call Lifecycle

All engines share a unified lifecycle for interacting with external APIs or internal services:

Authentication: Acquire and manage secure tokens.
Request Prep: Format data per engine requirements.
API Call: Execute asynchronous HTTP or SDK-based requests.
Response Parsing: Convert results into standardized message objects.
Error Handling: Retry, fallback, and log failed invocations gracefully.

This pattern minimizes coupling between engines and business logic, allowing centralized logging and observability across the entire stack.

Agent System Integration

KOLI’s dialogue orchestration layer (Agent) integrates tightly with the EnginePool. Each user session spawns an Agent instance from the AgentPool that:

Tracks dialogue state and user context.
Delegates modality-specific tasks to engines via EnginePool.
Controls flow logic such as slot filling, fallback, or command routing.

This separation of concerns allows the system to evolve rapidly—Agents manage intent and interaction goals, while engines focus on capability execution. A single Agent may dynamically mix multiple engines across turns (e.g., LLM-A → LLM-B → TTS-C), enabling hybrid strategies or A/B testing.

Adding New Engines

To integrate a new engine type into KOLI:

Create a New Class: Inherit from BaseEngine, implement setup() and run().
Register the Engine: Add it to the appropriate factory (ASR/LLM/TTS).
Update Configuration: Add parameters to config.yaml.
Test and Benchmark: Verify correct message format, latency, and edge-case handling.
Enable in Production: Use feature toggles to activate the new engine per user cohort or task type.

This extensibility ensures KOLI can incorporate emerging AI capabilities (e.g., multi-turn video LLMs, few-shot audio synthesis) while preserving system integrity.

PreviousTechnical Overview NextWeb3 Protocol Layer

Last updated 6 days ago