Translation Pipeline
A 8-stage neural pipeline purpose-built for Austronesian languages. Our foundation model is heavily pre-trained on trilingual corpora (Indonesian, Malay, Filipino) with domain-adaptive LoRA specialization. Achieving native-level fluency that generic machine translators can't match.
Why Our Model is Different
Unlike general-purpose translators that bolt on language support as an afterthought, our core model undergoes a three-phase training regime designed specifically for the Austronesian language family:
Phase 1. Trilingual Pre-training: The foundation Gemma model is continually
pre-trained on
a curated corpus spanning Indonesian, Malay, and Filipino. Building deep cross-lingual representations
that understand the shared Austronesian morphology, code-switching patterns, and cultural
context.
Phase 2. Register-Adaptive LoRA: Three specialized Low-Rank Adaptation layers are
trained
for distinct translation registers Formal (institutional, literary), Casual
(colloquial,
youth slang), and Neutral (game UI, system text). Each adapter is activated per-line based on
contextual metadata, producing translations that match the intended tone precisely.
Phase 3. Game-Specific Fine-tuning: During the pipeline's clustering stage, semantic
patterns
unique to each game title are automatically identified and collected. This data is used to construct
per-game LoRA adapters that capture title-specific terminology, character voice profiles,
and narrative conventions, essentially teaching the model to "speak" each game's language.
Neural Ensemble
Four specialized models working in tandem through a unified semantic memory layer
Entity Recognition Engine
Entity extraction system that identifies game-world elements, characters, locations, items, factions — without requiring per-title training data. Feeds into our protected-term registry and cross-lingual entity graph.
Semantic Embedding Core
Our in-house embedding engine performs 3-pass semantic analysis (classification → similarity → clustering) to map every text fragment into a high-dimensional narrative space. Powers archetype matching, emotional tone detection, and register classification.
Cross-lingual Alignment
High-dimensional cross-lingual retrieval model for semantic quality scoring. Validates translation fidelity across all three target languages simultaneously, providing word-level alignment confidence used by our tag restoration engine.
Trilingual Foundation LLM
Custom Gemma-based model continually pre-trained on Austronesian corpora with register-adaptive LoRA switching (Formal / Casual / Neutral). Per-game adapters are constructed from semantic clustering data, enabling title-specific voice and terminology matching.
Pipeline Architecture
Data flows top-to-bottom related stages run in parallel where possible
- Format Normalization extracts translatable strings from native game formats (XML, JSON, binary, custom archives) into a unified intermediate representation with full structural metadata for lossless reconstruction.
- Markup Abstraction game-specific markup (color tags, control codes, variables) is abstracted into format-agnostic placeholders, preserving the original mapping for pixel-perfect restoration.
- Intelligent Filtering applies heuristic and NER-based analysis to identify untranslatable content (proper nouns, coded identifiers, engine constants), preventing unnecessary LLM calls and reducing noise.
- Character Profiling extracts speaker identity from dialogue metadata and initiates archetype generation via the Foundation LLM for voice consistency.
- Deduplication & Stratification eliminates redundant strings and stratifies content into optimally-sized translation units.
- Narrative Profiling the Semantic Embedding Core performs tri-pass analysis (classification → similarity → clustering) mapping each text fragment against archetype centroids to determine emotional tone, register, and scene context.
- Knowledge Graph Construction the Entity Recognition Engine builds a relational graph of game-world entities, tracking character relationships, faction hierarchies, and location networks for context-aware translation.
- Terminology Mining zero-shot NER + semantic clustering identifies domain-specific vocabulary, which is then refined through LLM-assisted filtering (removing ~95% noise) to produce a curated protected-term dictionary.
- LoRA Data Collection cluster patterns and semantic signatures from this stage are automatically harvested to construct game-specific LoRA adapters for the Foundation LLM.
- Register-Based Routing each text unit is routed to the appropriate LoRA adapter (Formal / Casual / Neutral) based on narrative metadata from Stage B. Game-specific adapters are layered on top when available.
- Context-Augmented Generation retrieval-augmented prompting injects entity dictionaries, register-specific slang palettes, trigger phrases, and translation memory as few-shot examples for maximum consistency.
- Structured Trilingual Output generates all three language variants (ID, MS, TL) simultaneously with schema-enforced JSON output, ensuring parallel parity across languages.
- Real-time Integrity Checks inline validation catches tag corruption, hallucinated content, and length overflow before output is committed triggering automatic retry with failover routing.
- Adaptive Concurrency multi-worker pipeline with intelligent rate management and thermal throttling for sustained GPU throughput.
- Anomaly Detection
- Entity reversion: auto-corrects mistranslated proper nouns
- Hallucination scoring: structural divergence analysis
- Byte-level anomaly detection: overflow/underflow flagging
- LLM-Assisted Repair unfixable items (~5-10%) are re-processed through the Foundation LLM with corrective prompting.
- Differential Reporting generates before/after analysis for human review.
- Review Queue flagged items are presented with contextual diffs and confidence scores for expert review.
- Selective Override operators can apply corrections with granular merge strategies (overwrite, fill-missing, conditional).
- Feedback Loop corrections feed back into translation memory and LoRA training data for continuous model improvement.
- Embedding Coherence verifies that translated text maintains semantic alignment with source using cached high-dimensional embeddings.
- Adaptive Clustering unsupervised clustering with dimensionality reduction identifies semantic outliers and distributional anomalies. Cluster data is also harvested for LoRA adapter construction.
- Multi-factor Quality Score per-line composite score combining semantic fidelity, markup preservation, and length compliance.
- 5-Tier Restoration Cascade:
- Tier 0 Translation memory lookup (instant)
- Tier 1 Fuzzy pattern matching
- Tier 2 Cross-lingual word alignment matrix
- Tier 3 Contextual anchor inference
- Tier 4 Entity-based cross-lingual positional mapping
- Coherence Scoring sliding-window semantic analysis validates placement quality at every tier.
- Injection Sanitization detects and removes AI-hallucinated markup not present in the original source.
- Structural Parity Enforcement guarantees bit-perfect tag synchronization with source material.
- Markup Reconstruction abstract placeholders are restored to their original game-specific markup using the format registry from Stage A.
- Format Rebuilding structural manifests guide pixel-perfect reconstruction of native game files across 30+ supported formats.
- Profile-aware Validation final markup integrity check using tag profiles to catch any remaining discrepancies.
- Distribution Packaging automated build system generates installer-ready packages with per-language coverage analytics, lexical quality metrics, and auto-generated documentation.
Pipeline Internals
Key data artifacts that flow between stages through the neural pipeline
| Component | Function | Produced | Consumed |
|---|---|---|---|
| Structural Manifests | Lossless structural maps for native format reconstruction | ||
| Markup Registry | Bidirectional mapping between abstract and native markup | ||
| Narrative Metadata | Per-line archetype, emotion, register, and scene classification | ||
| Entity Graph | Relational knowledge graph of game-world entities | ||
| Semantic Memory | Shared high-dimensional embedding cache (fp16) across all neural models | ||
| Quality Scores | Per-line composite quality metrics with cluster assignments | ||
| LoRA Training Data | Automatically harvested semantic patterns for game-specific adapter construction |