[#3553576] feat: Add custom XLM-RoBERTa tokenizer with decorator pattern (!1) · Merge requests · project / ai_dropsolid · GitLab

What We Did:

Custom Tokenizer Infrastructure:

Added XLM-RoBERTa tokenizer with decorator pattern supporting both LiteLLM (HTTP) and CLI SentencePiece modes
Created admin UI for tokenizer configuration with real-time validation and test functionality
Implemented tokenizer mode switching (LiteLLM/CLI/None) with proper dependency injection

Token-Aware Text Chunking:

Replaced RecursiveTextChunker with TokenAwareTextChunker using density-based probing (3-4 API calls vs N calls)
Implemented separator hierarchy chunking (\n\n → \n → . → \t → ) with boundary snapping
Added risk-based validation for worst-case chunks with comprehensive math documentation

Embedding Strategy:

Created TokenAwareEmbeddingStrategy plugin replacing the old recursive implementation
Integrated with AI Search module using the token-aware chunker