[#3553576] feat: Add custom XLM-RoBERTa tokenizer with decorator pattern
What We Did:
Custom Tokenizer Infrastructure:
- Added XLM-RoBERTa tokenizer with decorator pattern supporting both LiteLLM (HTTP) and CLI SentencePiece modes
- Created admin UI for tokenizer configuration with real-time validation and test functionality
- Implemented tokenizer mode switching (LiteLLM/CLI/None) with proper dependency injection
Token-Aware Text Chunking:
- Replaced RecursiveTextChunker with TokenAwareTextChunker using density-based probing (3-4 API calls vs N calls)
- Implemented separator hierarchy chunking (
\n\n→\n→.→\t→) with boundary snapping - Added risk-based validation for worst-case chunks with comprehensive math documentation
Embedding Strategy:
- Created TokenAwareEmbeddingStrategy plugin replacing the old recursive implementation
- Integrated with AI Search module using the token-aware chunker