Skip to content

[#3553576] feat: Add custom XLM-RoBERTa tokenizer with decorator pattern

What We Did:

Custom Tokenizer Infrastructure:

  • Added XLM-RoBERTa tokenizer with decorator pattern supporting both LiteLLM (HTTP) and CLI SentencePiece modes
  • Created admin UI for tokenizer configuration with real-time validation and test functionality
  • Implemented tokenizer mode switching (LiteLLM/CLI/None) with proper dependency injection

Token-Aware Text Chunking:

  • Replaced RecursiveTextChunker with TokenAwareTextChunker using density-based probing (3-4 API calls vs N calls)
  • Implemented separator hierarchy chunking (\n\n\n. \t ) with boundary snapping
  • Added risk-based validation for worst-case chunks with comprehensive math documentation

Embedding Strategy:

  • Created TokenAwareEmbeddingStrategy plugin replacing the old recursive implementation
  • Integrated with AI Search module using the token-aware chunker

Merge request reports

Loading