Skip to content

Three-Layer Chunking Strategy

The Three-Layer Chunking Strategy is a methodology used for processing and segmenting text data within AI systems and knowledge bases. It is designed to optimize information retrieval by applying different parsing logic based on the structural and semantic characteristics of the content^[001-TODO__GBrain_-_AI_Agent_个人知识库与混合检索引擎.md].

This approach recognizes that a single splitting method (e.g., simple character limits) is often insufficient for complex data sources. By layering strategies, the system ensures that search results maintain high semantic relevance while respecting natural text boundaries.

The Three Layers

The strategy typically involves three distinct methods, employed based on the specific type of content being processed^[001-TODO__GBrain_-_AI_Agent_个人知识库与混合检索引擎.md]:

1. Recursive Chunking

This layer is designed for timeline and batch import scenarios^[001-TODO__GBrain_-AI_Agent_个人知识库与混合检索引擎.md]. It operates by identifying hierarchical separators within the text (e.g., headers, nested lists) to split content into logical sections^[001-TODO__GBrain-AI_Agent_个人知识库与混合检索引擎.md]. * Mechanism: Uses up to 5 levels of separator hierarchy. * Parameters: Targets approximately 300 words per chunk with 50 words of overlap to preserve context between segments^[001-TODO__GBrain-_AI_Agent_个人知识库与混合检索引擎.md].

2. Semantic Chunking

This layer is tailored for "compiled truth" sections—content where context is dense and meaning is paramount^[001-TODO__GBrain_-AI_Agent_个人知识库与混合检索引擎.md]. Instead of relying on fixed delimiters, it analyzes the meaning of the text. * Mechanism: Calculates the cosine similarity between embeddings of adjacent sentences^[001-TODO__GBrain-AI_Agent_个人知识库与混合检索引擎.md]. * Goal: Identifies natural "topic boundaries" where the semantic meaning shifts significantly, ensuring chunks remain topically coherent^[001-TODO__GBrain-_AI_Agent_个人知识库与混合检索引擎.md].

3. LLM-Guided Chunking

This is an advanced, intelligent layer used for high-value content^[001-TODO__GBrain_-AI_Agent_个人知识库与混合检索引擎.md]. It utilizes a Large Language Model (LLM) to understand the text dynamically. * Mechanism: Employs a sliding window technique (e.g., using Claude Haiku) to analyze the text^[001-TODO__GBrain-AI_Agent_个人知识库与混合检索引擎.md]. * Goal: The model actively identifies and signals topic switches, providing a more nuanced segmentation than static algorithms can achieve^[001-TODO__GBrain-_AI_Agent_个人知识库与混合检索引擎.md].

Application Context

This strategy is a critical component of the retrieval pipeline in systems like GBrain^[001-TODO__GBrain_-AI_Agent_个人知识库与混合检索引擎.md]. Once data is processed through these layers, it is fed into a hybrid search engine that combines vector search (HNSW cosine) with keyword search to retrieve relevant knowledge for an AI Agent^[001-TODO__GBrain-_AI_Agent_个人知识库与混合检索引擎.md].

  • [[Hybrid Search]]
  • [[Knowledge Graph]]
  • [[Semantic Search]]

Sources

  • 001-TODO__GBrain_-_AI_Agent_个人知识库与混合检索引擎.md