Skip to main content
Version: 2.20-unstable

Preprocessors

Module haystack_experimental.components.preprocessors.embedding_based_document_splitter

EmbeddingBasedDocumentSplitter

Splits documents based on embedding similarity using cosine distances between sequential sentence groups.

This component first splits text into sentences, optionally groups them, calculates embeddings for each group, and then uses cosine distance between sequential embeddings to determine split points. Any distance above the specified percentile is treated as a break point. The component also tracks page numbers based on form feed characters ( ) in the original document.

This component is inspired by 5 Levels of Text Splitting by Greg Kamradt.

Usage example

python
from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack_experimental.components.preprocessors import EmbeddingBasedDocumentSplitter

# Create a document with content that has a clear topic shift
doc = Document(
content="This is a first sentence. This is a second sentence. This is a third sentence. "
"Completely different topic. The same completely different topic."
)

# Initialize the embedder to calculate semantic similarities
embedder = SentenceTransformersDocumentEmbedder()

# Configure the splitter with parameters that control splitting behavior
splitter = EmbeddingBasedDocumentSplitter(
document_embedder=embedder,
sentences_per_group=2, # Group 2 sentences before calculating embeddings
percentile=0.95, # Split when cosine distance exceeds 95th percentile
min_length=50, # Merge splits shorter than 50 characters
max_length=1000 # Further split chunks longer than 1000 characters
)
splitter.warm_up()
result = splitter.run(documents=[doc])

# The result contains a list of Document objects, each representing a semantic chunk
# Each split document includes metadata: source_id, split_id, and page_number
print(f"Original document split into {len(result['documents'])} chunks")
for i, split_doc in enumerate(result['documents']):
print(f"Chunk {i}: {split_doc.content[:50]}...")

EmbeddingBasedDocumentSplitter.__init__

python
def __init__(*,
document_embedder: DocumentEmbedder,
sentences_per_group: int = 3,
percentile: float = 0.95,
min_length: int = 50,
max_length: int = 1000,
language: Language = "en",
use_split_rules: bool = True,
extend_abbreviations: bool = True)

Initialize EmbeddingBasedDocumentSplitter.

Arguments:

  • document_embedder: The DocumentEmbedder to use for calculating embeddings.
  • sentences_per_group: Number of sentences to group together before embedding.
  • percentile: Percentile threshold for cosine distance. Distances above this percentile are treated as break points.
  • min_length: Minimum length of splits in characters. Splits below this length will be merged.
  • max_length: Maximum length of splits in characters. Splits above this length will be recursively split.
  • language: Language for sentence tokenization.
  • use_split_rules: Whether to use additional split rules for sentence tokenization. Applies additional split rules from SentenceSplitter to the sentence spans.
  • extend_abbreviations: If True, the abbreviations used by NLTK's PunktTokenizer are extended by a list of curated abbreviations. Currently supported languages are: en, de. If False, the default abbreviations are used.

EmbeddingBasedDocumentSplitter.warm_up

python
def warm_up() -> None

Warm up the component by initializing the sentence splitter.

EmbeddingBasedDocumentSplitter.run

python
@component.output_types(documents=List[Document])
def run(documents: List[Document]) -> Dict[str, List[Document]]

Split documents based on embedding similarity.

Arguments:

  • documents: The documents to split.

Raises:

  • None: - RuntimeError: If the component wasn't warmed up.
  • TypeError: If the input is not a list of Documents.
  • ValueError: If the document content is None or empty.

Returns:

A dictionary with the following key:

  • documents: List of documents with the split texts. Each document includes:
  • A metadata field source_id to track the original document.
  • A metadata field split_id to track the split number.
  • A metadata field page_number to track the original page number.
  • All other metadata copied from the original document.

EmbeddingBasedDocumentSplitter.to_dict

python
def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

EmbeddingBasedDocumentSplitter.from_dict

python
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "EmbeddingBasedDocumentSplitter"

Deserializes the component from a dictionary.