Preprocessors
Module haystack_experimental.components.preprocessors.embedding_based_document_splitter
EmbeddingBasedDocumentSplitter
Splits documents based on embedding similarity using cosine distances between sequential sentence groups.
This component first splits text into sentences, optionally groups them, calculates embeddings for each group,
and then uses cosine distance between sequential embeddings to determine split points. Any distance above
the specified percentile is treated as a break point. The component also tracks page numbers based on form feed
characters () in the original document.
This component is inspired by 5 Levels of Text Splitting by Greg Kamradt.
Usage example
from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack_experimental.components.preprocessors import EmbeddingBasedDocumentSplitter
# Create a document with content that has a clear topic shift
doc = Document(
content="This is a first sentence. This is a second sentence. This is a third sentence. "
"Completely different topic. The same completely different topic."
)
# Initialize the embedder to calculate semantic similarities
embedder = SentenceTransformersDocumentEmbedder()
# Configure the splitter with parameters that control splitting behavior
splitter = EmbeddingBasedDocumentSplitter(
document_embedder=embedder,
sentences_per_group=2, # Group 2 sentences before calculating embeddings
percentile=0.95, # Split when cosine distance exceeds 95th percentile
min_length=50, # Merge splits shorter than 50 characters
max_length=1000 # Further split chunks longer than 1000 characters
)
splitter.warm_up()
result = splitter.run(documents=[doc])
# The result contains a list of Document objects, each representing a semantic chunk
# Each split document includes metadata: source_id, split_id, and page_number
print(f"Original document split into {len(result['documents'])} chunks")
for i, split_doc in enumerate(result['documents']):
print(f"Chunk {i}: {split_doc.content[:50]}...")
EmbeddingBasedDocumentSplitter.__init__
def __init__(*,
document_embedder: DocumentEmbedder,
sentences_per_group: int = 3,
percentile: float = 0.95,
min_length: int = 50,
max_length: int = 1000,
language: Language = "en",
use_split_rules: bool = True,
extend_abbreviations: bool = True)
Initialize EmbeddingBasedDocumentSplitter.
Arguments:
document_embedder: The DocumentEmbedder to use for calculating embeddings.sentences_per_group: Number of sentences to group together before embedding.percentile: Percentile threshold for cosine distance. Distances above this percentile are treated as break points.min_length: Minimum length of splits in characters. Splits below this length will be merged.max_length: Maximum length of splits in characters. Splits above this length will be recursively split.language: Language for sentence tokenization.use_split_rules: Whether to use additional split rules for sentence tokenization. Applies additional split rules from SentenceSplitter to the sentence spans.extend_abbreviations: If True, the abbreviations used by NLTK's PunktTokenizer are extended by a list of curated abbreviations. Currently supported languages are: en, de. If False, the default abbreviations are used.
EmbeddingBasedDocumentSplitter.warm_up
Warm up the component by initializing the sentence splitter.
EmbeddingBasedDocumentSplitter.run
Split documents based on embedding similarity.
Arguments:
documents: The documents to split.
Raises:
None: -RuntimeError: If the component wasn't warmed up.TypeError: If the input is not a list of Documents.ValueError: If the document content is None or empty.
Returns:
A dictionary with the following key:
documents: List of documents with the split texts. Each document includes:- A metadata field
source_idto track the original document. - A metadata field
split_idto track the split number. - A metadata field
page_numberto track the original page number. - All other metadata copied from the original document.
EmbeddingBasedDocumentSplitter.to_dict
Serializes the component to a dictionary.
EmbeddingBasedDocumentSplitter.from_dict
Deserializes the component from a dictionary.