Document Stores
Module document_store
BM25DocumentStats
A dataclass for managing document statistics for BM25 retrieval.
Arguments:
freq_token: A Counter of token frequencies in the document.doc_len: Number of tokens in the document.
InMemoryDocumentStore
Stores data in-memory. It's ephemeral and cannot be saved to disk.
InMemoryDocumentStore.__init__
def __init__(bm25_tokenization_regex: str = r"(?u)\b\w\w+\b",
bm25_algorithm: Literal["BM25Okapi", "BM25L",
"BM25Plus"] = "BM25L",
bm25_parameters: Optional[dict] = None,
embedding_similarity_function: Literal["dot_product",
"cosine"] = "dot_product",
index: Optional[str] = None,
async_executor: Optional[ThreadPoolExecutor] = None,
return_embedding: bool = True)
Initializes the DocumentStore.
Arguments:
bm25_tokenization_regex: The regular expression used to tokenize the text for BM25 retrieval.bm25_algorithm: The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus".bm25_parameters: Parameters for BM25 implementation in a dictionary format. For example:{'k1':1.5, 'b':0.75, 'epsilon':0.25}You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25.embedding_similarity_function: The similarity function used to compare Documents embeddings. One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information about your embedding model.index: A specific index to store the documents. If not specified, a random UUID is used. Using the same index allows you to store documents across multiple InMemoryDocumentStore instances.async_executor: Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded executor will be initialized and used.return_embedding: Whether to return the embedding of the retrieved Documents. Default is True.
InMemoryDocumentStore.__del__
Cleanup when the instance is being destroyed.
InMemoryDocumentStore.shutdown
Explicitly shutdown the executor if we own it.
InMemoryDocumentStore.storage
Utility property that returns the storage used by this instance of InMemoryDocumentStore.
InMemoryDocumentStore.to_dict
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
InMemoryDocumentStore.from_dict
Deserializes the component from a dictionary.
Arguments:
data: The dictionary to deserialize from.
Returns:
The deserialized component.
InMemoryDocumentStore.save_to_disk
Write the database and its' data to disk as a JSON file.
Arguments:
path: The path to the JSON file.
InMemoryDocumentStore.load_from_disk
Load the database and its' data from disk as a JSON file.
Arguments:
path: The path to the JSON file.
Returns:
The loaded InMemoryDocumentStore.
InMemoryDocumentStore.count_documents
Returns the number of how many documents are present in the DocumentStore.
InMemoryDocumentStore.filter_documents
Returns the documents that match the filters provided.
For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.
Arguments:
filters: The filters to apply to the document list.
Returns:
A list of Documents that match the given filters.
InMemoryDocumentStore.write_documents
Refer to the DocumentStore.write_documents() protocol documentation.
If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.
InMemoryDocumentStore.delete_documents
Deletes all documents with matching document_ids from the DocumentStore.
Arguments:
document_ids: The object_ids to delete.
InMemoryDocumentStore.bm25_retrieval
def bm25_retrieval(query: str,
filters: Optional[dict[str, Any]] = None,
top_k: int = 10,
scale_score: bool = False) -> list[Document]
Retrieves documents that are most relevant to the query using BM25 algorithm.
Arguments:
query: The query string.filters: A dictionary with filters to narrow down the search space.top_k: The number of top documents to retrieve. Default is 10.scale_score: Whether to scale the scores of the retrieved documents. Default is False.
Returns:
A list of the top_k documents most relevant to the query.
InMemoryDocumentStore.embedding_retrieval
def embedding_retrieval(
query_embedding: list[float],
filters: Optional[dict[str, Any]] = None,
top_k: int = 10,
scale_score: bool = False,
return_embedding: Optional[bool] = False) -> list[Document]
Retrieves documents that are most similar to the query embedding using a vector similarity metric.
Arguments:
query_embedding: Embedding of the query.filters: A dictionary with filters to narrow down the search space.top_k: The number of top documents to retrieve. Default is 10.scale_score: Whether to scale the scores of the retrieved Documents. Default is False.return_embedding: Whether to return the embedding of the retrieved Documents. If not provided, the value of thereturn_embeddingparameter set at component initialization will be used. Default is False.
Returns:
A list of the top_k documents most relevant to the query.
InMemoryDocumentStore.count_documents_async
Returns the number of how many documents are present in the DocumentStore.
InMemoryDocumentStore.filter_documents_async
Returns the documents that match the filters provided.
For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.
Arguments:
filters: The filters to apply to the document list.
Returns:
A list of Documents that match the given filters.
InMemoryDocumentStore.write_documents_async
async def write_documents_async(
documents: list[Document],
policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int
Refer to the DocumentStore.write_documents() protocol documentation.
If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.
InMemoryDocumentStore.delete_documents_async
Deletes all documents with matching document_ids from the DocumentStore.
Arguments:
document_ids: The object_ids to delete.
InMemoryDocumentStore.bm25_retrieval_async
async def bm25_retrieval_async(query: str,
filters: Optional[dict[str, Any]] = None,
top_k: int = 10,
scale_score: bool = False) -> list[Document]
Retrieves documents that are most relevant to the query using BM25 algorithm.
Arguments:
query: The query string.filters: A dictionary with filters to narrow down the search space.top_k: The number of top documents to retrieve. Default is 10.scale_score: Whether to scale the scores of the retrieved documents. Default is False.
Returns:
A list of the top_k documents most relevant to the query.
InMemoryDocumentStore.embedding_retrieval_async
async def embedding_retrieval_async(
query_embedding: list[float],
filters: Optional[dict[str, Any]] = None,
top_k: int = 10,
scale_score: bool = False,
return_embedding: bool = False) -> list[Document]
Retrieves documents that are most similar to the query embedding using a vector similarity metric.
Arguments:
query_embedding: Embedding of the query.filters: A dictionary with filters to narrow down the search space.top_k: The number of top documents to retrieve. Default is 10.scale_score: Whether to scale the scores of the retrieved Documents. Default is False.return_embedding: Whether to return the embedding of the retrieved Documents. Default is False.
Returns:
A list of the top_k documents most relevant to the query.