Skip to main content
Version: 2.19

JSONConverter

Converts JSON files to text documents.

Most common position in a pipelineBefore PreProcessors , or right at the beginning of an indexing pipeline
Mandatory init variablesONE OF, OR BOTH:

"jq_schema": A jq filter string to extract content

"content_key": A key string to extract document content
Mandatory run variables"sources": A list of file paths or ByteStream objects
Output variables"documents": A list of documents
API referenceConverters
GitHub linkhttps://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/json.py

Overview

JSONConverter converts one or more JSON files into a text document.

Parameters Overview

To initialize JSONConverter, you must provide either jq_schema, or content_key parameter, or both.

jq_schema parameter filter extracts nested data from JSON files. Refer to the jq documentation for filter syntax. If not set, the entire JSON file is used.

The content_key parameter lets you specify which key in the extracted data will be the document's content.

  • If both jq_schema and content_key are set, the content_key is searched in the data extracted by jq_schema. Non-object data will be skipped.
  • If only jq_schema is set, the extracted value must be scalar; objects or arrays will be skipped.
  • If only content_key is set, the source must be a JSON object, or it will be skipped.

Check out the API reference for the full list of parameters.

Usage

You need to install the jq package to use this Converter:

shell
pip install jq

Example

Here is an example of simple component usage:

python
import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))

converter = JSONConverter(content_key="text")
results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
## 'This is the content of my document'

In the following more complex example, we provide a jq_schema string to filter the JSON source files and extra_meta_fields to extract from the filtered data:

python
import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

data = {
"laureates": [
{
"firstname": "Enrico",
"surname": "Fermi",
"motivation": "for his demonstrations of the existence of new radioactive elements produced "
"by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
" slow neutrons",
},
{
"firstname": "Rita",
"surname": "Levi-Montalcini",
"motivation": "for their discoveries of growth factors",
},
],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(
jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
)

results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
## 'for his demonstrations of the existence of new radioactive elements produced by
## neutron irradiation, and for his related discovery of nuclear reactions brought
## about by slow neutrons'

print(documents[0].meta)
## {'firstname': 'Enrico', 'surname': 'Fermi'}

print(documents[1].content)
## 'for their discoveries of growth factors'

print(documents[1].meta)
## {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}