Version: 2.19

JSONConverter

Converts JSON files to text documents.


Most common position in a pipeline	Before PreProcessors , or right at the beginning of an indexing pipeline
Mandatory init variables	ONE OF, OR BOTH: "jq_schema": A jq filter string to extract content "content_key": A key string to extract document content
Mandatory run variables	"sources": A list of file paths or ByteStream objects
Output variables	"documents": A list of documents
API reference	Converters
GitHub link	https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/json.py

Overview

JSONConverter converts one or more JSON files into a text document.

Parameters Overview

To initialize JSONConverter, you must provide either jq_schema, or content_key parameter, or both.

jq_schema parameter filter extracts nested data from JSON files. Refer to the jq documentation for filter syntax. If not set, the entire JSON file is used.

The content_key parameter lets you specify which key in the extracted data will be the document's content.

If both jq_schema and content_key are set, the content_key is searched in the data extracted by jq_schema. Non-object data will be skipped.
If only jq_schema is set, the extracted value must be scalar; objects or arrays will be skipped.
If only content_key is set, the source must be a JSON object, or it will be skipped.

Check out the API reference for the full list of parameters.

Usage

You need to install the jq package to use this Converter:

shell

pip install jq

Example

Here is an example of simple component usage:

python

import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))

converter = JSONConverter(content_key="text")
results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
## 'This is the content of my document'

In the following more complex example, we provide a jq_schema string to filter the JSON source files and extra_meta_fields to extract from the filtered data:

python

import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

data = {
  "laureates": [
    {
      "firstname": "Enrico",
      "surname": "Fermi",
      "motivation": "for his demonstrations of the existence of new radioactive elements produced "
      "by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
      " slow neutrons",
    },
    {
      "firstname": "Rita",
      "surname": "Levi-Montalcini",
      "motivation": "for their discoveries of growth factors",
    },
  ],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(
  jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
)

results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
## 'for his demonstrations of the existence of new radioactive elements produced by
## neutron irradiation, and for his related discovery of nuclear reactions brought
## about by slow neutrons'

print(documents[0].meta)
## {'firstname': 'Enrico', 'surname': 'Fermi'}

print(documents[1].content)
## 'for their discoveries of growth factors'

print(documents[1].meta)
## {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}

Overview​

Parameters Overview​

Usage​

Example​

Overview

Parameters Overview

Usage

Example