pyspark.sql.DataFrameReader.json#
- DataFrameReader.json(path, schema=None, primitivesAsString=None, prefersDecimal=None, allowComments=None, allowUnquotedFieldNames=None, allowSingleQuotes=None, allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None, mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None, multiLine=None, allowUnquotedControlChars=None, lineSep=None, samplingRatio=None, dropFieldIfAllNull=None, encoding=None, locale=None, pathGlobFilter=None, recursiveFileLookup=None, modifiedBefore=None, modifiedAfter=None, allowNonNumericNumbers=None, useUnsafeRow=None)[source]#
Loads JSON files and returns the results as a
DataFrame.JSON Lines (newline-delimited JSON) is supported by default. For JSON (one record per file), set the
multiLineparameter totrue.If the
schemaparameter is not specified, this function goes through the input once to determine the input schema.New in version 1.4.0.
Changed in version 3.4.0: Supports Spark Connect.
Changed in version 4.2.0: Supports DataFrame input.
- Parameters
- pathstr, list,
RDD, orDataFrame string represents path to the JSON dataset, or a list of paths, or RDD of Strings storing JSON objects, or a DataFrame with a single string column containing JSON strings.
- schema
pyspark.sql.types.StructTypeor str, optional an optional
pyspark.sql.types.StructTypefor the input schema or a DDL-formatted string (For examplecol0 INT, col1 DOUBLE).
- pathstr, list,
- Other Parameters
- Extra options
For the extra options, refer to Data Source Option for the version you use.
Examples
Example 1: Write a DataFrame into a JSON file and read it back.
>>> import tempfile >>> with tempfile.TemporaryDirectory(prefix="json1") as d: ... # Write a DataFrame into a JSON file ... spark.createDataFrame( ... [{"age": 100, "name": "Hyukjin"}] ... ).write.mode("overwrite").format("json").save(d) ... ... # Read the JSON file as a DataFrame. ... spark.read.json(d).show() +---+-------+ |age| name| +---+-------+ |100|Hyukjin| +---+-------+
Example 2: Read JSON from multiple files in a directory
>>> from tempfile import TemporaryDirectory >>> with TemporaryDirectory(prefix="json2") as d1, TemporaryDirectory(prefix="json3") as d2: ... # Write a DataFrame into a JSON file ... spark.createDataFrame( ... [{"age": 30, "name": "Bob"}] ... ).write.mode("overwrite").format("json").save(d1) ... ... # Read the JSON files as a DataFrame. ... spark.createDataFrame( ... [{"age": 25, "name": "Alice"}] ... ).write.mode("overwrite").format("json").save(d2) ... spark.read.json([d1, d2]).show() +---+-----+ |age| name| +---+-----+ | 25|Alice| | 30| Bob| +---+-----+
Example 3: Read JSON with a custom schema
>>> import tempfile >>> with tempfile.TemporaryDirectory(prefix="json4") as d: ... # Write a DataFrame into a JSON file ... spark.createDataFrame( ... [{"age": 30, "name": "Bob"}] ... ).write.mode("overwrite").format("json").save(d) ... custom_schema = "name STRING, age INT" ... spark.read.json(d, schema=custom_schema).show() +----+---+ |name|age| +----+---+ | Bob| 30| +----+---+
Example 4: Parse JSON from a DataFrame with a single string column.
>>> json_df = spark.createDataFrame( ... [('{"name": "Alice", "age": 25}',), ('{"name": "Bob", "age": 30}',)], ... schema="value STRING", ... ) >>> spark.read.json(json_df).sort("name").show() +---+-----+ |age| name| +---+-----+ | 25|Alice| | 30| Bob| +---+-----+