pysparkformat

This project provides a collection of custom data source formats for Apache Spark 4.0+ and Databricks, leveraging the new V2 data source PySpark API.


Documentation Status Latest Python Release


Formats

Currently, the following formats are supported:

Format Read Write Description
http-csv Yes No Reads CSV files in parallel directly from a URL.
http-json Yes No Reads JSON Lines in parallel directly from a URL.

Installation

# Install PySpark 4.0.0.dev2
pip install pyspark==4.0.0.dev2

# Install the package using pip
pip install pysparkformat

For Databricks, install within a Databricks notebook using:

%pip install pysparkformat

This has been tested with Databricks Runtime 15.4 LTS and later.

http-csv

The following options can be specified when using the http-csv format:

Name Description Type Default
header Indicates whether the CSV file contains a header row. boolean false
sep The field delimiter character. string ,
encoding The character encoding of the file. string utf-8
quote The quote character. string "
escape The escape character. string \
maxLineSize The maximum length of a line (in bytes). integer 10000
partitionSize The size of each data partition (in bytes). integer 1048576

Example

from pyspark.sql import SparkSession
from pysparkformat.http.csv import HTTPCSVDataSource

# Initialize SparkSession (only needed if not running in Databricks)
spark = SparkSession.builder.appName("http-csv-example").getOrCreate()

# You may need to disable format checking depending on your cluster configuration
spark.conf.set("spark.databricks.delta.formatCheck.enabled", False)

# Register the custom data source
spark.dataSource.register(HTTPCSVDataSource)

# URL of the CSV file
url = "https://raw.githubusercontent.com/aig/pysparkformat/refs/heads/master/tests/data/valid-with-header.csv"

# Read the data
df = spark.read.format("http-csv").option("header", True).load(url)

# Display the DataFrame (use `display(df)` in Databricks)
df.show()

http-json

The following options can be specified when using the http-json format:

Name Description Type Default
maxLineSize The maximum length of a line (in bytes). integer 10000
partitionSize The size of each data partition (in bytes). integer 1048576

Example

from pyspark.sql import SparkSession
from pysparkformat.http.json import HTTPJSONDataSource

# Initialize SparkSession (only needed if not running in Databricks)
spark = SparkSession.builder.appName("http-json-example").getOrCreate()

# You may need to disable format checking depending on your cluster configuration
spark.conf.set("spark.databricks.delta.formatCheck.enabled", False)

# Register the custom data source
spark.dataSource.register(HTTPJSONDataSource)

# URL of the JSON file
url = "https://raw.githubusercontent.com/aig/pysparkformat/refs/heads/master/tests/data/valid-nested.jsonl"

# Read the data (you must specify the schema at the moment)
json_schema = "name string, wins array<array<string>>"
df = spark.read.format("http-json").schema(json_schema).load(url)

# Display the DataFrame (use `display(df)` in Databricks)
df.show()