pysparkformat
This project provides a collection of custom data source formats for Apache Spark 4.0+ and Databricks, leveraging the new V2 data source PySpark API.
Formats
Currently, the following formats are supported:
| Format | Read | Write | Description |
|---|---|---|---|
http-csv |
Yes | No | Reads CSV files in parallel directly from a URL. |
http-json |
Yes | No | Reads JSON Lines in parallel directly from a URL. |
Installation
# Install PySpark 4.0.0.dev2
pip install pyspark==4.0.0.dev2
# Install the package using pip
pip install pysparkformat
For Databricks, install within a Databricks notebook using:
%pip install pysparkformat
This has been tested with Databricks Runtime 15.4 LTS and later.
http-csv
The following options can be specified when using the http-csv format:
| Name | Description | Type | Default |
|---|---|---|---|
header |
Indicates whether the CSV file contains a header row. | boolean | false |
sep |
The field delimiter character. | string | , |
encoding |
The character encoding of the file. | string | utf-8 |
quote |
The quote character. | string | " |
escape |
The escape character. | string | \ |
maxLineSize |
The maximum length of a line (in bytes). | integer | 10000 |
partitionSize |
The size of each data partition (in bytes). | integer | 1048576 |
Example
from pyspark.sql import SparkSession
from pysparkformat.http.csv import HTTPCSVDataSource
# Initialize SparkSession (only needed if not running in Databricks)
spark = SparkSession.builder.appName("http-csv-example").getOrCreate()
# You may need to disable format checking depending on your cluster configuration
spark.conf.set("spark.databricks.delta.formatCheck.enabled", False)
# Register the custom data source
spark.dataSource.register(HTTPCSVDataSource)
# URL of the CSV file
url = "https://raw.githubusercontent.com/aig/pysparkformat/refs/heads/master/tests/data/valid-with-header.csv"
# Read the data
df = spark.read.format("http-csv").option("header", True).load(url)
# Display the DataFrame (use `display(df)` in Databricks)
df.show()
http-json
The following options can be specified when using the http-json format:
| Name | Description | Type | Default |
|---|---|---|---|
maxLineSize |
The maximum length of a line (in bytes). | integer | 10000 |
partitionSize |
The size of each data partition (in bytes). | integer | 1048576 |
Example
from pyspark.sql import SparkSession
from pysparkformat.http.json import HTTPJSONDataSource
# Initialize SparkSession (only needed if not running in Databricks)
spark = SparkSession.builder.appName("http-json-example").getOrCreate()
# You may need to disable format checking depending on your cluster configuration
spark.conf.set("spark.databricks.delta.formatCheck.enabled", False)
# Register the custom data source
spark.dataSource.register(HTTPJSONDataSource)
# URL of the JSON file
url = "https://raw.githubusercontent.com/aig/pysparkformat/refs/heads/master/tests/data/valid-nested.jsonl"
# Read the data (you must specify the schema at the moment)
json_schema = "name string, wins array<array<string>>"
df = spark.read.format("http-json").schema(json_schema).load(url)
# Display the DataFrame (use `display(df)` in Databricks)
df.show()