Lightweight Python API Wrapper for Apache Spark

github repo

A lightweight Python API wrapper for Apache Spark designed to simplify data manipulation tasks. This project provides an easy interface to instantiate Spark sessions and load CSV data into Spark DataFrames.

Features

  • Simplified Spark session management
  • Load CSV files as Spark DataFrames with header support

Tech Stack

  • Python
  • Apache Spark (PySpark)

Getting Started

Prerequisites

  • Python 3.6+
  • Apache Spark installed and configured

Installation

Clone the repository:

git clone https://github.com/justin-napolitano/project-spark-api.git
cd project-spark-api

Install PySpark (if not already installed):

pip install pyspark

Usage

Example usage in Python:

from sparkAPI import SparkAPI

spark_api = SparkAPI()
df = spark_api.load_spark_data_from_csv('path/to/your/file.csv')
df.show()

Project Structure

project-spark-api/
├── sparkAPI.py       # Main API wrapper class for Spark session and data loading

Future Work / Roadmap

  • Add support for additional data formats (e.g., JSON, Parquet)
  • Implement data transformation utilities
  • Enable configuration options for Spark session (e.g., app name, master URL)
  • Add error handling and logging
  • Provide unit tests and examples
hjkl / arrows · / search · :family · :tag · :datefrom · :dateto · ~/entries/slug · Ctrl+N/Ctrl+P for suggestions · Ctrl+C/Ctrl+G to cancel
entries 201/201 · entry -/-
:readyentries 201/201 · entry -/-