Justin Napolitano

SparkAPI is a lightweight Python utility library providing a collection of functions to simplify working with Apache Spark via PySpark. It aims to streamline common Spark operations such as session management and data loading.

Features

Easy instantiation of SparkSession
Simplified loading of CSV data into Spark DataFrames

Tech Stack

Python
Apache Spark (PySpark)

Getting Started

Prerequisites

Python 3.x
Apache Spark installed and configured
PySpark package installed

Installation

Clone the repository:

git clone https://github.com/justin-napolitano/SparkAPI.git
cd SparkAPI

Install PySpark if not already installed:

pip install pyspark

Usage

Import and use the SparkAPI class in your Python scripts:

from sparkAPI import SparkAPI

spark_api = SparkAPI()
df = spark_api.load_spark_data_from_csv('path/to/your/file.csv')
df.show()

Project Structure

SparkAPI/
└── sparkAPI.py       # Core class with Spark session management and data loading

Future Work / Roadmap

Expand support for additional data sources and formats
Add utility functions for common Spark transformations and actions
Implement configuration options for SparkSession builder
Include error handling and logging mechanisms
Provide unit tests and example notebooks