SparkAPI is a lightweight Python utility library providing a collection of functions to simplify working with Apache Spark via PySpark. It aims to streamline common Spark operations such as session management and data loading.
Features
- Easy instantiation of SparkSession
- Simplified loading of CSV data into Spark DataFrames
Tech Stack
- Python
- Apache Spark (PySpark)
Getting Started
Prerequisites
- Python 3.x
- Apache Spark installed and configured
- PySpark package installed
Installation
Clone the repository:
git clone https://github.com/justin-napolitano/SparkAPI.git
cd SparkAPI
Install PySpark if not already installed:
pip install pyspark
Usage
Import and use the SparkAPI class in your Python scripts:
from sparkAPI import SparkAPI
spark_api = SparkAPI()
df = spark_api.load_spark_data_from_csv('path/to/your/file.csv')
df.show()
Project Structure
SparkAPI/
└── sparkAPI.py # Core class with Spark session management and data loading
Future Work / Roadmap
- Expand support for additional data sources and formats
- Add utility functions for common Spark transformations and actions
- Implement configuration options for SparkSession builder
- Include error handling and logging mechanisms
- Provide unit tests and example notebooks