A lightweight Python API wrapper for Apache Spark designed to simplify data manipulation tasks. This project provides an easy interface to instantiate Spark sessions and load CSV data into Spark DataFrames.
Features
- Simplified Spark session management
- Load CSV files as Spark DataFrames with header support
Tech Stack
- Python
- Apache Spark (PySpark)
Getting Started
Prerequisites
- Python 3.6+
- Apache Spark installed and configured
Installation
Clone the repository:
git clone https://github.com/justin-napolitano/project-spark-api.git
cd project-spark-api
Install PySpark (if not already installed):
pip install pyspark
Usage
Example usage in Python:
from sparkAPI import SparkAPI
spark_api = SparkAPI()
df = spark_api.load_spark_data_from_csv('path/to/your/file.csv')
df.show()
Project Structure
project-spark-api/
├── sparkAPI.py # Main API wrapper class for Spark session and data loading
Future Work / Roadmap
- Add support for additional data formats (e.g., JSON, Parquet)
- Implement data transformation utilities
- Enable configuration options for Spark session (e.g., app name, master URL)
- Add error handling and logging
- Provide unit tests and examples