SparkAPI: Simplified PySpark Utility Library

github repo

SparkAPI is a lightweight Python utility library providing a collection of functions to simplify working with Apache Spark via PySpark. It aims to streamline common Spark operations such as session management and data loading.

Features

  • Easy instantiation of SparkSession
  • Simplified loading of CSV data into Spark DataFrames

Tech Stack

  • Python
  • Apache Spark (PySpark)

Getting Started

Prerequisites

  • Python 3.x
  • Apache Spark installed and configured
  • PySpark package installed

Installation

Clone the repository:

git clone https://github.com/justin-napolitano/SparkAPI.git
cd SparkAPI

Install PySpark if not already installed:

pip install pyspark

Usage

Import and use the SparkAPI class in your Python scripts:

from sparkAPI import SparkAPI

spark_api = SparkAPI()
df = spark_api.load_spark_data_from_csv('path/to/your/file.csv')
df.show()

Project Structure

SparkAPI/
└── sparkAPI.py       # Core class with Spark session management and data loading

Future Work / Roadmap

  • Expand support for additional data sources and formats
  • Add utility functions for common Spark transformations and actions
  • Implement configuration options for SparkSession builder
  • Include error handling and logging mechanisms
  • Provide unit tests and example notebooks
hjkl / arrows · / search · :family · :tag · :datefrom · :dateto · ~/entries/slug · Ctrl+N/Ctrl+P for suggestions · Ctrl+C/Ctrl+G to cancel
entries 201/201 · entry -/-
:readyentries 201/201 · entry -/-