A Java-based data ingestion workflow designed to download JSON data from a Google Cloud Storage bucket, parse it, and insert it into a PostgreSQL database. It handles unique constraint violations gracefully to maintain data integrity.
Features
- Connects to Google Cloud Storage to list and download JSON files.
- Parses JSON data and processes various entities such as Items, Resources, Contributors, Call Numbers, and Subjects.
- Inserts parsed data into PostgreSQL tables with error handling for unique constraint violations.
- Modular processors for different data components to maintain clean separation of concerns.
Tech Stack
- Java 11
- Maven for build and dependency management
- PostgreSQL as the relational database
- Google Cloud Storage for data source
- JSON processing with org.json
Getting Started
Prerequisites
- Java 11 or higher installed
- Maven installed
- PostgreSQL running locally or accessible
- Google Cloud Storage bucket with JSON files
- Service account key JSON file for GCS authentication
Installation
- Clone the repository:
git clone https://github.com/justin-napolitano/sup-court-data-ingestion.git
cd sup-court-data-ingestion
-
Update the database connection parameters and Google Cloud credentials path in
DataIngestionMain.java. -
Build the project using Maven:
mvn clean package
Running
Run the main class using Maven exec plugin:
mvn exec:java -Dexec.mainClass="com.data_ingestion.DataIngestionMain"
Project Structure
sup-court-data-ingestion/
βββ pom.xml
βββ readme.md
βββ resources/
β βββ secret.json # Google Cloud service account key
βββ src/
β βββ main/
β β βββ java/
β β β βββ com/data_ingestion/
β β β βββ CallNumbersProcessor.java
β β β βββ ContributorsProcessor.java
β β β βββ DataIngestionClient.java
β β β βββ DataIngestionMain.java
β β β βββ GCSClient.java
β β β βββ ItemsProcessor.java
β β β βββ ResourcesProcessor.java
β β β βββ SubjectsProcessor.java
β βββ test/
β βββ java/
β βββ com/example/AppTest.java
βββ target/ # Maven build output
Future Work / Roadmap
- Add comprehensive unit and integration tests for processors and clients.
- Implement configuration management to externalize DB and GCS credentials.
- Enhance error handling and logging with a structured logging framework.
- Support incremental data ingestion and data update scenarios.
- Containerize the application for easier deployment.
- Add support for parallel processing to improve ingestion speed.
For any questions or contributions, please open an issue or submit a pull request.