Java Data Ingestion from Google Cloud to PostgreSQL

github repo

A Java-based data ingestion workflow designed to download JSON data from a Google Cloud Storage bucket, parse it, and insert it into a PostgreSQL database. It handles unique constraint violations gracefully to maintain data integrity.


Features

  • Connects to Google Cloud Storage to list and download JSON files.
  • Parses JSON data and processes various entities such as Items, Resources, Contributors, Call Numbers, and Subjects.
  • Inserts parsed data into PostgreSQL tables with error handling for unique constraint violations.
  • Modular processors for different data components to maintain clean separation of concerns.

Tech Stack

  • Java 11
  • Maven for build and dependency management
  • PostgreSQL as the relational database
  • Google Cloud Storage for data source
  • JSON processing with org.json

Getting Started

Prerequisites

  • Java 11 or higher installed
  • Maven installed
  • PostgreSQL running locally or accessible
  • Google Cloud Storage bucket with JSON files
  • Service account key JSON file for GCS authentication

Installation

  1. Clone the repository:
git clone https://github.com/justin-napolitano/sup-court-data-ingestion.git
cd sup-court-data-ingestion
  1. Update the database connection parameters and Google Cloud credentials path in DataIngestionMain.java.

  2. Build the project using Maven:

mvn clean package

Running

Run the main class using Maven exec plugin:

mvn exec:java -Dexec.mainClass="com.data_ingestion.DataIngestionMain"

Project Structure

sup-court-data-ingestion/
β”œβ”€β”€ pom.xml
β”œβ”€β”€ readme.md
β”œβ”€β”€ resources/
β”‚   └── secret.json  # Google Cloud service account key
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main/
β”‚   β”‚   β”œβ”€β”€ java/
β”‚   β”‚   β”‚   └── com/data_ingestion/
β”‚   β”‚   β”‚       β”œβ”€β”€ CallNumbersProcessor.java
β”‚   β”‚   β”‚       β”œβ”€β”€ ContributorsProcessor.java
β”‚   β”‚   β”‚       β”œβ”€β”€ DataIngestionClient.java
β”‚   β”‚   β”‚       β”œβ”€β”€ DataIngestionMain.java
β”‚   β”‚   β”‚       β”œβ”€β”€ GCSClient.java
β”‚   β”‚   β”‚       β”œβ”€β”€ ItemsProcessor.java
β”‚   β”‚   β”‚       β”œβ”€β”€ ResourcesProcessor.java
β”‚   β”‚   β”‚       └── SubjectsProcessor.java
β”‚   └── test/
β”‚       └── java/
β”‚           └── com/example/AppTest.java
└── target/  # Maven build output

Future Work / Roadmap

  • Add comprehensive unit and integration tests for processors and clients.
  • Implement configuration management to externalize DB and GCS credentials.
  • Enhance error handling and logging with a structured logging framework.
  • Support incremental data ingestion and data update scenarios.
  • Containerize the application for easier deployment.
  • Add support for parallel processing to improve ingestion speed.

For any questions or contributions, please open an issue or submit a pull request.

hjkl / arrows Β· / search Β· :family Β· :tag Β· :datefrom Β· :dateto Β· ~/entries/slug Β· Ctrl+N/Ctrl+P for suggestions Β· Ctrl+C/Ctrl+G to cancel
entries 201/201 Β· entry -/-
:readyentries 201/201 Β· entry -/-