This repository contains the LOC Normalizer project, a tool designed to normalize the Library of Congress (LOC) data schema into a structured database format. The normalized data will then be used to construct a knowledge graph focused on Supreme Court law.
Features
- Extracts and processes JSON data blobs from GCP storage buckets.
- Normalizes complex JSON structures into flat tables suitable for database ingestion.
- Automates workflows using Google Cloud Run jobs and Docker containers.
- Integrates with Google Cloud services such as Cloud Storage, BigQuery, and Cloud Logging.
- Provides reusable GCP client utilities for storage, logging, and BigQuery operations.
Tech Stack
- Python (Jupyter Notebooks and scripts)
- Google Cloud Platform (Cloud Storage, BigQuery, Cloud Run, Artifact Registry)
- Docker for containerization
- Bash scripting for automation
- Google Cloud SDK (gcloud CLI)
Getting Started
Prerequisites
- Python 3.x
- Docker
- Google Cloud SDK (gcloud) installed and configured
- Access to a GCP project with appropriate permissions
Installation
- Clone the repository:
git clone https://github.com/justin-napolitano/loc_normalizer.git
cd loc_normalizer
- (Optional) Create and activate a Python virtual environment:
python3 -m venv venv
source venv/bin/activate
- Install Python dependencies:
pip install -r requirements.txt
Running Locally
- Use the provided Python scripts in
src/to interact with GCP buckets and process data. - Ensure your environment is authenticated with GCP credentials (e.g., set
GOOGLE_APPLICATION_CREDENTIALSor usegcloud auth application-default login).
Building and Deploying the Docker Container
- Build the Docker image:
./build.sh
- Deploy the image to Google Cloud Run:
./deploy.sh
These scripts build, tag, push the Docker image to Google Artifact Registry, and deploy it as a Cloud Run service.
Project Structure
loc_normalizer/
βββ build.sh # Script to build Docker image
βββ cloudbuild.yaml # Cloud Build configuration
βββ cloudbuildsample.yaml # Sample Cloud Build config
βββ create_deploy_cloud_run_job/ # Possibly deployment-related scripts
βββ deploy.sh # Deployment script for Cloud Run
βββ Dockerfile # Dockerfile for container image
βββ dply.sh # Additional deployment or utility script
βββ execute_job.sh # Script to execute jobs
βββ index.md # Project overview and plan
βββ job_create.sh # Job creation script
βββ post-image.jpeg # Image used in documentation
βββ readme.md # Secondary readme, possibly outdated
βββ requirements.txt # Python dependencies
βββ src/ # Source code and utilities
β βββ loc_flattener.py # JSON normalization logic
β βββ loc_scraper.py # Scraper for LOC data
β βββ gcputils/ # GCP client utilities (storage, logging, BigQuery, secrets)
β βββ create_last_page_touched_blob.py # Example GCS interaction
β βββ ...
βββ submit.sh # Script to submit jobs
Future Work / Roadmap
- Complete the normalization workflow to flatten JSON data fully and ingest into BigQuery.
- Expand the scraper to cover more LOC collections and handle pagination robustly.
- Develop the knowledge graph construction using normalized data.
- Improve error handling and logging in scripts.
- Automate CI/CD pipelines using Cloud Build and GitHub Actions.
- Add comprehensive documentation and usage examples.
Note: Some documentation files and scripts indicate ongoing development and may require updates or completion.