LOC Normalizer: Tool for Structuring Library of Congress Data

github repo

This repository contains the LOC Normalizer project, a tool designed to normalize the Library of Congress (LOC) data schema into a structured database format. The normalized data will then be used to construct a knowledge graph focused on Supreme Court law.

Features

  • Extracts and processes JSON data blobs from GCP storage buckets.
  • Normalizes complex JSON structures into flat tables suitable for database ingestion.
  • Automates workflows using Google Cloud Run jobs and Docker containers.
  • Integrates with Google Cloud services such as Cloud Storage, BigQuery, and Cloud Logging.
  • Provides reusable GCP client utilities for storage, logging, and BigQuery operations.

Tech Stack

  • Python (Jupyter Notebooks and scripts)
  • Google Cloud Platform (Cloud Storage, BigQuery, Cloud Run, Artifact Registry)
  • Docker for containerization
  • Bash scripting for automation
  • Google Cloud SDK (gcloud CLI)

Getting Started

Prerequisites

  • Python 3.x
  • Docker
  • Google Cloud SDK (gcloud) installed and configured
  • Access to a GCP project with appropriate permissions

Installation

  1. Clone the repository:
git clone https://github.com/justin-napolitano/loc_normalizer.git
cd loc_normalizer
  1. (Optional) Create and activate a Python virtual environment:
python3 -m venv venv
source venv/bin/activate
  1. Install Python dependencies:
pip install -r requirements.txt

Running Locally

  • Use the provided Python scripts in src/ to interact with GCP buckets and process data.
  • Ensure your environment is authenticated with GCP credentials (e.g., set GOOGLE_APPLICATION_CREDENTIALS or use gcloud auth application-default login).

Building and Deploying the Docker Container

  1. Build the Docker image:
./build.sh
  1. Deploy the image to Google Cloud Run:
./deploy.sh

These scripts build, tag, push the Docker image to Google Artifact Registry, and deploy it as a Cloud Run service.

Project Structure

loc_normalizer/
β”œβ”€β”€ build.sh                # Script to build Docker image
β”œβ”€β”€ cloudbuild.yaml         # Cloud Build configuration
β”œβ”€β”€ cloudbuildsample.yaml   # Sample Cloud Build config
β”œβ”€β”€ create_deploy_cloud_run_job/  # Possibly deployment-related scripts
β”œβ”€β”€ deploy.sh               # Deployment script for Cloud Run
β”œβ”€β”€ Dockerfile              # Dockerfile for container image
β”œβ”€β”€ dply.sh                 # Additional deployment or utility script
β”œβ”€β”€ execute_job.sh          # Script to execute jobs
β”œβ”€β”€ index.md                # Project overview and plan
β”œβ”€β”€ job_create.sh           # Job creation script
β”œβ”€β”€ post-image.jpeg         # Image used in documentation
β”œβ”€β”€ readme.md               # Secondary readme, possibly outdated
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ src/                    # Source code and utilities
β”‚   β”œβ”€β”€ loc_flattener.py    # JSON normalization logic
β”‚   β”œβ”€β”€ loc_scraper.py      # Scraper for LOC data
β”‚   β”œβ”€β”€ gcputils/           # GCP client utilities (storage, logging, BigQuery, secrets)
β”‚   β”œβ”€β”€ create_last_page_touched_blob.py  # Example GCS interaction
β”‚   └── ...
└── submit.sh               # Script to submit jobs

Future Work / Roadmap

  • Complete the normalization workflow to flatten JSON data fully and ingest into BigQuery.
  • Expand the scraper to cover more LOC collections and handle pagination robustly.
  • Develop the knowledge graph construction using normalized data.
  • Improve error handling and logging in scripts.
  • Automate CI/CD pipelines using Cloud Build and GitHub Actions.
  • Add comprehensive documentation and usage examples.

Note: Some documentation files and scripts indicate ongoing development and may require updates or completion.

hjkl / arrows Β· / search Β· :family Β· :tag Β· :datefrom Β· :dateto Β· ~/entries/slug Β· Ctrl+N/Ctrl+P for suggestions Β· Ctrl+C/Ctrl+G to cancel
entries 201/201 Β· entry -/-
:readyentries 201/201 Β· entry -/-