Justin Napolitano

A Python-based scraper designed to extract US Supreme Court case data from the Library of Congress digital collections. The project integrates with Google Cloud Platform (GCP) to run scraping jobs as scalable Cloud Run jobs, storing results in GCP buckets for downstream processing.

Features

Scrapes US Supreme Court case metadata from the Library of Congress public API.
Converts search results into JSON format.
Uploads scraped data to Google Cloud Storage buckets.
Designed as GCP Cloud Run jobs for scalable, cloud-native execution.
Includes scripts for building Docker images, deploying jobs, and managing workflows.

Tech Stack

Python 3
Google Cloud Platform (Cloud Run, Cloud Storage, Cloud Build)
Docker
Bash scripting
BeautifulSoup for HTML parsing

Getting Started

Prerequisites

Google Cloud account with billing enabled
Installed and configured gcloud CLI
Docker installed locally

Installation

Clone the repository:

git clone https://github.com/justin-napolitano/loc_scraper.git
cd loc_scraper

Install Python dependencies:

pip install -r requirements.txt

Build and Deploy

Build the Docker image using Cloud Build:

gcloud builds submit --config cloudbuild.yaml .

Deploy the scraping job to Cloud Run:

./deploy.sh

Running the Job

Trigger the Cloud Run job manually or via scheduler to start scraping.

Project Structure

loc_scraper/
├── build.sh              # Build script
├── cloudbuild.yaml       # Cloud Build configuration
├── deploy.sh             # Deployment script for Cloud Run job
├── Dockerfile            # Docker image specification
├── execute_job.sh        # Script to execute the scraping job
├── job_create.sh         # Script to create GCP job
├── logs/                 # Log files directory
├── output_2/             # Sample output JSON files from scraping
├── post-image.jpeg       # Featured image for documentation
├── quickstart/           # Quickstart instructions and scripts
├── readme.md             # Project README
├── requirements.txt      # Python dependencies
├── requirements_cloud.txt# Cloud-specific dependencies
├── run.sh                # Run script
└── src/                  # Source code
    ├── loc_scraper.py    # Main scraper implementation
    ├── loc_pdf_downloader.py # PDF downloader utility
    └── steps.md          # Setup and development steps

Future Work / Roadmap

Complete integration with chatbot APIs for enhanced data analysis.
Expand scraping to include more collections or metadata fields.
Implement automated scheduling and monitoring of scraping jobs.
Add data processing pipelines to build research tools from scraped data.
Improve error handling and logging for robustness.

For more details, see the index.md and readme.md files in the repo for background and usage notes.