A Python-based scraper designed to extract US Supreme Court case data from the Library of Congress digital collections. The project integrates with Google Cloud Platform (GCP) to run scraping jobs as scalable Cloud Run jobs, storing results in GCP buckets for downstream processing.
Features
- Scrapes US Supreme Court case metadata from the Library of Congress public API.
- Converts search results into JSON format.
- Uploads scraped data to Google Cloud Storage buckets.
- Designed as GCP Cloud Run jobs for scalable, cloud-native execution.
- Includes scripts for building Docker images, deploying jobs, and managing workflows.
Tech Stack
- Python 3
- Google Cloud Platform (Cloud Run, Cloud Storage, Cloud Build)
- Docker
- Bash scripting
- BeautifulSoup for HTML parsing
Getting Started
Prerequisites
- Google Cloud account with billing enabled
- Installed and configured gcloud CLI
- Docker installed locally
Installation
- Clone the repository:
git clone https://github.com/justin-napolitano/loc_scraper.git
cd loc_scraper
- Install Python dependencies:
pip install -r requirements.txt
Build and Deploy
- Build the Docker image using Cloud Build:
gcloud builds submit --config cloudbuild.yaml .
- Deploy the scraping job to Cloud Run:
./deploy.sh
Running the Job
Trigger the Cloud Run job manually or via scheduler to start scraping.
Project Structure
loc_scraper/
├── build.sh # Build script
├── cloudbuild.yaml # Cloud Build configuration
├── deploy.sh # Deployment script for Cloud Run job
├── Dockerfile # Docker image specification
├── execute_job.sh # Script to execute the scraping job
├── job_create.sh # Script to create GCP job
├── logs/ # Log files directory
├── output_2/ # Sample output JSON files from scraping
├── post-image.jpeg # Featured image for documentation
├── quickstart/ # Quickstart instructions and scripts
├── readme.md # Project README
├── requirements.txt # Python dependencies
├── requirements_cloud.txt# Cloud-specific dependencies
├── run.sh # Run script
└── src/ # Source code
├── loc_scraper.py # Main scraper implementation
├── loc_pdf_downloader.py # PDF downloader utility
└── steps.md # Setup and development steps
Future Work / Roadmap
- Complete integration with chatbot APIs for enhanced data analysis.
- Expand scraping to include more collections or metadata fields.
- Implement automated scheduling and monitoring of scraping jobs.
- Add data processing pipelines to build research tools from scraped data.
- Improve error handling and logging for robustness.
For more details, see the index.md and readme.md files in the repo for background and usage notes.