Overview
Loc Prodifier is a Python tool designed to merge data from staging tables into production tables within Google BigQuery while preventing duplicate records. It supports both local execution and deployment on Google Cloud Run, enabling scalable and parallel processing of multiple tables using Google Cloud Workflows.
Features
- Merges data from staging tables into production tables without duplicates.
- Supports parallel execution across multiple tables.
- Can run locally with custom credentials or be deployed on Google Cloud Run.
- Configurable via command-line arguments.
Tech Stack
- Python 3.7+
- Google Cloud BigQuery
- Google Cloud Run
- Google Cloud Workflows
- Docker
Getting Started
Prerequisites
- Python 3.7 or higher
- Google Cloud SDK
- Docker
- Google Cloud project with BigQuery, Cloud Run, and Artifact Registry enabled
Installation
- Clone the repository:
git clone https://github.com/justin-napolitano/loc_prodifier.git
cd loc_prodifier
- Install required Python packages:
pip install -r requirements.txt
Running Locally
Ensure you have your Google Cloud credentials JSON file. Run:
python loc_prodifier.py --dataset_id your_dataset_id --staging_table_id your_staging_table_id --prod_table_id your_prod_table_id --local
Running with Docker
- Build the Docker image:
docker build -t my-bigquery-script .
- Run the container:
docker run --rm my-bigquery-script --dataset_id your_dataset_id --staging_table_id your_staging_table_id --prod_table_id your_prod_table_id --local
Deploying to Google Cloud Run
- Use the provided
cloudbuild.yamlto build and push the Docker image to Artifact Registry. - Deploy the Cloud Run job using the Cloud Build steps or manually with gcloud commands.
- Use
workflow.yamlto orchestrate parallel merges via Google Cloud Workflows.
Project Structure
loc_prodifier/
βββ cloudbuild.yaml # Cloud Build configuration for building and deploying
βββ Dockerfile # Docker image build instructions
βββ gcputils/ # Google Cloud utility submodule (BigQuery, Storage, Logging, Secrets)
β βββ BigQueryClient.py # BigQuery client wrapper
β βββ gcpclient.py # Google Cloud Storage client
β βββ GoogleCloudLogging.py # Cloud Logging client
β βββ GoogleSecretManager.py # Secret Manager client
β βββ ...
βββ loc_prodifier.py # Main script for merging tables
βββ readme-prodifier.md # Original README content
βββ requirements.txt # Python dependencies
βββ workflow.yaml # Cloud Workflows definition for parallel execution
Future Work / Roadmap
- Add support for configurable merge conditions and update clauses.
- Enhance error handling and logging integration.
- Provide more detailed usage examples and automated tests.
- Expand support for additional data sources or cloud providers.
- Implement monitoring and alerting for workflow executions.
Note: Some documentation and features are inferred based on available code and files.