Loc Prodifier: Merging Data in Google BigQuery

github repo

Overview

Loc Prodifier is a Python tool designed to merge data from staging tables into production tables within Google BigQuery while preventing duplicate records. It supports both local execution and deployment on Google Cloud Run, enabling scalable and parallel processing of multiple tables using Google Cloud Workflows.

Features

  • Merges data from staging tables into production tables without duplicates.
  • Supports parallel execution across multiple tables.
  • Can run locally with custom credentials or be deployed on Google Cloud Run.
  • Configurable via command-line arguments.

Tech Stack

  • Python 3.7+
  • Google Cloud BigQuery
  • Google Cloud Run
  • Google Cloud Workflows
  • Docker

Getting Started

Prerequisites

  • Python 3.7 or higher
  • Google Cloud SDK
  • Docker
  • Google Cloud project with BigQuery, Cloud Run, and Artifact Registry enabled

Installation

  1. Clone the repository:
git clone https://github.com/justin-napolitano/loc_prodifier.git
cd loc_prodifier
  1. Install required Python packages:
pip install -r requirements.txt

Running Locally

Ensure you have your Google Cloud credentials JSON file. Run:

python loc_prodifier.py --dataset_id your_dataset_id --staging_table_id your_staging_table_id --prod_table_id your_prod_table_id --local

Running with Docker

  1. Build the Docker image:
docker build -t my-bigquery-script .
  1. Run the container:
docker run --rm my-bigquery-script --dataset_id your_dataset_id --staging_table_id your_staging_table_id --prod_table_id your_prod_table_id --local

Deploying to Google Cloud Run

  1. Use the provided cloudbuild.yaml to build and push the Docker image to Artifact Registry.
  2. Deploy the Cloud Run job using the Cloud Build steps or manually with gcloud commands.
  3. Use workflow.yaml to orchestrate parallel merges via Google Cloud Workflows.

Project Structure

loc_prodifier/
β”œβ”€β”€ cloudbuild.yaml        # Cloud Build configuration for building and deploying
β”œβ”€β”€ Dockerfile             # Docker image build instructions
β”œβ”€β”€ gcputils/              # Google Cloud utility submodule (BigQuery, Storage, Logging, Secrets)
β”‚   β”œβ”€β”€ BigQueryClient.py  # BigQuery client wrapper
β”‚   β”œβ”€β”€ gcpclient.py       # Google Cloud Storage client
β”‚   β”œβ”€β”€ GoogleCloudLogging.py # Cloud Logging client
β”‚   β”œβ”€β”€ GoogleSecretManager.py # Secret Manager client
β”‚   └── ...
β”œβ”€β”€ loc_prodifier.py       # Main script for merging tables
β”œβ”€β”€ readme-prodifier.md    # Original README content
β”œβ”€β”€ requirements.txt       # Python dependencies
└── workflow.yaml          # Cloud Workflows definition for parallel execution

Future Work / Roadmap

  • Add support for configurable merge conditions and update clauses.
  • Enhance error handling and logging integration.
  • Provide more detailed usage examples and automated tests.
  • Expand support for additional data sources or cloud providers.
  • Implement monitoring and alerting for workflow executions.

Note: Some documentation and features are inferred based on available code and files.

hjkl / arrows Β· / search Β· :family Β· :tag Β· :datefrom Β· :dateto Β· ~/entries/slug Β· Ctrl+N/Ctrl+P for suggestions Β· Ctrl+C/Ctrl+G to cancel
entries 201/201 Β· entry -/-
:readyentries 201/201 Β· entry -/-