Document Parser: A Python API for Document Processing

github repo

A document parsing pipeline designed to process various document types, tokenize and chunk content, extract entities, relationships, and citations, and store the results in a database. It leverages the Unstructured library and provides an API for document partitioning.

Features

  • Supports parsing of multiple document formats including PDFs, Word documents, HTML, emails, and images.
  • Tokenization and chunking of documents for downstream processing.
  • Extraction of entities, relationships, and citations from documents.
  • Integration with the Unstructured library for document partitioning.
  • API interface for document processing with support for form parameters.
  • Handles compressed files (gzip) and supports content type detection and validation.
  • Docker-compose setup for running the Unstructured service.

Tech Stack

  • Python 3.x
  • FastAPI for API implementation
  • Unstructured library for document partitioning
  • pypdf for PDF manipulation
  • Pandas for data handling
  • Docker for containerization
  • pytest for testing

Getting Started

Prerequisites

  • Python 3.8 or higher
  • Docker and Docker Compose (for running the Unstructured service)

Installation

  1. Clone the repository:
git clone https://github.com/justin-napolitano/document-parser.git
cd document-parser/unstructured-api
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Running the Unstructured Service

Start the Unstructured document parser service using Docker Compose:

docker-compose up -d

This will start the service on port 9000.

Running the API

Run the FastAPI app:

uvicorn prepline_general.api.app:app --host 0.0.0.0 --port 8000

The API documentation will be available at http://localhost:8000/general/docs.

Running Tests

Run the test suite with pytest:

pytest

Project Structure

.
β”œβ”€β”€ docker-compose.yml          # Docker Compose config for Unstructured service
β”œβ”€β”€ index.md                   # Project goals and overview
β”œβ”€β”€ sample-docs/               # Sample documents for testing
β”œβ”€β”€ scripts/                   # Utility scripts including smoketest
β”œβ”€β”€ unstructured-api/          # Main API and processing code
β”‚   β”œβ”€β”€ prepline_general/      # Core API modules and utilities
β”‚   β”‚   β”œβ”€β”€ api/               # FastAPI app, routers, models, utils
β”‚   β”‚   β”œβ”€β”€ filetypes.py       # File type detection and validation
β”‚   β”‚   β”œβ”€β”€ models/            # Pydantic models for form parameters
β”‚   β”‚   β”œβ”€β”€ utils.py           # Helpers for type parsing and conversion
β”‚   β”‚   β”œβ”€β”€ openapi.py         # Custom OpenAPI schema generation
β”‚   β”‚   β”œβ”€β”€ general.py         # API endpoints and processing logic
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ LICENSE.md             # Apache 2.0 License
β”‚   β”œβ”€β”€ README.md              # Unstructured API announcement and info
β”‚   β”œβ”€β”€ CHANGELOG.md           # Version history and changes
β”‚   └── test_general/          # Tests for API and utilities
└── README.md                  # This file

Future Work / Roadmap

  • Add detailed support for more document types and complex layouts.
  • Improve entity and relationship extraction capabilities.
  • Enhance chunking strategies and support for multipage sections.
  • Add authentication and rate limiting to the API.
  • Provide hosted deployment options and scalability improvements.
  • Expand test coverage and add benchmarks.
  • Improve documentation with usage examples and tutorials.

Assumptions

  • Primary language is Python based on code and dependencies.
  • The project is a wrapper around the Unstructured library with added API and utilities.
  • Some details on usage and installation are inferred from typical FastAPI and Docker setups.

License

This project uses the Apache License 2.0 as indicated in the LICENSE.md file.

hjkl / arrows Β· / search Β· :family Β· :tag Β· :datefrom Β· :dateto Β· ~/entries/slug Β· Ctrl+N/Ctrl+P for suggestions Β· Ctrl+C/Ctrl+G to cancel
entries 201/201 Β· entry -/-
:readyentries 201/201 Β· entry -/-