A document parsing pipeline designed to process various document types, tokenize and chunk content, extract entities, relationships, and citations, and store the results in a database. It leverages the Unstructured library and provides an API for document partitioning.

Features

Supports parsing of multiple document formats including PDFs, Word documents, HTML, emails, and images.
Tokenization and chunking of documents for downstream processing.
Extraction of entities, relationships, and citations from documents.
Integration with the Unstructured library for document partitioning.
API interface for document processing with support for form parameters.
Handles compressed files (gzip) and supports content type detection and validation.
Docker-compose setup for running the Unstructured service.

Tech Stack

Python 3.x
FastAPI for API implementation
Unstructured library for document partitioning
pypdf for PDF manipulation
Pandas for data handling
Docker for containerization
pytest for testing

Getting Started

Prerequisites

Python 3.8 or higher
Docker and Docker Compose (for running the Unstructured service)

Installation

Clone the repository:

git clone https://github.com/justin-napolitano/document-parser.git
cd document-parser/unstructured-api

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Running the Unstructured Service

Start the Unstructured document parser service using Docker Compose:

docker-compose up -d

This will start the service on port 9000.

Running the API

Run the FastAPI app:

uvicorn prepline_general.api.app:app --host 0.0.0.0 --port 8000

The API documentation will be available at http://localhost:8000/general/docs.

Running Tests

Run the test suite with pytest:

pytest

Project Structure

.
├── docker-compose.yml          # Docker Compose config for Unstructured service
├── index.md                   # Project goals and overview
├── sample-docs/               # Sample documents for testing
├── scripts/                   # Utility scripts including smoketest
├── unstructured-api/          # Main API and processing code
│   ├── prepline_general/      # Core API modules and utilities
│   │   ├── api/               # FastAPI app, routers, models, utils
│   │   ├── filetypes.py       # File type detection and validation
│   │   ├── models/            # Pydantic models for form parameters
│   │   ├── utils.py           # Helpers for type parsing and conversion
│   │   ├── openapi.py         # Custom OpenAPI schema generation
│   │   ├── general.py         # API endpoints and processing logic
│   │   └── ...
│   ├── LICENSE.md             # Apache 2.0 License
│   ├── README.md              # Unstructured API announcement and info
│   ├── CHANGELOG.md           # Version history and changes
│   └── test_general/          # Tests for API and utilities
└── README.md                  # This file

Future Work / Roadmap

Add detailed support for more document types and complex layouts.
Improve entity and relationship extraction capabilities.
Enhance chunking strategies and support for multipage sections.
Add authentication and rate limiting to the API.
Provide hosted deployment options and scalability improvements.
Expand test coverage and add benchmarks.
Improve documentation with usage examples and tutorials.

Assumptions

Primary language is Python based on code and dependencies.
The project is a wrapper around the Unstructured library with added API and utilities.
Some details on usage and installation are inferred from typical FastAPI and Docker setups.

License

This project uses the Apache License 2.0 as indicated in the LICENSE.md file.