A document parsing pipeline designed to process various document types, tokenize and chunk content, extract entities, relationships, and citations, and store the results in a database. It leverages the Unstructured library and provides an API for document partitioning.
Features
- Supports parsing of multiple document formats including PDFs, Word documents, HTML, emails, and images.
- Tokenization and chunking of documents for downstream processing.
- Extraction of entities, relationships, and citations from documents.
- Integration with the Unstructured library for document partitioning.
- API interface for document processing with support for form parameters.
- Handles compressed files (gzip) and supports content type detection and validation.
- Docker-compose setup for running the Unstructured service.
Tech Stack
- Python 3.x
- FastAPI for API implementation
- Unstructured library for document partitioning
- pypdf for PDF manipulation
- Pandas for data handling
- Docker for containerization
- pytest for testing
Getting Started
Prerequisites
- Python 3.8 or higher
- Docker and Docker Compose (for running the Unstructured service)
Installation
- Clone the repository:
git clone https://github.com/justin-napolitano/document-parser.git
cd document-parser/unstructured-api
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
Running the Unstructured Service
Start the Unstructured document parser service using Docker Compose:
docker-compose up -d
This will start the service on port 9000.
Running the API
Run the FastAPI app:
uvicorn prepline_general.api.app:app --host 0.0.0.0 --port 8000
The API documentation will be available at http://localhost:8000/general/docs.
Running Tests
Run the test suite with pytest:
pytest
Project Structure
.
βββ docker-compose.yml # Docker Compose config for Unstructured service
βββ index.md # Project goals and overview
βββ sample-docs/ # Sample documents for testing
βββ scripts/ # Utility scripts including smoketest
βββ unstructured-api/ # Main API and processing code
β βββ prepline_general/ # Core API modules and utilities
β β βββ api/ # FastAPI app, routers, models, utils
β β βββ filetypes.py # File type detection and validation
β β βββ models/ # Pydantic models for form parameters
β β βββ utils.py # Helpers for type parsing and conversion
β β βββ openapi.py # Custom OpenAPI schema generation
β β βββ general.py # API endpoints and processing logic
β β βββ ...
β βββ LICENSE.md # Apache 2.0 License
β βββ README.md # Unstructured API announcement and info
β βββ CHANGELOG.md # Version history and changes
β βββ test_general/ # Tests for API and utilities
βββ README.md # This file
Future Work / Roadmap
- Add detailed support for more document types and complex layouts.
- Improve entity and relationship extraction capabilities.
- Enhance chunking strategies and support for multipage sections.
- Add authentication and rate limiting to the API.
- Provide hosted deployment options and scalability improvements.
- Expand test coverage and add benchmarks.
- Improve documentation with usage examples and tutorials.
Assumptions
- Primary language is Python based on code and dependencies.
- The project is a wrapper around the Unstructured library with added API and utilities.
- Some details on usage and installation are inferred from typical FastAPI and Docker setups.
License
This project uses the Apache License 2.0 as indicated in the LICENSE.md file.