A high-performance FastAPI service for generating text embeddings using SentenceTransformers, specifically designed for processing legal documents and search queries. The service efficiently handles both short search queries and lengthy court opinions, generating semantic embeddings that can be used for document similarity matching and semantic search applications. It includes support for GPU acceleration when available.
The service is optimized to handle two main use cases:
- Embedding search queries: Quick, CPU-based processing for short search queries
- Embedding court opinions: GPU-accelerated processing for longer legal documents, with intelligent text chunking to maintain context
- Specialized text embedding generation for legal documents using the
sentence-transformers/all-mpnet-base-v2
model - Intelligent text chunking optimized for court opinions, based on sentence boundaries
- Dedicated CPU-based processing for search queries, ensuring fast response times
- GPU acceleration support for processing lengthy court opinions
- Batch processing capabilities for multiple documents
- Comprehensive text preprocessing and cleaning tailored for legal text
- Health check endpoint
This project uses Poetry for dependency management. To get started:
-
Install Poetry:
curl -sSL https://install.python-poetry.org | python3 -
-
Clone the repository and install dependencies:
git clone <repository-url> cd inception poetry install
The easiest way to run the embedding service is using Docker:
docker run -d -p 8005:8005 freelawproject/inception:latest
To handle more concurrent tasks, increase the number of workers:
docker run -d -p 8005:8005 -e EMBEDDING_WORKERS=4 freelawproject/inception:latest
Test that the service is running:
curl http://localhost:8005
# Should return: "Heartbeat detected."
The service includes a Python client for easy integration:
from examples.client_example import EmbeddingClient
# Initialize client
client = EmbeddingClient("http://localhost:8005")
# Get embedding for a query
query_embedding = client.get_query_embedding("What is copyright infringement?")
# Get embeddings for a document
doc_embeddings = client.get_document_embedding("The court finds that...")
# Process multiple documents
batch_results = client.get_batch_embeddings([
{"id": 1, "text": "First document..."},
{"id": 2, "text": "Second document..."}
])
Install client requirements:
pip install -r examples/requirements.txt
See DEVELOPING.md for more examples and detailed usage.
Generate embeddings for search queries (CPU-optimized):
curl 'http://localhost:8005/api/v1/embed/query' \
-X 'POST' \
-H 'Content-Type: application/json' \
-d '{"text": "What are the requirements for copyright infringement?"}'
Generate embeddings for court opinions or legal documents (GPU-accelerated when available):
curl 'http://localhost:8005/api/v1/embed/text' \
-X 'POST' \
-H 'Content-Type: text/plain' \
-d 'The court finds that the defendant...'
Process multiple documents in one request:
curl 'http://localhost:8005/api/v1/embed/batch' \
-X 'POST' \
-H 'Content-Type: application/json' \
-d '{
"documents": [
{"id": 1, "text": "First court opinion..."},
{"id": 2, "text": "Second court opinion..."}
]
}'
The service can be configured through environment variables or a .env
file. Copy .env.example
to .env
to get started:
cp .env.example .env
Model Settings:
TRANSFORMER_MODEL_NAME
: Model to use (default: "sentence-transformers/all-mpnet-base-v2")MAX_WORDS
: Maximum words per chunk (default: 350)
Server Settings:
HOST
: Server host (default: "0.0.0.0")PORT
: Server port (default: 8005)EMBEDDING_WORKERS
: Number of Gunicorn workers (default: 4)
GPU Settings:
FORCE_CPU
: Force CPU usage even if GPU is available (default: false)
Monitoring:
SENTRY_DSN
: Sentry DSN for error tracking (optional)ENABLE_METRICS
: Enable Prometheus metrics (default: true)
CORS Settings:
ALLOWED_ORIGINS
: Comma-separated list of allowed originsALLOWED_METHODS
: Comma-separated list of allowed methodsALLOWED_HEADERS
: Comma-separated list of allowed headers
See .env.example
for a complete list of configuration options.
For development setup and testing instructions, see DEVELOPING.md.
We welcome contributions to improve the embedding service!
- For development setup, see DEVELOPING.md
- For submitting changes, see SUBMITTING.md
Please ensure you:
- Follow the existing code style
- Add tests for new features
- Update documentation as needed
- Test thoroughly using provided tools:
# Run tests docker-compose -f docker-compose.dev.yml up test # Test endpoints ./test_service.sh # Test Python client python examples/client_example.py
The service includes several monitoring endpoints:
/health
: Health check endpoint providing service status and GPU information/metrics
: Prometheus metrics endpoint for monitoring request counts and processing times
Example health check:
curl http://localhost:8005/health
Example metrics:
curl http://localhost:8005/metrics
- Python 3.8+
- CUDA-compatible GPU (highly recommended, for long texts embedding)