Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add media description feature using Azure Content Understanding #2195

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .azdo/pipelines/azure-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ steps:
DEPLOYMENT_TARGET: $(DEPLOYMENT_TARGET)
AZURE_CONTAINER_APPS_WORKLOAD_PROFILE: $(AZURE_CONTAINER_APPS_WORKLOAD_PROFILE)
USE_CHAT_HISTORY_BROWSER: $(USE_CHAT_HISTORY_BROWSER)
USE_MEDIA_DESCRIBER_AZURE_CU: $(USE_MEDIA_DESCRIBER_AZURE_CU)
- task: AzureCLI@2
displayName: Deploy Application
inputs:
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/azure-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ jobs:
DEPLOYMENT_TARGET: ${{ vars.DEPLOYMENT_TARGET }}
AZURE_CONTAINER_APPS_WORKLOAD_PROFILE: ${{ vars.AZURE_CONTAINER_APPS_WORKLOAD_PROFILE }}
USE_CHAT_HISTORY_BROWSER: ${{ vars.USE_CHAT_HISTORY_BROWSER }}
USE_MEDIA_DESCRIBER_AZURE_CU: ${{ vars.USE_MEDIA_DESCRIBER_AZURE_CU }}
steps:
- name: Checkout
uses: actions/checkout@v4
Expand Down
4 changes: 3 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
- [Running unit tests](#running-unit-tests)
- [Running E2E tests](#running-e2e-tests)
- [Code Style](#code-style)
- [Adding new azd environment variables](#add-new-azd-environment-variables)
- [Adding new azd environment variables](#adding-new-azd-environment-variables)

## Code of Conduct

Expand Down Expand Up @@ -166,6 +166,8 @@

When adding new azd environment variables, please remember to update:

1. [main.parameters.json](./main.parameters.json)

Check failure on line 169 in CONTRIBUTING.md

View workflow job for this annotation

GitHub Actions / Check Broken Relative Paths

File CONTRIBUTING.md, line 169, Link ./main.parameters.json is broken.
1. [appEnvVariables in main.bicep](./main.bicep)

Check failure on line 170 in CONTRIBUTING.md

View workflow job for this annotation

GitHub Actions / Check Broken Relative Paths

File CONTRIBUTING.md, line 170, Link ./main.bicep is broken.
1. App Service's [azure.yaml](./azure.yaml)
1. [ADO pipeline](.azdo/pipelines/azure-dev.yml).
1. [Github workflows](.github/workflows/azure-dev.yml)
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,9 @@
- Azure AI Document Intelligence: SO (Standard) tier using pre-built layout. Pricing per document page, sample documents have 261 pages total. [Pricing](https://azure.microsoft.com/pricing/details/form-recognizer/)
- Azure AI Search: Basic tier, 1 replica, free level of semantic search. Pricing per hour. [Pricing](https://azure.microsoft.com/pricing/details/search/)
- Azure Blob Storage: Standard tier with ZRS (Zone-redundant storage). Pricing per storage and read operations. [Pricing](https://azure.microsoft.com/pricing/details/storage/blobs/)
- Azure Cosmos DB: Serverless tier. Pricing per request unit and storage. [Pricing](https://azure.microsoft.com/pricing/details/cosmos-db/)
- Azure Cosmos DB: Only provisioned if you enabled [chat history with Cosmos DB](docs/deploy_features.md#enabling-persistent-chat-history-with-azure-cosmos-db). Serverless tier. Pricing per request unit and storage. [Pricing](https://azure.microsoft.com/pricing/details/cosmos-db/)
- Azure AI Vision: Only provisioned if you enabled [GPT-4 with vision](docs/gpt4v.md). Pricing per 1K transactions. [Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/computer-vision/)

Check failure on line 96 in README.md

View workflow job for this annotation

GitHub Actions / Check URLs Don't Have Locale

File README.md, line 96, Link https://azure.microsoft.com/en-us/pricing/details/cognitive-services/computer-vision/ has locale.
- Azure AI Content Understanding: Only provisioned if you enabled [media description](docs/deploy_features.md#enabling-media-description-with-azure-content-understanding). Pricing per TODO. [Pricing](TODO)
- Azure Monitor: Pay-as-you-go tier. Costs based on data ingested. [Pricing](https://azure.microsoft.com/pricing/details/monitor/)

To reduce costs, you can switch to free SKUs for various services, but those SKUs have limitations.
Expand Down
19 changes: 14 additions & 5 deletions app/backend/prepdocs.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from azure.core.credentials import AzureKeyCredential
from azure.core.credentials_async import AsyncTokenCredential
from azure.identity.aio import AzureDeveloperCliCredential, get_bearer_token_provider
from rich.logging import RichHandler

from load_azd_env import load_azd_env
from prepdocslib.blobmanager import BlobManager
Expand Down Expand Up @@ -158,8 +159,10 @@ def setup_file_processors(
local_pdf_parser: bool = False,
local_html_parser: bool = False,
search_images: bool = False,
use_content_understanding: bool = False,
content_understanding_endpoint: Union[str, None] = None,
):
sentence_text_splitter = SentenceTextSplitter(has_image_embeddings=search_images)
sentence_text_splitter = SentenceTextSplitter()

doc_int_parser: Optional[DocumentAnalysisParser] = None
# check if Azure Document Intelligence credentials are provided
Expand All @@ -170,6 +173,8 @@ def setup_file_processors(
doc_int_parser = DocumentAnalysisParser(
endpoint=f"https://{document_intelligence_service}.cognitiveservices.azure.com/",
credential=documentintelligence_creds,
use_content_understanding=use_content_understanding,
content_understanding_endpoint=content_understanding_endpoint,
)

pdf_parser: Optional[Parser] = None
Expand Down Expand Up @@ -241,8 +246,7 @@ async def main(strategy: Strategy, setup_index: bool = True):

if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Prepare documents by extracting content from PDFs, splitting content into sections, uploading to blob storage, and indexing in a search index.",
epilog="Example: prepdocs.py '.\\data\*' -v",
description="Prepare documents by extracting content from PDFs, splitting content into sections, uploading to blob storage, and indexing in a search index."
)
parser.add_argument("files", nargs="?", help="Files to be processed")

Expand Down Expand Up @@ -295,17 +299,18 @@ async def main(strategy: Strategy, setup_index: bool = True):
args = parser.parse_args()

if args.verbose:
logging.basicConfig(format="%(message)s")
logging.basicConfig(format="%(message)s", datefmt="[%X]", handlers=[RichHandler(rich_tracebacks=True)])
# We only set the level to INFO for our logger,
# to avoid seeing the noisy INFO level logs from the Azure SDKs
logger.setLevel(logging.INFO)
logger.setLevel(logging.DEBUG)

load_azd_env()

use_int_vectorization = os.getenv("USE_FEATURE_INT_VECTORIZATION", "").lower() == "true"
use_gptvision = os.getenv("USE_GPT4V", "").lower() == "true"
use_acls = os.getenv("AZURE_ADLS_GEN2_STORAGE_ACCOUNT") is not None
dont_use_vectors = os.getenv("USE_VECTORS", "").lower() == "false"
use_content_understanding = os.getenv("USE_MEDIA_DESCRIBER_AZURE_CU", "").lower() == "true"

# Use the current user identity to connect to Azure services. See infra/main.bicep for role assignments.
if tenant_id := os.getenv("AZURE_TENANT_ID"):
Expand Down Expand Up @@ -403,6 +408,8 @@ async def main(strategy: Strategy, setup_index: bool = True):
local_pdf_parser=os.getenv("USE_LOCAL_PDF_PARSER") == "true",
local_html_parser=os.getenv("USE_LOCAL_HTML_PARSER") == "true",
search_images=use_gptvision,
use_content_understanding=use_content_understanding,
content_understanding_endpoint=os.getenv("AZURE_CONTENTUNDERSTANDING_ENDPOINT"),
)
image_embeddings_service = setup_image_embeddings_service(
azure_credential=azd_credential,
Expand All @@ -421,6 +428,8 @@ async def main(strategy: Strategy, setup_index: bool = True):
search_analyzer_name=os.getenv("AZURE_SEARCH_ANALYZER_NAME"),
use_acls=use_acls,
category=args.category,
use_content_understanding=use_content_understanding,
content_understanding_endpoint=os.getenv("AZURE_CONTENTUNDERSTANDING_ENDPOINT"),
)

loop.run_until_complete(main(ingestion_strategy, setup_index=not args.remove and not args.removeall))
Expand Down
2 changes: 1 addition & 1 deletion app/backend/prepdocslib/blobmanager.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ def sourcepage_from_file_page(cls, filename, page=0) -> str:

@classmethod
def blob_image_name_from_file_page(cls, filename, page=0) -> str:
return os.path.splitext(os.path.basename(filename))[0] + f"-{page}" + ".png"
return os.path.splitext(os.path.basename(filename))[0] + f"-{page+1}" + ".png"

@classmethod
def blob_name_from_file_name(cls, filename) -> str:
Expand Down
109 changes: 109 additions & 0 deletions app/backend/prepdocslib/cu_image.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
import logging
from typing import Union

import aiohttp
from azure.core.credentials_async import AsyncTokenCredential
from azure.identity.aio import get_bearer_token_provider
from rich.progress import Progress
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_fixed

logger = logging.getLogger("scripts")

CU_API_VERSION = "2024-12-01-preview"

PATH_ANALYZER_MANAGEMENT = "/analyzers/{analyzerId}"
PATH_ANALYZER_MANAGEMENT_OPERATION = "/analyzers/{analyzerId}/operations/{operationId}"

# Define Analyzer inference paths
PATH_ANALYZER_INFERENCE = "/analyzers/{analyzerId}:analyze"
PATH_ANALYZER_INFERENCE_GET_IMAGE = "/analyzers/{analyzerId}/results/{operationId}/images/{imageId}"

analyzer_name = "image_analyzer"
image_schema = {
"analyzerId": analyzer_name,
"name": "Image understanding",
"description": "Extract detailed structured information from images extracted from documents.",
"baseAnalyzerId": "prebuilt-image",
"scenario": "image",
"config": {"returnDetails": False},
"fieldSchema": {
"name": "ImageInformation",
"descriptions": "Description of image.",
"fields": {
"Description": {
"type": "string",
"description": "Description of the image. If the image has a title, start with the title. Include a 2-sentence summary. If the image is a chart, diagram, or table, include the underlying data in an HTML table tag, with accurate numbers. If the image is a chart, describe any axis or legends. The only allowed HTML tags are the table/thead/tr/td/tbody tags.",
},
},
},
}


class ContentUnderstandingManager:

def __init__(self, endpoint: str, credential: Union[AsyncTokenCredential, str]):
self.endpoint = endpoint
self.credential = credential

async def poll_api(self, session, poll_url, headers):

@retry(stop=stop_after_attempt(60), wait=wait_fixed(2), retry=retry_if_exception_type(ValueError))
async def poll():
async with session.get(poll_url, headers=headers) as response:
response.raise_for_status()
response_json = await response.json()
if response_json["status"] == "Failed":
raise Exception("Failed")
if response_json["status"] == "Running":
raise ValueError("Running")
return response_json

return await poll()

async def create_analyzer(self):
logger.info("Creating analyzer '%s'...", image_schema["analyzerId"])

token_provider = get_bearer_token_provider(self.credential, "https://cognitiveservices.azure.com/.default")
token = await token_provider()
headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}
params = {"api-version": CU_API_VERSION}
analyzer_id = image_schema["analyzerId"]
cu_endpoint = f"{self.endpoint}/contentunderstanding/analyzers/{analyzer_id}"
async with aiohttp.ClientSession() as session:
async with session.put(url=cu_endpoint, params=params, headers=headers, json=image_schema) as response:
if response.status == 409:
logger.info("Analyzer '%s' already exists.", analyzer_id)
return
elif response.status != 201:
data = await response.text()
logger.error("Error creating analyzer: %s", data)
response.raise_for_status()
else:
poll_url = response.headers.get("Operation-Location")

with Progress() as progress:
progress.add_task("Creating analyzer...", total=None, start=False)
await self.poll_api(session, poll_url, headers)

async def describe_image(self, image_bytes) -> str:
logger.info("Sending image to Azure Content Understanding service...")
async with aiohttp.ClientSession() as session:
token = await self.credential.get_token("https://cognitiveservices.azure.com/.default")
headers = {"Authorization": "Bearer " + token.token}
params = {"api-version": CU_API_VERSION}

async with session.post(
url=f"{self.endpoint}/contentunderstanding/analyzers/{analyzer_name}:analyze",
params=params,
headers=headers,
data=image_bytes,
) as response:
response.raise_for_status()
poll_url = response.headers["Operation-Location"]

with Progress() as progress:
progress.add_task("Processing...", total=None, start=False)
results = await self.poll_api(session, poll_url, headers)

fields = results["result"]["contents"][0]["fields"]
return fields["DescriptionHTML"]["valueString"]
Loading
Loading