Aspiring Architect: Azure AI Services - A Complete Guide to Building Intelligent Applications with Azure AI Foundry

A while ago i started exploring Azure AI Foundry and ended up going down a rabbit hole of 30+ implementations covering everything from GPT-5 chat to live speech transcription. In this post i will walk you through all the major Azure AI services, what they do, how to implement them, and when to use them — so you don't have to figure it all out the hard way like i did.

You can download git repo and start embedding your Azure OpenAI Service keys in.env file and start executing them as we go along.

README: https://github.com/pratappilaka24/Azure-Foundry-Samples?tab=readme-ov-file#readme

REPO: https://github.com/pratappilaka24/Azure-Foundry-Samples

Our objective is to understand the complete Azure AI Services ecosystem and how you can combine them to build enterprise-grade intelligent applications.

Azure OpenAI - GPT-5 Chat, Vision and Code

This is where most people start, and for good reason. Azure OpenAI gives you access to GPT-5 with enterprise-grade security, regional deployment and SLAs — unlike calling OpenAI directly.

The basic setup is straightforward. You initialize an AzureOpenAI client with your endpoint and API key, define a system role (something like "you are a helpful travel assistant"), pass in user messages and configure temperature and top_p for response behavior. That's it, you are doing conversational AI.

from openai import AzureOpenAI
from azure.core.credentials import AzureKeyCredential
from dotenv import load_dotenv
import os

load_dotenv()
client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_KEY"),
    api_version="2024-12-01-preview"
)

What makes it more interesting is Vision. You can encode an image to base64, pass it as image_url in the message content, and GPT-5 will analyze and explain it — diagrams, screenshots, anything. I used this for code explanation too. Point it at a source file with a "you are a teacher" system prompt and let it stream the explanation back. Really useful for documentation generation and code reviews.

Chat Output:

Image Reading Output:

DALL-E 3 - Generating Images from Text

This one is fun. DALL-E 3 connects to a separate Azure endpoint and lets you describe an image in plain English and get back a 1024x1024 image in seconds. Just provide a detailed text prompt, set your quality and size parameters and download the result.

result = client.images.generate(
    model="dall-e-3",
    prompt="A futuristic cityscape at sunset with Azure cloud symbols",
    size="1024x1024",
    quality="standard",
    n=1
)

Good for marketing materials, UI mockups, design concepts. The more detailed your prompt, the better the output.

Image Generation Output:

Structured Output - Extracting JSON from Documents

This one is underrated. Using OpenAI's function calling capability, you can take something like an unstructured invoice and get back perfectly structured JSON — invoice number, date, vendor, line items, totals, payment method. All of it.

You define a function schema with your properties, pass the function definitions to the chat completion API and the model returns data that matches your schema exactly. No hallucination, no parsing headaches. Perfect for financial processing and form automation workflows.

Azure Computer Vision - OCR, Object Detection and Brand Recognition

Computer Vision gives you deep image understanding without training any models yourself. You load an image as binary content, initialize ImageAnalysisClient and specify which visual features you want — READ for text extraction (OCR), TAGS for classification, CAPTION for description, OBJECTS, PEOPLE, SMART_CROPS.

from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures

client = ImageAnalysisClient(endpoint=endpoint, credential=credential)
result = client.analyze(
    image_data=image_data,
    visual_features=[VisualFeatures.READ, VisualFeatures.TAGS, VisualFeatures.CAPTION]
)

There is also a Brand Recognition mode specifically for detecting logos in images — it returns the brand name and a confidence score. Great for marketing analytics, competitive intelligence and shelf monitoring in retail.

Object detection Output:

Brand Recognition Output:

Azure Face API - Facial Detection and Attributes

The Face API detects human faces in images and extracts attributes like head pose, whether the person is wearing glasses, exposure levels and more. You pass in your image, specify which features you want detected, and get back structured facial attribute data.

Use cases range from access control and security to retail analytics and media analysis. The API generates face IDs in a privacy-respecting way and supports batch processing.

Azure Custom Vision - Train Your Own Classifier

This is where it gets really powerful for domain-specific problems. Custom Vision lets you train your own image classification model without needing any deep ML expertise. You collect images, tag them in the Custom Vision portal, train the model, deploy it to a prediction endpoint, and then call it from your code.

from azure.cognitiveservices.vision.customvision.prediction import CustomVisionPredictionClient

predictor = CustomVisionPredictionClient(endpoint, credentials)
results = predictor.classify_image(project_id, model_name, image_data)

for prediction in results.predictions:
    if prediction.probability > 0.90:
        print(f"{prediction.tag_name}: {prediction.probability:.2%}")

I tested this with a pet classification model and a food/object model. With enough good training images you can hit 95%+ accuracy and sub-100ms inference. No infrastructure to manage, just train and deploy.

Azure Document Intelligence - Invoice and Receipt Processing

This is a game changer for anyone doing document processing. Document Intelligence has pre-built models for invoices, receipts and other document types. You point it at a document URL, it processes it asynchronously, and you get back structured field-level data — vendor name, invoice number, line items, tax, totals — all of it.

from azure.ai.documentintelligence import DocumentIntelligenceClient

client = DocumentIntelligenceClient(endpoint=endpoint, credential=credential)
poller = client.begin_analyze_document("prebuilt-invoice", document_url)
result = poller.result()

95%+ extraction accuracy on standard invoice formats. This kind of thing used to take weeks to build, now you are up and running in an afternoon.

Document Intelligence Output:

Azure Language Service - Text Intelligence

The Language Service is a whole suite of NLP capabilities under one client.

Language Detection identifies which of 120+ languages your text is in. Useful for multi-language applications and routing pipelines.

Key Phrase Extraction pulls out the main topics from any text. Feed it an article and it tells you the key concepts. Good for SEO, content categorization and trend analysis.

Sentiment Analysis goes beyond just positive/negative. It does sentence-level breakdown, opinion mining, and returns confidence scores per sentiment category. You can even do aspect-based sentiment — so for a restaurant review you can get separate sentiment scores for food, service and ambience.

Entity Extraction identifies people, places, organisations, dates, amounts, and job titles in your text. Great for populating CRM data from unstructured sources or building advanced search.

Conversational Language Understanding lets you define custom intents and entities, train a model, and then extract structured information from conversational input. For example:

Utterance: "I need to book a flight from London to Bangalore on 10th December for 4 adults in economy"

Extracted:
- Intent: BookFlight
- From: London
- To: Bangalore  
- Date: 10th December
- Passengers: 4
- Class: Economy

This is the backbone of any chatbot or voice assistant workflow.

Sentiment Analysis Output:

Azure Translator - 100+ Languages

The Translator service is simple and incredibly powerful. One API call, multiple target languages, done.

documents = [InputTextItem(text="早上好，你好吗?")]
response = client.translate(content=documents, to=["en", "it", "fr"], from_parameter="zh-Hans")

Supports 100+ languages, auto-detection, transliteration and custom terminology for enterprise use. If you are building a global application this is the fastest path to multilingual support.

Translation Output:

Azure Speech Service - TTS, STT and Live Transcription

Text-to-Speech uses neural voice synthesis to convert text to natural-sounding audio. You pick a voice (there are 400+ across 140+ languages), configure your speech settings and call it. The en-US-AndrewMultilingualNeural voice is surprisingly natural.

For even more control you use SSML (Speech Synthesis Markup Language) — this lets you control pitch, rate, volume, add pauses, set speaking style (cheerful, professional, etc.) and even specify phoneme-level pronunciation. Full studio-quality control.

Speech-to-Text goes the other direction — audio file in, transcribed text out. Under 5% word error rate on clean audio. Supports real-time recognition and continuous recognition for ongoing streams.

Live Transcription is continuous speech recognition in real time — low latency, partial results as you speak, final results on pause. Good for live captioning, accessibility and interactive voice applications.

Azure Content Safety - Moderation at Scale

Content Safety detects harmful content in both images and text. It covers four categories — self-harm, violence, hate speech and sexual content — and returns a severity score for each. You set your own threshold and build your moderation workflow from there.

request = AnalyzeImageOptions(image=ImageData(content=image_file.read()))
response = safetyClient.analyze_image(request)

request = AnalyzeTextOptions(text=input_text)
response = safetyClient.analyze_text(request)

For any platform dealing with user-generated content this is essential. Reduces manual moderation load significantly.

Content Saftey Output:

Putting It All Together

The real power of Azure AI Services is in combining them. A simple example: Speech → Text → Language Understanding → Trigger Workflow. That is a fully voice-driven automation pipeline built entirely from managed services with no ML infrastructure.

Authentication across all services follows the same pattern:

from azure.core.credentials import AzureKeyCredential
from dotenv import load_dotenv
import os

load_dotenv()
endpoint = os.getenv("AZURE_SERVICE_ENDPOINT")
key = os.getenv("AZURE_SERVICE_KEY")
credential = AzureKeyCredential(key)

Always store keys in .env files, never hardcode them. In production use managed identities instead of API keys altogether.

The 30+ implementations i built in this AIFoundry project cover all of the above end to end. Everything is pay-per-use, auto-scales, no infrastructure to manage and all come with 99.9% SLA. The recommended approach is to start with Azure OpenAI for your foundational needs, layer on the specialised services as required, and combine them for compound intelligence workflows.

Learning and Sharing !

02/04/2026

Azure AI Services - A Complete Guide to Building Intelligent Applications with Azure AI Foundry