JSON Prompting for LLMs: A Practical Guide with Python Coding Examples

JSON Prompting is a technique for structuring instructions to AI models using the JavaScript Object Notation (JSON) format, making prompts clear, explicit, and machine-readable. Unlike traditional text-based prompts, which can leave room for ambiguity and misinterpretation, JSON prompts organize requirements as key-value pairs, arrays, and nested objects, turning vague requests into precise blueprints for the model to follow. This method greatly improves consistency and accuracy—especially for complex or repetitive tasks—by allowing users to specify things like task type, topic, audience, output format, and other parameters in an organized way that language models inherently understand. As AI systems increasingly rely on predictable, structured input for real-world workflows, JSON prompting has become a preferred strategy for generating sharper, more reliable results across major LLMs, including GPT-4, Claude, and Gemini.

In this tutorial, we’ll dive deep into the power of JSON prompting and why it can transform the way you interact with AI models.

We will walk you through the benefits of using JSON Prompting through coding examples —from simple text prompts to structured JSON prompts—and show you comparisons of their outputs. By the end, you’ll clearly see how structured prompts bring precision, consistency, and scalability to your workflows, whether you’re generating summaries, extracting data, or building advanced AI pipelines. Check out the FULL CODES here.

Installing the dependencies

Copy CodeCopiedUse a different Browserpip install openai

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass
os.environ[“OPENAI_API_KEY”] = getpass(‘Enter OpenAI API Key: ‘)

To get an OpenAI API key, visit https://platform.openai.com/settings/organization/api-keys and generate a new key. If you’re a new user, you may need to add billing details and make a minimum payment of $5 to activate API access. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserfrom openai import OpenAI
client = OpenAI()

Structured Prompts Ensure Consistency

Using structured prompts, such as JSON-based formats, forces you to think in terms of fields and values — a true advantage when working with LLMs. Check out the FULL CODES here.

By defining a fixed structure, you eliminate ambiguity and guesswork, ensuring that every response follows a predictable pattern.

Here’s a simple example:

Copy CodeCopiedUse a different BrowserSummarize the following email and list the action items clearly.

Email:
Hi team, let’s finalize the marketing plan by Tuesday. Alice, prepare the draft; Bob, handle the design.

We’ll feed this prompt to the LLM in two ways and then compare the outputs generated by a free-form prompt versus a structured (JSON-based) prompt to observe the difference in clarity and consistency. Check out the FULL CODES here.

Free-Form Prompt

Copy CodeCopiedUse a different Browserprompt_text = “””
Summarize the following email and list the action items clearly.

Email:
Hi team, let’s finalize the marketing plan by Tuesday. Alice, prepare the draft; Bob, handle the design.
“””

response_text = client.chat.completions.create(
model=”gpt-5″,
messages=[{“role”: “user”, “content”: prompt_text}]
)

text_output = response_text.choices[0].message.content
print(text_output)

Copy CodeCopiedUse a different BrowserSummary:
The team needs to finalize the marketing plan by Tuesday. Alice will prepare the draft, and Bob will handle the design.

Action items:
– Alice: Prepare the draft of the marketing plan by Tuesday.
– Bob: Handle the design by Tuesday.
– Team: Finalize the marketing plan by Tuesday.

JSON Prompt

Copy CodeCopiedUse a different Browserprompt_json = “””
Summarize the following email and return the output strictly in JSON format:

{
“summary”: “short summary of the email”,
“action_items”: [“task 1”, “task 2”, “task 3”],
“priority”: “low | medium | high”
}

Email:
Hi team, let’s finalize the marketing plan by Tuesday. Alice, prepare the draft; Bob, handle the design.
“””

response_json = client.chat.completions.create(
model=”gpt-5″,
messages=[
{“role”: “system”, “content”: “You are a precise assistant that always replies in valid JSON.”},
{“role”: “user”, “content”: prompt_json}
]
)

json_output = response_json.choices[0].message.content
print(json_output)

Copy CodeCopiedUse a different Browser{
“summary”: “Finalize the marketing plan by Tuesday; Alice to draft and Bob to handle design.”,
“action_items”: [
“Alice: prepare the draft”,
“Bob: handle the design”,
“Team: finalize the marketing plan by Tuesday”
],
“priority”: “medium”
}

In this example, the use of a structured JSON prompt leads to a clear and concise output that is easy to parse and evaluate. By defining fields such as “summary”, “action_items”, and “priority”, the LLM response becomes more consistent and actionable. Instead of generating free-flowing text, which might vary in style and detail, the model provides a predictable structure that eliminates ambiguity. This approach not only improves the readability and reliability of responses but also makes it easier to integrate the output into downstream workflows, such as project trackers, dashboards, or automated email handlers.

User can control the output

When you frame your prompt in JSON, you remove ambiguity from both the instruction and the output. In this example, asking for a market summary, sentiment, opportunities, risks, and a confidence score can yield inconsistent formats when passed as plain text. However, by structuring the request in JSON — with clearly defined fields like “summary”, “sentiment”, “opportunities”, “risks”, and “confidence_score” — the response becomes predictable, machine-friendly, and easier to parse. This consistency ensures that, whether you’re generating content, analyzing reports, or extracting insights, your workflow remains streamlined and reliable, with no surprises — just clean, structured results every time. Check out the FULL CODES here.

Free-Form Prompt

Copy CodeCopiedUse a different Browserplain_text_prompt = “””
Analyze the following market update:

Market Text:
Tesla’s Q2 earnings beat expectations due to higher Model Y sales, but rising competition from BYD is a risk.
Apple reported steady revenue growth driven by iPhone sales, but services revenue slightly declined.
Amazon’s AWS division continues to dominate cloud computing, though regulatory scrutiny in Europe is increasing.

Generate:
– A 2-line market summary
– Sentiment for each company (positive, negative, neutral)
– Key growth opportunities and risks
– A confidence score from 0 to 10
“””

response_plain = client.chat.completions.create(
model=”gpt-5″,
messages=[{“role”: “user”, “content”: plain_text_prompt}]
)

plain_output = response_plain.choices[0].message.content
print(plain_output)

Copy CodeCopiedUse a different BrowserMarket summary:
– Earnings updates skew constructive: Tesla beat on Q2 with strong Model Y, Apple grew on iPhone, and AWS remains the cloud leader.
– Offsetting risks include BYD pressure on Tesla, Apple’s services dip, and rising European scrutiny on Amazon.

Sentiment:
– Tesla: Positive
– Apple: Neutral
– Amazon: Positive

Key growth opportunities and risks:
– Tesla
– Opportunities: Sustain Model Y momentum; scale production and pricing to drive volume.
– Risks: Intensifying competition from BYD could pressure share and margins.
– Apple
– Opportunities: Monetize large iPhone base; re-accelerate services via bundles and ecosystem engagement.
– Risks: Services softness; dependence on iPhone for top-line growth.
– Amazon (AWS)
– Opportunities: Leverage leadership to win more enterprise/AI workloads and multi-year commitments.
– Risks: European regulatory scrutiny may lead to fines, compliance costs, or contract/pricing constraints.

Confidence score: 7/10

JSON Prompt

Copy CodeCopiedUse a different Browserjson_prompt = “””
Analyze the following market update and return the response in this JSON format:

{
“summary”: “2-line market overview”,
“companies”: [
{
“name”: “string”,
“sentiment”: “positive | negative | neutral”,
“opportunities”: [“list of opportunities”],
“risks”: [“list of risks”]
}
],
“confidence_score”: “integer (0-10)”
}

Market Text:
Tesla’s Q2 earnings beat expectations due to higher Model Y sales, but rising competition from BYD is a risk.
Apple reported steady revenue growth driven by iPhone sales, but services revenue slightly declined.
Amazon’s AWS division continues to dominate cloud computing, though regulatory scrutiny in Europe is increasing.
“””

response_json = client.chat.completions.create(
model=”gpt-5″,
messages=[
{“role”: “system”, “content”: “You are a precise assistant that always outputs valid JSON.”},
{“role”: “user”, “content”: json_prompt}
]
)

json_output = response_json.choices[0].message.content
print(json_output)

Copy CodeCopiedUse a different Browser{
“summary”: “Markets saw mixed corporate updates: Tesla beat expectations on strong Model Y sales and AWS maintained cloud leadership.nHowever, Apple’s growth was tempered by softer services revenue while Tesla and AWS face competition and regulatory risks.”,
“companies”: [
{
“name”: “Tesla”,
“sentiment”: “positive”,
“opportunities”: [
“Leverage strong Model Y demand to drive revenue and scale production”,
“Sustain earnings momentum from better-than-expected Q2 results”
],
“risks”: [
“Intensifying competition from BYD”,
“Potential price pressure impacting margins”
]
},
{
“name”: “Apple”,
“sentiment”: “neutral”,
“opportunities”: [
“Build on steady iPhone-driven revenue growth”,
“Revitalize Services to reaccelerate growth”
],
“risks”: [
“Slight decline in services revenue”,
“Reliance on iPhone as the primary growth driver”
]
},
{
“name”: “Amazon (AWS)”,
“sentiment”: “positive”,
“opportunities”: [
“Capitalize on cloud leadership to win new enterprise workloads”,
“Expand higher-margin managed services and deepen customer spend”
],
“risks”: [
“Increasing regulatory scrutiny in Europe”,
“Potential compliance costs or operational restrictions”
]
}
],
“confidence_score”: 8
}

The free-form prompt produced a useful summary but lacked structure, giving the model too much freedom and making it harder to parse programmatically or integrate into workflows.

In contrast, the JSON-prompted result gave the user full control over the output format, ensuring clean, machine-readable results with distinct fields for summary, sentiment, opportunities, risks, and confidence score. This structured approach not only simplifies downstream processing — for dashboards, automated alerts, or data pipelines — but also guarantees consistency across responses. By defining the fields upfront, users effectively guide the model to deliver exactly what they need, reducing ambiguity and improving reliability. Check out the FULL CODES here.

Reusable JSON prompt templates unlock scalability, speed, and clean handoffs.

By defining structured fields upfront, teams can generate consistent, machine-readable outputs that plug directly into APIs, databases, or apps without manual formatting. This standardization not only accelerates workflows but also ensures reliable, repeatable results, making collaboration and automation seamless across projects.

Check out the FULL CODES here and Notes. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post JSON Prompting for LLMs: A Practical Guide with Python Coding Examples appeared first on MarkTechPost.

What is a Voice Agent in AI? Top 9 Voice Agent Platforms to Know (2025 …

What is a Voice Agent?

An AI voice agent is a software system that can hold two-way, real-time conversations over the phone or internet (VoIP). Unlike legacy interactive voice response (IVR) trees, voice agents allow free-form speech, handle interruptions (“barge-in”), and can connect to external tools and APIs (e.g., CRMs, schedulers, payment systems) to complete tasks end-to-end.

The Core Pipeline

Automatic Speech Recognition (ASR)

Real-time transcription of incoming audio into text.

Requires streaming ASR with partial hypotheses within ~200–300 ms latency for natural turn-taking.

Language Understanding & Planning (often LLMs + tools)

Maintains dialog state and interprets user intent.

May call APIs, databases, or retrieval systems (RAG) to fetch answers or complete multi-step tasks.

Text-to-Speech (TTS)

Converts the agent’s response back into natural-sounding speech.

Modern TTS systems deliver first audio tokens in ~250 ms, support emotional tone, and allow barge-in handling.

Transport & Telephony Integration

Connects the agent to phone networks (PSTN), VoIP (SIP/WebRTC), and contact center systems.

Often includes DTMF (keypad tone) fallback for compliance-sensitive workflows.

Why Voice Agents Now?

A few trends explain their sudden viability:

Higher-quality ASR and TTS: Near-human transcription accuracy and natural-sounding synthetic voices.

Real-time LLMs: Models that can plan, reason, and generate responses with sub-second latency.

Improved endpointing: Better detection of turn-taking, interruptions, and phrase boundaries.

Together, these make conversations smoother and more human-like—leading enterprises to adopt voice agents for call deflection, after-hours coverage, and automated workflows.

How Voice Agents Differ from Assistants

Many confuse voice assistants (e.g., smart speakers) with voice agents. The difference:

Assistants answer questions → primarily informational.

Agents take action → perform real tasks via APIs and workflows (e.g., rescheduling an appointment, updating a CRM, processing a payment).

Top 9 AI Voice Agent Platforms (Voice-Capable)

Here is a list leading platforms helping developers and enterprises build production-grade voice agents:

OpenAI Voice AgentsLow-latency, multimodal API for building realtime, context-aware AI voice agents.

Google Dialogflow CXRobust dialog management platform with deep Google Cloud integration and multichannel telephony.

Microsoft Copilot StudioNo-code/low-code agent builder for Dynamics, CRM, and Microsoft 365 workflows.

Amazon LexAWS-native conversational AI for building voice and chat interfaces, with cloud contact center integration.

Deepgram Voice AI PlatformUnified platform for streaming speech-to-text, TTS, and agent orchestration—designed for enterprise use.

VoiceflowCollaborative agent design and operations platform for voice, web, and chat agents.

VapiDeveloper-first API to build, test, and deploy advanced voice AI agents with high configurability.

Retell AIComprehensive tooling for designing, testing, and deploying production-grade call center AI agents.

VoiceSpinContact-center solution with inbound and outbound AI voice bots, CRM integrations, and omnichannel messaging.

Conclusion

Voice agents have moved far beyond interactive voice responses IVRs. Today’s production systems integrate streaming ASR, tool-using planners (LLMs), and low-latency TTS to carry out tasks instead of just routing calls.

When selecting a platform, organizations should consider:

Integration surface (telephony, CRM, APIs)

Latency envelope (sub-second turn-taking vs. batch responses)

Operations needs (testing, analytics, compliance)

The post What is a Voice Agent in AI? Top 9 Voice Agent Platforms to Know (2025) appeared first on MarkTechPost.

Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Deci …

Retrieval-Augmented Generation (RAG) has emerged as a cornerstone technique for enhancing Large Language Models (LLMs) with real-time, domain-specific knowledge. But the landscape is rapidly shifting—today, the most common implementations are “Native RAG” pipelines, and a new paradigm called “Agentic RAG” is redefining what’s possible in AI-powered information synthesis and decision support.

Native RAG: The Standard Pipeline

Architecture

A Native RAG pipeline harnesses retrieval and generation-based methods to answer complex queries while ensuring accuracy and relevance. The pipeline typically involves:

Query Processing & Embedding: The user’s question is rewritten, if needed, embedded into a vector representation using an LLM or dedicated embedding model, and prepared for semantic search.

Retrieval: The system searches a vector database or document store, identifying top-k relevant chunks using similarity metrics (cosine, Euclidean, dot product). Efficient ANN algorithms optimize this stage for speed and scalability.

Reranking: Retrieved results are reranked based on relevance, recency, domain-specificity, or user preference. Reranking models—ranging from rule-based to fine-tuned ML systems—prioritize the highest-quality information.

Synthesis & Generation: The LLM synthesizes the reranked information to generate a coherent, context-aware response for the user.

Common Optimizations

Recent advances include dynamic reranking (adjusting depth by query complexity), fusion-based strategies that aggregate rankings from multiple queries, and hybrid approaches that combine semantic partitioning with agent-based selection for optimal retrieval robustness and latency.

Agentic RAG: Autonomous, Multi-Agent Information Workflows

What Is Agentic RAG?

Agentic RAG is an agent-based approach to RAG, leveraging multiple autonomous agents to answer questions and process documents in a highly coordinated fashion. Rather than a single retrieval/generation pipeline, Agentic RAG structures its workflow for deep reasoning, multi-document comparison, planning, and real-time adaptability.

Key Components

ComponentDescriptionDocument AgentEach document is assigned its own agent, able to answer queries about the document and perform summary tasks, working independently within its scope.Meta-AgentOrchestrates all document agents, managing their interactions, integrating outputs, and synthesizing a comprehensive answer or action.

Features and Benefits

Autonomy: Agents operate independently, retrieving, processing, and generating answers or actions for specific documents or tasks.

Adaptability: The system dynamically adjusts its strategy (e.g., reranking depth, document prioritization, tool selection) based on new queries or changing data contexts.

Proactivity: Agents anticipate needs, take preemptive steps towards goals (e.g., pulling additional sources or suggesting actions), and learn from previous interactions.

Advanced Capabilities

Agentic RAG goes beyond “passive” retrieval—agents can compare documents, summarize or contrast specific sections, aggregate multi-source insights, and even invoke tools or APIs for enriched reasoning. This enables:

Automated research and multi-database aggregation

Complex decision support (e.g., comparing technical features, summarizing key differences across product sheets)

Executive support tasks that require independent synthesis and real-time action recommendation.

Applications

Agentic RAG is ideal for scenarios where nuanced information processing and decision-making are required:

Enterprise Knowledge Management: Coordinating answers across heterogeneous internal repositories

AI-Driven Research Assistants: Cross-document synthesis for technical writers, analysts, or executives

Automated Action Workflows: Triggering actions (e.g., responding to invitations, updating records) after multi-step reasoning over documents or databases.

Complex Compliance and Security Audits: Aggregating and comparing evidence from varied sources in real time.

Conclusion

Native RAG pipelines have standardized the process of embedding, retrieving, reranking, and synthesizing answers from external data, enabling LLMs to serve as dynamic knowledge engines. Agentic RAG pushes the boundaries even further—by introducing autonomous agents, orchestration layers, and proactive, adaptive workflows, it transforms RAG from a retrieval tool into a full-blown agentic framework for advanced reasoning and multi-document intelligence.

Organizations seeking to move beyond basic augmentation—and into realms of deep, flexible AI orchestration—will find in Agentic RAG the blueprint for the next generation of intelligent systems.
The post Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Decision-Making? appeared first on MarkTechPost.

Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scal …

LLMs have rapidly advanced with soaring parameter counts, widespread use of mixture-of-experts (MoE) designs, and massive context lengths. Models like DeepSeek-R1, LLaMA-4, and Qwen-3 now reach trillions of parameters, demanding enormous compute, memory bandwidth, and fast inter-chip communication. MoE improves efficiency but creates challenges in expert routing, while context windows exceeding a million tokens strain attention and KV cache storage, which scales with concurrent users. In real-world deployments, unpredictable inputs, uneven expert activations, and bursty queries further complicate serving. Addressing these pressures requires a ground-up rethinking of AI infrastructure through hardware–software co-design, adaptive orchestration, and elastic resource management. 

Recent progress in LLMs is shaped by three main trends: ever-growing parameter counts, sparse MoE architectures, and extended context windows. Models like Llama 4, DeepSeek-V3, and Google’s PaLM push scale into the trillions of parameters, while MoE designs activate only subsets of experts per token, balancing efficiency with capacity. Meanwhile, context windows now span hundreds of thousands to millions of tokens, enabling long-form reasoning but straining compute and memory through large key-value caches. These advances place immense pressure on datacenters, demanding higher compute, memory, and bandwidth while introducing challenges in parallelism, workload heterogeneity, data convergence, and storage performance. 

Huawei researchers introduced CloudMatrix, a new AI datacenter architecture designed to handle the rising demands of large-scale LLMs. Its first implementation, CloudMatrix384, combines 384 Ascend 910C NPUs and 192 Kunpeng CPUs, all linked by a high-bandwidth, low-latency Unified Bus that enables fully peer-to-peer communication. This design allows flexible pooling of compute, memory, and network resources, making it ideal for MoE parallelism and distributed KV cache access. On top of this, CloudMatrix-Infer offers an optimized serving framework with peer-to-peer resource pools, large-scale expert parallelism, and hardware-aware optimizations like pipelining and INT8 quantization. Evaluations with DeepSeek-R1 show state-of-the-art throughput, efficiency, and scalability. 

Huawei CloudMatrix is a new AI datacenter architecture built on peer-to-peer high-bandwidth interconnects and fine-grained resource disaggregation. Its first large-scale implementation, CloudMatrix384, integrates 384 Ascend 910C NPUs and 192 Kunpeng CPUs into a single supernode, all linked by a unified bus network that enables direct all-to-all communication. This design allows compute, memory, and network resources to be shared seamlessly and scaled independently, operating as one cohesive system. By avoiding the bottlenecks of traditional hierarchical setups, CloudMatrix384 is particularly effective for communication-heavy tasks such as large-scale MoE parallelism and distributed KV cache management, making it ideal for scalable LLM serving. 

The researchers evaluate CloudMatrix-Infer on the DeepSeek-R1 model using the CloudMatrix384 supernode. The system achieves a prefill throughput of 6,688 tokens per second per NPU and a decode throughput of 1,943 tokens per second with latency kept under 50 ms, outperforming comparable systems such as SGLang on NVIDIA H100 and DeepSeek on H800. Even when constrained to stricter latency requirements of under 15 ms, it sustains 538 tokens per second in decoding. Moreover, INT8 quantization on the Ascend 910C preserves accuracy across 16 benchmarks, showing that efficiency improvements do not compromise model quality. 

In conclusion, Huawei CloudMatrix is a next-generation AI datacenter architecture designed to overcome the scalability limits of conventional clusters. Its first production system, CloudMatrix384, combines 384 Ascend 910C NPUs and 192 Kunpeng CPUs in a fully peer-to-peer supernode connected through a high-bandwidth, low-latency Unified Bus. To exploit this design, the study proposes CloudMatrix-Infer, which separates prefill, decode, and caching into independent pools, supports large-scale expert parallelism, and applies hardware-aware optimizations like pipelining and INT8 quantization. Tested on DeepSeek-R1, it achieved superior throughput and latency performance compared to NVIDIA-based systems, while preserving accuracy, showcasing its potential for large-scale AI deployments. 

Check out the Technical Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving appeared first on MarkTechPost.

AmbiGraph-Eval: A Benchmark for Resolving Ambiguity in Graph Query Gen …

Semantic parsing converts natural language into formal query languages such as SQL or Cypher, allowing users to interact with databases more intuitively. Yet, natural language is inherently ambiguous, often supporting multiple valid interpretations, while query languages demand exactness. Although ambiguity in tabular queries has been explored, graph databases present a challenge due to their interconnected structures. Natural language queries on graph nodes and relationships often yield multiple interpretations due to the structural richness and diversity of graph data. For example, a query like “best evaluated restaurant” may vary depending on whether results consider individual ratings or aggregate scores.

Ambiguities in interactive systems pose serious risks, as failures in semantic parsing can cause queries to diverge from user intent. Such errors may result in unnecessary data retrieval and computation, wasting time and resources. In high-stakes contexts such as real-time decision-making, these issues can degrade performance, raise operational costs, and reduce effectiveness. LLM-based semantic parsing shows promise in addressing complex and ambiguous queries by using linguistic knowledge and interactive clarification. However, LLMs face a challenge of self-preference bias. Trained on human feedback, they may adopt annotator preferences, leading to systematic misalignment with actual user intent.

Researchers from Hong Kong Baptist University, the National University of Singapore, BIFOLD & TU Berlin, and Ant Group present a method to address ambiguity in graph query generation. The concept of ambiguity in graph database queries is developed, categorizing it into three types: Attribute, Relationship, and Attribute-Relationship ambiguities. Researchers introduced AmbiGraph-Eval, a benchmark containing 560 ambiguous queries and corresponding graph database samples to evaluate model performance. It tests nine LLMs, analyzing their ability to resolve ambiguities and identifying areas for improvement. The study reveals that reasoning capabilities provide a limited advantage, highlighting the importance of understanding graph ambiguity and mastering query syntax.

The AmbiGraph-Eval benchmark is designed to evaluate LLMs’ ability to generate syntactically correct and semantically appropriate graph queries, such as Cypher, from ambiguous natural language inputs. Moreover, the dataset is created in two phases: data collection and human review. Ambiguous prompts are obtained through three methods, including direct extraction from graph databases, synthesis from unambiguous data using LLMs, and full generation by prompting LLMs to create new cases. To evaluate performance, the researchers tested four closed-source LLMs (e.g., GPT-4, Claude-3.5-Sonnet) and four open-source LLMs (e.g., Qwen-2.5, LLaMA-3.1). Evaluations are conducted through API calls or using 4x NVIDIA A40 GPUs.

The evaluation of zero-shot performance on the AmbiGraph-Eval benchmark shows disparities among models in resolving graph data ambiguities. In attribute ambiguity tasks, O1-mini excels in same-entity (SE) scenarios, with GPT-4o and LLaMA-3.1 performing well. However, GPT-4o outperforms others in cross-entity (CE) tasks, showing superior reasoning across entities. For relationship ambiguity, LLaMA-3.1 leads, while GPT-4o shows limitations in SE tasks but excels in CE tasks. Attribute-relationship ambiguity emerges as the most challenging, with LLaMA-3.1 performing best in SE tasks and GPT-4o dominating CE tasks. Overall, models struggle more with multi-dimensional ambiguities compared to isolated attribute or relationship ambiguities.

In conclusion, researchers introduced AmbiGraph-Eval, a benchmark for evaluating the ability of LLMs to resolve ambiguity in graph database queries. Evaluations of nine models reveal significant challenges in generating accurate Cypher statements, with strong reasoning skills offering only limited benefits. Core challenges include recognizing ambiguous intent, generating valid syntax, interpreting graph structures, and performing numerical aggregations. Ambiguity detection and syntax generation emerged as major bottlenecks hindering performance. To address these issues, future research should enhance models’ ambiguity resolution and syntax handling using methods like syntax-aware prompting and explicit ambiguity signaling.

Check out the Technical Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post AmbiGraph-Eval: A Benchmark for Resolving Ambiguity in Graph Query Generation appeared first on MarkTechPost.

Enhance Geospatial Analysis and GIS Workflows with Amazon Bedrock Capa …

As data becomes more abundant and information systems grow in complexity, stakeholders need solutions that reveal quality insights. Applying emerging technologies to the geospatial domain offers a unique opportunity to create transformative user experiences and intuitive workstreams for users and organizations to deliver on their missions and responsibilities.
In this post, we explore how you can integrate existing systems with Amazon Bedrock to create new workflows to unlock efficiencies insights. This integration can benefit technical, nontechnical, and leadership roles alike.
Introduction to geospatial data
Geospatial data is associated with a position relative to Earth (latitude, longitude, altitude). Numerical and structured geospatial data formats can be categorized as follows:

Vector data – Geographical features, such as roads, buildings, or city boundaries, represented as points, lines, or polygons
Raster data – Geographical information, such as satellite imagery, temperature, or elevation maps, represented as a grid of cells
Tabular data – Location-based data, such as descriptions and metrics (average rainfall, population, ownership), represented in a table of rows and columns

Geospatial data sources might also contain natural language text elements for unstructured attributes and metadata for categorizing and describing the record in question. Geospatial Information Systems (GIS) provide a way to store, analyze, and display geospatial information. In GIS applications, this information is frequently presented with a map to visualize streets, buildings, and vegetation.
LLMs and Amazon Bedrock
Large language models (LLMs) are a subset of foundation models (FMs) that can transform input (usually text or image, depending on model modality) into outputs (generally text) through a process called generation. Amazon Bedrock is a comprehensive, secure, and flexible service for building generative AI applications and agents.
LLMs work in many generalized tasks involving natural language. Some common LLM use cases include:

Summarization – Use a model to summarize text or a document.
Q&A – Use a model to answer questions about data or facts from context provided during training or inference using Retrieval Augmented Generation (RAG).
Reasoning – Use a model to provide chain of thought reasoning to assist a human with decision-making and hypothesis evaluation.
Data generation – Use a model to generate synthetic data for testing simulations or hypothetical scenarios.
Content generation – Use a model to draft a report from insights derived from an Amazon Bedrock knowledge base or a user’s prompt.
AI agent and tool orchestration – Use a model to plan the invocation of other systems and processes. After other systems are invoked by an agent, the agent’s output can then be used as context for further LLM generation.

GIS can implement these capabilities to create value and improve user experiences. Benefits can include:

Live decision-making – Taking real-time insights to support immediate decision-making, such as emergency response coordination and traffic management
Research and analysis – In-depth analysis that humans or systems can identify, such as trend analysis, patterns and relationships, and environmental monitoring
Planning – Using research and analysis for informed long-term decision-making, such as infrastructure development, resource allocation, and environmental regulation

Augmenting GIS and workflows with LLM capabilities leads to simpler analysis and exploration of data, discovery of new insights, and improved decision-making. Amazon Bedrock provides a way to host and invoke models as well as integrate the AI models with surrounding infrastructure, which we elaborate on in this post.
Combining GIS and AI through RAG and agentic workflows
LLMs are trained with large amounts of generalized information to discover patterns in how language is produced. To improve the performance of LLMs for specific use cases, approaches such as RAG and agentic workflows have been created. Retrieving policies and general knowledge for geospatial use cases can be accomplished with RAG, whereas calculating and analyzing GIS data would require an agentic workflow. In this section, we expand upon both RAG and agentic workflows in the context of geospatial use cases.
Retrieval Augmented Generation
With RAG, you can dynamically inject contextual information from a knowledge base during model invocation.
RAG supplements a user-provided prompt with data sourced from a knowledge base (collection of documents). Amazon Bedrock offers managed knowledge bases to data sources, such as Amazon Simple Storage Service (Amazon S3) and SharePoint, so you can provide supplemental information, such as city development plans, intelligence reports, or policies and regulations, when your AI assistant is generating a response for a user.
Knowledge bases are ideal for unstructured documents with information stored in natural language. When your AI model responds to a user with information sourced from RAG, it can provide references and citations to its source material. The following diagram shows how the systems connect together.

Because geospatial data is often structured and in a GIS, you can connect the GIS to the LLM using tools and agents instead of knowledge bases.
Tools and agents (to control a UI and a system)
Many LLMs, such as Anthropic’s Claude on Amazon Bedrock, make it possible to provide a description of tools available so your AI model can generate text to invoke external processes. These processes might retrieve live information, such as the current weather in a location or querying a structured data store, or might control external systems, such as starting a workflow or adding layers to a map. Some common geospatial functionality that you might want to integrate with your LLM using tools include:

Performing mathematical calculations like the distance between coordinates, filtering datasets based on numeric values, or calculating derived fields
Deriving information from predictive analysis models
Looking up points of interest in structured data stores
Searching content and metadata in unstructured data stores
Retrieving real-time geospatial data, like traffic, directions, or estimated time to reach a destination
Visualizing distances, points of interest, or paths
Submitting work outputs such as analytic reports
Starting workflows, like ordering supplies or adjusting supply chain

Tools are often implemented in AWS Lambda functions. Lambda runs code without the complexity and overhead of running servers. It handles the infrastructure management, enabling faster development, improved performance, enhanced security, and cost-efficiency.
Amazon Bedrock offers the feature Amazon Bedrock Agents to simplify the orchestration and integration with your geospatial tools. Amazon Bedrock agents follow instructions for LLM reasoning to break down a user prompt into smaller tasks and perform actions against identified tasks from action providers. The following diagram illustrates how Amazon Bedrock Agents works.

The following diagram shows how Amazon Bedrock Agents can enhance GIS solutions.

Solution overview
The following demonstration applies the concepts we’ve discussed to an earthquake analysis agent as an example. This example deploys an Amazon Bedrock agent with a knowledge base based on Amazon Redshift. The Redshift instance has two tables. One table is for earthquakes, which includes date, magnitude, latitude, and longitude. The second table holds the counites in California, described as polygon shapes. The geospatial capabilities of Amazon Redshift can relate these datasets to answer queries like which county had the most recent earthquake or which county has had the most earthquakes in the last 20 years. The Amazon Bedrock agent can generate these geospatially based queries based on natural language.
This script creates an end-to-end pipeline that performs the following steps:

Processes geospatial data.
Sets up cloud infrastructure.
Loads and configures the spatial database.
Creates an AI agent for spatial analysis.

In the following sections, we create this agent and test it out.
Prerequisites
To implement this approach, you must have an AWS account with the appropriate AWS Identity and Access Management (IAM) permissions for Amazon Bedrock, Amazon Redshift, and Amazon S3.
Additionally, complete the following steps to set up the AWS Command Line Interface (AWS CLI):

Confirm you have access to the latest version of the AWS CLI.
Sign in to the AWS CLI with your credentials.
Make sure ./jq is installed. If not, use the following command:

yum -y install jq

Set up error handling
Use the following code for the initial setup and error handling:

#!/usr/bin/env bash
set -ex

LOG_FILE=”deployment_$(date +%Y%m%d_%H%M%S).log”
touch “$LOG_FILE”

handle_error() {
    local exit_code=$?
    local line_number=$1
    if [ $exit_code -ne 0 ]; then
        log_error “Failed at line $line_number with exit code $exit_code”
        exit $exit_code
    fi
}
trap ‘handle_error $LINENO’ ERR

This code performs the following functions:

Creates a timestamped log file
Sets up error trapping that captures line numbers
Enables automatic script termination on errors
Implements detailed logging of failures

Validate the AWS environment
Use the following code to validate the AWS environment:

AWS_VERSION=$(aws –version 2>&1)
log “INFO” “AWS CLI version: $AWS_VERSION”

if ! aws sts get-caller-identity &>/dev/null; then
    log_error “AWS CLI is not configured with valid credentials”
    exit 1
fi

AWS_REGION=”us-east-1″
AWS_ACCOUNT_ID=$(aws sts get-caller-identity –query Account –output text)

This code performs the essential AWS setup verification:

Checks AWS CLI installation
Validates AWS credentials
Retrieves account ID for resource naming

Set up Amazon Redshift and Amazon Bedrock variables
Use the following code to create Amazon Redshift and Amazon Bedrock variables:

REDSHIFT_CLUSTER_IDENTIFIER=”geo-analysis-cluster”
REDSHIFT_DATABASE=”geo_db”
REDSHIFT_MASTER_USER= [Create username]
REDSHIFT_MASTER_PASSWORD= [Create Password]
REDSHIFT_NODE_TYPE=”dc2.large”
REDSHIFT_CLUSTER_TYPE=”single-node”
BEDROCK_ROLE_NAME=”BedrockGeospatialRole”
# Bedrock Configuration
AGENT_NAME=”GeoAgentRedshift”
KNOWLEDGE_BASE_NAME=”GeospatialKB”

Create IAM roles for Amazon Redshift and Amazon S3
Use the following code to set up IAM roles for Amazon S3 and Amazon Redshift:

if aws iam get-role –role-name “$REDSHIFT_ROLE_NAME” &>/dev/null; then
REDSHIFT_ROLE_ARN=$(aws iam get-role –role-name “$REDSHIFT_ROLE_NAME” –query ‘Role.Arn’ –output text)
log “INFO” “Using existing role ARN: $REDSHIFT_ROLE_ARN”
else
# Create trust policy document
cat > /tmp/trust-policy.json << EOF
{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: “redshift.amazonaws.com”
},
“Action”: “sts:AssumeRole”
}
]
}
EOF
# Create role
CREATE_ROLE_OUTPUT=$(aws iam create-role
–role-name “$REDSHIFT_ROLE_NAME”
–assume-role-policy-document “file:///tmp/trust-policy.json”
–description “Role for Redshift to access S3” 2>&1)

REDSHIFT_ROLE_ARN=$(aws iam get-role –role-name “$REDSHIFT_ROLE_NAME” –query ‘Role.Arn’ –output text)
if [ $? -ne 0 ]; then
log_error “Failed to create role:”
exit 1
fi
REDSHIFT_ROLE_ARN=$(echo “$CREATE_ROLE_OUTPUT” | jq -r ‘.Role.Arn’)
# Wait for role to be available
sleep 10
fi
ATTACH_POLICY_OUTPUT=$(aws iam attach-role-policy
–role-name “$REDSHIFT_ROLE_NAME”
–policy-arn “arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess” 2>&1)
if [ $? -ne 0 ]; then
if echo “$ATTACH_POLICY_OUTPUT” | grep -q “EntityAlreadyExists”; then
else
exit 1
fi
fi

Prepare the data and Amazon S3
Use the following code to prepare the data and Amazon S3 storage:

DATA_BUCKET=”geospatial-bedrock-demo-data-${AWS_ACCOUNT_ID}”
aws s3 mb s3://$DATA_BUCKET

# Download source data
curl -o earthquakes.csv https://raw.githubusercontent.com/Esri/gis-tools-for-hadoop/master/samples/data/earthquake-data/earthquakes.csv
curl -o california-counties.json https://raw.githubusercontent.com/Esri/gis-tools-for-hadoop/master/samples/data/counties-data/california-counties.json

This code sets up data storage and retrieval through the following steps:

Creates a unique S3 bucket
Downloads earthquake and county boundary data
Prepares for data transformation

Transform geospatial data
Use the following code to transform the geospatial data:

INPUT_FILE=”california-counties.json”
OUTPUT_FILE=”california-counties.csv”

# Create CSV header
echo “OBJECTID,AREA,PERIMETER,CO06_D00_,CO06_D00_I,STATE,COUNTY,NAME,LSAD,LSAD_TRANS,Shape_Length,Shape_Area,WKT” > “$OUTPUT_FILE”

# Function to convert ESRI rings to WKT POLYGON format
esri_to_wkt() {
local rings=$1

# Extract the first ring (exterior ring)
local exterior_ring=$(echo “$rings” | jq -c ‘.[0]’)

if [ “$exterior_ring” = “null” ] || [ -z “$exterior_ring” ]; then
echo “POLYGON EMPTY”
return
fi

# Start building the WKT string
local wkt=”POLYGON ((”

# Process each coordinate pair in the ring
local coords=$(echo “$exterior_ring” | jq -r ‘.[] | “(.[0]) (.[1])”‘)
local first_coord=””
local result=””

while IFS= read -r coord; do
if [ -z “$result” ]; then
result=”$coord”
first_coord=”$coord”
else
result=”$result, $coord”
fi
done <<< “$coords”

# Close the ring by adding the first coordinate again if needed
if [ “$first_coord” != “$(echo “$coords” | tail -1)” ]; then
result=”$result, $first_coord”
fi

wkt=”${wkt}${result}))”
echo “$wkt”
}

# Process each feature in the JSON file
jq -c ‘.features[]’ “$INPUT_FILE” | while read -r feature; do
# Extract attributes
OBJECTID=$(echo “$feature” | jq -r ‘.attributes.OBJECTID // empty’)
AREA=$(echo “$feature” | jq -r ‘.attributes.AREA // empty’)
PERIMETER=$(echo “$feature” | jq -r ‘.attributes.PERIMETER // empty’)
CO06_D00_=$(echo “$feature” | jq -r ‘.attributes.CO06_D00_ // empty’)
CO06_D00_I=$(echo “$feature” | jq -r ‘.attributes.CO06_D00_I // empty’)
STATE=$(echo “$feature” | jq -r ‘.attributes.STATE // empty’)
COUNTY=$(echo “$feature” | jq -r ‘.attributes.COUNTY // empty’)
NAME=$(echo “$feature” | jq -r ‘.attributes.NAME // empty’)
LSAD=$(echo “$feature” | jq -r ‘.attributes.LSAD // empty’)
LSAD_TRANS=$(echo “$feature” | jq -r ‘.attributes.LSAD_TRANS // empty’)
Shape_Length=$(echo “$feature” | jq -r ‘.attributes.Shape_Length // empty’)
Shape_Area=$(echo “$feature” | jq -r ‘.attributes.Shape_Area // empty’)

# Extract geometry and convert to WKT
if echo “$feature” | jq -e ‘.geometry.rings’ > /dev/null 2>&1; then
rings=$(echo “$feature” | jq -c ‘.geometry.rings’)
WKT=$(esri_to_wkt “$rings”)
else
WKT=”POLYGON EMPTY”
fi

# Escape any commas in the fields
NAME=$(echo “$NAME” | sed ‘s/,/\,/g’)
LSAD=$(echo “$LSAD” | sed ‘s/,/\,/g’)
LSAD_TRANS=$(echo “$LSAD_TRANS” | sed ‘s/,/\,/g’)

# Write to CSV – wrap WKT field in quotes
echo “$OBJECTID,$AREA,$PERIMETER,$CO06_D00_,$CO06_D00_I,$STATE,$COUNTY,$NAME,$LSAD,$LSAD_TRANS,$Shape_Length,$Shape_Area,”$WKT”” >> “$OUTPUT_FILE”
done

echo “Conversion complete. Output saved to $OUTPUT_FILE”

# Upload data files to S3
aws s3 cp earthquakes.csv s3://$DATA_BUCKET/earthquakes/
aws s3 cp california-counties.csv s3://$DATA_BUCKET/counties/

This code performs the following actions to convert the geospatial data formats:

Transforms ESRI JSON to WKT format
Processes county boundaries into CSV format
Preserves spatial information for Amazon Redshift

Create a Redshift cluster
Use the following code to set up the Redshift cluster:

# Create Redshift cluster
aws redshift create-cluster
    –cluster-identifier “$REDSHIFT_CLUSTER_IDENTIFIER”
    –node-type “$REDSHIFT_NODE_TYPE”
    –cluster-type single-node
    –master-username “$REDSHIFT_MASTER_USER”
    –master-user-password “$REDSHIFT_MASTER_PASSWORD”
    –db-name “$REDSHIFT_DATABASE”
    –cluster-subnet-group-name “$SUBNET_GROUP_NAME”
    –vpc-security-group-ids “$SG_ID”
    –iam-roles “$REDSHIFT_ROLE_ARN”

# Wait for cluster availability
while true; do
    CLUSTER_STATUS=$(aws redshift describe-clusters
        –cluster-identifier “$REDSHIFT_CLUSTER_IDENTIFIER”
        –query ‘Clusters[0].ClusterStatus’
        –output text)
    if [ “$CLUSTER_STATUS” = “available” ]; then
        break
    fi
    sleep 30
done

This code performs the following functions:

Sets up a single-node cluster
Configures networking and security
Waits for cluster availability

Create a database schema
Use the following code to create the database schema:

aws redshift-data execute-statement
    –cluster-identifier “$REDSHIFT_CLUSTER_IDENTIFIER”
    –database “$REDSHIFT_DATABASE”
    –sql ”
CREATE TABLE IF NOT EXISTS counties (
    OBJECTID INTEGER PRIMARY KEY,
    AREA DOUBLE PRECISION,
    NAME VARCHAR(100),
    geom GEOMETRY
);

CREATE TABLE IF NOT EXISTS earthquakes (
    earthquake_date VARCHAR(50),
    latitude double precision,
    longitude double precision,
    magnitude double precision
);”

This code performs the following functions:

Creates a counties table with spatial data
Creates an earthquakes table
Configures appropriate data types

Create an Amazon Bedrock knowledge base
Use the following code to create a knowledge base:

# Create knowledge base
aws bedrock-agent create-knowledge-base
    –name “$KNOWLEDGE_BASE_NAME”
    –knowledge-base-configuration “{
        “type”: “SQL”,
        “sqlKnowledgeBaseConfiguration”: {
            “type”: “REDSHIFT”
        }
    }”
    –region “$AWS_REGION”

# Create data source
aws bedrock-agent create-data-source
    –knowledge-base-id “$KB_ID”
    –name “EarthquakeDataSource”
    –data-source-configuration “{“type”: “REDSHIFT_METADATA”}”

This code performs the following functions:

Creates an Amazon Bedrock knowledge base
Sets up an Amazon Redshift data source
Enables spatial queries

Create an Amazon Bedrock agent
Use the following code to create and configure an agent:

# Create agent
aws bedrock-agent create-agent
    –agent-name “$AGENT_NAME”
    –instruction “You are a geospatial analysis assistant…”
    –foundation-model “anthropic.claude-3-sonnet-20240229-v1:0”

# Associate knowledge base
aws bedrock-agent associate-agent-knowledge-base
    –agent-id “$AGENT_ID”
    –knowledge-base-id “$KB_ID”
    –description “Earthquake data knowledge base”
    –agent-version “DRAFT”

This code performs the following functions:

Creates an Amazon Bedrock agent
Associates the agent with the knowledge base
Configures the AI model and instructions

Test the solution
Let’s observe the system behavior with the following natural language user inputs in the chat window.
Example 1: Summarization and Q&A
For this example, we use the prompt “Summarize which zones allow for building of an apartment.”
The LLM performs retrieval with a RAG approach, then uses the retrieved residential code documents as context to answer the user’s query in natural language.

This example demonstrates the LLM capabilities for hallucination mitigation, RAG, and summarization.
Example 2: Generate a draft report
Next, we input the prompt “Write me a report on how various zones and related housing data can be utilized to plan new housing development to meet high demand.”
The LLM retrieves relevant urban planning code documents, then summarizes the information into a standard reporting format as described in its system prompt.

This example demonstrates the LLM capabilities for prompt templates, RAG, and summarization.
Example 3: Show places on the map
For this example, we use the prompt “Show me the low density properties on Abbeville street in Macgregor on the map with their address.”
The LLM creates a chain of thought to look up which properties match the user’s query and then invokes the draw marker tool on the map. The LLM provides tool invocation parameters in its scratchpad, awaits the completion of these tool invocations, then responds in natural language with a bulleted list of markers placed on the map.

This example demonstrates the LLM capabilities for chain of thought reasoning, tool use, retrieval systems using agents, and UI control.
Example 4: Use the UI as context
For this example, we choose a marker on a map and input the prompt “Can I build an apartment here.”
The “here” is not contextualized from conversation history but rather from the state of the map view. Having a state engine that can relay information from a frontend view to the LLM input adds a richer context.
The LLM understands the context of “here” based on the selected marker, performs retrieval to see the land development policy, and responds to the user in simple natural language, “No, and here is why…”

This example demonstrates the LLM capabilities for UI context, chain of thought reasoning, RAG, and tool use.
Example 5: UI context and UI control
Next, we choose a marker on the map and input the prompt “draw a .25 mile circle around here so I can visualize walking distance.”
The LLM invokes the draw circle tool to create a layer on the map centered at the selected marker, contextualized by “here.”

This example demonstrates the LLM capabilities for UI context, chain of thought reasoning, tool use, and UI control.
Clean up
To clean up your resources and prevent AWS charges from being incurred, complete the following steps:

Delete the Amazon Bedrock knowledge base.
Delete the Redshift cluster.
Delete the S3 bucket.

Conclusion
The integration of LLMs with GIS creates intuitive systems that help users of different technical levels perform complex spatial analysis through natural language interactions. By using RAG and agent-based workflows, organizations can maintain data accuracy while seamlessly connecting AI models to their existing knowledge bases and structured data systems. Amazon Bedrock facilitates this convergence of AI and GIS technology by providing a robust platform for model invocation, knowledge retrieval, and system control, ultimately transforming how users visualize, analyze, and interact with geographical data.
For further exploration, Earth on AWS has videos and articles you can explore to understand how AWS is helping build GIS applications on the cloud.

About the Authors
Dave Horne is a Sr. Solutions Architect supporting Federal System Integrators at AWS. He is based in Washington, DC, and has 15 years of experience building, modernizing, and integrating systems for public sector customers. Outside of work, Dave enjoys playing with his kids, hiking, and watching Penn State football!
Kai-Jia Yue is a solutions architect on the Worldwide Public Sector Global Systems Integrator Architecture team at Amazon Web Services (AWS). She has a focus in data analytics and helping customer organizations make data-driven decisions. Outside of work, she loves spending time with friends and family and traveling.
Brian Smitches is the Head of Partner Deployed Engineering at Windsurf focusing on how partners can bring organizational value through the adoption of Agentic AI software development tools like Windsurf and Devin. Brian has a background in Cloud Solutions Architecture from his time at AWS, where he worked in the AWS Federal Partner ecosystem. In his personal time, Brian enjoys skiing, water sports, and traveling with friends and family.

Beyond the basics: A comprehensive foundation model selection framewor …

Most organizations evaluating foundation models limit their analysis to three primary dimensions: accuracy, latency, and cost. While these metrics provide a useful starting point, they represent an oversimplification of the complex interplay of factors that determine real-world model performance.
Foundation models have revolutionized how enterprises develop generative AI applications, offering unprecedented capabilities in understanding and generating human-like content. However, as the model landscape expands, organizations face complex scenarios when selecting the right foundation model for their applications. In this blog post we present a systematic evaluation methodology for Amazon Bedrock users, combining theoretical frameworks with practical implementation strategies that empower data scientists and machine learning (ML) engineers to make optimal model selections.
The challenge of foundation model selection
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models from leading AI companies such as AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, TwelveLabs (coming soon), Writer, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. The service’s API-driven approach allows seamless model interchangeability, but this flexibility introduces a critical challenge: which model will deliver optimal performance for a specific application while meeting operational constraints?
Our research with enterprise customers reveals that many early generative AI projects select models based on either limited manual testing or reputation, rather than systematic evaluation against business requirements. This approach frequently results in:

Over-provisioning computational resources to accommodate larger models than required
Sub-optimal performance because of misalignment between model strengths and use case requirements
Unnecessarily high operational costs because of inefficient token utilization
Production performance issues discovered too late in the development lifecycle

In this post, we outline a comprehensive evaluation methodology optimized for Amazon Bedrock implementations using Amazon Bedrock Evaluations while providing forward-compatible patterns as the foundation model landscape evolves. To read more about on how to evaluate large language model (LLM) performance, see LLM-as-a-judge on Amazon Bedrock Model Evaluation.
A multidimensional evaluation framework—Foundation model capability matrix
Foundation models vary significantly across multiple dimensions, with performance characteristics that interact in complex ways. Our capability matrix provides a structured view of critical dimensions to consider when evaluating models in Amazon Bedrock. Below are four core dimensions (in no specific order) – Task performance, Architectural characteristics, Operational considerations, and Responsible AI attributes.
Task performance
Evaluating the models based on the task performance is crucial for achieving direct impact on business outcomes, ROI, user adoption and trust, and competitive advantage.

Task-specific accuracy: Evaluate models using benchmarks relevant to your use case (MMLU, HELM, or domain-specific benchmarks).
Few-shot learning capabilities: Strong few-shot performers require minimal examples to adapt to new tasks, leading to cost efficiency, faster time-to-market, resource optimization, and operational benefits.
Instruction following fidelity: For the applications that require precise adherence to commands and constraints, it is critical to evaluate model’s instruction following fidelity.
Output consistency: Reliability and reproducibility across multiple runs with identical prompts.
Domain-specific knowledge: Model performance varies dramatically across specialized fields based on training data. Evaluate the models base on your domain-specific use-case scenarios.
Reasoning capabilities: Evaluate the model’s ability to perform logical inference, causal reasoning, and multi-step problem-solving. This can include reasoning such as deductive and inductive, mathematical, chain-of-thought, and so on.

Architectural characteristics
Architectural characteristics for evaluating the models are important as they directly impact the model’s performance, efficiency, and suitability for specific tasks.

Parameter count (model size): Larger models typically offer more capabilities but require greater computational resources and may have higher inference costs and latency.
Training data composition: Models trained on diverse, high-quality datasets tend to have better generalization abilities across different domains.
Model architecture: Decoder-only models excel at text generation, encoder-decoder architectures handle translation and summarization more effectively, while mixture of experts (MoE) architectures can be a powerful tool for improving the performance of both decoder-only and encoder-decoder models. Some specialized architectures focus on enhancing reasoning capabilities through techniques like chain-of-thought prompting or recursive reasoning.
Tokenization methodology: The way models process text affects performance on domain-specific tasks, particularly with specialized vocabulary.
Context window capabilities: Larger context windows enable processing more information at once, critical for document analysis and extended conversations.
Modality: Modality refers to type of data a model can process and generate, such as text, image, audio, or video. Consider the modality of the models depending on the use case, and choose the model optimized for that specific modality.

Operational considerations
Below listed operational considerations are critical for model selection as they directly impact the real-world feasibility, cost-effectiveness, and sustainability of AI deployments.

Throughput and latency profiles: Response speed impacts user experience and throughput determines scalability.
Cost structures: Input/output token pricing significantly affects economics at scale.
Scalability characteristics: Ability to handle concurrent requests and maintain performance during traffic spikes.
Customization options: Fine-tuning capabilities and adaptation methods for tailoring to specific use cases or domains.
Ease of integration: Ease of integration into existing systems and workflow is an important consideration.
Security: When dealing with sensitive data, model security—including data encryption, access control, and vulnerability management—is a crucial consideration.

Responsible AI attributes
As AI becomes increasingly embedded in business operations and daily lives, evaluating models on responsible AI attributes isn’t just a technical consideration—it’s a business imperative.

Hallucination propensity: Models vary in their tendency to generate plausible but incorrect information.
Bias measurements: Performance across different demographic groups affects fairness and equity.
Safety guardrail effectiveness: Resistance to generating harmful or inappropriate content.
Explainability and privacy: Transparency features and handling of sensitive information.
Legal Implications: Legal considerations should include data privacy, non-discrimination, intellectual property, and product liability.

Agentic AI considerations for model selection
The growing popularity of agentic AI applications introduces evaluation dimensions beyond traditional metrics. When assessing models for use in autonomous agents, consider these critical capabilities:
Agent-specific evaluation dimensions

Planning and reasoning capabilities: Evaluate chain-of-thought consistency across complex multi-step tasks and self-correction mechanisms that allow agents to identify and fix their own reasoning errors.
Tool and API integration: Test function calling capabilities, parameter handling precision, and structured output consistency (JSON/XML) for seamless tool use.
Agent-to-agent communication: Assess protocol adherence to frameworks like A2A and efficient contextual memory management across extended multi-agent interactions.

Multi-agent collaboration testing for applications using multiple specialized agents

Role adherence: Measure how well models maintain distinct agent personas and responsibilities without role confusion.
Information sharing efficiency: Test how effectively information flows between agent instances without critical detail loss.
Collaborative intelligence: Verify whether multiple agents working together produce better outcomes than single-model approaches.
Error propagation resistance: Assess how robustly multi-agent systems contain and correct errors rather than amplifying them.

A four-phase evaluation methodology
Our recommended methodology progressively narrows model selection through increasingly sophisticated assessment techniques:
Phase 1: Requirements engineering
Begin with a precise specification of your application’s requirements:

Functional requirements: Define primary tasks, domain knowledge needs, language support, output formats, and reasoning complexity.
Non-functional requirements: Specify latency thresholds, throughput requirements, budget constraints, context window needs, and availability expectations.
Responsible AI requirements: Establish hallucination tolerance, bias mitigation needs, safety requirements, explainability level, and privacy constraints.
Agent-specific requirements: For agentic applications, define tool-use capabilities, protocol adherence standards, and collaboration requirements.

Assign weights to each requirement based on business priorities to create your evaluation scorecard foundation.
Phase 2: Candidate model selection
Use the Amazon Bedrock model information API to filter models based on hard requirements. This typically reduces candidates from dozens to 3–7 models that are worth detailed evaluation.
Filter options include but aren’t limited to the following:

Filter by modality support, context length, and language capabilities
Exclude models that don’t meet minimum performance thresholds
Calculate theoretical costs at projected scale so that you can exclude options that exceed the available budget
Filter for customization requirements such as fine-tuning capabilities
For agentic applications, filter for function calling and multi-agent protocol support

Although the Amazon Bedrock model information API might not provide the filters you need for candidate selection, you can use the Amazon Bedrock model catalog (shown in the following figure) to obtain additional information about these models.

Phase 3: Systematic performance evaluation
Implement structured evaluation using Amazon Bedrock Evaluations:

Prepare evaluation datasets: Create representative task examples, challenging edge cases, domain-specific content, and adversarial examples.
Design evaluation prompts: Standardize instruction format, maintain consistent examples, and mirror production usage patterns.
Configure metrics: Select appropriate metrics for subjective tasks (human evaluation and reference-free quality), objective tasks (precision, recall, and F1 score), and reasoning tasks (logical consistency and step validity).
For agentic applications: Add protocol conformance testing, multi-step planning assessment, and tool-use evaluation.
Execute evaluation jobs: Maintain consistent parameters across models and collect comprehensive performance data.
Measure operational performance: Capture throughput, latency distributions, error rates, and actual token consumption costs.

Phase 4: Decision analysis
Transform evaluation data into actionable insights:

Normalize metrics: Scale all metrics to comparable units using min-max normalization.
Apply weighted scoring: Calculate composite scores based on your prioritized requirements.
Perform sensitivity analysis: Test how robust your conclusions are against weight variations.
Visualize performance: Create radar charts, efficiency frontiers, and tradeoff curves for clear comparison.
Document findings: Detail each model’s strengths, limitations, and optimal use cases.

Advanced evaluation techniques
Beyond standard procedures, consider the following approaches for evaluating models.
A/B testing with production traffic
Implement comparative testing using Amazon Bedrock’s routing capabilities to gather real-world performance data from actual users.
Adversarial testing
Test model vulnerabilities through prompt injection attempts, challenging syntax, edge case handling, and domain-specific factual challenges.
Multi-model ensemble evaluation
Assess combinations such as sequential pipelines, voting ensembles, and cost-efficient routing based on task complexity.
Continuous evaluation architecture
Design systems to monitor production performance with:

Stratified sampling of production traffic across task types and domains
Regular evaluations and trigger-based reassessments when new models emerge
Performance thresholds and alerts for quality degradation
User feedback collection and failure case repositories for continuous improvement

Industry-specific considerations
Different sectors have unique requirements that influence model selection:

Financial services: Regulatory compliance, numerical precision, and personally identifiable information (PII) handling capabilities
Healthcare: Medical terminology understanding, HIPAA adherence, and clinical reasoning
Manufacturing: Technical specification comprehension, procedural knowledge, and spatial reasoning
Agentic systems: Autonomous reasoning, tool integration, and protocol conformance

Best practices for model selection
Through this comprehensive approach to model evaluation and selection, organizations can make informed decisions that balance performance, cost, and operational requirements while maintaining alignment with business objectives. The methodology makes sure that model selection isn’t a one-time exercise but an evolving process that adapts to changing needs and technological capabilities.

Assess your situation thoroughly: Understand your specific use case requirements and available resources
Select meaningful metrics: Focus on metrics that directly relate to your business objectives
Build for continuous evaluation: Design your evaluation process to be repeatable as new models are released

Looking forward: The future of model selection
As foundation models evolve, evaluation methodologies must keep pace. Below are further considerations (By no means this list of considerations is exhaustive and is subject to ongoing updates as technology evolves and best practices emerge), you should take into account while selecting the best model(s) for your use-case(s).

Multi-model architectures: Enterprises will increasingly deploy specialized models in concert rather than relying on single models for all tasks.
Agentic landscapes: Evaluation frameworks must assess how models perform as autonomous agents with tool-use capabilities and inter-agent collaboration.
Domain specialization: The growing landscape of domain-specific models will require more nuanced evaluation of specialized capabilities.
Alignment and control: As models become more capable, evaluation of controllability and alignment with human intent becomes increasingly important.

Conclusion
By implementing a comprehensive evaluation framework that extends beyond basic metrics, organizations can informed decisions about which foundation models will best serve their requirements. For agentic AI applications in particular, thorough evaluation of reasoning, planning, and collaboration capabilities is essential for success. By approaching model selection systematically, organizations can avoid the common pitfalls of over-provisioning, misalignment with use case needs, excessive operational costs, and late discovery of performance issues. The investment in thorough evaluation pays dividends through optimized costs, improved performance, and superior user experiences.

About the author
Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Accelerate intelligent document processing with generative AI on AWS

Every day, organizations process millions of documents, including invoices, contracts, insurance claims, medical records, and financial statements. Despite the critical role these documents play, an estimated 80–90% of the data they contain is unstructured and largely untapped, hiding valuable insights that could transform business outcomes. Despite advances in technology, many organizations still rely on manual data entry, spending countless hours extracting information from PDFs, scanned images, and forms. This manual approach is time-consuming, error-prone, and prevents organizations from scaling their operations and responding quickly to business demands.
Although generative AI has made it easier to build proof-of-concept document processing solutions, the journey from proof of concept to production remains fraught with challenges. Organizations often find themselves rebuilding from scratch when they discover their prototype can’t handle production volumes, lacks proper error handling, doesn’t scale cost-effectively, or fails to meet enterprise security and compliance requirements. What works in a demo with a handful of documents often breaks down when processing thousands of documents daily in a production environment.
In this post, we introduce our open source GenAI IDP Accelerator—a tested solution that we use to help customers across industries address their document processing challenges. Automated document processing workflows accurately extract structured information from documents, reducing manual effort. We will show you how this ready-to-deploy solution can help you build those workflows with generative AI on AWS in days instead of months.
Understanding intelligent document processing
Intelligent document processing (IDP) encompasses the technologies and techniques used to extract and process data from various document types. Common IDP tasks include:

OCR (Optical Character Recognition) – Converting scanned documents and images into machine-readable text
Document classification – Automatically identifying document types (such as invoices, contracts, or forms)
Data extraction – Pulling structured information from unstructured documents
Assessment – Evaluating the quality and confidence of extracted data
Summarization – Creating concise summaries of document content
Evaluation – Measuring accuracy and performance against expected outcomes

These capabilities are critical across industries. In financial services, organizations use IDP to process loan applications, extract data from bank statements, and validate insurance claims. Healthcare providers rely on IDP to extract patient information from medical records, process insurance forms, and handle lab results efficiently. Manufacturing and logistics companies use IDP to process invoices and purchase orders, extract shipping information, and handle quality certificates. Government agencies use IDP to process citizen applications, extract data from tax forms, manage permits and licenses, and enforce regulatory compliance.
The generative AI revolution in IDP
Traditional IDP solutions relied on template-based extraction, regular expressions, and classical machine learning (ML) models. Though functional, these approaches required extensive setup, struggled with document variations, and achieved limited accuracy on complex documents.
The emergence of large language models (LLMs) and generative AI has fundamentally transformed IDP capabilities. Modern AI models can understand document context, handle variations without templates, achieve near-human accuracy on complex extractions, and adapt to new document types with minimal examples. This shift from rule-based to intelligence-based processing means organizations can now process different document types with high accuracy, dramatically reducing the time and cost of implementation.
GenAI IDP Accelerator
We’re excited to share the GenAI IDP Accelerator—an open source solution that transforms how organizations handle document processing by dramatically reducing manual effort and improving accuracy. This serverless foundation offers processing patterns which use Amazon Bedrock Data Automation for rich out-of-the-box document processing features, high accuracy, ease of use, and straightforward per-page pricing, Amazon Bedrock state-of-the-art foundation models (FMs) for complex documents requiring custom logic, and other AWS AI services to provide a flexible, scalable starting point for enterprises to build document automation tailored to their specific needs.
The following is a short demo of the solution in action, in this case showcasing the default Amazon Bedrock Data Automation processing pattern.

Real-world impact
The GenAI IDP Accelerator is already transforming document processing for organizations across industries.
Competiscan: Transforming marketing intelligence at scale
Competiscan, a leader in competitive marketing intelligence, faced a massive challenge: processing 35,000–45,000 marketing campaigns daily while maintaining a searchable archive of 45 million campaigns spanning 15 years.
Using the GenAI IDP Accelerator, Competiscan achieved the following:

85% classification and extraction accuracy across diverse marketing materials
Increased scalability to handle 35,000–45,000 daily campaigns
Removal of critical bottlenecks, facilitating business growth
Production deployment in just 8 weeks from initial concept

Ricoh: Scaling document processing
Ricoh, a global leader in document management, implemented the GenAI IDP Accelerator to transform healthcare document processing for their clients. Processing over 10,000 healthcare documents monthly with potential to scale to 70,000, they needed a solution that could handle complex medical documentation with high accuracy.
The results speak for themselves:

Savings potential of over 1,900 person-hours annually through automation
Achieved extraction accuracy to help minimize financial penalties from processing errors
Automated classification of grievances vs. appeals
Created a reusable framework deployable across multiple healthcare customers
Integrated with human-in-the-loop review for cases requiring expert validation
Leveraged modular architecture to integrate with existing systems, enabling custom document splitting and large-scale document processing

Solution overview
The GenAI IDP Accelerator is a modular, serverless solution that automatically converts unstructured documents into structured, actionable data. Built entirely on AWS services, it provides enterprise-grade scalability, security, and cost-effectiveness while requiring minimal setup and maintenance. Its configuration-driven design helps teams quickly adapt prompts, extraction templates, and validation rules for their specific document types without touching the underlying infrastructure.
The solution follows a modular pipeline that enriches documents at each stage, from OCR to classification, to extraction, to assessment, to summarization, and ending with evaluation.
You can deploy and customize each step independently, so you can optimize for your specific use cases while maintaining the benefits of the integrated workflow.
The following diagram illustrates the solution architecture, showing the default Bedrock Data Automation workflow (Pattern-1).

Refer to the GitHub repo for additional details and processing patterns.
Some of the key features of the solution include:

Serverless architecture – Built on AWS Lambda, AWS Step Functions, and other serverless technologies for queueing, concurrency management, and retries to provide automatic scaling and pay-per-use pricing for production workloads of many sizes
Generative AI-powered document packet splitting and classification – Intelligent document classification using Amazon Bedrock Data Automation or Amazon Bedrock multimodal FMs, including support for multi-document packets and packet splitting
Advanced AI key information extraction – Key information extraction using Amazon Bedrock Data Automation or Amazon Bedrock multimodal FMs
Multiple processing patterns – Choose from pre-built patterns optimized for different workloads with different configurability, cost, and accuracy requirements, or extend the solution with additional patterns:

Pattern 1 – Uses Amazon Bedrock Data Automation, a fully managed service that offers rich out-of-the-box features, ease of use, and straightforward per-page pricing. This pattern is recommended for most use cases.
Pattern 2 – Uses Amazon Textract and Amazon Bedrock with Amazon Nova, Anthropic’s Claude, or custom fine-tuned Amazon Nova models. This pattern is ideal for complex documents requiring custom logic.
Pattern 3 – Uses Amazon Textract, Amazon SageMaker with a fine-tuned model for classification, and Amazon Bedrock for extraction. This pattern is ideal for documents requiring specialized classification.

We expect to add more pattern options to handle additional real-world document processing needs, and to take advantage of ever-improving state-of-the-art capabilities:

Few-shot learning – Improve accuracy for classification and extraction by providing few-shot examples to guide the AI models
Confidence assessment – AI-powered quality assurance that evaluates extraction field confidence, used to indicate documents for human review
Human-in-the-loop (HITL) review – Integrated workflow for human review of low-confidence extractions using Amazon SageMaker Augmented AI (Amazon A2I), currently available for Pattern 1, with support for Patterns 2 and 3 coming soon
Web user interface – Responsive web UI for monitoring document processing, viewing results, and managing configurations
Knowledge base integration – Query processed documents using natural language through Amazon Bedrock Knowledge Bases
Built-in evaluation – Framework to evaluate and improve accuracy against baseline data
Analytics and reporting database – Centralized analytics database for tracking processing metrics, accuracy trends, and cost optimization across document workflows, and for analyzing extracted document content using Amazon Athena
No-code configuration – Customize document types, extraction fields, and processing logic through configuration, editable in the web UI
Developer-friendly python package – For data science and engineering teams who want to experiment, optimize, or integrate the IDP capabilities directly into their workflows, the solution’s core logic is available through the idp_common Python package

Prerequisites
Before you deploy the solution, make sure you have an AWS account with administrator permissions and access to Amazon and Anthropic models on Amazon Bedrock. For more details, see Access Amazon Bedrock foundation models.
Deploy the GenAI IDP Accelerator
To deploy the GenAI IDP Accelerator, you can use the provided AWS CloudFormation template. For more details, see the quick start option on the GitHub repo. The high-level steps are as follows:

Log in to your AWS account.
Choose Launch Stack for your preferred AWS Region:

Region
Launch Stack

US East (N. Virginia)

US West (Oregon)

Enter your email address and choose your processing pattern (default is Pattern 1, using Amazon Bedrock Data Automation).
Use defaults for all other configuration parameters.
Deploy the stack.

The stack takes approximately 15–20 minutes to deploy the resources. After deployment, you will receive an email with login credentials for the web interface.
Process documents
After you deploy the solution, you can start processing documents:

Use the web interface to upload a sample document (you can use the provided sample: lending_package.pdf).

In production, you typically automate loading your documents directly to the Amazon Simple Storage Service (Amazon S3) input bucket, automatically triggering processing. To learn more, see Testing without the UI.

Select your document from the document list and choose View Processing Flow to watch as your document flows through the pipeline.

Examine the extracted data with confidence scores.

Use the knowledge base feature to ask questions about processed content.

Alternative deployment methods
You can build the solution from source code if you need to deploy the solution to additional Regions or build and deploy code changes.
We hope to add support for AWS Cloud Development Kit (AWS CDK) and Terraform deployments. Follow the GitHub repository for updates, or contact AWS Professional Services for implementation assistance.
Update an existing GenAI IDP Accelerator stack
You can update your existing GenAI IDP Accelerator stack to the latest release. For more details, see Updating an Existing Stack.
Clean up
When you’re finished experimenting, clean up your resources by using the AWS CloudFormation console to delete the IDP stack that you deployed.
Conclusion
In this post, we discussed the GenAI IDP Accelerator, a new approach to document processing that combines the power of generative AI with the reliability and scale of AWS. You can process hundreds or even millions of documents to achieve better results faster and more cost-effectively than traditional approaches.
Visit the GitHub repository for detailed guides and examples and choose watch to stay informed on new releases and features. AWS Professional Services and AWS Partners are available to help with implementation. You can also join the GitHub community to contribute improvements and share your experiences.

About the Authors
Bob Strahan is a Principal Solutions Architect in the AWS Generative AI Innovation Center.
Joe King is a Senior Data Scientist in the AWS Generative AI Innovation Center.
Mofijul Islam is an Applied Scientist in the AWS Generative AI Innovation Center.
Vincil Bishop is a Senior Deep Learning Architect in the AWS Generative AI Innovation Center.
David Kaleko is a Senior Applied Scientist in the AWS Generative AI Innovation Center.
Rafal Pawlaszek is a Senior Cloud Application Architect in the AWS Generative AI Innovation Center.
Spencer Romo is a Senior Data Scientist in the AWS Generative AI Innovation Center.
Vamsi Thilak Gudi is a Solutions Architect in the AWS World Wide Public Sector team.

Acknowledgments
We would like to thank Abhi Sharma, Akhil Nooney, Aleksei Iancheruk, Ava Kong, Boyi Xie, Diego Socolinsky, Guillermo Tantachuco, Ilya Marmur, Jared Kramer, Jason Zhang, Jordan Ratner, Mariano Bellagamba, Mark Aiyer, Niharika Jain, Nimish Radia, Shean Sager, Sirajus Salekin, Yingwei Yu, and many others in our expanding community, for their unwavering vision, passion, contributions, and guidance throughout.

What Is Speaker Diarization? A 2025 Technical Guide: Top 9 Speaker Dia …

Table of contentsHow Speaker Diarization WorksAccuracy, Metrics, and Current ChallengesTechnical Insights and 2025 TrendsTop 9 Speaker Diarization Libraries and APIs in 2025FAQs

Speaker diarization is the process of answering “who spoke when” by separating an audio stream into segments and consistently labeling each segment by speaker identity (e.g., Speaker A, Speaker B), thereby making transcripts clearer, searchable, and useful for analytics across domains like call centers, legal, healthcare, media, and conversational AI. As of 2025, modern systems rely on deep neural networks to learn robust speaker embeddings that generalize across environments, and many no longer require prior knowledge of the number of speakers—enabling practical real-time scenarios such as debates, podcasts, and multi-speaker meetings.

How Speaker Diarization Works

Modern diarization pipelines comprise several coordinated components; weakness in one stage (e.g., VAD quality) cascades to others.

Voice Activity Detection (VAD): Filters out silence and noise to pass speech to later stages; high-quality VADs trained on diverse data sustain strong accuracy in noisy conditions.

Segmentation: Splits continuous audio into utterances (commonly 0.5–10 seconds) or at learned change points; deep models increasingly detect speaker turns dynamically instead of fixed windows, reducing fragmentation.

Speaker Embeddings: Converts segments into fixed-length vectors (e.g., x-vectors, d-vectors) capturing vocal timbre and idiosyncrasies; state-of-the-art systems train on large, multilingual corpora to improve generalization to unseen speakers and accents.

Speaker Count Estimation: Some systems estimate how many unique speakers are present before clustering, while others cluster adaptively without a preset count.

Clustering and Assignment: Groups embeddings by likely speaker using methods such as spectral clustering or agglomerative hierarchical clustering; tuning is pivotal for borderline cases, accent variation, and similar voices.

Accuracy, Metrics, and Current Challenges

Industry practice views real-world diarization below roughly 10% total error as reliable enough for production use, though thresholds vary by domain.

Key metrics include Diarization Error Rate (DER), which aggregates missed speech, false alarms, and speaker confusion; boundary errors (turn-change placement) also matter for readability and timestamp fidelity.

Persistent challenges include overlapping speech (simultaneous speakers), noisy or far-field microphones, highly similar voices, and robustness across accents and languages; cutting-edge systems mitigate these with better VADs, multi-condition training, and refined clustering, but difficult audio still degrades performance.

Technical Insights and 2025 Trends

Deep embeddings trained on large-scale, multilingual data are now the norm, improving robustness across accents and environments.

Many APIs bundle diarization with transcription, but standalone engines and open-source stacks remain popular for custom pipelines and cost control.

Audio-visual diarization is an active research area to resolve overlaps and improve turn detection using visual cues when available.

Real-time diarization is increasingly feasible with optimized inference and clustering, though latency and stability constraints remain in noisy multi-party settings.

Top 9 Speaker Diarization Libraries and APIs in 2025

NVIDIA Streaming Sortformer: Real-time speaker diarization that instantly identifies and labels participants in meetings, calls, and voice-enabled applications—even in noisy, multi-speaker environments

AssemblyAI (API): Cloud Speech-to-Text with built‑in diarization; include lower DER, stronger short‑segment handling (~250 ms), and improved robustness in noisy and overlapped speech, enabled via a simple speaker_labels parameter at no extra cost. Integrates with a broader audio intelligence stack (sentiment, topics, summarization) and publishes practical guidance and examples for production use

Deepgram (API): Language‑agnostic diarization trained on 100k+ speakers and 80+ languages; vendor benchmarks highlight ~53% accuracy gains vs. prior version and 10× faster processing vs. the next fastest vendor, with no fixed limit on number of speakers. Designed to pair speed with clustering‑based precision for real‑world, multi‑speaker audio.

Speechmatics (API): Enterprise‑focused STT with diarization available through Flow; offers both cloud and on‑prem deployment, configurable max speakers, and claims competitive accuracy with punctuation‑aware refinements for readability. Suitable where compliance and infrastructure control are priorities.

Gladia (API): Combines Whisper transcription with pyannote diarization and offers an “enhanced” mode for tougher audio; supports streaming and speaker hints, making it a fit for teams standardizing on Whisper who need integrated diarization without stitching multiple.

SpeechBrain (Library): PyTorch toolkit with recipes spanning 20+ speech tasks, including diarization; supports training/fine‑tuning, dynamic batching, mixed precision, and multi‑GPU, balancing research flexibility with production‑oriented patterns. Good fit for PyTorch‑native teams building bespoke diarization stacks.

FastPix (API): Developer‑centric API emphasizing quick integration and real‑time pipelines; positions diarization alongside adjacent features like audio normalization, STT, and language detection to streamline production workflows. A pragmatic choice when teams want API simplicity over managing open‑source stacks.

NVIDIA NeMo (Toolkit): GPU‑optimized speech toolkit including diarization pipelines (VAD, embedding extraction, clustering) and research directions like Sortformer/MSDD for end‑to‑end diarization; supports both oracle and system VAD for flexible experimentation. Best for teams with CUDA/GPU workflows seeking custom multi‑speaker ASR systems

pyannote‑audio (Library): Widely used PyTorch toolkit with pretrained models for segmentation, embeddings, and end‑to‑end diarization; active research community and frequent updates, with reports of strong DER on benchmarks under optimized configs. Ideal for teams wanting open‑source control and the ability to fine‑tune on domain data

FAQs

What is speaker diarization? Speaker diarization is the process of determining “who spoke when” in an audio stream by segmenting speech and assigning consistent speaker labels (e.g., Speaker A, Speaker B). It improves transcript readability and enables analytics like speaker-specific insights.

How is diarization different from speaker recognition? Diarization separates and labels distinct speakers without knowing their identities, while speaker recognition matches a voice to a known identity (e.g., verifying a specific person). Diarization answers “who spoke when,” recognition answers “who is speaking.”

What factors most affect diarization accuracy? Audio quality, overlapping speech, microphone distance, background noise, number of speakers, and very short utterances all impact accuracy. Clean, well-mic’d audio with clearer turn-taking and sufficient speech per speaker generally yields better results.
The post What Is Speaker Diarization? A 2025 Technical Guide: Top 9 Speaker Diarization Libraries and APIs in 2025 appeared first on MarkTechPost.

NVIDIA AI Just Released Streaming Sortformer: A Real-Time Speaker Diar …

NVIDIA has released its Streaming Sortformer, a breakthrough in real-time speaker diarization that instantly identifies and labels participants in meetings, calls, and voice-enabled applications—even in noisy, multi-speaker environments. Designed for low-latency, GPU-powered inference, the model is optimized for English and Mandarin, and can track up to four simultaneous speakers with millisecond-level precision. This innovation marks a major step forward in conversational AI, enabling a new generation of productivity, compliance, and interactive voice applications.

Core Capabilities: Real-Time, Multi-Speaker Tracking

Unlike traditional diarization systems that require batch processing or expensive, specialized hardware, Streaming Sortformer performs frame-level diarization in real time. That means every utterance is tagged with a speaker label (e.g., spk_0, spk_1) and a precise timestamp as the conversation unfolds. The model is low-latency, processing audio in small, overlapping chunks—a critical feature for live transcriptions, smart assistants, and contact center analytics where every millisecond counts.

Labels 2–4+ speakers on the fly: Robustly tracks up to four participants per conversation, assigning consistent labels as each speaker enters the stream.

GPU-accelerated inference: Fully optimized for NVIDIA GPUs, integrating seamlessly with the NVIDIA NeMo and NVIDIA Riva platforms for scalable, production deployment.

Multilingual support: While tuned for English, the model shows strong results on Mandarin meeting data and even non-English datasets like CALLHOME, indicating broad language compatibility beyond its core targets.

Precision and reliability: Delivers a competitive Diarization Error Rate (DER), outperforming recent alternatives like EEND-GLA and LS-EEND in real-world benchmarks.

These capabilities make Streaming Sortformer immediately useful for live meeting transcripts, contact center compliance logs, voicebot turn-taking, media editing, and enterprise analytics—all scenarios where knowing “who said what, when” is essential.

Architecture and Innovation

At its core, Streaming Sortformer is a hybrid neural architecture, combining the strengths of Convolutional Neural Networks (CNNs), Conformers, and Transformers. Here’s how it works:

Audio pre-processing: A convolutional pre-encode module compresses raw audio into a compact representation, preserving critical acoustic features while reducing computational overhead.

Context-aware sorting: A multi-layer Fast-Conformer encoder (17 layers in the streaming variant) processes these features, extracting speaker-specific embeddings. These are then fed into an 18-layer Transformer encoder with a hidden size of 192, followed by two feedforward layers with sigmoid outputs for each frame.

Arrival-Order Speaker Cache (AOSC): The real magic happens here. Streaming Sortformer maintains a dynamic memory buffer—AOSC—that stores embeddings of all speakers detected so far. As new audio chunks arrive, the model compares them against this cache, ensuring that each participant retains a consistent label throughout the conversation. This elegant solution to the “speaker permutation problem” is what enables real-time, multi-speaker tracking without expensive recomputation.

End-to-end training: Unlike some diarization pipelines that rely on separate voice activity detection and clustering steps, Sortformer is trained end-to-end, unifying speaker separation and labeling in a single neural network.

Source: https://developer.nvidia.com/blog/identify-speakers-in-meetings-calls-and-voice-apps-in-real-time-with-nvidia-streaming-sortformer/

Integration and Deployment

Streaming Sortformer is open, production-grade, and ready for integration into existing workflows. Developers can deploy it via NVIDIA NeMo or Riva, making it a drop-in replacement for legacy diarization systems. The model accepts standard 16kHz mono-channel audio (WAV files) and outputs a matrix of speaker activity probabilities for each frame—ideal for building custom analytics or transcription pipelines.

Real-World Applications

The practical impact of Streaming Sortformer is vast:

Meetings and productivity: Generate live, speaker-tagged transcripts and summaries, making it easier to follow discussions and assign action items.

Contact centers: Separate agent and customer audio streams for compliance, quality assurance, and real-time coaching.

Voicebots and AI assistants: Enable more natural, context-aware dialogues by accurately tracking speaker identity and turn-taking patterns.

Media and broadcast: Automatically label speakers in recordings for editing, transcription, and moderation workflows.

Enterprise compliance: Create auditable, speaker-resolved logs for regulatory and legal requirements.

Source: https://developer.nvidia.com/blog/identify-speakers-in-meetings-calls-and-voice-apps-in-real-time-with-nvidia-streaming-sortformer/

Benchmark Performance and Limitations

In benchmarks, Streaming Sortformer achieves a lower Diarization Error Rate (DER) than recent streaming diarization systems, indicating higher accuracy in real-world conditions. However, the model is currently optimized for scenarios with up to four speakers; expanding to larger groups remains an area for future research. Performance may also vary in challenging acoustic environments or with underrepresented languages, though the architecture’s flexibility suggests room for adaptation as new training data becomes available.

Technical Highlights at a Glance

FeatureStreaming SortformerMax speakers2–4+LatencyLow (real-time, frame-level)LanguagesEnglish (optimized), Mandarin (validated), others possibleArchitectureCNN + Fast-Conformer + Transformer + AOSCIntegrationNVIDIA NeMo, NVIDIA Riva, Hugging FaceOutputFrame-level speaker labels, precise timestampsGPU SupportYes (NVIDIA GPUs required)Open SourceYes (pre-trained models, codebase)

Looking Ahead

NVIDIA’s Streaming Sortformer is not just a technical demo—it’s a production-ready tool already changing how enterprises, developers, and service providers handle multi-speaker audio. With GPU acceleration, seamless integration, and robust performance across languages, it’s poised to become the de facto standard for real-time speaker diarization in 2025 and beyond.

For AI managers, content creators, and digital marketers focused on conversational analytics, cloud infrastructure, or voice applications, Streaming Sortformer is a must-evaluate platform. Its combination of speed, accuracy, and ease of deployment makes it a compelling choice for anyone building the next generation of voice-enabled products.

Summary

NVIDIA’s Streaming Sortformer delivers instant, GPU-accelerated speaker diarization for up to four participants, with proven results in English and Mandarin. Its novel architecture and open accessibility position it as a foundational technology for real-time voice analytics—a leap forward for meetings, contact centers, AI assistants, and beyond.

FAQs: NVIDIA Streaming Sortformer

How does Streaming Sortformer handle multiple speakers in real time?

Streaming Sortformer processes audio in small, overlapping chunks and assigns consistent labels (e.g., spk_0–spk_3) as each speaker enters the conversation. It maintains a lightweight memory of detected speakers, enabling instant, frame-level diarization without waiting for the full recording. This supports fluid, low-latency experiences for live transcripts, contact centers, and voice assistants.

What hardware and setup are recommended for best performance?

It’s designed for NVIDIA GPUs to achieve low-latency inference. A typical setup uses 16 kHz mono audio input, with integration paths through NVIDIA’s speech AI stacks (e.g., NeMo/Riva) or the available pretrained models. For production workloads, allocate a recent NVIDIA GPU and ensure streaming-friendly audio buffering (e.g., 20–40 ms frames with slight overlap).

Does it support languages beyond English, and how many speakers can it track?

The current release targets English with validated performance on Mandarin and can label two to four speakers on the fly. While it can generalize to other languages to some extent, accuracy depends on acoustic conditions and training coverage. For scenarios with more than four concurrent speakers, consider segmenting the session or evaluating pipeline adjustments as model variants evolve.

Check out the Model on Hugging Face and Technical details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post NVIDIA AI Just Released Streaming Sortformer: A Real-Time Speaker Diarization that Figures Out Who’s Talking in Meetings and Calls Instantly appeared first on MarkTechPost.

What is DeepSeek-V3.1 and Why is Everyone Talking About It?

The Chinese AI startup DeepSeek releases DeepSeek-V3.1, it’s latest flagship language model. It builds on the architecture of DeepSeek-V3, adding significant enhancements to reasoning, tool use, and coding performance. Notably, DeepSeek models have rapidly gained a reputation for delivering OpenAI and Anthropic-level performance at a fraction of the cost.

Model Architecture and Capabilities

Hybrid Thinking Mode: DeepSeek-V3.1 supports both thinking (chain-of-thought reasoning, more deliberative) and non-thinking (direct, stream-of-consciousness) generation, switchable via the chat template. This is a departure from previous versions and offers flexibility for varied use cases.

Tool and Agent Support: The model has been optimized for tool calling and agent tasks (e.g., using APIs, code execution, search). Tool calls use a structured format, and the model supports custom code agents and search agents, with detailed templates provided in the repository.

Massive Scale, Efficient Activation: The model boasts 671B total parameters, with 37B activated per token—a Mixture-of-Experts (MoE) design that lowers inference costs while maintaining capacity. The context window is 128K tokens, much larger than most competitors.

Long Context Extension: DeepSeek-V3.1 uses a two-phase long-context extension approach. The first phase (32K) was trained on 630B tokens (10x more than V3), and the second (128K) on 209B tokens (3.3x more than V3). The model is trained with FP8 microscaling for efficient arithmetic on next-gen hardware.

Chat Template: The template supports multi-turn conversations with explicit tokens for system prompts, user queries, and assistant responses. The thinking and non-thinking modes are triggered by <think> and </think> tokens in the prompt sequence.

Performance Benchmarks

DeepSeek-V3.1 is evaluated across a wide range of benchmarks (see table below), including general knowledge, coding, math, tool use, and agent tasks. Here are highlights:

MetricV3.1-NonThinkingV3.1-ThinkingCompetitorsMMLU-Redux (EM)91.893.793.4 (R1-0528)MMLU-Pro (EM)83.784.885.0 (R1-0528)GPQA-Diamond (Pass@1)74.980.181.0 (R1-0528)LiveCodeBench (Pass@1)56.474.873.3 (R1-0528)AIMÉ 2025 (Pass@1)49.888.487.5 (R1-0528)SWE-bench (Agent mode)54.5—30.5 (R1-0528)

The thinking mode consistently matches or exceeds previous state-of-the-art versions, especially in coding and math. The non-thinking mode is faster but slightly less accurate, making it ideal for latency-sensitive applications.

Tool and Code Agent Integration

Tool Calling: Structured tool invocations are supported in non-thinking mode, allowing for scriptable workflows with external APIs and services.

Code Agents: Developers can build custom code agents by following the provided trajectory templates, which detail the interaction protocol for code generation, execution, and debugging. DeepSeek-V3.1 can use external search tools for up-to-date information, a feature critical for business, finance, and technical research applications.

Deployment

Open Source, MIT License: All model weights and code are freely available on Hugging Face and ModelScope under the MIT license, encouraging both research and commercial use.

Local Inference: The model structure is compatible with DeepSeek-V3, and detailed instructions for local deployment are provided. Running requires significant GPU resources due to the model’s scale, but the open ecosystem and community tools lower barriers to adoption.

Summary

DeepSeek-V3.1 represents a milestone in the democratization of advanced AI, demonstrating that open-source, cost-efficient, and highly capable language models. Its blend of scalable reasoning, tool integration, and exceptional performance in coding and math tasks positions it as a practical choice for both research and applied AI development.

Check out the Model on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post What is DeepSeek-V3.1 and Why is Everyone Talking About It? appeared first on MarkTechPost.

Fine-tune OpenAI GPT-OSS models using Amazon SageMaker HyperPod recipe …

This post is the second part of the GPT-OSS series focusing on model customization with Amazon SageMaker AI. In Part 1, we demonstrated fine-tuning GPT-OSS models using open source Hugging Face libraries with SageMaker training jobs, which supports distributed multi-GPU and multi-node configurations, so you can spin up high-performance clusters on demand.
In this post, we show how you can fine-tune GPT OSS models on using recipes on SageMaker HyperPod and Training Jobs. SageMaker HyperPod recipes help you get started with training and fine-tuning popular publicly available foundation models (FMs) such as Meta’s Llama, Mistral, and DeepSeek in just minutes, using either SageMaker HyperPod or training jobs. The recipes provide pre-built, validated configurations that alleviate the complexity of setting up distributed training environments while maintaining enterprise-grade performance and scalability for models. We outline steps to fine-tune the GPT-OSS model on a multilingual reasoning dataset, HuggingFaceH4/Multilingual-Thinking, so GPT-OSS can handle structured, chain-of-thought (CoT) reasoning across multiple languages.
Solution overview
This solution uses SageMaker HyperPod recipes to run a fine-tuning job on HyperPod using Amazon Elastic Kubernetes Service (Amazon EKS) orchestration or training jobs. Recipes are processed through the SageMaker HyperPod recipe launcher, which serves as the orchestration layer responsible for launching a job on the corresponding architecture such as SageMaker HyperPod (Slurm or Amazon EKS) or training jobs. To learn more, see SageMaker HyperPod recipes.
For details on fine-tuning the GPT-OSS model, see Fine-tune OpenAI GPT-OSS models on Amazon SageMaker AI using Hugging Face libraries.
In the following sections, we discuss the prerequisites for both options, and then move on to the data preparation. The prepared data is saved to Amazon FSx for Lustre, which is used as the persistent file system for SageMaker HyperPod, or Amazon Simple Storage Service (Amazon S3) for training jobs. We then use recipes to submit the fine-tuning job, and finally deploy the trained model to a SageMaker endpoint for testing and evaluating the model. The following diagram illustrates this architecture.

Prerequisites
To follow along, you must have the following prerequisites:

A local development environment with AWS credentials configured for creating and accessing SageMaker resources, or a remote environment such as Amazon SageMaker Studio.
For SageMaker HyperPod fine-tuning, complete the following:

Make sure you have one ml.p5.48xlarge instance (with 8 x NVIDIA H100 GPUs) for cluster usage. If you don’t have sufficient limits, request the following SageMaker quotas on the Service Quotas console: P5 instance (ml.p5.48xlarge) for HyperPod clusters (ml.p5.48xlarge for cluster usage): 1.
Set up a SageMaker HyperPod cluster on Amazon EKS. For instructions, refer to Orchestrating SageMaker HyperPod clusters with Amazon EKS. Alternatively, you can use the AWS CloudFormation template provided in the Amazon EKS Support in Amazon SageMaker HyperPod workshop and follow the instructions to set up a cluster and a development environment to access and submit jobs to the cluster.
Set up an FSx for Lustre file system for saving and loading data and checkpoints. Refer to Set Up an FSx for Lustre File System to set up an FSx for Lustre volume and associate it with the cluster.

For fine-tuning the model using SageMaker training jobs, you must have one ml.p5.48xlarge instance (with 8 x NVIDIA H100 GPUs) for training jobs usage. If you don’t have sufficient limits, request the following SageMaker quotas on the Service Quotas console: P5 instance (ml.p5.48xlarge) for training jobs (ml.p5.48xlarge for cluster usage): 1.

It might take up to 24 hours for these limits to be approved. You can also use SageMaker training plans to reserve these instances for a specific timeframe and use case (cluster or training jobs usage). For more details, see Reserve training plans for your training jobs or HyperPod clusters.
Next, use your preferred development environment to prepare the dataset for fine-tuning. You can find the full code in the Generative AI using Amazon SageMaker repository on GitHub.
Data tokenization
We use the Hugging FaceH4/Multilingual-Thinking dataset, which is a multilingual reasoning dataset containing CoT examples translated into languages such as French, Spanish, and German. The recipe supports a sequence length of 4,000 tokens for the GPT-OSS 120B model. The following example code demonstrates how to tokenize the multilingual-thinking dataset. The recipe accepts data in Hugging Face format (arrow). After it’s tokenized, you can save the processed dataset to disk.

from datasets import load_dataset

from transformers import AutoTokenizer
import numpy as np

dataset = load_dataset(“HuggingFaceH4/Multilingual-Thinking”, split=”train”)

tokenizer = AutoTokenizer.from_pretrained(“openai/gpt-oss-120b”)
messages = dataset[0][“messages”]
conversation = tokenizer.apply_chat_template(messages, tokenize=False)
print(conversation)

def preprocess_function(example):
   return tokenizer.apply_chat_template(example[‘messages’],
                                        return_dict=True,
                                        padding=”max_length”,
                                        max_length=4096,
                                        truncation=True)

def label(x):
   x[“labels”]=np.array(x[“input_ids”])
   x[“labels”][x[“labels”]==tokenizer.pad_token_id]=-100
   x[“labels”]=x[“labels”].tolist()
   return x
 
dataset = dataset.map(preprocess_function,
                      remove_columns=[‘reasoning_language’,
                                      ‘developer’,
                                      ‘user’,
                                      ‘analysis’,
                                      ‘final’,
                                      ‘messages’])
dataset = dataset.map(label)

# for HyperPod, save to mounted FSx volume
dataset.save_to_disk(“/fsx/multilingual_4096”)

# for training jobs, save to S3
dataset.save_to_disk(“multilingual_4096″)

def upload_directory(local_dir, bucket_name, s3_prefix=”):
    s3_client = boto3.client(‘s3’)
    
    for root, dirs, files in os.walk(local_dir):
        for file in files:
            local_path = os.path.join(root, file)
            # Calculate relative path for S3
            relative_path = os.path.relpath(local_path, local_dir)
            s3_path = os.path.join(s3_prefix, relative_path).replace(“\”, “/”)
            
            print(f”Uploading {local_path} to {s3_path}”)
            s3_client.upload_file(local_path, bucket_name, s3_path)

upload_directory(‘./multilingual_4096/’, <your-bucket>, ‘multilingual_4096′)

Now that you have prepared and tokenized the dataset, you can fine-tune the GPT-OSS model on your dataset, using either SageMaker HyperPod or training jobs. SageMaker training jobs are ideal for one-off or periodic training workloads that need temporary compute resources, making it a fully managed, on-demand experience for your training needs. SageMaker HyperPod is optimal for continuous development and experimentation, providing a persistent, preconfigured, and failure-resilient cluster. Depending on your choice, skip to the appropriate section for next steps.
Fine-tune the model using SageMaker HyperPod
To fine-tune the model using HyperPod, start by setting up the virtual environment and installing the necessary dependencies to execute the training job on the EKS cluster. Make sure the cluster is InService before proceeding, and you’re using Python 3.9 or greater in your development environment.

python3 -m venv ${PWD}/venv
source venv/bin/activate

Next, download and set up the SageMaker HyperPod recipes repository:

git clone –recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install -r requirements.txt 

You can now use the SageMaker HyperPod recipe launch scripts to submit your training job. Using the recipe involves updating the k8s.yaml configuration file and executing the launch script.
In recipes_collection/cluster/k8s.yaml, update the persistent_volume_claims section. It mounts the FSx claim to the /fsx directory of each computing pod:

– claimName: fsx-claim    
  mountPath: fsx

SageMaker HyperPod recipes provide a launch script for each recipe within the launcher_scripts directory. To fine-tune the GPT-OSS-120B model, update the launch scripts located at launcher_scripts/gpt_oss/run_hf_gpt_oss_120b_seq4k_gpu_lora.sh and update the cluster_type parameter.
The updated launch script should look similar to the following code when running SageMaker HyperPod with Amazon EKS. Make sure that cluster=k8s and cluster_type=k8s are updated in the launch script:

#!/bin/bash

# Original Copyright (c), NVIDIA CORPORATION. Modifications © Amazon.com

#Users should setup their cluster type in /recipes_collection/config.yaml

SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-“$(pwd)”}

HF_MODEL_NAME_OR_PATH=”openai/gpt-oss-120b” # HuggingFace pretrained model name or path

TRAIN_DIR=”/fsx/multilingual_4096″ # Location of training dataset
VAL_DIR=”/fsx/multilingual_4096″ # Location of validation dataset

EXP_DIR=”/fsx/experiment” # Location to save experiment info including logging, checkpoints, ect
HF_ACCESS_TOKEN=”hf_xxxxxxxx” # Optional HuggingFace access token

HYDRA_FULL_ERROR=1 python3 “${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py”
    recipes=fine-tuning/gpt_oss/hf_gpt_oss_120b_seq4k_gpu_lora
    container=”658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:sm-pytorch_gpt_oss_patch_pt-2.7_cuda12.8″
    base_results_dir=”${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results”
    recipes.run.name=”hf-gpt-oss-120b-lora”
cluster=k8s  # Imp: add cluster line when running on HP EKS
cluster_type=k8s  # Imp: add cluster_type line when running on HP EKS
    recipes.exp_manager.exp_dir=”$EXP_DIR”
    recipes.trainer.num_nodes=1
    recipes.model.data.train_dir=”$TRAIN_DIR”
    recipes.model.data.val_dir=”$VAL_DIR”
    recipes.model.hf_model_name_or_path=”$HF_MODEL_NAME_OR_PATH”
    recipes.model.hf_access_token=”$HF_ACCESS_TOKEN”

When the script is ready, you can launch fine-tuning of the GPT OSS 120B model using the following code:

chmod +x launcher_scripts/gpt_oss/run_hf_gpt_oss_120b_seq4k_gpu_lora.sh
bash launcher_scripts/gpt_oss/run_hf_gpt_oss_120b_seq4k_gpu_lora.sh

After submitting a job for fine-tuning, you can use the following command to verify successful submission. You should be able to see the pods running in your cluster:

kubectl get pods
NAME                                READY  STATUS   RESTARTS   AGE
hf-gpt-oss-120b-lora-h2cwd-worker-0 1/1    Running  0          14m

To check logs for the job, you can use the kubectl logs command:
kubectl logs -f hf-gpt-oss-120b-lora-h2cwd-worker-0
You should be able to see the following logs when the training begins and completes. You will find the checkpoints written to the /fsx/experiment/checkpoints folder.

warnings.warn(
    
Epoch 0:  40%|████      | 50/125 [08:47<13:10,  0.09it/s, Loss/train=0.254, Norms/grad_norm=0.128, LR/learning_rate=2.2e-6] [NeMo I 2025-08-18 17:49:48 nemo_logging:381] save SageMakerCheckpointType.PEFT_FULL checkpoint: /fsx/experiment/checkpoints/peft_full/steps_50
[NeMo I 2025-08-18 17:49:48 nemo_logging:381] Saving PEFT checkpoint to /fsx/experiment/checkpoints/peft_full/steps_50
[NeMo I 2025-08-18 17:49:49 nemo_logging:381] Loading Base model from : openai/gpt-oss-120b
You are attempting to use Flash Attention 2 without specifying a torch dtype. This might lead to unexpected behaviour
Loading checkpoint shards: 100%|██████████| 15/15 [01:49<00:00,  7.33s/it]
[NeMo I 2025-08-18 17:51:39 nemo_logging:381] Merging the adapter, this might take a while……
Unloading and merging model: 100%|██████████| 547/547 [00:07<00:00, 71.27it/s]
[NeMo I 2025-08-18 17:51:47 nemo_logging:381] Checkpointing to /fsx/experiment/checkpoints/peft_full/steps_50/final-model……
[NeMo I 2025-08-18 18:00:14 nemo_logging:381] Successfully save the merged model checkpoint.
`Trainer.fit` stopped: `max_steps=50` reached.
Epoch 0:  40%|████      | 50/125 [23:09<34:43,  0.04it/s, Loss/train=0.264, Norms/grad_norm=0.137, LR/learning_rate=2e-6]  

When the training is complete, the final merged model can be found in the experiment directory path you defined in the launcher script under /fsx/experiment/checkpoints/peft_full/steps_50/final-model.
Fine-tune using SageMaker training jobs
You can also use recipes directly with SageMaker training jobs using the SageMaker Python SDK. The training jobs automatically spin up the compute, load the input data, run the training script, save the model to your output location, and tear down the instances, for a smooth training experience.
The following code snippet shows how to use recipes with the PyTorch estimator. You can use the training_recipe parameter to specify the training or fine-tuning recipe to be used, and recipe_overrides for any parameters that need replacement. For training jobs, update the input, output, and results directories to locations in /opt/ml as required by SageMaker training jobs.

import os
import sagemaker,boto3
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import FileSystemInput

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
output = os.path.join(f”s3://{bucket}”, “output”)

recipe_overrides = {
    “run”: {
        “results_dir”: “/opt/ml/model”,
    },
    “exp_manager”: {
        “exp_dir”: “”,
        “explicit_log_dir”: “/opt/ml/output/tensorboard”,
        “checkpoint_dir”: “/opt/ml/checkpoints”,
    },
    “model”: {
        “data”: {
            “train_dir”: “/opt/ml/input/data/train”,
            “val_dir”: “/opt/ml/input/data/val”,
        },
    },
    “use_smp_model”: “False”,
}

# create the estimator object
estimator = PyTorch(
  output_path=output,
  base_job_name=f”gpt-oss-recipe”,
  role=role,
  instance_type=”ml.p5.48xlarge”,
  training_recipe=”fine-tuning/gpt_oss/hf_gpt_oss_120b_seq4k_gpu_lora”,
  recipe_overrides=recipe_overrides,
  sagemaker_session=sagemaker_session,
  image_uri=”658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:sm-pytorch_gpt_oss_patch_pt-2.7_cuda12.8″,
)

# submit the training job
estimator.fit(
inputs={
“train”: f”s3://{bucket}/datasets/multilingual_4096/”,
“val”: f”s3://{bucket}/datasets/multilingual_4096/”}, wait=True)

After the job is submitted, you can monitor the status of your training job on the SageMaker console, by choosing Training jobs under Training in the navigation pane. Choose the training job that starts with gpt-oss-recipe to view its details and logs. When the training job is complete, the outputs will be saved to an S3 location. You can get the location of the output artifacts from the S3 model artifact section on the job details page.
Run inference
After you fine-tune your GPT-OSS model with SageMaker recipes on either SageMaker training jobs or SageMaker HyperPod, the output is a customized model artifact that merges the base model with the customized PEFT adapters. This final model is stored in Amazon S3 and can be deployed directly from Amazon S3 to SageMaker endpoints for real-time inference.
To serve GPT-OSS models, you must have the latest vLLM containers (v0.10.1 or later). A full list of vllm-openai Docker image versions is available on Docker hub.
The steps to deploy your fine-tuned GPT-OSS model are outlined in this section.
Build the latest GPT-OSS container for your SageMaker endpoint
If you’re deploying the model from SageMaker Studio using JupyterLab or the Code Editor, both environments come with Docker preinstalled. Make sure that you’re using the SageMaker Distribution image v3.0 or later for compatibility.You can build your deployment container by running the following commands:

%%bash # <- use this if you’re running this inside JupterLab cell

# navigate to deploy dir from the current workdir, to build container
cd ./deploy

# build a push container
chmod +X build.sh
bash build.sh

cd .. 

If you’re running these commands from a local terminal or other environment, simply omit the %%bash line and run the commands as standard shell commands.
The build.sh script is responsible for automatically building and pushing a vllm-openai container that is optimized for SageMaker endpoints. After it’s built, the custom SageMaker endpoint compatible vllm image is pushed to Amazon Elastic Container Registry (Amazon ECR). SageMaker endpoints can then pull this image from Amazon ECR at runtime to spin up the container for inference.
The following is an example of the build.sh script:

export REGION={region}
export ACCOUNT_ID={account_id}
export REPOSITORY_NAME=vllm
export TAG=v0.10.1

full_name=”${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPOSITORY_NAME}:${TAG}”

echo “building $full_name”

DOCKER_BUILDKIT=0 docker build . –network sagemaker –tag $full_name –file Dockerfile

aws ecr get-login-password –region $REGION | docker login –username AWS –password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

# If the repository doesn’t exist in ECR, create it.
aws ecr describe-repositories –region ${REGION} –repository-names “${REPOSITIRY_NAME}” > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository –region ${REGION} –repository-name “${REPOSITORY_NAME}” > /dev/null
fi

docker tag $REPOSITORY_NAME:$TAG ${full_name}
docker push ${full_name}

The Dockerfile defines how we convert an open source vLLM Docker image into a SageMaker hosting-compatible image. This involves extending the base vllm-openai image, adding the serve entrypoint script, and making it executable. See the following example Dockerfile:

FROM vllm/vllm-openai:v0.10.1

COPY serve /usr/bin/serve
RUN chmod 777 /usr/bin/serve

ENTRYPOINT [ “/usr/bin/serve” ]

The serve script acts as a translation layer between SageMaker hosting conventions and the vLLM runtime. You can maintain the same deployment workflow you’re familiar with when hosting models on SageMaker endpoints, while automatically converting SageMaker-specific configurations into the format expected by vLLM.
Key points to note about this script:

It enforces the use of port 8080, which SageMaker requires for inference containers
It dynamically translates environment variables prefixed with OPTION_ into CLI arguments for vLLM (for example, OPTION_MAX_MODEL_LEN=4096 changes to –max-model-len 4096)
It prints the final set of arguments for visibility
It finally launches the vLLM API server with the translated arguments

The following is an example serve script:

#!/bin/bash

# Define the prefix for environment variables to look for
PREFIX=”OPTION_”
ARG_PREFIX=”–”

# Initialize an array for storing the arguments
# port 8080 required by sagemaker, https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-container-response
ARGS=(–port 8080)

# Loop through all environment variables
while IFS=’=’ read -r key value; do
    # Remove the prefix from the key, convert to lowercase, and replace underscores with dashes
    arg_name=$(echo “${key#”${PREFIX}”}” | tr ‘[:upper:]’ ‘[:lower:]’ | tr ‘_’ ‘-‘)

    # Add the argument name and value to the ARGS array
    ARGS+=(“${ARG_PREFIX}${arg_name}”)
    if [ -n “$value” ]; then
        ARGS+=(“$value”)
    fi
done < <(env | grep “^${PREFIX}”)

echo “——————————————————————-”
echo “vLLM engine args: [${ARGS[@]}]”
echo “——————————————————————-”

# Pass the collected arguments to the main entrypoint
exec python3 -m vllm.entrypoints.openai.api_server “${ARGS[@]}”

Host customized GPT-OSS as a SageMaker real-time endpoint
Now you can deploy your fine-tuned GPT-OSS model using the ECR image URI you built in the previous step. In this example, the model artifacts are stored securely in an S3 bucket, and SageMaker will download them into the container at runtime.Complete the following configurations:

Set model_data to point to the S3 prefix where your model artifacts are located
Set the OPTION_MODEL environment variable to /opt/ml/model, which is where SageMaker mounts the model inside the container
(Optional) If you’re serving a model from Hugging Face Hub instead of Amazon S3, you can set OPTION_MODEL directly to the Hugging Face model ID instead

The endpoint startup might take several minutes as the model artifacts are downloaded and the container is initialized.The following is an example deployment code:

inference_image = f”{account_id}.dkr.ecr.{region}.amazonaws.com/vllm:v0.10.1″


lmi_model = sagemaker.Model(
    image_uri=inference_image,
    env={
        “OPTION_MODEL”: “/opt/ml/model”, # set this to let SM endpoint read a model stored in s3, else set it to HF MODEL ID
        “OPTION_SERVED_MODEL_NAME”: “model”,
        “OPTION_TENSOR_PARALLEL_SIZE”: json.dumps(num_gpus),
        “OPTION_DTYPE”: “bfloat16″,
        #”VLLM_ATTENTION_BACKEND”: “TRITON_ATTN_VLLM_V1”, # not required for vLLM 0.10.1 and above
        “OPTION_ASYNC_SCHEDULING”: “true”,
        “OPTION_QUANTIZATION”: “mxfp4”
    },
    role=role,
    name=model_name,
    model_data={
        ‘S3DataSource’: {
            ‘S3Uri’: “s3://path/to/gpt-oss/model/artifacts”,
            ‘S3DataType’: ‘S3Prefix’,
            ‘CompressionType’: ‘None’
        }
    },
)

lmi_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=600,
    endpoint_name=endpoint_name,
    endpoint_type=sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
    inference_component_name=inference_component_name,
    resources=ResourceRequirements(requests={“num_accelerators”: 1, “memory”: 1024*3, “copies”: 1,}),
)

Sample inference
After your endpoint is deployed and in the InService state, you can invoke your fine-tuned GPT-OSS model using the SageMaker Python SDK.
The following is an example predictor setup:

pretrained_predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker.Session(boto3.Session(region_name=boto3.Session().region_name)),
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
    component_name=inference_component_name
)

The modified vLLM container is fully compatible with the OpenAI-style messages input format, making it straightforward to send chat-style requests:

payload = {
    “messages”: [{“role”: “user”, “content”: “Hello who are you?”}],
    “parameters”: {“max_new_tokens”: 64, “temperature”: 0.2}
}

output = pretrained_predictor.predict(payload)

You have successfully deployed and invoked your custom fine-tuned GPT-OSS model on SageMaker real-time endpoints, using the vLLM framework for optimized, low-latency inference. You can find more GPT-OSS hosting examples in the OpenAI gpt-oss examples GitHub repo.
Clean up
To avoid incurring additional charges, complete the following steps to clean up the resources used in this post:

Delete the SageMaker endpoint:

pretrained_predictor.delete_endpoint()

If you created a SageMaker HyperPod cluster for the purposes of this post, delete the cluster by following the instructions in Deleting a SageMaker HyperPod cluster.
Clean up the FSx for Lustre volume if it’s no longer needed by following instructions in Deleting a file system.
If you used training jobs, the training instances are automatically deleted when the jobs are complete.

Conclusion
In this post, we showed how to fine-tune OpenAI’s GPT-OSS models (gpt-oss-120b and gpt-oss-20b) on SageMaker AI using SageMaker HyperPod recipes. We discussed how SageMaker HyperPod recipes provide a powerful yet accessible solution for organizations to scale their AI model training capabilities with large language models (LLMs) including GPT-OSS, using either a persistent cluster through SageMaker HyperPod, or an ephemeral cluster using SageMaker training jobs. The architecture streamlines complex distributed training workflows through its intuitive recipe-based approach, reducing setup time from weeks to minutes. We also showed how these fine-tuned models can be seamlessly deployed to production using SageMaker endpoints with vLLM optimization, providing enterprise-grade inference capabilities with OpenAI-compatible APIs. This end-to-end workflow, from training to deployment, helps organizations build and serve custom LLM solutions while using the scalable infrastructure of AWS and comprehensive ML platform capabilities of SageMaker.
To begin using the SageMaker HyperPod recipes, visit the Amazon SageMaker HyperPod recipes GitHub repo for comprehensive documentation and example implementations. If you’re interested in exploring the fine-tuning further, the Generative AI using Amazon SageMaker GitHub repo has the necessary code and notebooks. Our team continues to expand the recipe ecosystem based on customer feedback and emerging ML trends, making sure that you have the tools needed for successful AI model training.
Special thanks to everyone who contributed to the launch: Hengzhi Pei, Zach Kimberg, Andrew Tian, Leonard Lausen, Sanjay Dorairaj, Manish Agarwal, Sareeta Panda, Chang Ning Tsai, Maxwell Nuyens, Natasha Sivananjaiah, and Kanwaljit Khurmi.

About the authors
Durga Sury is a Senior Solutions Architect at Amazon SageMaker, where she helps enterprise customers build secure and scalable AI/ML systems. When she’s not architecting solutions, you can find her enjoying sunny walks with her dog, immersing herself in murder mystery books, or catching up on her favorite Netflix shows.
Pranav Murthy is a Senior Generative AI Data Scientist at AWS, specializing in helping organizations innovate with Generative AI, Deep Learning, and Machine Learning on Amazon SageMaker AI. Over the past 10+ years, he has developed and scaled advanced computer vision (CV) and natural language processing (NLP) models to tackle high-impact problems—from optimizing global supply chains to enabling real-time video analytics and multilingual search. When he’s not building AI solutions, Pranav enjoys playing strategic games like chess, traveling to discover new cultures, and mentoring aspiring AI practitioners. You can find Pranav on LinkedIn.
Sumedha Swamy is a Senior Manager of Product Management at Amazon Web Services (AWS), where he leads several areas of the Amazon SageMaker, including SageMaker Studio – the industry-leading integrated development environment for machine learning, developer and administrator experiences, AI infrastructure, and SageMaker SDK.
Dmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. You can connect with Dmitry on LinkedIn.
Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.
Anirudh Viswanathan is a Senior Product Manager, Technical, at AWS with the SageMaker team, where he focuses on Machine Learning. He holds a Master’s in Robotics from Carnegie Mellon University and an MBA from the Wharton School of Business. Anirudh is a named inventor on more than 50 AI/ML patents. He enjoys long-distance running, exploring art galleries, and attending Broadway shows.

Inline code nodes now supported in Amazon Bedrock Flows in public prev …

Today, we are excited to announce the public preview of support for inline code nodes in Amazon Bedrock Flows. With this powerful new capability, you can write Python scripts directly within your workflow, alleviating the need for separate AWS Lambda functions for simple logic. This feature streamlines preprocessing and postprocessing tasks (like data normalization and response formatting), simplifying generative AI application development and making it more accessible across organizations. By removing adoption barriers and reducing maintenance overhead, the inline code feature accelerates enterprise adoption of generative AI solutions, resulting in faster iteration cycles and broader participation in AI application building.
Organizations using Amazon Bedrock Flows now can use inline code nodes to design and deploy workflows for building more scalable and efficient generative AI applications fully within the Amazon Bedrock environment while achieving the following:

Preprocessing – Transforming input data before sending it to a large language model (LLM) without having to set up a separate Lambda function. For example, extracting specific fields from JSON, formatting text data, or normalizing values.
Postprocessing – Performing operations on model outputs directly within the flow. For example, extracting entities from responses, formatting JSON for downstream systems, or applying business rules to the results.
Complex use cases – Managing the execution of complex, multi-step generative AI workflows that can call popular packages like opencv, scipy, of pypdf.
Builder-friendly – Creating and managing inline code through both the Amazon Bedrock API and the AWS Management Console.
Observability – Seamless user experience with the ability to trace the inputs and outputs from each node.

In this post, we discuss the benefits of this new feature, and show how to use inline code nodes in Amazon Bedrock Flows.
Benefits of inline code in Amazon Bedrock Flows
Thomson Reuters, a global information services company providing essential news, insights, and technology solutions to professionals across legal, tax, accounting, media, and corporate sectors, handles complex, multi-step generative AI use cases that require simple preprocessing and postprocessing as part of the workflow. With the inline code feature in Amazon Bedrock Flows, Thomson Reuters can now benefit from the following:

Simplified flow management – Alleviate the need to create and maintain individual Lambda functions for each custom code block, making it straightforward to manage thousands of workflows across a large user base (over 16,000 users and 6,000 chains) with less operational overhead.
Flexible data processing – Enable direct preprocessing of data before LLM calls and postprocessing of LLM responses, including the ability to interact with internal AWS services and third-party APIs through a single interface.
DIY flow creation – Help users build complex workflows with custom code blocks through a self-service interface, without exposing them to the underlying infrastructure complexities or requiring Lambda function management.

Solution overview
In the following sections, we show how to create a simple Amazon Bedrock flow and add inline code nodes. Our example showcases a practical application where we’ll construct a flow that processes user requests for music playlists, incorporating both preprocessing and postprocessing inline code nodes to handle data validation and response formatting.
Prerequisites
Before implementing the new capabilities, make sure you have the following:

An AWS account
Other Amazon Bedrock services in place:

Create and test your base prompts for customer service interactions in Amazon Bedrock Prompt Management
Create guardrails with relevant rules using Amazon Bedrock Guardrails

Resources in auxiliary AWS services needed for your workflow, such as Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), and Amazon Simple Notification Service (Amazon SNS)
Required AWS Identity and Access Management (IAM) permissions:

Access to Amazon Bedrock Flows
Appropriate access to LLMs in Amazon Bedrock

After these components are in place, you can proceed with using Amazon Bedrock Flows with inline code capabilities in your generative AI use case.
Create your flow using inline code nodes
Complete the following steps to create your flow:

On the Amazon Bedrock console, choose Flows under Builder tools in the navigation pane.
Create a new flow, for example, easy-inline-code-flow. For detailed instructions on creating a flow, see Amazon Bedrock Flows is now generally available with enhanced safety and traceability.
Add an inline code node. (For this example, we create two nodes for two separate prompts).

Amazon Bedrock provides different node types to build your prompt flow. For this example, we use an inline code node instead of calling a Lambda function for custom code for a generative AI-powered application. There are two inline code nodes in the flow. We have extended the sample from the documentation Create a flow with a single prompt. The new node type Inline Code is on the Nodes tab in the left pane.

Add some code to process in the Preprocessing_InlineCode node before sending it to the prompt node prompt_1. Python 3 is only supported at the time of writing. In this example, we check if the number of songs requested by the user is more than 10 and it’s set to 10.

There is a Python code editor and sample code templates as well for writing the code.

We use the following code:

import json
def __func():
try:
if userprompt[‘number’] > 10:
userprompt[‘number’]=10
return userprompt
else:
return userprompt

except Exception as e:
return {
“error”: “Invalid input format”,
“details”: str(e)
}
__func()

In the Postprocessing_Inline Code node, we check the number of words in the response and feed the data to the next prompt node, prompt_2.

def __func():
# Remove extra whitespace and count
cleaned_text = ‘ ‘.join(playlist.split())
word_count = len(cleaned_text.split())
return{
“playlist”: playlist, “word_count”: word_count
}
__func()

Test the flow with the following prompt:

Sample input for the Flow Input node
{
“genre”: “pop”,
“number”: 8
}

Input to the inline code node (Python function) must be treated as untrusted user input, and appropriate parsing, validation, and data handling should be implemented.
You can see the output as shown in the following screenshot. The system also provides access to node execution traces, offering detailed insights into each processing step, real-time performance metrics, and highlighting any issues that occurred during the flow’s execution. Traces can be enabled using an API and sent to an Amazon CloudWatch log. In the API, set the enableTrace field to true in an InvokeFlow request. Each flowOutputEvent in the response is returned alongside a flowTraceEvent.

You have now successfully created and executed an Amazon Bedrock flows using inline code nodes. You can also use Amazon Bedrock APIs to programmatically execute this flow. For additional details on how to configure flows with enhanced safety and traceability, see Amazon Bedrock Flows is now generally available with enhanced safety and traceability.
Considerations
When working with inline code nodes in Amazon Bedrock Flows, the following are the important things to note:

Code is executed in an AWS managed, secured, sandbox environment that is not shared with anyone and doesn’t have internet access
The feature supports Python 3.12 and above
It efficiently handles code with binary size up to 4 MB, which is roughly 4 million characters
It supports popular packages like opencv, scipy, and pypdf
It supports 25 concurrent code execution sessions per AWS account

Conclusion
The integration of inline code nodes in Amazon Bedrock Flows marks a significant advancement in democratizing generative AI development, reducing the complexity of managing separate Lambda functions for basic processing tasks. This enhancement responds directly to enterprise customers’ needs for a more streamlined development experience, helping developers focus on building sophisticated AI workflows rather than managing infrastructure.
Inline code in Amazon Bedrock Flows is now available in public preview in the following AWS Regions: US East (N. Virginia, Ohio), US West (Oregon) and Europe (Frankfurt). To get started, open the Amazon Bedrock console or Amazon Bedrock APIs to begin building flows with Amazon Bedrock Flows. To learn more, refer to Create your first flow in Amazon Bedrock and Track each step in your flow by viewing its trace in Amazon Bedrock.
We’re excited to see the innovative applications you will build with these new capabilities. As always, we welcome your feedback through AWS re:Post for Amazon Bedrock or your usual AWS contacts. Join the generative AI builder community at community.aws to share your experiences and learn from others.

About the authors
Shubhankar Sumar is a Senior Solutions Architect at AWS, where he specializes in architecting generative AI-powered solutions for enterprise software and SaaS companies across the UK. With a strong background in software engineering, Shubhankar excels at designing secure, scalable, and cost-effective multi-tenant systems on the cloud. His expertise lies in seamlessly integrating cutting-edge generative AI capabilities into existing SaaS applications, helping customers stay at the forefront of technological innovation.
Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Accelerate enterprise AI implementations with Amazon Q Business

As an Amazon Web Services (AWS) enterprise customer, you’re probably exploring ways to use generative AI to enhance your business processes, improve customer experiences, and drive innovation.
With a variety of options available—from Amazon Q Business to other AWS services or third-party offerings—choosing the right tool for your use case can be challenging. This post aims to guide you through the decision-making process and highlight the unique advantages of Amazon Q Business and how to build an AWS architecture to get started and onboard more use cases.
Amazon Q Business is an AI-powered assistant that can help employees quickly find information, solve problems, and get work done across their company’s data and applications. With Amazon Q Business, employees can access information from various internal documents, websites, wikis, and other business resources through natural conversations, helping them to find exactly what they need without extensive searching. It can also be used to automate common workflows across enterprise systems. Amazon Q Business prioritizes security and privacy by operating within your organization’s existing permissions and access controls, helping to ensure that employees only see information that they’re authorized to access.
Understand your use case
The first step in selecting the right generative AI solution is to clearly define your use case. Are you looking to enhance a single system, or do you need a solution that spans multiple platforms? Single-system use cases might be well-served by specific generative AI solutions, while cross-system scenarios often benefit from a more unified approach. Organizations that benefit most from Amazon Q Business typically share several key characteristics:

Data complexity: Companies with large volumes of data spread across multiple repositories and formats (documents, images, audio, video)
Knowledge dependency: Organizations where employee productivity depends on accessing institutional knowledge quickly and accurately
Security requirements: Organizations with strict security and compliance needs requiring role-based permissions and access controls
Collaboration needs: Teams that need to share information and collaborate across departments and geographies
Process complexity: Organizations with complex workflows that could benefit from automation and streamlining

Key considerations for tool selection
When evaluating generative AI tools, there are several factors should you should consider to help ensure successful implementation and adoption:

Customization needs: Determine if you need custom AI behaviors or if out-of-the-box solutions suffice
Integration complexity: Assess the number of systems involved and the complexity of data flows between them
Future scalability: Think about your long-term needs and choose a solution that can grow with you
Data privacy and residency: Understand your data governance requirements and make sure that your chosen solution can meet them
Cost-effectiveness: Evaluate the total cost of ownership, including implementation, maintenance, and scaling costs
Time to market: Consider how quickly you need to implement your generative AI solution
Change management: As with any enterprise AI implementation, organizations must invest in proper training and change management strategies to help ensure adoption

The case for Amazon Q Business
Amazon Q Business offers unique advantages, especially for organizations that already have AWS services or that have complex, cross-system needs. For AWS enterprise customers that have the resources to build and operate their own solutions, an architecture that includes Amazon Q Business offers flexibility and cost advantages, including:

Unified experience: Amazon Q Business can provide a consistent AI experience across multiple systems, creating a seamless interface for users.
Architectural benefits: As a native AWS service, Amazon Q Business integrates seamlessly with your existing AWS architecture, reducing complexity and potential points of failure.
Flexibility: Amazon Q Business can connect to various enterprise systems, so that you can use it to create custom workflows that span multiple platforms.
Scalability: By using Amazon Q Business, you can take advantage of the proven scalability of AWS to handle growing workloads without worrying about infrastructure management.
Security and compliance: Use the robust security features and compliance certifications of AWS to help reduce your security and compliance burden.
Cost advantages: Amazon Q Business offers a pay-as-you-go model, so you can scale costs with the number of users and usage for knowledge bases. This can lead to significant cost savings (see pricing details).

Implement your generative AI use cases
After you’ve chosen your generative AI use cases, consider a phased implementation approach:

Start with pilot use cases to prove value quickly: Good pilot use cases include IT help desk or HR workflows. You can get started by taking advantage of AWS-provided example projects and open source samples.
Evaluate the next use cases: Prioritize you next use cases by business impact and feature coverage with existing Amazon Q Business connectors and plugins. Often AIOps use cases that include integrations or chat interfaces on top of ServiceNow, Confluence, Teams, or Slack are good examples.
Use existing data sources: Connect Amazon Q Business to enterprise systems with supported connectors first to maximize immediate value.
Implement accuracy testing using frameworks: Use tools such as the AWS evaluation framework for Amazon Q Business, which includes automated testing pipelines, ground truth datasets, and comprehensive metrics for measuring response quality, relevancy, truthfulness, and overall accuracy.
Iteratively scale successful implementations across your organization: Start your implementation with the teams that are most interested in the application and willing to provide feedback. Make changes based on the feedback as needed, then expand it across the organization.
Measure and track results: Establish clear KPIs before implementation to quantify business impact.

Monitor usage and costs, implement feedback loops, and make sure to support security and compliance throughout your generative AI journey. Amazon Q Business can provide significant value when implemented in appropriate use cases with proper planning and governance. Success depends on careful evaluation of business needs, thorough implementation planning, and ongoing management of the solution.
Get started on AWS
When implementing your generative AI use cases, architectural decisions play a crucial role in achieving long-term success. Let’s explore some best practices for a typical AWS enterprise environment.

AWS Identity and Access Management (IAM): Connecting your corporate source of identities to AWS IAM Identity Center provides better security and user experience, Amazon Q Business users authorize their Amazon Q session with their usual sign-in process, using their existing organizational credentials through the identity source already in place.
Account structure: Set up Amazon Q Business service, data sources, and plugins in a shared services account based on application group or business unit to help reduce the number of similar deployments across different AWS accounts.
Access channels: When rolling out new use cases, consider also enabling existing familiar enterprise channels such as collaboration tools (Teams or Slack) to provide a frictionless way to test and roll out new use cases.
Data sources: When adding data sources, estimate index storage needs and whether your use case requires crawling access control list (ACL) and identity information from the data source and if it is supported by the connector. To reduce initial complexity, focus on use cases that provide the same data to all users, then expand it in a second phase for use cases that rely on ACLs to control access.
Plugins: Use plugins to integrate external services as actions. For each use case, verify if a built-in plugin can provide this functionality, or if a custom plugin is needed. For custom plugins, plan an architecture that enables pointing to backend services using OpenAPI endpoints in other AWS accounts across the organization. This allows flexible integration of existing AWS Lambda functions or container-based functionality.

By carefully considering these aspects, you can create a solid foundation for your generative AI implementation that aligns with your organization’s needs and future growth plans.
How to deploy Amazon Q Business in your organization
The following reference architecture illustrates the main components and flow of a typical Amazon Q Business implementation:

The workflow is as follows:

A user interacts with an assistant through an enterprise collaboration system.
Alternate: A user interacts with the built-in web interface provided by Amazon Q Business.
The user is authenticated using IAM Identity Center and federated by a third-party identity provider (IdP).
Data sources are configured for existing enterprise systems and data is crawled and indexed in Amazon Q Business. You can use custom connectors to integrate data sources that aren’t provided by Amazon Q Business.
The user makes a request that requires action through a custom plugin. Use custom plugins to integrate third-party applications.
The custom plugin calls an API endpoint that calls an Amazon Bedrock agent using Lambda or Amazon Elastic Kubernetes Service (Amazon EKS) in another AWS account. The response is returned to Amazon Q Business and the user.

Use Amazon Q Business to improve enterprise productivity
Amazon Q Business, offers numerous practical applications across enterprise functions. Let’s explore some of the key use cases where Amazon Q Business can enhance organizational efficiency and productivity.

Knowledge management and support: Amazon Q Business can manage and retrieve information from documentation and repositories such as internal wikis, SharePoint, Confluence, and other knowledge bases. It provides contextual answers through natural language queries and helps maintain documentation quality by suggesting updates while connecting related information across different repositories. For examples, see Smartsheet enhances productivity with Amazon Q Business.
Employee onboarding and training: Improve your employee onboarding experience with automated, personalized learning journeys powered by intelligent support. From instant answers to common questions to guided system setup and interactive training content, this solution helps integrate new team members while supporting their continuous learning and development. To learn more, see Deriv Boosts Productivity and Reduces Onboarding Time by 45% with Amazon Q Business and this Amazon Machine Learning blog post.
IT help desk support: Shorten IT response times by using AI-driven assistance that delivers round-the-clock support and intelligent troubleshooting guidance. By automating ticket management and using historical data for solution recommendations, this system dramatically reduces response times while easing the burden on your IT support teams.
Human resources: Support your HR operations and increase employee satisfaction with an AI-powered solution that provides quick answers to policy questions and streamlines benefits management. This intelligent assistant guides employees through HR processes, simplifies leave management, and offers quick access to essential forms and documents, creating a more efficient and user-friendly HR experience.
Sales and marketing: Strengthen your sales and marketing efforts with an AI-powered platform that streamlines content creation, market analysis, and proposal development. From generating fresh content ideas to quickly providing product information and competitor insights, teams can use this solution to respond faster to customer needs while making data-driven decisions. See How AWS sales uses Amazon Q Business for customer engagement.
AI operations: Upgrade and improve your operational workflow with AI-driven monitoring and automation that transforms system management and incident response. From real-time performance tracking to automated routine tasks and intelligent root cause analysis, teams can use this solution to maintain operational efficiency and reduce manual intervention.

Customer case study
A leading enterprise organization transformed its operational efficiency by implementing Amazon Q Business to tackle widespread knowledge accessibility challenges. Prior to implementation, the company struggled with fragmented institutional knowledge scattered across multiple systems, causing significant productivity losses as employees—from systems analysts to executives—spent hours daily searching through documentation, legacy code, and reports.
By deploying Amazon Q Business, the organization centralized its scattered information from various sources including Amazon Simple Storage Service (Amazon S3) buckets, Jira, SharePoint, and other content management systems into a single, intelligent interface. The solution dramatically streamlined access to critical information across their complex ecosystem of enterprise resource planning (ERP) systems, databases, sales platforms, and e-commerce integrations.

With approximately 300 employees each saving two hours daily on routine information retrieval tasks, the company achieved remarkable productivity and efficiency gains. Beyond the gains, Amazon Q Business fostered smarter collaboration, reduced subject-matter expert (SME) dependencies, and accelerated decision-making processes, effectively redefining how enterprise knowledge is accessed and used across the organization.
Conclusion
Amazon Q Business offers AWS customers a scalable and comprehensive solution for enhancing business processes across their organization. By carefully evaluating your use cases, following implementation best practices, and using the architectural guidance provided in this post, you can deploy Amazon Q Business to transform your enterprise productivity. The key to success lies in starting small, proving value quickly, and scaling systematically across your organization.
For more information on Amazon Q Business, including detailed documentation and getting started guides, visit:

Explore the Amazon Q documentation to understand more about building custom plugins.
Check out these related resources:

Getting Started with Amazon Q Business
Plugins for Amazon Q Business
Amazon Q Business FAQs

For questions and feedback, visit the AWS re:Post or contact AWS Support.

About the authors
Oliver Steffmann is a Principal Solutions Architect at AWS based in New York and is passionate about GenAI and public blockchain use cases. He has over 20 years of experience working with financial institutions and helps his customers get their cloud transformation off the ground. Outside of work he enjoys spending time with his family and training for the next Ironman.
Krishna Pramod is a Senior Solutions Architect at AWS. He works as a trusted advisor for customers, guiding them through innovation with modern technologies and development of well-architected applications in the AWS cloud. Outside of work, Krishna enjoys reading, music and exploring new destinations.
Mo Naqvi is a Generative AI Specialist at AWS on the Amazon Q Business team, where he helps enterprise customers leverage generative AI to transform workplace productivity and unlock business intelligence. With expertise in AI-powered search, deep research capabilities, and agentic workflows, he enables organizations to break down data silos and derive actionable insights from their enterprise information.

Liquid AI Releases LFM2-VL: Super-Fast, Open-Weight Vision-Language Mo …

Liquid AI has officially released LFM2-VL, a new family of vision-language foundation models optimized for low-latency, on-device deployment. With two highly efficient variants—LFM2-VL-450M and LFM2-VL-1.6B—this launch marks a significant leap in bringing multimodal AI to smartphones, laptops, wearables, and embedded systems without compromising speed or accuracy.

Unprecedented Speed and Efficiency

LFM2-VL models are engineered to deliver up to 2× faster GPU inference compared to existing vision-language models, while maintaining competitive benchmark performance on tasks like image description, visual question answering, and multimodal reasoning. The 450M-parameter variant is tailored for highly resource-constrained environments, while the 1.6B-parameter version offers greater capability while still remaining lightweight enough for single-GPU or high-end mobile use.

https://www.liquid.ai/blog/lfm2-vl-efficient-vision-language-models

Technical Innovations

Modular Architecture: LFM2-VL combines a language model backbone (LFM2-1.2B or LFM2-350M), a SigLIP2 NaFlex vision encoder (400M or 86M parameters), and a multimodal projector with a “pixel unshuffle” technique that dynamically reduces image token counts for faster processing.

Native Resolution Handling: Images are processed at their native resolution up to 512×512 pixels without distortion from upscaling. Larger images are split into non-overlapping 512×512 patches, preserving detail and aspect ratio. The 1.6B model also encodes a downscaled thumbnail of the full image for global context understanding.

Flexible Inference: Users can tune the speed-quality tradeoff at inference time by adjusting maximum image tokens and patch count, allowing real-time adaptation to device capabilities and application needs.

Training: The models were first pre-trained on the LFM2 backbone, then jointly mid-trained to fuse vision and language capabilities using a progressive adjustment of text-to-image data ratios, and finally fine-tuned for image understanding on approximately 100 billion multimodal tokens.

Benchmark Performance

LFM2-VL delivers competitive results on public benchmarks such as RealWorldQA, MM-IFEval, and OCRBench, rivaling larger models like InternVL3 and SmolVLM2, but with a smaller memory footprint and much faster processing—making it ideal for edge and mobile applications.

Both model sizes are open-weight and downloadable on Hugging Face under an Apache 2.0-based license, permitting free use for research and commercial use by companies. Larger enterprises must contact Liquid AI for a commercial license. The models integrate seamlessly with Hugging Face Transformers and support quantization for further efficiency gains on edge hardware.

https://www.liquid.ai/blog/lfm2-vl-efficient-vision-language-models

Use Cases and Integration

LFM2-VL is designed for developers and enterprises seeking to deploy fast, accurate, and efficient multimodal AI directly on devices—reducing cloud dependency and enabling new applications in robotics, IoT, smart cameras, mobile assistants, and more. Example applications include real-time image captioning, visual search, and interactive multimodal chatbots.

Getting Started

Download: Both models are available now on the Liquid AI Hugging Face collection.

Run: Example inference code is provided for platforms like llama.cpp, supporting various quantization levels for optimal performance on different hardware.

Customize: The architecture supports integration with Liquid AI’s LEAP platform for further customization and multi-platform edge deployment.

In summary, Liquid AI’s LFM2-VL sets a new standard for efficient, open-weight vision-language models on the edge. With native resolution support, tunable speed-quality tradeoffs, and a focus on real-world deployment, it empowers developers to build the next generation of AI-powered applications—anywhere, on any device.

Check out the Technical Details and Models on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Liquid AI Releases LFM2-VL: Super-Fast, Open-Weight Vision-Language Models Designed for Low-Latency and Device-Aware Deployment appeared first on MarkTechPost.