MIT’s LEGO: A Compiler for AI Chips that Auto-Generates Fast, Effici …

Table of contentsHardware Generation without TemplatesInput IR: Affine, Relation-Centric Semantics (Deconstruct)Front End: FU Graph + Memory Co-Design (Architect)Back End: Compile & Optimize to RTL (Compile & Optimize)OutcomeImportance for each segmentHow the “Compiler for AI Chips” Works—Step-by-Step ?Where It Lands in the Ecosystem?Summary

MIT researchers (Han Lab) introduced LEGO, a compiler-like framework that takes tensor workloads (e.g., GEMM, Conv2D, attention, MTTKRP) and automatically generates synthesizable RTL for spatial accelerators—no handwritten templates. LEGO’s front end expresses workloads and dataflows in a relation-centric affine representation, builds FU (functional unit) interconnects and on-chip memory layouts for reuse, and supports fusing multiple spatial dataflows in a single design. The back end lowers to a primitive-level graph and uses linear programming and graph transforms to insert pipeline registers, rewire broadcasts, extract reduction trees, and shrink area and power. Evaluated across foundation models and classic CNNs/Transformers, LEGO’s generated hardware shows 3.2× speedup and 2.4× energy efficiency over Gemmini under matched resources.

https://hanlab.mit.edu/projects/lego

Hardware Generation without Templates

Existing flows either: (1) analyze dataflows without generating hardware, or (2) generate RTL from hand-tuned templates with fixed topologies. These approaches restrict the architecture space and struggle with modern workloads that need to switch dataflows dynamically across layers/ops (e.g., conv vs. depthwise vs. attention). LEGO directly targets any dataflow and combinations, generating both architecture and RTL from a high-level description rather than configuring a few numeric parameters in a template.

https://hanlab.mit.edu/projects/lego

Input IR: Affine, Relation-Centric Semantics (Deconstruct)

LEGO models tensor programs as loop nests with three index classes: temporal (for-loops), spatial (par-for FUs), and computation (pre-tiling iteration domain). Two affine relations drive the compiler:

Data mapping fI→Df_{I→D}: maps computation indices to tensor indices.

Dataflow mapping fTS→If_{TS→I}: maps temporal/spatial indices to computation indices.

This affine-only representation eliminates modulo/division in the core analysis, making reuse detection and address generation a linear-algebra problem. LEGO also decouples control flow from dataflow (a vector c encodes control signal propagation/delay), enabling shared control across FUs and substantially reducing control logic overhead.

Front End: FU Graph + Memory Co-Design (Architect)

The main objectives is to maximize reuse and on-chip bandwidth while minimizing interconnect/mux overhead.

Interconnection synthesis. LEGO formulates reuse as solving linear systems over the affine relations to discover direct and delay (FIFO) connections between FUs. It then computes minimum-spanning arborescences (Chu-Liu/Edmonds) to keep only necessary edges (cost = FIFO depth). A BFS-based heuristic rewrites direct interconnects when multiple dataflows must co-exist, prioritizing chain reuse and nodes already fed by delay connections to cut muxes and data nodes.

Banked memory synthesis. Given the set of FUs that must read/write a tensor in the same cycle, LEGO computes bank counts per tensor dimension from the maximum index deltas (optionally dividing by GCD to reduce banks). It then instantiates data-distribution switches to route between banks and FUs, leaving FU-to-FU reuse to the interconnect.

Dataflow fusion. Interconnects for different spatial dataflows are combined into a single FU-level Architecture Description Graph (ADG); careful planning avoids naïve mux-heavy merges and yields up to ~20% energy gains compared to naïve fusion.

Back End: Compile & Optimize to RTL (Compile & Optimize)

The ADG is lowered to a Detailed Architecture Graph (DAG) of primitives (FIFOs, muxes, adders, address generators). LEGO applies several LP/graph passes:

Delay matching via LP. A linear program chooses output delays DvD_v to minimize inserted pipeline registers ∑(Dv−Du−Lv)⋅bitwidthsum (D_v-D_u-L_v)cdot text{bitwidth} across edges—meeting timing alignment with minimal storage.

Broadcast pin rewiring. A two-stage optimization (virtual cost shaping + MST-based rewiring among destinations) converts expensive broadcasts into forward chains, enabling register sharing and lower latency; a final LP re-balances delays.

Reduction tree extraction + pin reuse. Sequential adder chains become balanced trees; a 0-1 ILP remaps reducer inputs across dataflows so fewer physical pins are required (mux instead of add). This reduces both logic depth and register count.

These passes focus on the datapath, which dominates resources (e.g., FU-array registers ≈ 40% area, 60% power), and produce ~35% area savings versus naïve generation.

Outcome

Setup. LEGO is implemented in C++ with HiGHS as the LP solver and emits SpinalHDL→Verilog. Evaluation covers tensor kernels and end-to-end models (AlexNet, MobileNetV2, ResNet-50, EfficientNetV2, BERT, GPT-2, CoAtNet, DDPM, Stable Diffusion, LLaMA-7B). A single LEGO-MNICOC accelerator instance is used across models; a mapper picks per-layer tiling/dataflow. Gemmini is the main baseline under matched resources (256 MACs, 256 KB on-chip buffer, 128-bit bus @ 16 GB/s).

End-to-end speed/efficiency. LEGO achieves 3.2× speedup and 2.4× energy efficiency on average vs. Gemmini. Gains stem from: (i) a fast, accurate performance model guiding mapping; (ii) dynamic spatial dataflow switching enabled by generated interconnects (e.g., depthwise conv layers choose OH–OW–IC–OC). Both designs are bandwidth-bound on GPT-2.

Resource breakdown. Example SoC-style configuration shows FU array and NoC dominate area/power, with PPUs contributing ~2–5%. This supports the decision to aggressively optimize datapaths and control reuse.

Generative models. On a larger 1024-FU configuration, LEGO sustains >80% utilization for DDPM/Stable Diffusion; LLaMA-7B remains bandwidth-limited (expected for low operational intensity).

https://hanlab.mit.edu/projects/lego

Importance for each segment

For researchers: LEGO provides a mathematically grounded path from loop-nest specifications to spatial hardware with provable LP-based optimizations. It abstracts away low-level RTL and exposes meaningful levers (tiling, spatialization, reuse patterns) for systematic exploration.

For practitioners: It is effectively hardware-as-code. You can target arbitrary dataflows and fuse them in one accelerator, letting a compiler derive interconnects, buffers, and controllers while shrinking mux/FIFO overheads. This improves energy and supports multi-op pipelines without manual template redesign.

For product leaders: By lowering the barrier to custom silicon, LEGO enables task-tuned, power-efficient edge accelerators (wearables, IoT) that keep pace with fast-moving AI stacks—the silicon adapts to the model, not the other way around. End-to-end results against a state-of-the-art generator (Gemmini) quantify the upside.

How the “Compiler for AI Chips” Works—Step-by-Step?

Deconstruct (Affine IR). Write the tensor op as loop nests; supply affine f_{I→D} (data mapping), f_{TS→I} (dataflow), and control flow vector c. This specifies what to compute and how it is spatialized, without templates.

Architect (Graph Synthesis). Solve reuse equations → FU interconnects (direct/delay) → MST/heuristics for minimal edges and fused dataflows; compute banked memory and distribution switches to satisfy concurrent accesses without conflicts.

Compile & Optimize (LP + Graph Transforms). Lower to a primitive DAG; run delay-matching LP, broadcast rewiring (MST), reduction-tree extraction, and pin-reuse ILP; perform bit-width inference and optional power gating. These passes jointly deliver ~35% area and ~28% energy savings vs. naïve codegen.

Where It Lands in the Ecosystem?

Compared with analysis tools (Timeloop/MAESTRO) and template-bound generators (Gemmini, DNA, MAGNET), LEGO is template-free, supports any dataflow and their combinations, and emits synthesizable RTL. Results show comparable or better area/power versus expert handwritten accelerators under similar dataflows and technologies, while offering one-architecture-for-many-models deployment.

Summary

LEGO operationalizes hardware generation as compilation for tensor programs: an affine front end for reuse-aware interconnect/memory synthesis and an LP-powered back end for datapath minimization. The framework’s measured 3.2× performance and 2.4× energy gains over a leading open generator, plus ~35% area reductions from back-end optimizations, position it as a practical path to application-specific AI accelerators at the edge and beyond.

Check out the Paper and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post MIT’s LEGO: A Compiler for AI Chips that Auto-Generates Fast, Efficient Spatial Accelerators appeared first on MarkTechPost.

Bringing AI Agents Into Any UI: The AG-UI Protocol for Real-Time, Stru …

AI agents are no longer just chatbots that spit out answers. They’re evolving into complex systems that can reason step by step, call APIs, update dashboards, and collaborate with humans in real time. But this raises a key question: how should agents talk to user interfaces?

Ad-hoc sockets and custom APIs can work for prototypes, but they don’t scale. Each project reinvents how to stream outputs, manage tool calls, or handle user corrections. That’s exactly the gap the AG-UI (Agent–User Interaction) Protocol aims to fill.

What AG-UI Brings to the Table

AG-UI is a streaming event protocol designed for agent-to-UI communication. Instead of returning a single blob of text, agents emit a continuous sequence of JSON events:

TEXT_MESSAGE_CONTENT for streaming responses token by token.

TOOL_CALL_START / ARGS / END for external function calls.

STATE_SNAPSHOT and STATE_DELTA for keeping UI state in sync with the backend.

Lifecycle events (RUN_STARTED, RUN_FINISHED) to frame each interaction.

All of this flows over standard transports like HTTP SSE or WebSockets, so developers don’t have to build custom protocols. The frontend subscribes once and can render partial results, update charts, and even send user corrections mid-run.

This design makes AG-UI more than a messaging layer—it’s a contract between agents and UIs. Backend frameworks can evolve, UIs can change, but as long as they speak AG-UI, everything stays interoperable.

First-Party and Partner Integrations

One reason AG-UI is gaining traction is its breadth of supported integrations. Instead of leaving developers to wire everything manually, many agent frameworks already ship with AG-UI support.

Mastra (TypeScript): Native AG-UI support with strong typing, ideal for finance and data-driven copilots.

LangGraph: AG-UI integrated into orchestration workflows so every node emits structured events.

CrewAI: Multi-agent coordination exposed to UIs via AG-UI, letting users follow and guide “agent crews.”

Agno: Full-stack multi-agent systems with AG-UI-ready backends for dashboards and ops tools.

LlamaIndex: Adds interactive data retrieval workflows with live evidence streaming to UIs.

Pydantic AI: Python SDK with AG-UI baked in, plus example apps like the AG-UI Dojo.

CopilotKit: Frontend toolkit offering React components that subscribe to AG-UI streams.

Other integrations are in progress—like AWS Bedrock Agents, Google ADK, and Cloudflare Agents—which will make AG-UI accessible on major cloud platforms. Language SDKs are also expanding: Kotlin support is complete, while .NET, Go, Rust, Nim, and Java are in development.

Real-World Use Cases

Healthcare, finance, and analytics teams use AG-UI to turn critical data streams into live, context-rich interfaces: clinicians see patient vitals update without page reloads, stock traders trigger a stock-analysis agent and watch results stream inline, and analysts view a LangGraph-powered dashboard that visualizes charting plans token by token as the agent reasons.

Beyond data display, AG-UI simplifies workflow automation. Common patterns—data migration, research summarization, form-filling—are reduced to a single SSE event stream instead of custom sockets or polling loops. Because agents emit only STATE_DELTA patches, the UI refreshes just the pieces that changed, cutting bandwidth and eliminating jarring reloads. The same mechanism powers 24/7 customer-support bots that show typing indicators, tool-call progress, and final answers within one chat window, keeping users engaged throughout the interaction.

For developers, the protocol enables code-assistants and multi-agent applications with minimal glue code. Experiences that mirror GitHub Copilot—real-time suggestions streaming into editors—are built by simply listening to AG-UI events. Frameworks such as LangGraph, CrewAI, and Mastra already emit the spec’s 16 event types, so teams can swap back-end agents while the front-end remains unchanged. This decoupling speeds prototyping across domains: tax software can show optimistic deduction estimates while validation runs in the background, and a CRM page can autofill client details as an agent returns structured data to a Svelte + Tailwind UI.

AG-UI Dojo

CopilotKit has also recently introduced AG-UI Dojo, a “learning-first” suite of minimal, runnable demos that teach and validate AG-UI integrations end-to-end. Each demo includes a live preview, code, and linked docs, covering six primitives needed for production agent UIs: agentic chat (streaming + tool hooks), human-in-the-loop planning, agentic and tool-based generative UI, shared state, and predictive state updates for real-time collaboration. Teams can use the Dojo as a checklist to troubleshoot event ordering, payload shape, and UI–agent state sync before shipping, reducing integration ambiguity and debugging time.

You can play around with the Dojo here, Dojo source code and more technical details on the Dojo are available in the blog

Roadmap and Community Contributions

The public roadmap shows where AG-UI is heading and where developers can plug in:

SDK maturity: Ongoing investment in TypeScript and Python SDKs, with expansion into more languages.

Debugging and developer tools: Better error handling, observability, and lifecycle event clarity.

Performance and transports: Work on large payload handling and alternative streaming transports beyond SSE/WS.

Sample apps and playgrounds: The AG-UI Dojo demonstrates building blocks for UIs and is expanding with more patterns.

On the contribution side, the community has added integrations, improved SDKs, expanded documentation, and built demos. Pull requests across frameworks like Mastra, LangGraph, and Pydantic AI have come from both maintainers and external contributors. This collaborative model ensures AG-UI is shaped by real developer needs, not just spec writers.

Summary

AG-UI is emerging as the default interaction protocol for agent UIs. It standardizes the messy middle ground between agents and frontends, making applications more responsive, transparent, and maintainable.

With first-party integrations across popular frameworks, community contributions shaping the roadmap, and tooling like the AG-UI Dojo lowering the barrier to entry, the ecosystem is maturing fast.

Launch AG-UI with a single command, choose your agent framework, and be prototyping in under five minutes.

Copy CodeCopiedUse a different Browsernpx create-ag-ui-app@latest
#then
<pick your agent framework> 

#For details and patterns, see the quickstart blog: go.copilotkit.ai/ag-ui-cli-blog.

FAQs

FAQ 1: What problem does AG-UI solve?

AG-UI standardizes how agents communicate with user interfaces. Instead of ad-hoc APIs, it defines a clear event protocol for streaming text, tool calls, state updates, and lifecycle signals—making interactive UIs easier to build and maintain.

FAQ 2: Which frameworks already support AG-UI?

AG-UI has first-party integrations with Mastra, LangGraph, CrewAI, Agno, LlamaIndex, and Pydantic AI. Partner integrations include CopilotKit on the frontend. Support for AWS Bedrock Agents, Google ADK, and additional languages like .NET, Go, and Rust is in progress.

FAQ 3: How does AG-UI differ from REST APIs?

REST works for single request–response tasks. AG-UI is designed for interactive agents—it supports streaming output, incremental updates, tool usage, and user input during a run, which REST cannot handle natively.

FAQ 4: What transports does AG-UI use?

By default, AG-UI runs over HTTP Server-Sent Events (SSE). It also supports WebSockets, and the roadmap includes exploration of alternative transports for high-performance or binary data use cases.

FAQ 5: How can developers get started with AG-UI?

You can install official SDKs (TypeScript, Python) or use supported frameworks like Mastra or Pydantic AI. The AG-UI Dojo provides working examples and UI building blocks to experiment with event streams.

Thanks to the CopilotKit team for the thought leadership/ Resources for this article. CopilotKit team has supported us in this content/article.

The post Bringing AI Agents Into Any UI: The AG-UI Protocol for Real-Time, Structured Agent–Frontend Streams appeared first on MarkTechPost.

Scale visual production using Stability AI Image Services in Amazon Be …

This post was written with Alex Gnibus of Stability AI.
Stability AI Image Services are now available in Amazon Bedrock, offering ready-to-use media editing capabilities delivered through the Amazon Bedrock API. These image editing tools expand on the capabilities of Stability AI’s Stable Diffusion 3.5 models (SD3.5) and Stable Image Core and Ultra models, which are already available in Amazon Bedrock and have set new standards in image generation.
The professional creative production process consists of multiple editing steps to get the exact output needed. With Stability AI Image Services in Amazon Bedrock, you can modify, enhance, and transform existing images without jumping between multiple systems or sending files to external services. Everything runs through the same Amazon Bedrock experience you’re already using. The business impact can be immediate for teams that produce visual content at scale.
In this post, we explore examples of how these tools enable precise creative control to accelerate professional-grade visual content.
Editing tools now available in Amazon Bedrock
Stability AI Image Services span 9 tools across two categories: Edit and Control. Each tool handles specific editing tasks that typically require specialized software or manual intervention.
Edit: Advanced capabilities for granular editing steps
The tools in the Edit category make complex editing tasks more accessible and efficient.
The suite begins with fundamental yet powerful retouching tools. The Erase Object tool, for example, removes unwanted elements from images while intelligently maintaining background consistency. The following animation showcases the Erase Object tool removing a mannequin from a product shot while preserving the background. The tool can transform a source image based on a mask image or derive the mask from the source image’s alpha channel.

The Remove Background tool automatically isolates subjects with precision. This enables the creation of clean, professional product listings with consistent backgrounds or a variety of lifestyle settings, which is a game changer for ecommerce.
The following example illustrates the removal of an image background, while preserving details of a furniture product in the foreground.

The Search and Recolor and Search and Replace tools target specific elements within images for modification. Search and Recolor changes object colors; for example, showing different colorways of a dress without new photoshoots. In the following illustration, Search and Recolor changes the color swatch on furniture.

Search and Replace can swap objects entirely, which is useful for updating seasonal elements in marketing materials or replacing products. The following is an application of Search and Replace for virtual try-on experiences.

The Inpaint tool intelligently modifies images by filling in or replacing specified areas with new content based on the content of a mask image.
Control: Structural and stylistic precision
This category of tools provides precise manipulation of image structure and style through three specialized tools.
The Sketch tool transforms sketch-style renderings into photorealistic concepts. Architecture firms might use this to convert conceptual drawings into realistic visualizations, and apparel brands to turn design sketches into product mockups. The tool helps accelerate the creative production process from initial concepts to final visual execution.
In this example, the Sketch tool transforms a building architecture drawing to help real estate developers visualize the concept against a cityscape.

In another example, the Sketch tool transforms a mannequin drawing into a photorealistic model shot.

The Structure tool maintains the structural elements of input images while allowing content modification. This tool helps preserve layouts, compositions, and spatial relationships while changing subjects or styles. Creative teams can use the Structure tool to recreate scenes with different subjects or render new characters while maintaining consistent framing.
The following example demonstrates the Structure tool transforming a workshop scene into a new scene while preserving the composition and spatial relationships.

The Style Guide and Style Transfer tools help marketing teams produce new images that align with brand style and guidelines. The Style Guide tool takes artistic styles and colors from a reference style image and generates new images based on text prompts.
In the following example, the Style Guide tool takes clues from a brand’s color palette and textures and generates new images matching brand identity.

The Style Transfer tool uses visual characteristics from reference images to transform existing images, while preserving the original composition. For example, a home decor retailer can transform product imagery from modern minimalist to traditional styles without new photography. Marketing teams could create seasonal variations by applying different visual styles to existing product catalogs.
Solution overview
To demonstrate Stability AI Image Services in Amazon Bedrock, let’s walk through an example using a Jupyter notebook found in the GitHub repo.
Prerequisites
To follow along, you must have the following prerequisites:

An AWS account.
AWS credentials configured for creating and accessing Amazon Bedrock and Amazon SageMaker AI resources.
An AWS Identity and Access Management (IAM) execution role for SageMaker AI, which has the AmazonSageMakerFullAccess and AmazonBedrockLimitedAccess AWS managed policies attached. For more details, see How to use SageMaker AI execution roles.
A SageMaker notebook instance.
Stability AI Image Services model access, which you can request through the Amazon Bedrock console. Refer to Access Amazon Bedrock foundation models for more details.

Create a SageMaker AI notebook instance
Complete the following steps to create a SageMaker AI notebook instance, which can be used to run the sample notebook:

On the SageMaker AI console, in the navigation pane, under Applications and IDEs, choose Notebooks.
Choose Create notebook instance.
For Notebook instance name, enter a name for your notebook instance (for example, ai-images-notebook-instance).
For Notebook Instance type, choose ml.t2.medium.
For Platform identifier, choose Amazon Linux 2.
For IAM role, choose either an existing IAM role, which has the AmazonSageMakerFullAccess and AmazonBedrockLimitedAccess policies attached, or choose Create a new role.
Note the name of the IAM role that you chose.
Leave other settings as default and choose Create notebook instance.

After a few minutes, SageMaker AI creates a notebook instance, and its status changes from Pending to InService.
Confirm the IAM role for the notebook instance has the necessary permissions
Complete the following steps to verify that the SageMaker AI execution role that you assigned to the notebook instance has the correct permissions:

On the IAM console, in the navigation pane, under Access management, choose Roles.
In the Roles search bar, enter the name of the SageMaker AI execution role that you used when creating the notebook instance.
Choose the IAM role.
Under Permissions policies, verify that the AWS managed policies AmazonSageMakerFullAccess and AmazonBedrockLimitedAccess are present.
(Optional) If either policy is missing, choose Add permissions, then choose Attach policies to attach the missing policy.

In the Other permissions policies search bar, enter the policy name.
Select the policy, then chose Add permissions.

Run the notebook
Complete the following steps to run the notebook:

On the SageMaker AI console, in the navigation pane, under Applications and IDEs, choose Notebooks.
Choose the newly created ai-images-notebook-instance notebook instance.
Wait for the notebook to be in InService status.
Choose the Open JupyterLab link to launch JupyterLab in a new browser tab.
On the Git menu, choose Clone a Repository.
Enter the URI https://github.com/aws-samples/stabilityai-sample-notebooks.git and select Include submodules and Download the repository.
Choose Clone.
On the File menu, choose Open from path.
Enter the following: stabilityai-sample-notebooks/stability-ai-image-services/stability-ai-image-services-sample-notebook.ipynb
Choose Open.
When prompted, choose the kernel conda_python3, then choose Select.
Run through each notebook cell to experience Stability AI Image Services in Amazon Bedrock.

Clean up
To avoid ongoing charges, stop the ai-images-notebook-instance SageMaker AI notebook instance that you created in this walkthrough:

On the SageMaker AI console, in the navigation pane, under Applications and IDEs, choose Notebooks.
Choose the ai-images-notebook-instance SageMaker AI notebook instance that you created.
Choose Actions, then choose Stop.

After a few minutes, the notebook instance transitions from Stopping to Stopped status.

Choose Actions, then Delete.

After a few seconds, SageMaker AI deletes the notebook instance.
For more details, refer to Clean up Amazon SageMaker notebook instance resources.
Conclusion
The availability of Stability AI Image Services in Amazon Bedrock is an exciting step forward for visual content creation and manipulation, with particularly time-saving implications for professional creative teams at enterprises.
For example, in media and entertainment, creators can rapidly enhance scenes and create special effects, and marketing teams can generate multiple campaign variations effortlessly. Retail and ecommerce businesses can streamline product photography and digital catalog creation, and gaming developers can prototype environments more efficiently. Architecture firms can visualize design concepts instantly, and educational institutions can create more engaging visual content.
With these tools, businesses of different sizes can produce professional-grade, highly engaging visual content with efficiency and creativity. These tools can streamline operations, reduce costs, and open new creative possibilities, helping brands tell their stories more effectively and engage customers in more compelling ways.
To get started, check out Stability AI models in Amazon Bedrock and the AWS Samples GitHub repo.

About the authors
Alex Gnibus is a Product Marketing Manager at Stability AI, connecting the dots between cutting-edge research breakthroughs and practical use cases. With experience spanning from creative agencies to deep enterprise tech, Alex brings both technical expertise and an understanding of the challenges that professional creative teams can solve with generative AI.
Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS Enterprise customers grow by understanding their goals and challenges and guiding them on how they can architect their applications in a cloud-based manner while making sure they are resilient and scalable. She’s passionate about machine learning technologies and environmental sustainability.
Fabio Branco is a Senior Customer Solutions Manager at Amazon Web Services (AWS) and strategic advisor helping customers achieve business transformation, drive innovation through generative AI and data solutions, and successfully navigate their cloud journeys. Prior to AWS, he held Product Management, Engineering, Consulting, and Technology Delivery roles across multiple Fortune 500 companies in industries, including retail and consumer goods, oil and gas, financial services, insurance, and aerospace and defense.
Suleman Patel is a Senior Solutions Architect at Amazon Web Services (AWS), with a special focus on machine learning and modernization. With expertise in both business and technology, Suleman helps customers design and build solutions that tackle real-world business problems. When he’s not immersed in his work, Suleman loves exploring the outdoors, taking road trips, and cooking up delicious dishes in the kitchen.

Prompting for precision with Stability AI Image Services in Amazon Bed …

Amazon Bedrock now offers Stability AI Image Services: 9 tools that improve how businesses create and modify images. The technology extends Stable Diffusion and Stable Image models to give you precise control over image creation and editing. Clear prompts are critical—they provide art direction to the AI system. Strong prompts control specific elements like tone, texture, lighting, and composition to create the desired visual outcomes. This capability serves professional needs across product photography, concept, and marketing campaigns.
In this post, we expand on the post Understanding prompt engineering: Unlock the creative potential of Stability AI models on AWS. We show how to effectively use advanced prompting techniques to maximize image generation quality and precision for enterprise application using Stability AI Image Services in Amazon Bedrock.
Solution overview
Stability AI Image Services are available as APIs in Amazon Bedrock, featuring capabilities such as, in-painting, style transfer, recoloring, background removal, object removal, style guide, and much more.
In the following sections, we first discuss prompt structure for maximum control of image generation, then we provide advanced techniques of prompting for stylistic guidance. Code samples can be found in the following GitHub repository.
Prerequisites
To get started with Stability AI Image Services in Amazon Bedrock, follow the instructions in Getting started with the API to complete the following prerequisites:

Set up your AWS account.
Acquire credentials to grant programmatic access.
Attach the Amazon Bedrock permission to an AWS Identity and Access Management (IAM) user or role.
Request access to the Amazon Bedrock models.

Structure prompts that maximize control
To maximize the granular capabilities of Stability AI Image Services in Amazon Bedrock, you must construct prompts that enable fine-grained control.
This section outlines best practices for building effective prompts that produce the desired output. We demonstrate how prompt structure affects results and why more structured prompts typically yield more consistent and controllable outcomes.
Choose the right prompt type for your use case
Selecting the right prompt format helps the model better understand your intent. Three primary prompt formats deliver different levels of control and readability:

Natural language maximizes readability and is best for general usage
Tag-based formats enable precise structural control and are ideal for technical application
Hybrid formats combine natural language and the structural elements of tags to provide even more control

The following table provides examples of these three common ways to phrase your prompts. Each prompt format has its strengths depending on your goal or the interface you’re using.

Prompt type
Prompt example
Generated image using Stable Image Ultra in Amazon Bedrock
Description and use case

Basic Prompt (Natural Language)
“A clean product photo of a perfume bottle on a marble countertop”

This is readable and intuitive. Great for exploration, conversational tools, and some model types. Stable Diffusion 3.5 responds best to this style.

Tag-Based Prompt
“perfume bottle, marble surface, soft light, high quality, product photo”

Used in many generation UIs or with models trained on datasets like LAION or Danbooru. Compact and good for stacking details.

Hybrid Prompt
“perfume bottle on marble counter, soft studio lighting, sharp focus, f/2.8lens”

Best of both worlds. Add emphasis with weighting syntax to influence the model’s priorities.

Build modular prompts
Modular prompting enhances AI image generation effectiveness. This approach divides prompts into distinct components, each specifying what to draw and how it should appear. Modular structures provide several benefits: they help prevent conflicting or confusing instructions, allow for precise output control, and simplify prompt debugging. By isolating individual elements, you can quickly identify and adjust effective or ineffective parts of your prompts. This method ultimately leads to more refined and targeted AI-generated images.
The following table provides examples of modular prompt modules. Experiment with different prompt sequences for your desired outcome; for example, placing the style before the subject will give it a more visual weight.

Module
Example
Description

Prefix
“fashion editorial portrait of”
Sets the tone and intent for a high-fashion styled portrait

Subject
“a woman with medium-brown skin and short coiled hair”
Gives the model’s look and surface detail to help guide facial features

Modifiers
“wearing an asymmetrical black mesh top, metallic jewelry”
Adds stylized clothing and accessories for visual interest

Action
“seated with her shoulders angled, eyes locked on camera, one arm lifted”
Describes body language and pose to give dynamic composition

Environment
“bathed in intersecting beams of hard directional light through window slats”
Adds context for dramatic light play and atmosphere

Style
“high-contrast chiaroscuro lighting, sculptural and abstract”
Informs the aesthetic and mood (shadow-driven, moody, architectural)

Camera/Lighting
“shot on 85mm, studio setup, layered shadows and light falling across face and body”
Adds technical precision and helps control realism and fidelity

The following example illustrates how to use a modular prompt to generate the desired output.

Modular Prompt
Generated Image Using Stable Image Ultra in Amazon Bedrock

“fashion editorial portrait of a woman with medium-brown skin and short coiled hair, wearing an asymmetrical black mesh top and metallic jewelry, seated with shoulders angled and one arm lifted, eyes locked on camera, bathed in intersecting beams of hard directional light through window slats, layered shadows and highlights sculpting her face and body, high-contrast chiaroscuro lighting, abstract and bold, shot on 85mm in studio”

Use negative prompts for polished output
Negative prompts improve AI output quality by removing specific visual elements. Explicitly defining what not to include in the prompt guides the model’s output, typically leading to professional outputs. Negative prompts act like a retoucher’s checklist used to address aspects of an image to enhance quality and appeal. For example, “No weird hands. No blurry corners. No cartoon filters. Definitely no watermarks.” Negative prompts result in clean, confident, compositions, free of distracting element and distortions.
The following table provides examples of additional tokens that can be used in negative prompts.

Artifact Type
Tokens to Use

Low quality or noise
blurry, lowres, jpeg artifacts, noisy

Anatomy or model issues
deformed, extra limbs, bad hands, missing fingers

Style clashes
cartoon, illustration, anime, painting

Technical errors
watermark, text, signature, overexposed

General cleanup
ugly, poorly drawn, distortion, worst quality

The following example illustrates how a well-structured negative prompt can enhance photorealism.

Without Negative Prompt
Prompt “(medium full shot) of (charming office cubicle) made of glass material, multiple colors, modern style, space-saving, upholstered seat, patina, gold trim, located in a modern garden, with sleek furniture, stylish decor, bright lighting, comfortable seating, Masterpiece, best quality, raw photo, realistic, very aesthetic, dark “

With Negative Prompt
Prompt “(medium full shot) of (charming office cubicle) made of glass material, multiple colors, modern style, space-saving, upholstered seat, patina, gold trim, located in a modern garden, with sleek furniture, stylish decor, bright lighting, comfortable seating, Masterpiece, best quality, raw photo, realistic, very aesthetic, dark” Negative Prompt “cartoon, 3d render, cgi, oversaturated, smooth plastic textures, unreal lighting, artificial, matte surface, painterly, dreamy, glossy finish, digital art, low detail background”

Emphasize or suppress elements with prompt weighting
Prompt weighting controls the influence of individual elements in AI image generation. These numerical weights prioritize specific prompt components over others. For example, to emphasize the character over the background, you can apply a 1.8 weight to “character” (character: 1.8) and 1.1 to “background” (background: 1.1), which makes sure the model prioritizes character detail while maintaining environmental context. This targeted emphasis produces more precise outputs by minimizing competition between prompt elements and clarifying the model’s priorities.
The syntax for prompt weights is (<term>:<weight>). You can also use a shorthand such as ((<term>)), where the number of parentheses represent the weight. Values between 0.0–1.0 deemphasize the term, and values between 1.1–2.0 emphasize the term.For example:

(term:1.2): Emphasize
(term:0.8): Deemphasize
((term)): Shorthand for (term:1.2)
(((((((((term)))))))): Shorthand for (term:1.8)

The following example shows how prompt weights contribute to the generated output.

Prompt with weights “editorial product photo of (a translucent gel moisturizer jar:1.4) placed on a (frosted glass pedestal:1.2), surrounded by (dewy pink flower petals:1.1), with soft (diffused lighting:1.3), subtle water droplets, shallow depth of field”

Prompt without weights “editorial product photo of a translucent gel moisturizer jar placed on a frosted glass pedestal, surrounded by dewy pink flower petals, with soft, subtle water droplets, shallow depth of field”

You can also use weights in negative prompts to reduce how strongly the model avoids something. For example, “(text:0.5), (blurry:0.2), (lowres:0.1).” This tells the model to be especially sure to avoid generating blurry text or low-resolution content.
Giving specific stylistic guidance
Effective prompt writing when using Stability AI Image Services such as Style Transfer and Style Guide requires a good understanding of style matching and reference-driven prompting. These techniques help provide clear stylistic direction for both text-to-image and image-to-image creation.
Image-to-image style transfer extracts stylistic elements from an input image (control image) and uses it to guide the creation of an output image based on the prompt. Approach writing the prompt as if you’re directing a professional photographer or stylist. Focus on materials, lighting quality, and artistic intention—not just objects. For example, a well-structured prompt might read: “Close-up editorial photo of a translucent green lip gloss tube on crushed iridescent plastic, diffused colored lighting, shallow DOF, high fashion product styling.”
Style tag layering: Known aesthetic labels that align with brand identity
The art of crafting effective prompts often relies on incorporating established style tags that resonate with familiar visual languages and datasets. By strategically blending terms from recognized aesthetic categories (ranging from editorial photography and analog film to anime, cyberpunk cityscapes, and brutalist structures), creators can guide the AI toward specific visual outcomes that align with their brand identity. These style descriptors serve as powerful anchors in the prompt engineering process. The versatility of these tags extends further through their ability to be combined and weighted, allowing for nuanced control over the final aesthetic. For instance, a skincare brand might blend the clean lines of product photography with dreamy, surreal elements, whereas a tech company could merge brutalist structure with cyberpunk elements for a distinctive visual identity. This approach to style mixing helps creators improve their outputs while maintaining clear ties to recognizable visual genres that resonate with their target audience. The key is understanding how these style tags interact and using their combinations to create unique, yet culturally relevant, visual expressions that serve specific creative or commercial objectives. The following table provides examples of prompts for a desired aesthetic.

Desired aesthetic
Prompt phrases
Example use case

Retro / Y2K
2000s nostalgia, flash photography, candy tones, harsh lighting
Metallic textures, thin fonts, early digital feel.

Clean modern
neutral tones, soft gradients, minimalist styling, editorial layout
Great for wellness or skincare products.

Bold streetwear
urban background, oversized fit, strong pose, midday shadow
Fashion photography and lifestyle ads. Prioritize outfit structure and location cues.

Hyperreal surrealism
dreamcore lighting, glossy textures, cinematic DOF, surreal shadows
Plays well in music, fashion, or alt-culture campaigns.

Invoke a named style as a reference
Some prompt structures benefit from invoking a named visual signature from a specific artist, especially when combined with your own stylistic phrasing or workflows, as shown in the following example.

Prompt “editorial studio portrait of a woman with glowing skin in minimalist glam makeup, high-contrast lighting, clean background, (depiction of Van Gogh style:1.3)”

The following is a more conceptual example.

Prompt “product shot of a silver hair oil bottle with soft reflections on curved chrome, (depiction of Wes Anderson style:1.2), under cold studio lighting”

These phrases function like calling on a genre; they imply choices around materials, lighting, layout, and color tonality.
Use reference images to guide style
Another useful technique is using a reference image to guide the pose, color, or composition of the output. For use cases like matching a pose from a lookbook image, transferring a color palette from a campaign still, or copying shadowplay from a photo shoot, you can extract and apply structure or style from reference images.
Stability AI Image Services support a variety of image-to-image workflows where you can use a reference image (control image) to guide the output, such as Structure, Sketch, and Style. Tools like ControlNet (a neural network architecture developed by Stability AI that enhances control), IP-Adapter (an image prompt adapter), or clip-based captioning also enable further control when paired with Stability AI models.
We will discuss ControlNet, IP-Adapter, and clip-based captioning in a subsequent post.
The following is an example of an image-to-image workflow:

Find a high-quality editorial reference.
Use it with a depth, canny, or seg ControlNet to lock a pose.
Style with a prompt.

Prompt “fashion editorial of a model in layered knitwear, dramatic colored lighting, strong shadows, high ISO texture”

Create the right mood with lighting control
In a prompt, lighting sets tone, adds dimensionality, and mimics the language of photography. It shouldn’t just be “bright vs. dark.” Lighting is often the style itself, especially for audiences like Gen Z, for instance TikTok, early-aughts flash, harsh backlight, and color gels. The following table provides some useful lighting style prompt terms.

Lighting style
Prompt terms
Example use case

High-contrast studio
hard directional light, deep shadows, controlled highlights
Beauty, tech, fashion with punchy visuals

Soft editorial
diffused light, soft shadows, ambient glow, overcast
Skincare, fashion, wellness

Colored gel lighting
blue and pink gel lighting, dramatic color shadows, rim lighting
Nightlife, music-adjacent fashion, youth-forward styling

Natural bounce
golden hour, soft natural light, sun flare, warm tones
Outdoors, lifestyle, brand-friendly minimalism

Build intent with posing and framing terms
Good posing helps products feel aspirational and digital models more dynamic. With AI, you must be intentional. Framing and pose cues help avoid stiffness, anatomical errors, and randomness. The following table provides some useful posing and framing prompt terms.

Prompt cue
Description
Tip

looking off camera
Creates candid or editorial energy
Useful for lookbooks or ad pages

hands in motion
Adds realism and fluidity
Avoids awkward, static body posture

seated with body turned
Adds depth and twist to the torso
Reduces symmetry, feels natural

shot from low angle
Power or status cue
Works well for stylized streetwear or product hero shots

Example: Putting it all together
The following example puts together what we’ve discussed in this post.

Prompt “studio portrait of a model with platinum hair in metallic cargo pants and a cropped mesh hoodie, seated with legs wide on (acrylic stairs:1.6), magenta and teal gel lighting from left and behind, dramatic contrast, shot on 50mm, streetwear editorial for Gen Z campaign” Negative prompt “blurry, extra limbs, watermark, cartoon, distorted face missing fingers, bad anatomy”

Let’s break down the preceding prompt. We direct the look of the subject (platinum hair, metallic clothes), specify their pose (seated wide-legged, confident, unposed), define the environment (acrylic stairs and studio setup, controlled, modern), state the lighting (mixed gel sources, bold stylization), designate the lens (50mm, portrait realism), and lastly detail the purpose (for Gen Z campaign, sets visual and cultural tone). Together, the prompt produces the desired result.
Best practices and troubleshooting
Prompting is rarely a one-and-done task, especially for creative use cases. Most great images come from refining an idea over multiple attempts. Consider the following methodology to iterate over your prompts:

Keep a prompt log
Change one variable at a time
Save seeds and base images
Use comparison grids

Sometimes things go wrong—maybe the model ignores your prompt, or the image looks messy. These issues are common and often quick to fix, and you can get sharper, cleaner, and more intentional outputs with every adjustment. The following table provides useful tips for troubleshooting your prompts.

Problem
Cause of issue
How to fix it

Style feels random
Model is confused or terms are vague
Clarify style, add weight, remove conflicts

Face gets warped
Over-styled or lacks facial cues
Add portrait of, headshot, or adjust pose or lighting

Image is too dark
Lighting not defined
Add softbox from left, natural light, or time of day

Repetitive poses
Same seed or static structure
Switch seed or change camera angle or subject action

Lacks realism or feels “AI-ish”
Wrong tone or artifacts
Add negatives like cartoon, digital texture, distorted

Conclusion
Mastering advanced prompting techniques can turn basic image generation into professional creative outputs. Stability AI Image Services in Amazon Bedrock provide precise control over visual creation and editing, helping businesses convert concepts into production-ready assets. The combination of technical expertise and creative intent can help creators achieve the precision and consistency required in professional settings. This control proves valuable across multiple applications, such as marketing campaigns, brand consistency, and product visualizations. This post demonstrated how to optimize Stability AI Image Services in Amazon Bedrock to produce high-quality imagery that aligns with your creative goals.
To implement these techniques, access Stability AI Image Services through Amazon Bedrock or explore Stability AI’s foundation models available in Amazon SageMaker JumpStart. You can also find practical code examples in our GitHub repository.

About the authors
Maxfield Hulker is the VP of Community and Business Development at Stability AI. He is a longtime leader in the generative AI space. He has helped build creator-focused platforms like Civitai and Dream Studio. Maxfield regularly publishes guides and tutorials to make advanced AI techniques more accessible.
Suleman Patel is a Senior Solutions Architect at Amazon Web Services (AWS), with a special focus on machine learning and modernization. Leveraging his expertise in both business and technology, Suleman helps customers design and build solutions that tackle real-world business problems. When he’s not immersed in his work, Suleman loves exploring the outdoors, taking road trips, and cooking up delicious dishes in the kitchen.
Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area working with generative AI model providers and helping customer optimize their generative AI workloads on AWS. She helps enterprise customers grow by understanding their goals and challenges, and guides them on how they can architect their applications in a cloud-based manner while supporting resilience and scalability. She’s passionate about machine learning technologies and environmental sustainability.
Fabio Branco is a Senior Customer Solutions Manager at Amazon Web Services (AWS) and a strategic advisor, helping customers achieve business transformation, drive innovation through generative AI and data solutions, and successfully navigate their cloud journeys. Prior to AWS, he held Product Management, Engineering, Consulting, and Technology Delivery roles across multiple Fortune 500 companies in industries, including retail and consumer goods, oil and gas, financial services, insurance, and aerospace and defense.

Monitor Amazon Bedrock batch inference using Amazon CloudWatch metrics

As organizations scale their use of generative AI, many workloads require cost-efficient, bulk processing rather than real-time responses. Amazon Bedrock batch inference addresses this need by enabling large datasets to be processed in bulk with predictable performance—at 50% lower cost than on-demand inference. This makes it ideal for tasks such as historical data analysis, large-scale text summarization, and background processing workloads.
In this post, we explore how to monitor and manage Amazon Bedrock batch inference jobs using Amazon CloudWatch metrics, alarms, and dashboards to optimize performance, cost, and operational efficiency.
New features in Amazon Bedrock batch inference
Batch inference in Amazon Bedrock is constantly evolving, and recent updates bring significant enhancements to performance, flexibility, and cost transparency:

Expanded model support – Batch inference now supports additional model families, including Anthropic’s Claude Sonnet 4 and OpenAI OSS models. For the most up-to-date list, refer to Supported Regions and models for batch inference.
Performance enhancements – Batch inference optimizations on newer Anthropic Claude and OpenAI GPT OSS models now deliver higher batch throughput as compared to previous models, helping you process large workloads more quickly.
Job monitoring capabilities – You can now track how your submitted batch jobs are progressing directly in CloudWatch, without the heavy lifting of building custom monitoring solutions. This capability provides AWS account-level visibility into job progress, making it straightforward to manage large-scale workloads.

Use cases for batch inference
AWS recommends using batch inference in the following use cases:

Jobs are not time-sensitive and can tolerate minutes to hours of delay
Processing is periodic, such as daily or weekly summarization of large datasets (news, reports, transcripts)
Bulk or historical data needs to be analyzed, such as archives of call center transcripts, emails, or chat logs
Knowledge bases need enrichment, including generating embeddings, summaries, tags, or translations at scale
Content requires large-scale transformation, such as classification, sentiment analysis, or converting unstructured text into structured outputs
Experimentation or evaluation is needed, for example testing prompt variations or generating synthetic datasets
Compliance and risk checks must be run on historical content for sensitive data detection or governance

Launch an Amazon Bedrock batch inference job
You can start a batch inference job in Amazon Bedrock using the AWS Management Console, AWS SDKs, or AWS Command Line Interface (AWS CLI). For detailed instructions, see Create a batch inference job.
To use the console, complete the following steps:

On the Amazon Bedrock console, choose Batch inference under Infer in the navigation pane.
Choose Create batch inference job.
For Job name, enter a name for your job.
For Model, choose the model to use.
For Input data, enter the location of the Amazon Simple Storage Service (Amazon S3) input bucket (JSONL format).
For Output data, enter the S3 location of the output bucket.
For Service access, select your method to authorize Amazon Bedrock.
Choose Create batch inference job.

Monitor batch inference with CloudWatch metrics
Amazon Bedrock now automatically publishes metrics for batch inference jobs under the AWS/Bedrock/Batch namespace. You can track batch workload progress at the AWS account level with the following CloudWatch metrics. For current Amazon Bedrock models, these metrics include records pending processing, input and output tokens processed per minute, and for Anthropic Claude models, they also include tokens pending processing.
The following metrics can be monitored by modelId:

NumberOfTokensPendingProcessing – Shows how many tokens are still waiting to be processed, helping you gauge backlog size
NumberOfRecordsPendingProcessing – Tracks how many inference requests remain in the queue, giving visibility into job progress
NumberOfInputTokensProcessedPerMinute – Measures how quickly input tokens are being consumed, indicating overall processing throughput
NumberOfOutputTokensProcessedPerMinute – Measures generation speed

To view these metrics using the CloudWatch console, complete the following steps:

On the CloudWatch console, choose Metrics in the navigation pane.
Filter metrics by AWS/Bedrock/Batch.
Select your modelId to view detailed metrics for your batch job.

To learn more about how to use CloudWatch to monitor metrics, refer to Query your CloudWatch metrics with CloudWatch Metrics Insights.
Best practices for monitoring and managing batch inference
Consider the following best practices for monitoring and managing your batch inference jobs:

Cost monitoring and optimization – By monitoring token throughput metrics (NumberOfInputTokensProcessedPerMinute and NumberOfOutputTokensProcessedPerMinute) alongside your batch job schedules, you can estimate inference costs using information on the Amazon Bedrock pricing page. This helps you understand how fast tokens are being processed, what that means for cost, and how to adjust job size or scheduling to stay within budget while still meeting throughput needs.
SLA and performance tracking – The NumberOfTokensPendingProcessing metric is useful for understanding your batch backlog size and tracking overall job progress, but it should not be relied on to predict job completion times because they might vary depending on overall inference traffic to Amazon Bedrock. To understand batch processing speed, we recommend monitoring throughput metrics (NumberOfInputTokensProcessedPerMinute and NumberOfOutputTokensProcessedPerMinute) instead. If these throughput rates fall significantly below your expected baseline, you can configure automated alerts to trigger remediation steps—for example, shifting some jobs to on-demand processing to meet your expected timelines.
Job completion tracking – When the metric NumberOfRecordsPendingProcessing reaches zero, it indicates that all running batch inference jobs are complete. You can use this signal to trigger stakeholder notifications or start downstream workflows.

Example of CloudWatch metrics
In this section, we demonstrate how you can use CloudWatch metrics to set up proactive alerts and automation.
For example, you can create a CloudWatch alarm that sends an Amazon Simple Notification Service (Amazon SNS) notification when the average NumberOfInputTokensProcessedPerMinute exceeds 1 million within a 6-hour period. This alert could prompt an Ops team review or trigger downstream data pipelines.

The following screenshot shows that the alert has In alarm status because the batch inference job met the threshold. The alarm will trigger the target action, in our case an SNS notification email to the Ops team.

The following screenshot shows an example of the email the Ops team received, notifying them that the number of processed tokens exceeded their threshold.

You can also build a CloudWatch dashboard displaying the relevant metrics. This is ideal for centralized operational monitoring and troubleshooting.

Conclusion
Amazon Bedrock batch inference now offers expanded model support, improved performance, deeper visibility into the progress of your batch workloads, and enhanced cost monitoring.
Get started today by launching an Amazon Bedrock batch inference job, setting up CloudWatch alarms, and building a monitoring dashboard, so you can maximize efficiency and value from your generative AI workloads.

About the authors
Vamsi Thilak Gudi is a Solutions Architect at Amazon Web Services (AWS) in Austin, Texas, helping Public Sector customers build effective cloud solutions. He brings diverse technical experience to show customers what’s possible with AWS technologies. He actively contributes to the AWS Technical Field Community for Generative AI.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Avish Khosla is a software developer on Bedrock’s Batch Inference team, where the team build reliable, scalable systems to run large-scale inference workloads on generative AI models. he care about clean architecture and great docs. When he is not shipping code, he is on a badminton court or glued to a good cricket match.
Chintan Vyas serves as a Principal Product Manager–Technical at Amazon Web Services (AWS), where he focuses on Amazon Bedrock services. With over a decade of experience in Software Engineering and Product Management, he specializes in building and scaling large-scale, secure, and high-performance Generative AI services. In his current role, he leads the enhancement of programmatic interfaces for Amazon Bedrock. Throughout his tenure at AWS, he has successfully driven Product Management initiatives across multiple strategic services, including Service Quotas, Resource Management, Tagging, Amazon Personalize, Amazon Bedrock, and more. Outside of work, Chintan is passionate about mentoring emerging Product Managers and enjoys exploring the scenic mountain ranges of the Pacific Northwest.
Mayank Parashar is a Software Development Manager for Amazon Bedrock services.

IBM AI Releases Granite-Docling-258M: An Open-Source, Enterprise-Ready …

IBM has released Granite-Docling-258M, an open-source (Apache-2.0) vision-language model designed specifically for end-to-end document conversion. The model targets layout-faithful extraction—tables, code, equations, lists, captions, and reading order—emitting a structured, machine-readable representation rather than lossy Markdown. It is available on Hugging Face with a live demo and MLX build for Apple Silicon.

What’s new compared to SmolDocling?

Granite-Docling is the product-ready successor to SmolDocling-256M. IBM replaced the earlier backbone with a Granite 165M language model and upgraded the vision encoder to SigLIP2 (base, patch16-512) while retaining the Idefics3-style connector (pixel-shuffle projector). The resulting model has 258M parameters and shows consistent accuracy gains across layout analysis, full-page OCR, code, equations, and tables (see metrics below). IBM also addressed instability failure modes observed in the preview model (e.g., repetitive token loops).

Architecture and training pipeline

Backbone: Idefics3-derived stack with SigLIP2 vision encoder → pixel-shuffle connector → Granite 165M LLM.

Training framework: nanoVLM (lightweight, pure-PyTorch VLM training toolkit).

Representation: Outputs DocTags, an IBM-authored markup designed for unambiguous document structure (elements + coordinates + relationships), which downstream tools convert to Markdown/HTML/JSON.

Compute: Trained on IBM’s Blue Vela H100 cluster.

Quantified improvements (Granite-Docling-258M vs. SmolDocling-256M preview)

Evaluated with docling-eval, LMMS-Eval, and task-specific datasets:

Layout: MAP 0.27 vs. 0.23; F1 0.86 vs. 0.85.

Full-page OCR: F1 0.84 vs. 0.80; lower edit distance.

Code recognition: F1 0.988 vs. 0.915; edit distance 0.013 vs. 0.114.

Equation recognition: F1 0.968 vs. 0.947.

Table recognition (FinTabNet @150dpi): TEDS-structure 0.97 vs. 0.82; TEDS with content 0.96 vs. 0.76.

Other benchmarks: MMStar 0.30 vs. 0.17; OCRBench 500 vs. 338.

Stability: “Avoids infinite loops more effectively” (production-oriented fix).

Multilingual support

Granite-Docling adds experimental support for Japanese, Arabic, and Chinese. IBM marks this as early-stage; English remains the primary target.

How the DocTags pathway changes Document AI

Conventional OCR-to-Markdown pipelines lose structural information and complicate downstream retrieval-augmented generation (RAG). Granite-Docling emits DocTags—a compact, LLM-friendly structural grammar—which Docling converts into Markdown/HTML/JSON. This preserves table topology, inline/floating math, code blocks, captions, and reading order with explicit coordinates, improving index quality and grounding for RAG and analytics.

Inference and integration

Docling Integration (recommended): The docling CLI/SDK automatically pulls Granite-Docling and converts PDFs/office docs/images to multiple formats. IBM positions the model as a component inside Docling pipelines rather than a general VLM.

Runtimes: Works with Transformers, vLLM, ONNX, and MLX; a dedicated MLX build is optimized for Apple Silicon. A Hugging Face Space provides an interactive demo (ZeroGPU).

License: Apache-2.0.

Why Granite-Docling?

For enterprise document AI, small VLMs that preserve structure reduce inference cost and pipeline complexity. Granite-Docling replaces multiple single-purpose models (layout, OCR, table, code, equations) with a single component that emits a richer intermediate representation, improving downstream retrieval and conversion fidelity. The measured gains—in TEDS for tables, F1 for code/equations, and reduced instability—make it a practical upgrade from SmolDocling for production workflows.

Demo

Summary

Granite-Docling-258M marks a significant advancement in compact, structure-preserving document AI. By combining IBM’s Granite backbone, SigLIP2 vision encoder, and the nanoVLM training framework, it delivers enterprise-ready performance across tables, equations, code, and multilingual text—all while remaining lightweight and open-source under Apache 2.0. With measurable gains over its SmolDocling predecessor and seamless integration into Docling pipelines, Granite-Docling provides a practical foundation for document conversion and RAG workflows where precision and reliability are critical.

Check out the Models on Hugging Face and Demo here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post IBM AI Releases Granite-Docling-258M: An Open-Source, Enterprise-Ready Document AI Model appeared first on MarkTechPost.

How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face …

In this tutorial, we build an advanced voice AI agent using Hugging Face’s freely available models, and we keep the entire pipeline simple enough to run smoothly on Google Colab. We combine Whisper for speech recognition, FLAN-T5 for natural language reasoning, and Bark for speech synthesis, all connected through transformers pipelines. By doing this, we avoid heavy dependencies, API keys, or complicated setups, and we focus on showing how we can turn voice input into meaningful conversation and get back natural-sounding voice responses in real time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install “transformers>=4.42.0” accelerate torchaudio sentencepiece gradio soundfile

import os, torch, tempfile, numpy as np
import gradio as gr
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

DEVICE = 0 if torch.cuda.is_available() else -1

asr = pipeline(
“automatic-speech-recognition”,
model=”openai/whisper-small.en”,
device=DEVICE,
chunk_length_s=30,
return_timestamps=False
)

LLM_MODEL = “google/flan-t5-base”
tok = AutoTokenizer.from_pretrained(LLM_MODEL)
llm = AutoModelForSeq2SeqLM.from_pretrained(LLM_MODEL, device_map=”auto”)

tts = pipeline(“text-to-speech”, model=”suno/bark-small”)

We install the necessary libraries and load three Hugging Face pipelines: Whisper for speech-to-text, FLAN-T5 for generating responses, and Bark for text-to-speech. We set the device automatically so that we can use GPU if available. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserSYSTEM_PROMPT = (
“You are a helpful, concise voice assistant. ”
“Prefer direct, structured answers. ”
“If the user asks for steps or code, use short bullet points.”
)

def format_dialog(history, user_text):
turns = []
for u, a in history:
if u: turns.append(f”User: {u}”)
if a: turns.append(f”Assistant: {a}”)
turns.append(f”User: {user_text}”)
prompt = (
“Instruction:n”
f”{SYSTEM_PROMPT}nn”
“Dialog so far:n” + “n”.join(turns) + “nn”
“Assistant:”
)
return prompt

We define a system prompt that guides our agent to stay concise and structured, and we implement a format_dialog function that takes past conversation history along with the user input and builds a prompt string for the model to generate the assistant’s reply. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef transcribe(filepath):
out = asr(filepath)
text = out[“text”].strip()
return text

def generate_reply(history, user_text, max_new_tokens=256):
prompt = format_dialog(history, user_text)
inputs = tok(prompt, return_tensors=”pt”, truncation=True).to(llm.device)
with torch.no_grad():
ids = llm.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True,
top_p=0.9,
repetition_penalty=1.05,
)
reply = tok.decode(ids[0], skip_special_tokens=True).strip()
return reply

def synthesize_speech(text):
out = tts(text)
audio = out[“audio”]
sr = out[“sampling_rate”]
audio = np.asarray(audio, dtype=np.float32)
return (sr, audio)

We create three core functions for our voice agent: transcribe converts recorded audio into text using Whisper, generate_reply builds a context-aware response from FLAN-T5, and synthesize_speech turns that response back into spoken audio with Bark. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef clear_history():
return [], []

def voice_to_voice(mic_file, history):
history = history or []
if not mic_file:
return history, None, “Please record something!”
try:
user_text = transcribe(mic_file)
except Exception as e:
return history, None, f”ASR error: {e}”

if not user_text:
return history, None, “Didn’t catch that. Try again?”

try:
reply = generate_reply(history, user_text)
except Exception as e:
return history, None, f”LLM error: {e}”

try:
sr, wav = synthesize_speech(reply)
except Exception as e:
return history + [(user_text, reply)], None, f”TTS error: {e}”

return history + [(user_text, reply)], (sr, wav), f”User: {user_text}nAssistant: {reply}”

def text_to_voice(user_text, history):
history = history or []
user_text = (user_text or “”).strip()
if not user_text:
return history, None, “Type a message first.”
try:
reply = generate_reply(history, user_text)
sr, wav = synthesize_speech(reply)
except Exception as e:
return history, None, f”Error: {e}”
return history + [(user_text, reply)], (sr, wav), f”User: {user_text}nAssistant: {reply}”

def export_chat(history):
lines = []
for u, a in history or []:
lines += [f”User: {u}”, f”Assistant: {a}”, “”]
text = “n”.join(lines).strip() or “No conversation yet.”
with tempfile.NamedTemporaryFile(delete=False, suffix=”.txt”, mode=”w”) as f:
f.write(text)
path = f.name
return path

We add interactive functions for our agent: clear_history resets the conversation, voice_to_voice handles speech input and returns a spoken reply, text_to_voice processes typed input and speaks back, and export_chat saves the entire dialog into a downloadable text file. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserwith gr.Blocks(title=”Advanced Voice AI Agent (HF Pipelines)”) as demo:
gr.Markdown(
“## Advanced Voice AI Agent (Hugging Face Pipelines Only)n”
“- **ASR**: openai/whisper-small.enn”
“- **LLM**: google/flan-t5-basen”
“- **TTS**: suno/bark-smalln”
“Speak or type; the agent replies with voice + text.”
)

with gr.Row():
with gr.Column(scale=1):
mic = gr.Audio(sources=[“microphone”], type=”filepath”, label=”Record”)
say_btn = gr.Button(” Speak”)
text_in = gr.Textbox(label=”Or type instead”, placeholder=”Ask me anything…”)
text_btn = gr.Button(” Send”)
export_btn = gr.Button(” Export Chat (.txt)”)
reset_btn = gr.Button(” Reset”)
with gr.Column(scale=1):
audio_out = gr.Audio(label=”Assistant Voice”, autoplay=True)
transcript = gr.Textbox(label=”Transcript”, lines=6)
chat = gr.Chatbot(height=360)
state = gr.State([])

def update_chat(history):
return [(u, a) for u, a in (history or [])]

say_btn.click(voice_to_voice, [mic, state], [state, audio_out, transcript]).then(
update_chat, inputs=state, outputs=chat
)
text_btn.click(text_to_voice, [text_in, state], [state, audio_out, transcript]).then(
update_chat, inputs=state, outputs=chat
)
reset_btn.click(clear_history, None, [chat, state])
export_btn.click(export_chat, state, gr.File(label=”Download chat.txt”))

demo.launch(debug=False)

We build a clean Gradio UI that lets us speak or type and then hear the agent’s response. We wire buttons to our callbacks, maintain chat state, and stream results into a chatbot, transcript, and audio player, all launched in one Colab app.

In conclusion, we see how seamlessly Hugging Face pipelines enable us to create a voice-driven conversational agent that listens, thinks, and responds. We now have a working demo that captures audio, transcribes it, generates intelligent responses, and returns speech output, all inside Colab. With this foundation, we can experiment with larger models, add multilingual support, or even extend the system with custom logic. Still, the core idea remains the same: we can bring together ASR, LLM, and TTS into one smooth workflow for an interactive voice AI experience.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face Pipelines? appeared first on MarkTechPost.

Meta AI Researchers Release MapAnything: An End-to-End Transformer Arc …

A team of researchers from Meta Reality Labs and Carnegie Mellon University has introduced MapAnything, an end-to-end transformer architecture that directly regresses factored metric 3D scene geometry from images and optional sensor inputs. Released under Apache 2.0 with full training and benchmarking code, MapAnything advances beyond specialist pipelines by supporting over 12 distinct 3D vision tasks in a single feed-forward pass.

https://map-anything.github.io/assets/MapAnything.pdf

Why a Universal Model for 3D Reconstruction?

Image-based 3D reconstruction has historically relied on fragmented pipelines: feature detection, two-view pose estimation, bundle adjustment, multi-view stereo, or monocular depth inference. While effective, these modular solutions require task-specific tuning, optimization, and heavy post-processing.

Recent transformer-based feed-forward models such as DUSt3R, MASt3R, and VGGT simplified parts of this pipeline but remained limited: fixed numbers of views, rigid camera assumptions, or reliance on coupled representations that needed expensive optimization.

MapAnything overcomes these constraints by:

Accepting up to 2,000 input images in a single inference run.

Flexibly using auxiliary data such as camera intrinsics, poses, and depth maps.

Producing direct metric 3D reconstructions without bundle adjustment.

The model’s factored scene representation—composed of ray maps, depth, poses, and a global scale factor—provides modularity and generality unmatched by prior approaches.

Architecture and Representation

At its core, MapAnything employs a multi-view alternating-attention transformer. Each input image is encoded with DINOv2 ViT-L features, while optional inputs (rays, depth, poses) are encoded into the same latent space via shallow CNNs or MLPs. A learnable scale token enables metric normalization across views.

The network outputs a factored representation:

Per-view ray directions (camera calibration).

Depth along rays, predicted up-to-scale.

Camera poses relative to a reference view.

A single metric scale factor converting local reconstructions into a globally consistent frame.

This explicit factorization avoids redundancy, allowing the same model to handle monocular depth estimation, multi-view stereo, structure-from-motion (SfM), or depth completion without specialized heads.

https://map-anything.github.io/assets/MapAnything.pdf

Training Strategy

MapAnything was trained across 13 diverse datasets spanning indoor, outdoor, and synthetic domains, including BlendedMVS, Mapillary Planet-Scale Depth, ScanNet++, and TartanAirV2. Two variants are released:

Apache 2.0 licensed model trained on six datasets.

CC BY-NC model trained on all thirteen datasets for stronger performance.

Key training strategies include:

Probabilistic input dropout: During training, geometric inputs (rays, depth, pose) are provided with varying probabilities, enabling robustness across heterogeneous configurations.

Covisibility-based sampling: Ensures input views have meaningful overlap, supporting reconstruction up to 100+ views.

Factored losses in log-space: Depth, scale, and pose are optimized using scale-invariant and robust regression losses to improve stability.

Training was performed on 64 H200 GPUs with mixed precision, gradient checkpointing, and curriculum scheduling, scaling from 4 to 24 input views.

Benchmarking Results

Multi-View Dense Reconstruction

On ETH3D, ScanNet++ v2, and TartanAirV2-WB, MapAnything achieves state-of-the-art (SoTA) performance across pointmaps, depth, pose, and ray estimation. It surpasses baselines like VGGT and Pow3R even when limited to images only, and improves further with calibration or pose priors.

For example:

Pointmap relative error (rel) improves to 0.16 with only images, compared to 0.20 for VGGT.

With images + intrinsics + poses + depth, the error drops to 0.01, while achieving >90% inlier ratios.

Two-View Reconstruction

Against DUSt3R, MASt3R, and Pow3R, MapAnything consistently outperforms across scale, depth, and pose accuracy. Notably, with additional priors, it achieves >92% inlier ratios on two-view tasks, significantly beyond prior feed-forward models.

Single-View Calibration

Despite not being trained specifically for single-image calibration, MapAnything achieves an average angular error of 1.18°, outperforming AnyCalib (2.01°) and MoGe-2 (1.95°).

Depth Estimation

On the Robust-MVD benchmark:

MapAnything sets new SoTA for multi-view metric depth estimation.

With auxiliary inputs, its error rates rival or surpass specialized depth models such as MVSA and Metric3D v2.

Overall, benchmarks confirm 2× improvement over prior SoTA methods in many tasks, validating the benefits of unified training.

Key Contributions

The research team highlight four major contributions:

Unified Feed-Forward Model capable of handling more than 12 problem settings, from monocular depth to SfM and stereo.

Factored Scene Representation enabling explicit separation of rays, depth, pose, and metric scale.

State-of-the-Art Performance across diverse benchmarks with fewer redundancies and higher scalability.

Open-Source Release including data processing, training scripts, benchmarks, and pretrained weights under Apache 2.0.

Conclusion

MapAnything establishes a new benchmark in 3D vision by unifying multiple reconstruction tasks—SfM, stereo, depth estimation, and calibration—under a single transformer model with a factored scene representation. It not only outperforms specialist methods across benchmarks but also adapts seamlessly to heterogeneous inputs, including intrinsics, poses, and depth. With open-source code, pretrained models, and support for over 12 tasks, MapAnything lays the groundwork for a truly general-purpose 3D reconstruction backbone.

Check out the Paper, Codes and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meta AI Researchers Release MapAnything: An End-to-End Transformer Architecture that Directly Regresses Factored, Metric 3D Scene Geometry appeared first on MarkTechPost.

Supercharge your organization’s productivity with the Amazon Q Busin …

Generative AI solutions like Amazon Q Business are transforming the way employees work. Organizations in every industry are embracing these tools to help their workforce extract valuable insights from increasingly fragmented data to accelerate decision-making processes. However, the adoption of generative AI tools hasn’t been without its challenges.
Two hurdles have emerged in the implementation of generative AI solutions. First, users often find themselves compelled to abandon familiar workflows, manually transferring data to an AI assistant for analysis. This creates unnecessary friction and increases the time to value. Second, the absence of generative AI tools in commonly used software makes it difficult for employees to identify opportunities where AI can significantly boost their productivity.
Enter Amazon Q Business, a generative AI-powered assistant tailored for the modern workplace, so you can engage in conversations, solve complex problems, and take action by seamlessly connecting to company data and enterprise systems. Amazon Q Business provides employees with instant access to relevant information and advice, streamlining tasks, accelerating decision-making, and fostering creativity and innovation in the workplace. We recently launched the Amazon Q Business browser extension in Amazon Q Business, and it is now available to Amazon Q Business subscribers (Lite and Pro). The Amazon Q Business browser extension brings the power of Amazon Q Business directly into your browsers, so you can receive context-aware, generative AI assistance and get on-the-go help for daily tasks.
In this post, we show how to implement this solution for your own enterprise, giving your team seamless access to AI-driven insights and assistance.
Use cases for the Amazon Q Business browser extension
The Amazon Q Business browser extension is deployed to all Amazonians, making tens of thousands of users more productive every day. In this section, we highlight some of the most impactful use cases for which Amazonians use the Amazon Q Business browser extension to boost their productivity.
Analyze web content
Business and technical teams need to analyze and synthesize information across various reports, competitive analyses, and industry documents found outside the company’s data to develop insights and strategy. They must make sure their strategic recommendations are based on verified data sources and trustworthy industry information. Additionally, identifying patterns across multiple sources is time-consuming and complex. With the Amazon Q Business browser extension, strategists can quickly generate industry insights and identify trends across trusted internal and external data sources in seconds, while maintaining the human element in strategic thinking.
Check out the following demo video:

Improve content quality
The Amazon Q Business browser extension brings the unique ability to incorporate context that might not be readily available to your generative AI assistant. You can use the Amazon Q Business browser extension for content creation and content quality improvements by including multiple disparate sources in your queries that typically aren’t available to generative AI assistants. You can use it to perform real-time validation of content from various sources and incorporate web-based style guides and best practices to accelerate content creation.
Check out the following demo video:

Solution overview
In the following sections, we walk through how to get started with the Amazon Q Business browser extension if you have already enabled Amazon Q Business for your organization. To learn more, see Configuring the Amazon Q Business browser extension for use.
Prerequisites
Complete the prerequisite steps in this section before deploying the browser extension.
Create an Amazon Q Business application and subscribe your users
The Amazon Q Business browser extension is a feature of Amazon Q Business and requires customers to first create an Amazon Q Business application and subscribe their users before the browser extension can be enabled. To learn more about how you can get started with Amazon Q Business, see Getting started with Amazon Q Business.
Set up the Amazon Q Business web experience
The browser extension uses the Amazon Q Business web experience client as the mechanism to authenticate users and offer Amazon Q Business features. The first step to enabling the browser extension is to create an Amazon Q Business web experience. If you have already created a web experience for your users, you can skip this step. However, if you have developed a custom web experience using the Amazon Q Business APIs, complete the following steps to create an Amazon Q Business web experience:

On the Amazon Q Business console, go to your Amazon Q Business application.

The Web experience settings section shows if you already have a web experience deployed. If you don’t have a web experience deployed, this section will be empty, with the message “A web experience needs to be created before deploying.”

At the top of your application details page, choose Edit.

For Outcome, select Web experience.
Choose Update.

This step might take a few minutes to complete.

After your web experience is deployed, you will find a URL where your web experience is hosted on your Amazon Q Business application details page. Save this URL for later.

Grant users access to send queries directly to the large language model
The Amazon Q Business browser extension can include your users’ web page context in queries by passing the web page content as file attachments alongside a user’s prompt. Because the file attachment feature is available only for General knowledge mode, the browser extension requires Amazon Q Business admins to grant users access to send queries directly to the large language model (LLM) to take advantage of the full feature set of the browser extension. Without this prerequisite, users can only access their company knowledge through the browser extension and can’t ask Amazon Q Business questions about their web page content.
Amazon Q Business does not store user conversation data and does not use queries or conversations for training its LLMs. Conversations are only stored within the application for 30 days. You can delete these conversations by accessing the Amazon Q Business web experience and choosing Chat in the navigation pane, as shown in the following screenshot.

To grant users access to send queries directly to the Amazon Q LLM, complete the following steps:

On the Amazon Q Business console, go to your application.
Choose Admin controls and guardrails in the navigation pane.

In the Global controls section, choose Edit.

Select Allow end users to send queries directly to the LLM.
Choose Save.

You are now ready to enable the browser extension for your users.
Configure the Amazon Q Business browser extension
Now that you have completed the prerequisites for the browser extension, complete the following steps to enable the browser extension for your users:

On the Amazon Q Business console, go to your application.
Under Enhancements in the navigation pane, choose Integrations.
In the Browser extensions section, choose Edit.

Select the check boxes for the browser extensions you want to enable:

The Chromium check box enables the Chrome store extension, which supports Google Chrome and Microsoft Edge browsers.
The Firefox check box enables the Firefox Browser add-on for Firefox browsers.

You can also view the Chrome or Firefox store pages for the extension using the links in the respective Learn more sections.

Choose Save.

Your users will now see instructions to install the Amazon Q Business browser extension the next time they log in to the Amazon Q Business web experience. If you have not yet done so, share the web experience URL you obtained in the earlier steps with your users so they can follow the steps to install the browser extension.
Activate the browser extension if you are using IAM federation authentication for Amazon Q Business
If you’re using an external identity provider (IdP) for your Amazon Q Business application, you must allow-list the browser extension with the external provider before your users can start using the browser extension. You can allow-list the following URLs with your IdP to activate the browser extension:

For the Chromium browser extension (suitable for Google Chrome and Microsoft Edge), use https://feihpdljijcgnokhfoibicengfiellbp.chromiumapp.org/
For the Mozilla Firefox browser extension, https://ba6e8e6e4fa44c1057cf5f26fba9b2e788dfc34f.extensions.allizom.org/

You don’t need to take the aforementioned steps if you’re using AWS IAM Identity Center as the authentication solution for your Amazon Q Business application.
Get started with the browser extension
After you share the web experience URL with your users, they can use it to find the browser extension store page and install the browser extension. Users can complete the following steps:

Log in to the Amazon Q Business web experience provided by your admin.

You will notice a banner letting you know that your admin has enabled the browser extension for you.

Choose Install extension.

The link will take you to the appropriate Amazon Q Business browser extension store page based on the browser you’re using.

Choose Add to Chrome or the appropriate installation option for your browser.

Upon installing the extension, you will find it in your browser’s tool bar under Extensions. You can choose the pin icon to pin the browser extension.

After you open your browser extension, you will see a side pane as shown in the following screenshot. It will automatically detect the correct web experience URL from your open tabs to help you sign in. If it doesn’t, enter the web experience URL provided by your admin in the Amazon Q URL section and choose Sign in.

Upon sign in, you’re ready to go! Refer to the earlier section discussing Amazon’s use cases for inspiration on how you can use the extension to boost your productivity.

Deploy the Amazon Q Business browser extension on behalf of your users
Some admins might choose to directly deploy the Amazon Q Business browser extension on their users’ browsers to streamline and accelerate adoption.
Enterprises use varying mobile device management software and have differing requirements for their browser policies. To deploy the Amazon Q Business browser extension, refer to the following resources:

Mozilla Firefox policy settings
Google Chrome policy settings
Microsoft Edge:

Policy settings
Reference guide

Customize the Amazon Q Business browser extension for your enterprise
Some admins might choose to customize the look and feel of the Amazon Q Business browser extension to fit their enterprise’s needs. This section outlines the extension’s supported customization functionality and the corresponding browser extension policy values to configure on your users’ browsers.
Remove the Amazon Q Business URL input from the browser extension login page
If you don’t want to require an Amazon Q Business web experience URL from your users at sign-in, you can set a default URL on their behalf by setting the Q_BIZ_BROWSER_EXTENSION_URL policy to the appropriate Amazon Q Business web experience URL for your users.

Replace the browser extension’s toolbar icon
You can modify the toolbar icon of your browser extension by setting the value of one or more of the following browser policy keys to the URL of your PNG or SVG image or a valid datauri for your users:

Q_BIZ_BROWSER_EXTENSION_ICON_128 (mandatory)
Q_BIZ_BROWSER_EXTENSION_ICON_16 (optional)
Q_BIZ_BROWSER_EXTENSION_ICON_32 (optional)
Q_BIZ_BROWSER_EXTENSION_ICON_48 (optional)

Replace the logo or icon in the browser extension window
To change the logo or icon in your browser extension window, set the value of the Q_BIZ_BROWSER_EXTENSION_LOGO policy key with a URL to your PNG or SVG image or a valid datauri for your users.

Modify the name of the browser extension shown in the browser extension window
To replace references to “Amazon Q,” “Amazon Q Business,” “AWS,” and “Amazon Web Services” with a name of your choice inside the browser extension window, set the value of the Q_BIZ_BROWSER_EXTENSION_ENTERPRISE_NAME policy key with the new name for your users.

Modify the title of your browser extension in hover text
To change the title of your browser extension as it shows in the text when hovering over your extension (“Amazon Q Business has access to this site,” as seen in the prior screenshot), set the Q_BIZ_BROWSER_EXTENSION_TITLE_NAME policy to the appropriate string for your users.

Replace the AI policy link in the browser extension footer with your own link
To replace the link text in the footer of your browser extension, set Q_BIZ_BROWSER_EXTENSION_FOOTER_POLICY_NAME to the appropriate string for your users.
To replace the URL in the footer of your browser extension, set Q_BIZ_BROWSER_EXTENSION_FOOTER_POLICY_URL to the appropriate URL for your users.

Congratulations! You and your organization are ready to receive generative assistance for your browser-based tasks.
Clean up
This section outlines the steps to disable or remove the browser extension or revert deployments and customization for your users.
Disable the Amazon Q Business browser extension through the Amazon Q Business console
You can disable the Amazon Q Business browser extension from the Amazon Q Business console whenever you choose, even before removing the browser extension from your users’ browsers. To do so, complete the following steps:

On the Amazon Q Business console, go to your application.
Under Enhancements in the navigation pane, choose Integrations.
In the Browser extensions section, choose Edit.

Deselect the check boxes for the browser extensions you want to disable:

The Chromium check box disables the Chrome store extension, which supports Google Chrome and Microsoft Edge browsers.
The Firefox check box disables the Firefox Browser add-on for Firefox browsers.

Choose Save.

Revert the deployment of the Amazon Q Business browser extension on behalf of your users
Enterprises use varying mobile device management software and have differing requirements for their browser policies. If you deployed the browser extension by updating your browser policy settings, you should remove those policies by following the guidance in the policy settings documentation for the respective browsers:

Mozilla Firefox policy settings
Google Chrome policy settings
Microsoft Edge:

Policy settings
Reference guide

Revert the deployment of the Amazon Q Business browser extension on behalf of your users
If you customized the Amazon Q Business browser extension by modifying browser policies as detailed earlier in this post, you can revert those customizations by simply removing the corresponding policy entry in your browser policy settings.
Conclusion
In this post, we showed how to use the Amazon Q Business browser extension to give your team seamless access to AI-driven insights and assistance. The browser extension is now available in US East (N. Virginia) and US West (Oregon) AWS Regions for Mozilla, Google Chrome, and Microsoft Edge as part of the Lite Subscription. There is no additional cost to use the browser extension.
To get started, log in to the Amazon Q Business console and setup the browser extension for your Amazon Q Business application. To learn more, see Configuring the Amazon Q Business browser extension for use.

About the authors
Firaz Akmal is a Sr. Product Manager for Amazon Q Business and has been at AWS for 8+ years. He is a customer advocate, helping customers transform their search and generative AI use-cases on AWS. Outside of work Firaz enjoys spending time in the mountains of the PNW or experiencing the world through his daughter’s perspective.
Abhinand Sukumar is a Senior Product Manager at Amazon Web Services for Amazon Q Business, where he drives the product vision and roadmap for innovative generative AI solutions. Abhinand works closely with customers and engineering to deliver successful integrations, including the browser extension. His expertise spans generative AI experiences and AI/ML educational devices, with a deep passion for education, artificial intelligence, and design thinking. Prior to joining AWS, Abhinand worked as an embedded software engineer in the networking industry. With 5-6 years of experience in technology,

Build Agentic Workflows with OpenAI GPT OSS on Amazon SageMaker AI and …

OpenAI has released two open-weight models, gpt-oss-120b (117 billion parameters) and gpt-oss-20b (21 billion parameters), both built with a Mixture of Experts (MoE) design and a 128K context window. These models are the leading open source models, according to Artificial Analysis benchmarks, and excel at reasoning and agentic workflows. With Amazon SageMaker AI, you can fine-tune or customize models and deploy with your choice of framework through a fully managed service. Amazon SageMaker Inference gives you the flexibility to bring your own inference code and framework without having to build and maintain your own clusters.
Although large language models (LLMs) excel at understanding language and generating content, building real-world agentic applications requires complex workflow management, tool calling capabilities, and context management. Multi-agent architectures address these challenges by breaking down complex systems into specialized components, but they introduce new complexities in agent coordination, memory management, and workflow orchestration.
In this post, we show how to deploy gpt-oss-20b model to SageMaker managed endpoints and demonstrate a practical stock analyzer agent assistant example with LangGraph, a powerful graph-based framework that handles state management, coordinated workflows, and persistent memory systems. We will then deploy our agents to Amazon Bedrock AgentCore, a unified orchestration layer that abstracts away infrastructure and allows you to securely deploy and operate AI agents at scale.
Solution overview
In this solution, we build an agentic stock analyzer with the following key components:

The GPT OSS 20B model deployed to a SageMaker endpoint using vLLM, an open source serving framework for LLMs
LangGraph to build a multi-agent orchestration framework
Amazon Bedrock AgentCore to deploy the agents

The following diagram illustrates the solution architecture.

This architecture illustrates a multi-agent workflow hosted on Amazon Bedrock AgentCore Runtime running on AWS. A user submits a query, which is handled by a pipeline of specialized agents—Data Gathering Agent, Stock Performance Analyzer Agent, and Stock Report Generation Agent—that are each responsible for a distinct part of the stock evaluation process.
These agents collaborate within Amazon Bedrock AgentCore Runtime, and when language understanding or generation is required, they invoke a GPT OSS model hosted on SageMaker AI. The model processes the input and returns structured outputs that inform agent actions, enabling a fully serverless, modular, and scalable agentic system using open-source models.
Prerequisites

Ensure that you have required quota for G6e instances to deploy the model. Request quota here if you do not.
If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain.
Ensure your IAM role has required permissions to deploy SageMaker Models and Endpoints. For more information, see How Amazon SageMaker AI works with IAM in the SageMaker Developer Guide.

Deploy GPT-OSS models to SageMaker Inference
Customers who want to customize their models and frameworks can deploy using serverful deployments, but this requires access to GPUs, serving frameworks, load balancers, and infrastructure setup. SageMaker AI provides a fully managed hosting platform that takes care of provisioning the infrastructure with the necessary drivers, downloads the models, and deploys them. OpenAI’s GPT-OSS models are launched with a 4-bit quantization scheme (MXFP4), enabling fast inference while keeping resource usage low. These models can run on P5(H100), P6(H200), and P4(A100) and G6e(L40) instances.The GPT-OSS models are sparse MoE architectures with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts with no shared expert. Using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single H100 GPU.
To deploy these models effectively, you need a powerful serving framework like vLLM. To deploy the model, we build a vLLM container with the latest version that supports GPT OSS models on SageMaker AI.
You can use the following Docker file and script to build the container and push it to a local Amazon Elastic Container Registry (Amazon ECR). The recommended approach is to do this directly from Amazon SageMaker Studio, which provides a managed JupyterLab environment with AWS CLI access where you can build and push images to ECR as part of your SageMaker workflow. Alternatively, you can also perform the same steps on an Amazon Elastic Compute Cloud (Amazon EC2) instance with Docker installed.
After you have built and pushed the container to Amazon ECR, you can open Amazon SageMaker Studio by going to the SageMaker AI console, as shown in the following screenshot.

You can then create a Jupyter space or use an existing one to launch JupyterLab and run notebooks.

Clone the following notebook and run “Option 3: Deploying from HF using BYOC.” Update the required parameters, such as the inference image in the notebook with the container image. We also provide necessary environment variables, as shown in the following code.

inference_image  f”{account_id}.dkr.ecr.{region}.amazonaws.com/vllm:v0.10.0-gpt-oss”
instance_type  “ml.g6e.4xlarge”
num_gpu  1
model_name  sagemakerutilsname_from_base(“model-byoc”)
endpoint_name  model_name
inference_component_name  f”ic-{model_name}”
config  {
“OPTION_MODEL”: “openai/gpt-oss-20b”,
“OPTION_SERVED_MODEL_NAME”: “model”,
“OPTION_TENSOR_PARALLEL_SIZE”: jsondumps(num_gpu),
“OPTION_ASYNC_SCHEDULING”: “true”,
}

After you set up the deployment configuration, you can deploy to SageMaker AI using the following code:

from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

lmi_model = sagemaker.Model(
    image_uri=inference_image,
    env=config,
    role=role,
    name=model_name,
)

lmi_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=600,
    endpoint_name=endpoint_name,
    endpoint_type=sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
    inference_component_name=inference_component_name,
    resources=ResourceRequirements(requests={“num_accelerators”: num_gpu, “memory”: 1024*5, “copies”: 1,}),
)

You can now run an inference example:

payload={
    “messages”: [
        {“role”: “user”, “content”: “Name popular places to visit in London?”}
    ],
}
res = llm.predict(payload)
print(“—–n” + res[“choices”][0][“message”][“content”] + “n—–n”)
print(res[“usage”])

—–
Here are some of the must‑see spots in London — a mix of iconic landmarks, world‑class museums, and vibrant neighborhoods:

| # | Place | Why It’s Popular |
|—|——-|——————|
| 1 | **Buckingham Palace** | The Queen’s official London residence – watch the Changing of the Guard. |
| 2 | **The Tower of London & Tower Bridge** | Historic castle, Crown Jewels, and the iconic bridge with glass floors. |
| 3 | **The British Museum** | World‑famous collection from the Rosetta Stone to Egyptian mummies (free entry). |
| 4 | **The Houses of Parliament & Big Ben** | The classic symbol of London’s politics and architecture. |
| 5 | **The National Gallery (Tate Britain)** | Home to masterpieces from Van Gogh to Turner. |
| 6 | **Buckinghamshire Gardens (Kew Gardens)** | Stunning botanical gardens with a glasshouse and the Horniman Insect Zoo. |
| 7 | **Camden Market** | Eclectic stalls, street food, music and vintage fashion. |
| 8 | **Covent Garden** | Lively piazza with street performers, boutique shops, and the Royal Opera House. |
| 9 | **West End Theatres** | Theatre district famous for grand productions (musicals, dramas). |
|10 | **The Shard** | Skyscraper with panoramic 360° views of London. |
|11 | **St. Paul’s Cathedral** | Massive dome, stunning interior and a climb up the Whispering Gallery. |
|12 | **The Tate Modern** | Contemporary art museum set in a former power station. |
|13 | **The Victoria & Albert Museum** | Design and fashion, costume, and jewelry collections. |
|14 | **Hyde Park & Kensington Gardens** | Huge green spaces with Serpentine Lake, Speaker’s Corner and Speakers’ Corner. |
|15 | **Oxford Street & Regent Street** | Prime shopping streets for fashion, flagship stores, and historic architecture. |

These spots cover history, culture, shopping, and leisure—perfect for a first visit or a weekend escape in London!
—–

Use LangGraph to build a stock analyzer agent
For our stock analyzing multi-agent system, we use LangGraph to orchestrate the workflow. Jupyter notebook for the code is located in this github repository. The system comprises three specialized tools that work together to analyze stocks comprehensively:

The gather_stock_data tool collects comprehensive stock data for a given ticker symbol, including current price, historical performance, financial metrics, and market data. It returns formatted information covering price history, company fundamentals, trading metrics, and recent news headlines.
The analyze_stock_performance tool performs detailed technical and fundamental analysis of stock data, calculating metrics like price trends, volatility, and overall investment scores. It evaluates multiple factors including P/E ratios, profit margins, and dividend yields to provide a comprehensive performance analysis
The generate_stock_reporttool creates professional PDF reports from the gathered stock data and analysis, automatically uploading them to Amazon S3 with organized date-based folders.

For local testing, you can use a simplified version of the system by importing the necessary functions from your local script. For example:

from langgraph_stock_local import langgraph_stock_sagemaker
# Test the agent locally
result = langgraph_stock_sagemaker({
    “prompt”: “Analyze SIM_STOCK Stock for Investment purposes.”
})
print(result)

This way, you can iterate quickly on your agent’s logic before deploying it to a scalable platform, making sure each component functions correctly and the overall workflow produces the expected results for different types of stocks.
Deploy to Amazon Bedrock AgentCore
After you have developed and tested your LangGraph framework locally, you can deploy it to Amazon Bedrock AgentCore Runtime. Amazon Bedrock AgentCore handles the heavy lifting of container orchestration, session management, scalability and abstracting the management of infrastructure. It provides persistent execution environments that can maintain an agent’s state across multiple invocations.
Before deploying our stock analyzer agent to Amazon Bedrock AgentCore Runtime, we need to create an AWS Identity and Access Management IAM role with the appropriate permissions. This role allows Amazon Bedrock AgentCore to invoke your SageMaker endpoint for GPT-OSS model inference, manage ECR repositories for storing container images, write Amazon CloudWatch logs for monitoring and debugging, access Amazon Bedrock AgentCore workload services for runtime operations, and send telemetry data to AWS X-Ray and CloudWatch for observability. See the following code:

from create_agentcore_role import create_bedrock_agentcore_role
role_arn = create_bedrock_agentcore_role(
    role_name=”MyStockAnalyzerRole”,
    sagemaker_endpoint_name=”your-endpoint-name”,
    region=”us-west-2″
)

After creating the role, you can use the Amazon Bedrock AgentCore Starter Toolkit to deploy your agent. The toolkit simplifies the deployment process by packaging your code, creating the necessary container image, and configuring the runtime environment:

from bedrock_agentcore_starter_toolkit import Runtime
agentcore_runtime = Runtime()
# Configure the agent
response = agentcore_runtime.configure(
    entrypoint=”langgraph_stock_sagemaker_gpt_oss.py”,
    execution_role=role_arn,
    auto_create_ecr=True,
    requirements_file=”requirements.txt”,
    region=”us-west-2″,
    agent_name=”stock_analyzer_agent”
)
# Deploy to the cloud
launch_result = agentcore_runtime.launch(local=False, local_build=False)

When you’re using BedrockAgentCoreApp, it automatically creates an HTTP server that listens on port 8080, implements the required /invocations endpoint for processing the agent’s requirements, implements the/ping endpoint for health checks (which is very important for asynchronous agents), handles proper content types and response formats, and manages error handling according to AWS standards.
After you deploy to Amazon Bedrock AgentCore Runtime, you will be able to see the status show as Ready on the Amazon Bedrock AgentCore console.

Invoke the agent
After you create the agent, you must set up the agent invocation entry point. With Amazon AgentCore Runtime, we decorate the invocation part of our agent with the @app.entrypoint decorator and use it as the entry point for our runtime. After you deploy the agent to Amazon AgentCore Runtime, you can invoke it using the AWS SDK:

import boto3
import json
agentcore_client = boto3.client(‘bedrock-agentcore’, region_name=’us-west-2′)
response = agentcore_client.invoke_agent_runtime(
    agentRuntimeArn=launch_result.agent_arn,
    qualifier=”DEFAULT”,
    payload=json.dumps({
        “prompt”: “Analyze SIM_STOCK for investment purposes”
    })
)

After invoking the stock analyzer agent through Amazon Bedrock AgentCore Runtime, you must parse and format the response for clear presentation. The response processing involves the following steps:

Decode the byte stream from Amazon Bedrock AgentCore into readable text.
Parse the JSON response containing the complete stock analysis.
Extract three main sections using regex pattern matching:

Stock Data Gathering Section: Extracts core stock information including symbol, company details, current pricing, market metrics, financial ratios, trading data, and recent news headlines.
Performance Analysis section: Analyzes technical indicators, fundamental metrics, and volatility measures to generate comprehensive stock analysis.
Stock Report Generation Section: Generates a detailed PDF report with all the Stock Technical Analysis.

The system also includes error handling that gracefully handles JSON parsing errors, falls back to plain text display if structured parsing fails, and provides debugging information for troubleshooting parsing issues of the stock analysis response.

stock_analysis = parse_bedrock_agentcore_stock_response(invoke_response)

This formatted output makes it straightforward to review the agent’s decision-making process and present professional stock analysis results to stakeholders, completing the end-to-end workflow from model deployment to meaningful business output:

STOCK DATA GATHERING REPORT:
================================
Stock Symbol: SIM_STOCK
Company Name: Simulated Stock Inc.
Sector: SIM_SECTOR
Industry: SIM INDUSTRY
CURRENT MARKET DATA:
– Current Price: $29.31
– Market Cap: $3,958
– 52-Week High: $29.18
– 52-Week Low: $16.80
– YTD Return: 1.30%
– Volatility (Annualized): 32.22%
FINANCIAL METRICS:
– P/E Ratio: 44.80
– Forward P/E: 47.59
– Price-to-Book: 11.75
– Dividend Yield: 0.46%
– Revenue (TTM): $4,988
– Profit Margin: 24.30%

STOCK PERFORMANCE ANALYSIS:
===============================
Stock: SIM_STOCK | Current Price: $29.31
TECHNICAL ANALYSIS:
– Price Trend: SLIGHT UPTREND
– YTD Performance: 1.03%
– Technical Score: 3/5
FUNDAMENTAL ANALYSIS:
– P/E Ratio: 34.80
– Profit Margin: 24.30%
– Dividend Yield: 0.46%
– Beta: 1.165
– Fundamental Score: 3/5
STOCK REPORT GENERATION:
===============================
Stock: SIM_STOCK
Sector: SIM_INDUSTRY
Current Price: $29.78
REPORT SUMMARY:
– Technical Analysis: 8.33% YTD performance
– Report Type: Comprehensive stock analysis for informational purposes
– Generated: 2025-09-04 23:11:55
PDF report uploaded to S3: s3://amzn-s3-demo-bucket/2025/09/04/SIM_STOCK_Stock_Report_20250904_231155.pdf
REPORT CONTENTS:
• Executive Summary with key metrics
• Detailed market data and financial metrics
• Technical and fundamental analysis
• Professional formatting for documentation

Clean up
You can delete the SageMaker endpoint to avoid accruing costs after your testing by running the following cells in the same notebook:

sessdelete_inference_component(inference_component_name)
sessdelete_endpoint(endpoint_name)
sessdelete_endpoint_config(endpoint_name)
sessdelete_model(model_name)

You can also delete Amazon Bedrock AgentCore resources using the following commands:

runtime_delete_response  agentcore_control_clientdelete_agent_runtime(
agentRuntimeIdlaunch_resultagent_id
)
response  ecr_clientdelete_repository(
repositoryNamelaunch_resultecr_urisplit(‘/’)[1],
force
)

Conclusion
In this post, we built an end-to-end solution for deploying OpenAI’s open-weight models on a single G6e(L40s) GPU, creating a multi-agent stock analysis system with LangGraph and deploying it seamlessly with Amazon Bedrock AgentCore. This implementation demonstrates how organizations can now use powerful open source LLMs cost-effectively with efficient serving frameworks such as vLLM. Beyond the technical implementation, enhancing this workflow can provide significant business value, such as reduction in stock analysis processing time, increased analyst productivity by automating routine stock assessments. Furthermore, by freeing analysts from repetitive tasks, organizations can redirect skilled professionals toward complex cases and relationship-building activities that drive business growth.
We invite you to try out our code samples and iterate your agentic workflows to meet your use cases.

About the authors
Vivek Gangasani is a Worldwide Lead GenAI Specialist Solutions Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product strategy for SageMaker Inference. He also helps enterprises and startups deploy, manage, and scale their GenAI models with SageMaker and GPUs. Currently, he is focused on developing strategies and solutions for optimizing inference performance and GPU efficiency for hosting Large Language Models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the Llama family, and Qwen, focusing on their fine-tuning and optimization for specific scientific applications. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker, enabling the scaling of foundation models from development to production. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.

A Coding Guide to Implement Zarr for Large-Scale Data: Chunking, Compr …

In this tutorial, we take a deep dive into the capabilities of Zarr, a library designed for efficient storage & manipulation of large, multidimensional arrays. We begin by exploring the basics, creating arrays, setting chunking strategies, and modifying values directly on disk. From there, we expand into more advanced operations such as experimenting with chunk sizes for different access patterns, applying multiple compression codecs to optimize both speed and storage efficiency, and comparing their performance on synthetic datasets. We also build hierarchical structures enriched with metadata, simulate realistic workflows with time-series and volumetric data, and demonstrate advanced indexing to extract meaningful subsets. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install zarr numcodecs -q
import zarr
import numpy as np
import matplotlib.pyplot as plt
from numcodecs import Blosc, Delta, FixedScaleOffset
import tempfile
import shutil
import os
from pathlib import Path

print(f”Zarr version: {zarr.__version__}”)
print(f”NumPy version: {np.__version__}”)

print(“=== BASIC ZARR OPERATIONS ===”)

We begin our tutorial by installing Zarr and Numcodecs, along with essential libraries like NumPy and Matplotlib. We then set up the environment and verify the versions, preparing ourselves to dive into basic Zarr operations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsertutorial_dir = Path(tempfile.mkdtemp(prefix=”zarr_tutorial_”))
print(f”Working directory: {tutorial_dir}”)

z1 = zarr.zeros((1000, 1000), chunks=(100, 100), dtype=’f4′,
store=str(tutorial_dir / ‘basic_array.zarr’), zarr_format=2)
z2 = zarr.ones((500, 500, 10), chunks=(100, 100, 5), dtype=’i4′,
store=str(tutorial_dir / ‘multi_dim.zarr’), zarr_format=2)

print(f”2D Array shape: {z1.shape}, chunks: {z1.chunks}, dtype: {z1.dtype}”)
print(f”3D Array shape: {z2.shape}, chunks: {z2.chunks}, dtype: {z2.dtype}”)

z1[100:200, 100:200] = np.random.random((100, 100)).astype(‘f4′)
z2[:, :, 0] = np.arange(500*500).reshape(500, 500)

print(f”Memory usage estimate: {z1.nbytes_stored() / 1024**2:.2f} MB”)

We create our working directory and initialize Zarr arrays: a 2D array of zeros and a 3D array of ones. We then fill them with random and sequential values, while also checking their shapes, chunk sizes, and memory usage in real time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n=== ADVANCED CHUNKING ===”)

time_steps, height, width = 365, 1000, 2000
time_series = zarr.zeros(
(time_steps, height, width),
chunks=(30, 250, 500),
dtype=’f4’,
store=str(tutorial_dir / ‘time_series.zarr’),
zarr_format=2
)

for t in range(0, time_steps, 30):
end_t = min(t + 30, time_steps)
seasonal = np.sin(2 * np.pi * np.arange(t, end_t) / 365)[:, None, None]
spatial = np.random.normal(20, 5, (end_t – t, height, width))
time_series[t:end_t] = (spatial + 10 * seasonal).astype(‘f4′)

print(f”Time series created: {time_series.shape}”)
print(f”Approximate chunks created”)

import time
start = time.time()
temporal_slice = time_series[:, 500, 1000]
temporal_time = time.time() – start

start = time.time()
spatial_slice = time_series[100, :200, :200]
spatial_time = time.time() – start

print(f”Temporal access time: {temporal_time:.4f}s”)
print(f”Spatial access time: {spatial_time:.4f}s”)

In this step, we simulate a year-long time-series dataset with optimized chunking for both temporal and spatial access. We add seasonal patterns and spatial noise, then measure access speeds, allowing us to see firsthand how chunking impacts performance in real-world data exploration. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n=== COMPRESSION AND CODECS ===”)

data = np.random.randint(0, 1000, (1000, 1000), dtype=’i4’)

from zarr.codecs import BloscCodec, BytesCodec

z_none = zarr.array(data, chunks=(100, 100),
codecs=[BytesCodec()],
store=str(tutorial_dir / ‘no_compress.zarr’))

z_lz4 = zarr.array(data, chunks=(100, 100),
codecs=[BytesCodec(), BloscCodec(cname=’lz4′, clevel=5)],
store=str(tutorial_dir / ‘lz4_compress.zarr’))

z_zstd = zarr.array(data, chunks=(100, 100),
codecs=[BytesCodec(), BloscCodec(cname=’zstd’, clevel=9)],
store=str(tutorial_dir / ‘zstd_compress.zarr’))

sequential_data = np.cumsum(np.random.randint(-5, 6, (1000, 1000)), axis=1)
z_delta = zarr.array(sequential_data, chunks=(100, 100),
codecs=[BytesCodec(), BloscCodec(cname=’zstd’, clevel=5)],
store=str(tutorial_dir / ‘sequential_compress.zarr’))

sizes = {
‘No compression’: z_none.nbytes_stored(),
‘LZ4’: z_lz4.nbytes_stored(),
‘ZSTD’: z_zstd.nbytes_stored(),
‘Sequential+ZSTD’: z_delta.nbytes_stored()
}

print(“Compression comparison:”)
original_size = data.nbytes
for name, size in sizes.items():
ratio = size / original_size
print(f”{name}: {size/1024**2:.2f} MB (ratio: {ratio:.3f})”)

print(“n=== HIERARCHICAL DATA ORGANIZATION ===”)

root = zarr.open_group(str(tutorial_dir / ‘experiment.zarr’), mode=’w’)

raw_data = root.create_group(‘raw_data’)
processed = root.create_group(‘processed’)
metadata = root.create_group(‘metadata’)

raw_data.create_dataset(‘images’, shape=(100, 512, 512), chunks=(10, 128, 128), dtype=’u2′)
raw_data.create_dataset(‘timestamps’, shape=(100,), dtype=’datetime64[ns]’)

processed.create_dataset(‘normalized’, shape=(100, 512, 512), chunks=(10, 128, 128), dtype=’f4′)
processed.create_dataset(‘features’, shape=(100, 50), chunks=(20, 50), dtype=’f4′)

root.attrs[‘experiment_id’] = ‘EXP_2024_001’
root.attrs[‘description’] = ‘Advanced Zarr tutorial demonstration’
root.attrs[‘created’] = str(np.datetime64(‘2024-01-01’))

raw_data.attrs[‘instrument’] = ‘Synthetic Camera’
raw_data.attrs[‘resolution’] = [512, 512]
processed.attrs[‘normalization’] = ‘z-score’

timestamps = np.datetime64(‘2024-01-01’) + np.arange(100) * np.timedelta64(1, ‘h’)
raw_data[‘timestamps’][:] = timestamps

for i in range(100):
frame = np.random.poisson(100 + 50 * np.sin(2 * np.pi * i / 100), (512, 512)).astype(‘u2’)
raw_data[‘images’][i] = frame

print(f”Created hierarchical structure with {len(list(root.group_keys()))} groups”)
print(f”Data arrays and groups created successfully”)

print(“n=== ADVANCED INDEXING ===”)

volume_data = zarr.zeros((50, 20, 256, 256), chunks=(5, 5, 64, 64), dtype=’f4′,
store=str(tutorial_dir / ‘volume.zarr’), zarr_format=2)

for t in range(50):
for z in range(20):
y, x = np.ogrid[:256, :256]
center_y, center_x = 128 + 20*np.sin(t*0.1), 128 + 20*np.cos(t*0.1)
focus_quality = 1 – abs(z – 10) / 10

signal = focus_quality * np.exp(-((y-center_y)**2 + (x-center_x)**2) / (50**2))
noise = 0.1 * np.random.random((256, 256))
volume_data[t, z] = (signal + noise).astype(‘f4′)

print(“Various slicing operations:”)

max_projection = np.max(volume_data[:, 10], axis=0)
print(f”Max projection shape: {max_projection.shape}”)

z_stack = volume_data[25, :, 100:156, 100:156]
print(f”Z-stack subset: {z_stack.shape}”)

bright_pixels = volume_data[volume_data > 0.5]
print(f”Pixels above threshold: {len(bright_pixels)}”)

We benchmark compression by writing the same data with no compression, LZ4, and ZSTD, then compare on-disk sizes to see practical savings. Next, we organize an experiment as a Zarr group hierarchy with rich attributes, images, and timestamps. Finally, we generate a synthetic 4D volume and perform advanced indexing, max projections, sub-stacks, and thresholding, to validate fast, slice-wise access. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n=== PERFORMANCE OPTIMIZATION ===”)

def process_chunk_serial(data, func):
results = []
for i in range(0, len(dt), 100):
chunk = data[i:i+100]
results.append(func(chunk))
return np.concatenate(results)

def gaussian_filter_1d(x, sigma=1.0):
kernel_size = int(4 * sigma)
if kernel_size % 2 == 0:
kernel_size += 1
kernel = np.exp(-0.5 * ((np.arange(kernel_size) – kernel_size//2) / sigma)**2)
kernel = kernel / kernel.sum()
return np.convolve(x.astype(float), kernel, mode=’same’)

large_array = zarr.random.random((10000,), chunks=(1000,),
store=str(tutorial_dir / ‘large.zarr’), zarr_format=2)

start_time = time.time()
chunk_size = 1000
filtered_data = []
for i in range(0, len(large_array), chunk_size):
end_idx = min(i + chunk_size, len(large_array))
chunk_data = large_array[i:end_idx]
smoothed = np.convolve(chunk_data, np.ones(5)/5, mode=’same’)
filtered_data.append(smoothed)

result = np.concatenate(filtered_data)
processing_time = time.time() – start_time

print(f”Chunk-aware processing time: {processing_time:.4f}s”)
print(f”Processed {len(large_array):,} elements”)

print(“n=== VISUALIZATION ===”)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle(‘Advanced Zarr Tutorial – Data Visualization’, fontsize=16)

axes[0,0].plot(temporal_slice)
axes[0,0].set_title(‘Temporal Evolution (Single Pixel)’)
axes[0,0].set_xlabel(‘Day of Year’)
axes[0,0].set_ylabel(‘Temperature’)

im1 = axes[0,1].imshow(spatial_slice, cmap=’viridis’)
axes[0,1].set_title(‘Spatial Pattern (Day 100)’)
plt.colorbar(im1, ax=axes[0,1])

methods = list(sizes.keys())
ratios = [sizes[m]/original_size for m in methods]
axes[0,2].bar(range(len(methods)), ratios)
axes[0,2].set_xticks(range(len(methods)))
axes[0,2].set_xticklabels(methods, rotation=45)
axes[0,2].set_title(‘Compression Ratios’)
axes[0,2].set_ylabel(‘Size Ratio’)

axes[1,0].imshow(max_projection, cmap=’hot’)
axes[1,0].set_title(‘Max Intensity Projection’)

z_profile = np.mean(volume_data[25, :, 120:136, 120:136], axis=(1,2))
axes[1,1].plot(z_profile, ‘o-‘)
axes[1,1].set_title(‘Z-Profile (Center Region)’)
axes[1,1].set_xlabel(‘Z-slice’)
axes[1,1].set_ylabel(‘Mean Intensity’)

axes[1,2].plot(result[:1000])
axes[1,2].set_title(‘Processed Signal (First 1000 points)’)
axes[1,2].set_xlabel(‘Sample’)
axes[1,2].set_ylabel(‘Amplitude’)

plt.tight_layout()
plt.show()

We optimize performance by processing data in chunk-sized batches, applying simple smoothing filters without loading everything into memory. We then visualize temporal trends, spatial patterns, compression effects, and volume profiles, allowing us to see at a glance how our choices in chunking and compression shape the results. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n=== TUTORIAL SUMMARY ===”)
print(“Zarr features demonstrated:”)
print(“✓ Multi-dimensional array creation and manipulation”)
print(“✓ Optimal chunking strategies for different access patterns”)
print(“✓ Advanced compression with multiple codecs”)
print(“✓ Hierarchical data organization with metadata”)
print(“✓ Advanced indexing and data views”)
print(“✓ Performance optimization techniques”)
print(“✓ Integration with visualization tools”)

def show_tree(path, prefix=””, max_depth=3, current_depth=0):
if current_depth > max_depth:
return
items = sorted(path.iterdir())
for i, item in enumerate(items):
is_last = i == len(items) – 1
current_prefix = “└── ” if is_last else “├── ”
print(f”{prefix}{current_prefix}{item.name}”)
if item.is_dir() and current_depth < max_depth:
next_prefix = prefix + (” ” if is_last else “│ “)
show_tree(item, next_prefix, max_depth, current_depth + 1)

print(f”nFiles created in {tutorial_dir}:”)
show_tree(tutorial_dir)

print(f”nTotal disk usage: {sum(f.stat().st_size for f in tutorial_dir.rglob(‘*’) if f.is_file()) / 1024**2:.2f} MB”)

print(“n Advanced Zarr tutorial completed successfully!”)

We wrap up the tutorial by highlighting everything we explored: array creation, chunking, compression, hierarchical organization, indexing, performance tuning, and visualization. We also review the files generated during the session and confirm total disk usage, giving us a complete picture of how Zarr handles large-scale data efficiently from start to finish.

In conclusion, we move beyond the fundamentals and gain a comprehensive view of how Zarr fits into modern data workflows. We see how it handles storage optimization through compression, organizes complex experiments through hierarchical groups, and enables smooth access to slices of large datasets with minimal overhead. Performance enhancements, such as chunk-aware processing and integration with visualization tools, bring additional depth, demonstrating how theory is directly translated into practice.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide to Implement Zarr for Large-Scale Data: Chunking, Compression, Indexing, and Visualization Techniques appeared first on MarkTechPost.

Google AI Ships TimesFM-2.5: Smaller, Longer-Context Foundation Model …

Google Research has released TimesFM-2.5, a 200M-parameter, decoder-only time-series foundation model with a 16K context length and native probabilistic forecasting support. The new checkpoint is live on Hugging Face. On GIFT-Eval, TimesFM-2.5 now tops the leaderboard across accuracy metrics (MASE, CRPS) among zero-shot foundation models.

What is Time-Series Forecasting?

Time-series forecasting is the practice of analyzing sequential data points collected over time to identify patterns and predict future values. It underpins critical applications across industries, including forecasting product demand in retail, monitoring weather and precipitation trends, and optimizing large-scale systems such as supply chains and energy grids. By capturing temporal dependencies and seasonal variations, time-series forecasting enables data-driven decision-making in dynamic environments.

What changed in TimesFM-2.5 vs v2.0?

Parameters: 200M (down from 500M in 2.0).

Max context: 16,384 points (up from 2,048).

Quantiles: Optional 30M-param quantile head for continuous quantile forecasts up to 1K horizon.

Inputs: No “frequency” indicator required; new inference flags (flip-invariance, positivity inference, quantile-crossing fix).

Roadmap: Upcoming Flax implementation for faster inference; covariates support slated to return; docs being expanded.

Why does a longer context matter?

16K historical points allow a single forward pass to capture multi-seasonal structure, regime breaks, and low-frequency components without tiling or hierarchical stitching. In practice, that reduces pre-processing heuristics and improves stability for domains where context >> horizon (e.g., energy load, retail demand). The longer context is a core design change explicitly noted for 2.5.

What’s the research context?

TimesFM’s core thesis—a single, decoder-only foundation model for forecasting—was introduced in the ICML 2024 paper and Google’s research blog. GIFT-Eval (Salesforce) emerged to standardize evaluation across domains, frequencies, horizon lengths, and univariate/multivariate regimes, with a public leaderboard hosted on Hugging Face.

Key Takeaways

Smaller, Faster Model: TimesFM-2.5 runs with 200M parameters (half of 2.0’s size) while improving accuracy.

Longer Context: Supports 16K input length, enabling forecasts with deeper historical coverage.

Benchmark Leader: Now ranks #1 among zero-shot foundation models on GIFT-Eval for both MASE (point accuracy) and CRPS (probabilistic accuracy).

Production-Ready: Efficient design and quantile forecasting support make it suitable for real-world deployments across industries.

Broad Availability: The model is live on Hugging Face.

Summary

TimesFM-2.5 shows that foundation models for forecasting are moving past proof-of-concept into practical, production-ready tools. By cutting parameters in half while extending context length and leading GIFT-Eval across both point and probabilistic accuracy, it marks a step-change in efficiency and capability. With Hugging Face access already live and BigQuery/Model Garden integration on the way, the model is positioned to accelerate adoption of zero-shot time-series forecasting in real-world pipelines.

Check out the Model card (HF), Repo, Benchmark and Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Ships TimesFM-2.5: Smaller, Longer-Context Foundation Model That Now Leads GIFT-Eval (Zero-Shot Forecasting) appeared first on MarkTechPost.

Stanford Researchers Introduced MedAgentBench: A Real-World Benchmark …

A team of Stanford University researchers have released MedAgentBench, a new benchmark suite designed to evaluate large language model (LLM) agents in healthcare contexts. Unlike prior question-answering datasets, MedAgentBench provides a virtual electronic health record (EHR) environment where AI systems must interact, plan, and execute multi-step clinical tasks. This marks a significant shift from testing static reasoning to assessing agentic capabilities in live, tool-based medical workflows.

https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

Why Do We Need Agentic Benchmarks in Healthcare?

Recent LLMs have moved beyond static chat-based interactions toward agentic behavior—interpreting high-level instructions, calling APIs, integrating patient data, and automating complex processes. In medicine, this evolution could help address staff shortages, documentation burden, and administrative inefficiencies.

While general-purpose agent benchmarks (e.g., AgentBench, AgentBoard, tau-bench) exist, healthcare lacked a standardized benchmark that captures the complexity of medical data, FHIR interoperability, and longitudinal patient records. MedAgentBench fills this gap by offering a reproducible, clinically relevant evaluation framework.

What Does MedAgentBench Contain?

How Are the Tasks Structured?

MedAgentBench consists of 300 tasks across 10 categories, written by licensed physicians. These tasks include patient information retrieval, lab result tracking, documentation, test ordering, referrals, and medication management. Tasks average 2–3 steps and mirror workflows encountered in inpatient and outpatient care.

What Patient Data Supports the Benchmark?

The benchmark leverages 100 realistic patient profiles extracted from Stanford’s STARR data repository, comprising over 700,000 records including labs, vitals, diagnoses, procedures, and medication orders. Data was de-identified and jittered for privacy while preserving clinical validity.

How Is the Environment Built?

The environment is FHIR-compliant, supporting both retrieval (GET) and modification (POST) of EHR data. AI systems can simulate realistic clinical interactions such as documenting vitals or placing medication orders. This design makes the benchmark directly translatable to live EHR systems.

How Are Models Evaluated?

Metric: Task success rate (SR), measured with strict pass@1 to reflect real-world safety requirements.

Models Tested: 12 leading LLMs including GPT-4o, Claude 3.5 Sonnet, Gemini 2.0, DeepSeek-V3, Qwen2.5, and Llama 3.3.

Agent Orchestrator: A baseline orchestration setup with nine FHIR functions, limited to eight interaction rounds per task.

Which Models Performed Best?

Claude 3.5 Sonnet v2: Best overall with 69.67% success, especially strong in retrieval tasks (85.33%).

GPT-4o: 64.0% success, showing balanced retrieval and action performance.

DeepSeek-V3: 62.67% success, leading among open-weight models.

Observation: Most models excelled at query tasks but struggled with action-based tasks requiring safe multi-step execution.

https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

What Errors Did Models Make?

Two dominant failure patterns emerged:

Instruction adherence failures — invalid API calls or incorrect JSON formatting.

Output mismatch — providing full sentences when structured numerical values were required.

These errors highlight gaps in precision and reliability, both critical in clinical deployment.

Summary

MedAgentBench establishes the first large-scale benchmark for evaluating LLM agents in realistic EHR settings, pairing 300 clinician-authored tasks with a FHIR-compliant environment and 100 patient profiles. Results show strong potential but limited reliability—Claude 3.5 Sonnet v2 leads at 69.67%—highlighting the gap between query success and safe action execution. While constrained by single-institution data and EHR-focused scope, MedAgentBench provides an open, reproducible framework to drive the next generation of dependable healthcare AI agents

Check out the PAPER and Technical Blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Stanford Researchers Introduced MedAgentBench: A Real-World Benchmark for Healthcare AI Agents appeared first on MarkTechPost.

Streamline access to ISO-rating content changes with Verisk rating ins …

This post is co-written with Samit Verma, Eusha Rizvi, Manmeet Singh, Troy Smith, and Corey Finley from Verisk.
Verisk Rating Insights as a feature of ISO Electronic Rating Content (ERC) is a powerful tool designed to provide summaries of ISO Rating changes between two releases. Traditionally, extracting specific filing information or identifying differences across multiple releases required manual downloads of full packages, which was time-consuming and prone to inefficiencies. This challenge, coupled with the need for accurate and timely customer support, prompted Verisk to explore innovative ways to enhance user accessibility and automate repetitive processes. Using generative AI and Amazon Web Services (AWS) services, Verisk has made significant strides in creating a conversational user interface for users to easily retrieve specific information, identify content differences, and improve overall operational efficiency.
In this post, we dive into how Verisk Rating Insights, powered by Amazon Bedrock, large language models (LLM), and Retrieval Augmented Generation (RAG), is transforming the way customers interact with and access ISO ERC changes.
The challenge
Rating Insights provides valuable content, but there were significant challenges with user accessibility and the time it took to extract actionable insights:

Manual downloading – Customers had to download entire packages to get even a small piece of relevant information. This was inefficient, especially when only a part of the filing needed to be reviewed.
Inefficient data retrieval – Users couldn’t quickly identify the differences between two content packages without downloading and manually comparing them, which could take hours and sometimes days of analysis.
Time-consuming customer support – Verisk’s ERC Customer Support team spent 15% of their time weekly addressing queries from customers who were impacted by these inefficiencies. Furthermore, onboarding new customers required half a day of repetitive training to ensure they understood how to access and interpret the data.
Manual analysis time – Customers often spent 3–4 hours per test case analyzing the differences between filings. With multiple test cases to address, this led to significant delays in critical decision-making.

Solution overview
To solve these challenges, Verisk embarked on a journey to enhance Rating Insights with generative AI technologies. By integrating Anthropic’s Claude, available in Amazon Bedrock, and Amazon OpenSearch Service, Verisk created a sophisticated conversational platform where users can effortlessly access and analyze rating content changes.
The following diagram illustrates the high-level architecture of the solution, with distinct sections showing the data ingestion process and inference loop. The architecture uses multiple AWS services to add generative AI capabilities to the Ratings Insight system. This system’s components work together seamlessly, coordinating multiple LLM calls to generate user responses.

The following diagram shows the architectural components and the high-level steps involved in the Data Ingestion process.

The steps in the data ingestion process proceed as follows:

This process is triggered when a new file is dropped. It is responsible for chunking the document using a custom chunking strategy. This strategy recursively checks each section and keeps them intact without overlap. The process then embeds the chunks and stores them in OpenSearch Service as vector embeddings.
The embedding model used in Amazon Bedrock is amazon titan-embed-g1-text-02.
Amazon OpenSearch Serverless is utilized as a vector embedding store with metadata filtering capability.

The following diagram shows the architectural components and the high-level steps involved in the inference loop to generate user responses.

The steps in the inference loop proceed as follows:

This component is responsible for multiple tasks: it supplements user questions with recent chat history, embeds the questions, retrieves relevant chunks from the vector database, and finally calls the generation model to synthesize a response.
Amazon ElastiCache is used for storing recent chat history.
The embedding model utilized in Amazon Bedrock is amazon titan-embed-g1-text-02.
OpenSearch Serverless is implemented for RAG (Retrieval-Augmented Generation).
For generating responses to user queries, the system uses Anthropic’s Claude Sonnet 3.5 (model ID: anthropic.claude-3-5-sonnet-20240620-v1:0), which is available through Amazon Bedrock.

Key technologies and frameworks used
We used Anthropic’s Claude Sonnet 3.5 (model ID: anthropic.claude-3-5-sonnet-20240620-v1:0) to understand user input and provide detailed, contextually relevant responses. Anthropic’s Claude Sonnet 3.5 enhances the platform’s ability to interpret user queries and deliver accurate insights from complex content changes. LlamaIndex, which is an open source framework, served as the chain framework for efficiently connecting and managing different data sources to enable dynamic retrieval of content and insights.
We implemented RAG, which allows the model to pull specific, relevant data from the OpenSearch Serverless vector database. This means the system generates precise, up-to-date responses based on a user’s query without needing to sift through massive content downloads. The vector database enables intelligent search and retrieval, organizing content changes in a way that makes them quickly and easily accessible. This eliminates the need for manual searching or downloading of entire content packages. Verisk applied guardrails in Amazon Bedrock Guardrails along with custom guardrails around the generative model so the output adheres to specific compliance and quality standards, safeguarding the integrity of responses.
Verisk’s generative AI solution is a comprehensive, secure, and flexible service for building generative AI applications and agents. Amazon Bedrock connects you to leading FMs, services to deploy and operate agents, and tools for fine-tuning, safeguarding, and optimizing models along with knowledge bases to connect applications to your latest data so that you have everything you need to quickly move from experimentation to real-world deployment.
Given the novelty of generative AI, Verisk has established a governance council to oversee its solutions, ensuring they meet security, compliance, and data usage standards. Verisk implemented strict controls within the RAG pipeline to ensure data is only accessible to authorized users. This helps maintain the integrity and privacy of sensitive information. Legal reviews ensure IP protection and contract compliance.
How it works
The integration of these advanced technologies enables a seamless, user-friendly experience. Here’s how Verisk Rating Insights now works for customers:

Conversational user interface – Users can interact with the platform by using a conversational interface. Instead of manually reviewing content packages, users enter a natural language query (for example, “What are the changes in coverage scope between the two recent filings?”). The system uses Anthropic’s Claude Sonnet 3.5 to understand the intent and provides an instant summary of the relevant changes.
Dynamic content retrieval – Thanks to RAG and OpenSearch Service, the platform doesn’t require downloading entire files. Instead, it dynamically retrieves and presents the specific changes a user is seeking, enabling quicker analysis and decision-making.
Automated difference analysis – The system can automatically compare two content packages, highlighting the differences without requiring manual intervention. Users can query for precise comparisons (for example, “Show me the differences in rating criteria between Release 1 and Release 2”).
Customized insights – The guardrails in place mean that responses are accurate, compliant, and actionable. Additionally, if needed, the system can help users understand the impact of changes and assist them in navigating the complexities of filings, providing clear, concise insights.

The following diagram shows the architectural components and the high-level steps involved in the evaluation loop to generate relevant and grounded responses.

The steps in the evaluation loop proceed as follows:

This component is responsible for calling Anthropic’s Claude Sonnet 3.5 model and subsequently invoking the custom-built evaluation APIs to ensure response accuracy.
The generation model employed is Anthropic’s Claude Sonnet 3.5, which handles the creation of responses.
The Evaluation API ensures that responses remain relevant to user queries and stay grounded within the provided context.

The following diagram shows the process of capturing the chat history as contextual memory and storage for analysis.

Quality benchmarks
The Verisk Rating Insights team has implemented a comprehensive evaluation framework and feedback loop mechanism respectively, shown in the above figures, to support continuous improvement and address the issues that might arise.
Ensuring high accuracy and consistency in responses is essential for Verisk’s generative AI solutions. However, LLMs can sometimes produce hallucinations or provide irrelevant details, affecting reliability. To address this, Verisk implemented:

Evaluation framework – Integrated into the query pipeline, it validates responses for precision and relevance before delivery.
Extensive testing – Product subject matter experts (SMEs) and quality experts rigorously tested the solution to ensure accuracy and reliability. Verisk collaborated with in-house insurance domain experts to develop SME evaluation metrics for accuracy and consistency. Multiple rounds of SME evaluations were conducted, where experts graded these metrics on a 1–10 scale. Latency was also tracked to assess speed. Feedback from each round was incorporated into subsequent tests to drive improvements.
Continual model improvement – Using customer feedback serves as a crucial component in driving the continuous evolution and refinement of the generative models, improving both accuracy and relevance. By seamlessly integrating user interactions and feedback with chat history, a robust data pipeline is created that streams the user interactions to an Amazon Simple Storage Service (Amazon S3) bucket, which acts as a data hub. The interactions then go into Snowflake, which is a cloud-based data platform and data warehouse as a service that offers capabilities such as data warehousing, data lakes, data sharing, and data exchange. Through this integration, we built comprehensive analytics dashboards that provide valuable insights into user experience patterns and pain points.

Although the initial results were promising, they didn’t meet the desired accuracy and consistency levels. The development process involved several iterative improvements, such as redesigning the system and making multiple calls to the LLM. The primary metric for success was a manual grading system where business experts compared the results and provided continuous feedback to improve overall benchmarks.
Business impact and opportunity
By integrating generative AI into Verisk Rating Insights, the business has seen a remarkable transformation. Customers enjoyed significant time savings. By eliminating the need to download entire packages and manually search for differences, the time spent on analysis has been drastically reduced. Customers no longer spend 3–4 hours per test case. What at one time took days now takes minutes.
This time savings brought increased productivity. With an automated solution that instantly provides relevant insights, customers can focus more on decision-making rather than spending time on manual data retrieval. And by automating difference analysis and providing a centralized, effortless platform, customers can be more confident in the accuracy of their results and avoid missing critical changes.
For Verisk, the benefit was a reduced customer support burden because the ERC customer support team now spends less time addressing queries. With the AI-powered conversational interface, users can self-serve and get answers in real time, freeing up support resources for more complex inquiries.
The automation of repetitive training tasks meant quicker and more efficient customer onboarding. This reduces the need for lengthy training sessions, and new customers become proficient faster. The integration of generative AI has reduced redundant workflows and the need for manual intervention. This streamlines operations across multiple departments, leading to a more agile and responsive business.
Conclusion
Looking ahead, Verisk plans to continue enhancing the Rating Insights platform twofold. First, we’ll expand the scope of queries, enabling more sophisticated queries related to different filing types and more nuanced coverage areas. Second, we’ll scale the platform. With Amazon Bedrock providing the infrastructure, Verisk aims to scale this solution further to support more users and additional content sets across various product lines.
Verisk Rating Insights, now powered by generative AI and AWS technologies, has transformed the way customers interact with and access rating content changes. Through a conversational user interface, RAG, and vector databases, Verisk intends to eliminate inefficiencies and save customers valuable time and resources while enhancing overall accessibility. For Verisk, this solution has improved operational efficiency and provided a strong foundation for continued innovation.
With Amazon Bedrock and a focus on automation, Verisk is driving the future of intelligent customer support and content management, empowering both their customers and their internal teams to make smarter, faster decisions.
For more information, refer to the following resources:

Explore generative AI on AWS
Learn about unlocking the business value of generative AI
Learn more about Anthropic’s Claude 3 models on Amazon Bedrock
Learn about Amazon Bedrock and how to build and scale generative AI applications with FMs
Explore generative AI quick start proofs of concept

About the authors
Samit Verma serves as the Director of Software Engineering at Verisk, overseeing the Rating and Coverage development teams. In this role, he plays a key part in architectural design and provides strategic direction to multiple development teams, enhancing efficiency and ensuring long-term solution maintainability. He holds a master’s degree in information technology.
Eusha Rizvi serves as a Software Development Manager at Verisk, leading several technology teams within the Ratings Products division. Possessing strong expertise in system design, architecture, and engineering, Eusha offers essential guidance that advances the development of innovative solutions. He holds a bachelor’s degree in information systems from Stony Brook University.
Manmeet Singh is a Software Engineering Lead at Verisk and AWS Certified Generative AI Specialist. He leads the development of an agentic RAG-based generative AI system on Amazon Bedrock, with expertise in LLM orchestration, prompt engineering, vector databases, microservices, and high-availability architecture. Manmeet is passionate about applying advanced AI and cloud technologies to deliver resilient, scalable, and business-critical systems.
Troy Smith is a Vice President of Rating Solutions at Verisk. Troy is a seasoned insurance technology leader with more than 25 years of experience in rating, pricing, and product strategy. At Verisk, he leads the team behind ISO Electronic Rating Content, a widely used resource across the insurance industry. Troy has held leadership roles at Earnix and Capgemini and was the cofounder and original creator of the Oracle Insbridge Rating Engine.
Corey Finley is a Product Manager at Verisk. Corey has over 22 years of experience across personal and commercial lines of insurance. He has worked in both implementation and product support roles and has led efforts for major carriers including Allianz, CNA, Citizens, and others. At Verisk, he serves as Product Manager for VRI, RaaS, and ERC.
Arun Pradeep Selvaraj is a Senior Solutions Architect at Amazon Web Services (AWS). Arun is passionate about working with his customers and stakeholders on digital transformations and innovation in the cloud while continuing to learn, build, and reinvent. He is creative, energetic, deeply customer-obsessed, and uses the working backward process to build modern architectures to help customers solve their unique challenges. Connect with him on LinkedIn.
Ryan Doty is a Solutions Architect Manager at Amazon Web Services (AWS), based out of New York. He helps financial services customers accelerate their adoption of the AWS Cloud by providing architectural guidelines to design innovative and scalable solutions. Coming from a software development and sales engineering background, the possibilities that the cloud can bring to the world excite him.

Unified multimodal access layer for Quora’s Poe using Amazon Bedrock

Organizations gain competitive advantage by deploying and integrating new generative AI models quickly through Generative AI Gateway architectures. This unified interface approach simplifies access to multiple foundation models (FMs), addressing a critical challenge: the proliferation of specialized AI models, each with unique capabilities, API specifications, and operational requirements. Rather than building and maintaining separate integration points for each model, the smart move is to build an abstraction layer that normalizes these differences behind a single, consistent API.
The AWS Generative AI Innovation Center and Quora recently collaborated on an innovative solution to address this challenge. Together, they developed a unified wrapper API framework that streamlines the deployment of Amazon Bedrock FMs on Quora’s Poe system. This architecture delivers a “build once, deploy multiple models” capability that significantly reduces deployment time and engineering effort, with real protocol bridging code visible throughout the codebase.
For technology leaders and developers working on AI multi-model deployment at scale, this framework demonstrates how thoughtful abstraction and protocol translation can accelerate innovation cycles while maintaining operational control.
In this post, we explore how the AWS Generative AI Innovation Center and Quora collaborated to build a unified wrapper API framework that dramatically accelerates the deployment of Amazon Bedrock FMs on Quora’s Poe system. We detail the technical architecture that bridges Poe’s event-driven ServerSentEvents protocol with Amazon Bedrock REST-based APIs, demonstrate how a template-based configuration system reduced deployment time from days to 15 minutes, and share implementation patterns for protocol translation, error handling, and multi-modal capabilities. We show how this “build once, deploy multiple models” approach helped Poe integrate over 30 Amazon Bedrock models across text, image, and video modalities while reducing code changes by up to 95%.
Quora and Amazon Bedrock
Poe.com is an AI system developed by Quora that users and developers can use to interact with a wide range of advanced AI models and assistants powered by multiple providers. The system offers multi-model access, enabling side-by-side conversations with various AI chatbots for tasks such as natural language understanding, content generation, image creation, and more.
This screenshot below showcases the user interface of Poe, the AI platform created by Quora. The image displays Poe’s extensive library of AI models, which are presented as individual “chatbots” that users can interact with.

The following screenshot provides a view of the Model Catalog within Amazon Bedrock, a fully managed service from Amazon Web Services (AWS) that offers access to a diverse range of foundation models (FMs). This catalog acts as a central hub for developers to discover, evaluate, and access state-of-the-art AI from various providers.

Initially, integrating the diverse FMs available through Amazon Bedrock presented significant technical challenges for the Poe.com team. The process required substantial engineering resources to establish connections with each model while maintaining consistent performance and reliability standards. Maintainability emerged as an extremely important consideration, as was the ability to efficiently onboard new models as they became available—both factors adding further complexity to the integration challenges.
Technical challenge: Bridging different systems
The integration between Poe and Amazon Bedrock presented fundamental architectural challenges that required innovative solutions. These systems were built with different design philosophies and communication patterns, creating a significant technical divide that the wrapper API needed to bridge.
Architectural divide
The core challenge stems from the fundamentally different architectural approaches of the two systems. Understanding these differences is essential to appreciating the complexity of the integration solution. Poe operates on a modern, reactive, ServerSentEvents-based architecture through the Fast API library (fastapi_poe). This architecture is stream-optimized for real-time interactions and uses an event-driven response model designed for continuous, conversational AI. Amazon Bedrock, on the other hand, functions as an enterprise cloud service. It offers REST-based with AWS SDK access patterns, SigV4 authentication requirements, AWS Region-specific model availability, and a traditional request-response pattern with streaming options. This fundamental API mismatch creates several technical challenges that the Poe wrapper API solves, as detailed in the following table.

Challenge Category
Technical Issue
Source Protocol
Target Protocol
Integration Complexity

Protocol Translation
Converting between WebSocket-based protocol and REST APIs
WebSocket (bidirectional, persistent)
REST (request/response, stateless)
High: Requires protocol bridging

Authentication Bridging
Connecting JWT validation with AWS SigV4 signing
JWT token validation
AWS SigV4 authentication
Medium: Credential transformation needed

Response Format Transformation
Adapting JSON responses into expected format
Standard JSON structure
Custom format requirements
Medium: Data structure mapping

Streaming Reconciliation
Mapping chunked responses to ServerSentEvents
Chunked HTTP responses
ServerSentEvents stream
High: Real-time data flow conversion

Parameter Standardization
Creating unified parameter space across models
Model-specific parameters
Standardized parameter interface
Medium: Parameter normalization

API evolution and the Converse API
In May 2024, Amazon Bedrock introduced the Converse API, which offered standardization benefits that significantly simplified the integration architecture:

Unified interface across diverse model providers (such as Anthropic, Meta, and Mistral)
Conversation memory with consistent handling of chat history
Streaming and non-streaming modes through a single API pattern
Multimodal support for text, images, and structured data
Parameter normalization that reduces model-specific implementation quirks
Built-in content moderation capabilities

The solution presented in this post uses the Converse API where appropriate, while also maintaining compatibility with model-specific APIs for specialized capabilities. This hybrid approach provides flexibility while taking advantage of the Converse API’s standardization benefits.
Solution overview
The wrapper API framework provides a unified interface between Poe and Amazon Bedrock models. It serves as a translation layer that normalizes the differences between models and protocols while maintaining the unique capabilities of each model.
The solution architecture follows a modular design that separates concerns and enables flexible scaling, as illustrated in the following diagram.

The wrapper API consists of several key components working together to provide a seamless integration experience:

Client – The entry point where users interact with AI capabilities through various interfaces.
Poe layer – Consists of the following:

Poe UI – Handles user experience, request formation, parameters controls, file uploads, and response visualization.
Poe FastAPI – Standardizes user interactions and manages the communication protocol between clients and underlying systems.

Bot Factory – Dynamically creates appropriate model handlers (bots) based on the requested model type (chat, image, or video). This factory pattern provides extensibility for new model types and variations. See the following code:

# From core/bot_factory.py – Actual implementation
class BotFactory:
“””
Factory for creating different types of bots.
Handles bot creation based on the bot type and configuration.
“””
@staticmethod
def create_bot(bot_config: BotConfig) -> PoeBot:
# Check if a custom bot class is specified
if hasattr(bot_config, ‘bot_class’) and bot_config.bot_class:
# Use the custom bot class directly
bot = bot_config.bot_class(bot_config)

# Explicitly ensure we’re returning a PoeBot
if not isinstance(bot, PoeBot):
raise TypeError(f”Custom bot class must return a PoeBot instance, got {type(bot)}”)
return bot

# Determine bot type based on configuration
if hasattr(bot_config, ‘enable_video_generation’) and bot_config.enable_video_generation:
# Video generation bot
if ‘luma’ in bot_config.bot_name:
from core.refactored_luma_bot import LumaVideoBot
return LumaVideoBot(bot_config)
else:
from core.refactored_nova_reel_bot import NovaReelVideoBot
return NovaReelVideoBot(bot_config)

elif hasattr(bot_config, ‘enable_image_generation’) and bot_config.enable_image_generation:
# Image generation bot
if hasattr(bot_config, ‘model_id’) and “stability” in bot_config.model_id.lower():
# Stability AI image generation bot
from core.refactored_image_stability_ai import AmazonBedrockImageStabilityAIBot
return AmazonBedrockImageStabilityAIBot(bot_config)
else:
# Other image generation bot (Titan, Canvas, etc.)
from core.refactored_image_bot_amazon import RefactoredAmazonImageGenerationBot
return RefactoredAmazonImageGenerationBot(bot_config)

else:
# Check if this is a Claude 3.7 model
if hasattr(bot_config, ‘model_id’) and “claude-3-7″ in bot_config.model_id.lower():
return ClaudePlusBot(bot_config)
else:
# Default to standard chat bot
return RefactoredAmazonBedrockPoeBot(bot_config)

Service manager: Orchestrates the services needed to process requests effectively. It coordinates between different specialized services, including:

Token services – Managing token limits and counting.
Streaming services – Handling real-time responses.
Error services – Normalizing and handling errors.
AWS service integration – Managing API calls to Amazon Bedrock.

AWS services component – Converts responses from Amazon Bedrock format to Poe’s expected format and vice-versa, handling streaming chunks, image data, and video outputs.
Amazon Bedrock layer – Amazon’s FM service that provides the actual AI processing capabilities and model hosting, including:

Model diversity – Provides access to over 30 text models (such as Amazon Titan, Amazon Nova, Anthropic’s Claude, Meta’s Llama, Mistral, and more), image models, and video models.
API structure – Exposes both model-specific APIs and the unified Converse API.
Authentication – Requires AWS SigV4 signing for secure access to model endpoints.
Response management – Returns model outputs with standardized metadata and usage statistics.

The request processing flow in this unified wrapper API shows the orchestration required when bridging Poe’s event-driven ServerSentEvents protocol with Amazon Bedrock REST-based APIs, showcasing how multiple specialized services work together to deliver a seamless user experience.
The flow begins when a client sends a request through Poe’s interface, which then forwards it to the Bot Factory component. This factory pattern dynamically creates the appropriate model handler based on the requested model type, whether for chat, image, or video generation. The service manager component then orchestrates the various specialized services needed to process the request effectively, including token services, streaming services, and error handling services.
The following sequence diagram illustrates the complete request processing flow.

Configuration template for rapid multi-bot deployment
The most powerful aspect of the wrapper API is its unified configuration template system, which supports rapid deployment and management of multiple bots with minimal code changes. This approach is central to the solution’s success in reducing deployment time.
The system uses a template-based configuration approach with shared defaults and model-specific overrides:

# Bot configurations using the template pattern

CHAT_BOTS = {
‘poe-nova-micro’: BotConfig(
# Identity
bot_name=’poe-nova-micro’,
model_id=’amazon.nova-micro-v1:0′,
aws_region=aws_config[‘region’],
poe_access_key=’XXXXXXXXXXXXXXXXXXXXXX’,

# Model-specific parameters
supports_system_messages=True,
enable_image_comprehension=True,
expand_text_attachments=True,
streaming=True,
max_tokens=1300,
temperature=0.7,
top_p=0.9,

# Model-specific pricing
enable_monetization=True,
pricing_type=”variable”,
input_token_cost_milli_cents=2,
output_token_cost_milli_cents=4,
image_analysis_cost_milli_cents=25,

# Generate rate card with model-specific values
custom_rate_card=create_rate_card(2, 4, 25),

# Include common parameters
**DEFAULT_CHAT_CONFIG
),

‘poe-mistral-pixtral’: BotConfig(
# Identity
bot_name=’poe-mistral-pixtral’,
model_id=’us.mistral.pixtral-large-2502-v1:0′,
aws_region=aws_config[‘region’],
poe_access_key=’XXXXXXXXXXXXXXXXXXXXXX’,

# Model-specific parameters
supports_system_messages=False,
enable_image_comprehension=False,
# …
# Include common parameters
**DEFAULT_CHAT_CONFIG
)
}

This configuration-driven architecture offers several significant advantages:

Rapid deployment – Adding new models requires only creating a new configuration entry rather than writing integration code. This is a key factor in the significant improvement in deployment time.
Consistent parameter management – Common parameters are defined one time in DEFAULT_CHAT_CONFIG and inherited by bots, maintaining consistency and reducing duplication.
Model-specific customization – Each model can have its own unique settings while still benefiting from the shared infrastructure.
Operational flexibility – Parameters can be adjusted without code changes, allowing for quick experimentation and optimization.
Centralized credential management – AWS credentials are managed in one place, improving security and simplifying updates.
Region-specific deployment – Models can be deployed to different Regions as needed, with Region settings controlled at the configuration level.

The BotConfig class provides a structured way to define bot configurations with type validation:

# From config/bot_config.py – Actual implementation (partial)
class BotConfig(BaseModel):
# Core Bot Identity
bot_name: str = Field(…, description=”Name of the bot”)
model_id: str = Field(…, description=”Identifier for the AI model”)

# AWS Configuration
aws_region: Optional[str] = Field(default=”us-east-1″, description=”AWS region for deployment”)
aws_access_key: Optional[str] = Field(default=None, description=”AWS access key”)
aws_secret_key: Optional[str] = Field(default=None, description=”AWS secret key”)
aws_security_token: Optional[str] = None

# Poe Configuration
poe_access_key: str = Field(…, description=”Poe access key”)
modal_app_name: str = Field(…, description=”Modal app name”)

# Capability Flags
allow_attachments: bool = Field(default=True, description=”Whether to allow file attachments in Poe”)
supports_system_messages: bool = Field(default=False)
enable_image_comprehension: bool = Field(default=False)
expand_text_attachments: bool = Field(default=False)
streaming: bool = Field(default=False)
enable_image_generation: bool = Field(default=False)
enable_video_generation: bool = Field(default=False)

# Inference Configuration
max_tokens: Optional[int] = Field(default=None, description=”Maximum number of tokens to generate”)
temperature: Optional[float] = Field(default=None, description=”Temperature for sampling”)
top_p: Optional[float] = Field(default=None, description=”Top-p sampling parameter”)
optimize_latency: bool = Field(default=False, description=”Enable latency optimization with performanceConfig”)

# Reasoning Configuration (Claude 3.7+)
enable_reasoning: bool = Field(default=False, description=”Enable Claude’s reasoning capability”)
reasoning_budget: Optional[int] = Field(default=1024, description=”Token budget for reasoning (1024-4000 recommended)”)

# Monetization Configuration
enable_monetization: bool = Field(default=False, description=”Enable variable pricing monetization”)
custom_rate_card: Optional[str] = Field(
default=None,
description=”Custom rate card for variable pricing in markdown format”
)
input_token_cost_milli_cents: Optional[int] = Field(
default=None,
description=”Cost per input token in thousandths of a cent”
)
output_token_cost_milli_cents: Optional[int] = Field(
default=None,
description=”Cost per output token in thousandths of a cent”
)
image_analysis_cost_milli_cents: Optional[int] = Field(
default=None,
description=”Cost per image analysis in thousandths of a cent”
)

Advanced multimodal capabilities
One of the most powerful aspects of the framework is how it handles multimodal capabilities through simple configuration flags:

enable_image_comprehension – When set to True for text-only models like Amazon Nova Micro, Poe itself uses vision capabilities to analyze images and convert them into text descriptions that are sent to the Amazon Bedrock model. This enables even text-only models to classify images without having built-in vision capabilities.
expand_text_attachments – When set to True, Poe parses uploaded text files and includes their content in the conversation, enabling models to work with document content without requiring special file handling capabilities.
supports_system_messages – This parameter controls whether the model can accept system prompts, allowing for consistent behavior across models with different capabilities.

These configuration flags create a powerful abstraction layer that offers the following benefits:

Extends model capabilities – Text-only models gain pseudo-multimodal capabilities through Poe’s preprocessing
Optimizes built-in features – True multimodal models can use their built-in capabilities for optimal results
Simplifies integration – It’s controlled through simple configuration switches rather than code changes
Maintains consistency – It provides a uniform user experience regardless of the underlying model’s native capabilities

Next, we explore the technical implementation of the solution in more detail.
Protocol translation layer
The most technically challenging aspect of the solution was bridging between Poe’s API protocols and the diverse model interfaces available through Amazon Bedrock. The team accomplished this through a sophisticated protocol translation layer:

# From services/streaming_service.py – Actual implementation
def _extract_content_from_event(self, event: Dict[str, Any]) -> Optional[str]:
“””Extract content from a streaming event based on model provider.”””
try:
# Handle Anthropic Claude models
if “message” in event:
message = event.get(“message”, {})
if “content” in message and isinstance(message[“content”], list):
for content_item in message[“content”]:
if content_item.get(“type”) == “text”:
return content_item.get(“text”, “”)
elif “content” in message:
return str(message.get(“content”, “”))

# Handle Amazon Titan models
if “delta” in event:
delta = event.get(“delta”, {})
if “text” in delta:
return delta.get(“text”, “”)

# Handle other model formats
if “chunk” in event:
chunk_data = event.get(“chunk”, {})
if “bytes” in chunk_data:
# Process binary data if present
try:
text = chunk_data[“bytes”].decode(“utf-8”)
return json.loads(text).get(“completion”, “”)
except Exception:
self.logger.warning(“Failed to decode bytes in chunk”)

# No matching format found
return None

This translation layer handles subtle differences between models and makes sure that regardless of which Amazon Bedrock model is being used, the response to Poe is consistent and follows Poe’s expected format.
Error handling and normalization
A critical aspect of the implementation is comprehensive error handling and normalization. The ErrorService provides consistent error handling across different models:

# Simplified example of error handling (not actual code)
class ErrorService:
def normalize_Amazon_Bedrock_error(self, error: Exception) -> str:
“””Normalize Amazon Bedrock errors into a consistent format.”””
if isinstance(error, ClientError):
if “ThrottlingException” in str(error):
return “The model is currently experiencing high demand. Please try again in a moment.”
elif “ValidationException” in str(error):
return “There was an issue with the request parameters. Please try again with different settings.”
elif “AccessDeniedException” in str(error):
return “Access to this model is restricted. Please check your permissions.”
else:
return f”An error occurred while communicating with the model: {str(error)}”
elif isinstance(error, ConnectionError):
return “Connection error. Please check your network and try again.”
else:
return f”An unexpected error occurred: {str(error)}”

This approach makes sure users receive meaningful error messages regardless of the underlying model or error condition.
Token counting and optimization
The system implements sophisticated token counting and optimization to maximize effective use of models:

# From services/streaming_service.py – Actual implementation (partial)
# Calculate approximate JSON overhead
user_message_tokens = 0
for msg in conversation[‘messages’]:
for content_block in msg.get(‘content’, []):
if ‘text’ in content_block:
# Simple word-based estimation of actual text content
user_message_tokens += len(content_block[‘text’].split())

# Estimate JSON structure overhead (difference between total and content)
json_overhead = int((input_tokens – system_tokens) – user_message_tokens)

# Ensure we’re working with integers for calculations
input_tokens_for_pct = int(input_tokens)
system_tokens_for_pct = int(system_tokens)
json_overhead_for_pct = int(json_overhead)

# Calculate percentage with float arithmetic and proper integer division
json_overhead_percent = (float(json_overhead_for_pct) / max(1, input_tokens_for_pct – system_tokens_for_pct)) * 100

This detailed token tracking enables accurate cost estimation and optimization, facilitating efficient use of model resources.
AWS authentication and security
The AwsClientService handles authentication and security for Amazon Bedrock API calls.This implementation provides secure authentication with AWS services while providing proper error handling and connection management.
Comparative analysis
The implementation of the wrapper API dramatically improved the efficiency and capabilities of deploying Amazon Bedrock models on Poe, as detailed in the following table.

Feature
Before (Direct API)
After (Wrapper API)

Deployment Time
Days per model
Minutes per model

Developer Focus
Configuration and plumbing
Innovation and features

Model Diversity
Limited by integration capacity
Extensive (across Amazon Bedrock models)

Maintenance Overhead
High (separate code for each model)
Low (configuration-based)

Error Handling
Custom per model
Standardized across models

Cost Tracking
Complex (multiple integrations)
Simplified (centralized)

Multimodal Support
Fragmented
Unified

Security
Varied implementations
Consistent best practices

This comparison highlights the significant improvements achieved through the wrapper API approach, demonstrating the value of investing in a robust abstraction layer.
Performance metrics and business impact
The wrapper API framework delivered significant and measurable business impact across multiple dimensions, including increased model diversity, deployment efficiency, and developer productivity.
Poe can rapidly expand its model offerings, integrating tens of Amazon Bedrock models across text, image, and video modalities. This expansion occurred over a period of weeks rather than the months it would have taken with the previous approach.
The following table summarizes the deployment efficiency metrics.

Metric
Before
After
Improvement

New Model Deployment
2 –3 days
15 minutes
96x faster

Code Changes Required
500+ lines
20–30 lines
95% reduction

Testing Time
8–12 hours
30–60 minutes
87% reduction

Deployment Steps
10–15 steps
3–5 steps
75% reduction

These metrics were measured through direct comparison of engineering hours required before and after implementation, tracking actual deployments of new models.
The engineering team saw a dramatic shift in focus from integration work to feature development, as detailed in the following table.

Activity
Before (% of time)
After (% of time)
Change

API Integration
65%
15%
-50%

Feature Development
20%
60%
+40%

Testing
10%
15%
+5%

Documentation
5%
10%
+5%

Scaling and performance considerations
The wrapper API is designed to handle high-volume production workloads with robust scaling capabilities.
Connection pooling
To handle multiple concurrent requests efficiently, the wrapper implements connection pooling using aiobotocore. This allows it to maintain a pool of connections to Amazon Bedrock, reducing the overhead of establishing new connections for each request:

# From services/aws_service.py – Connection management
async def setup_client(self) -> None:
“””Initialize AWS client with proper configuration.”””
async with self._client_lock:
try:
# Always clean up existing clients first to avoid stale connections
if self.Amazon_Bedrock_client:
await self.cleanup()

# Increase timeout for image generation
config = Config(
read_timeout=300, # 5 minutes timeout
retries={‘max_attempts’: 3, ‘mode’: ‘adaptive’},
connect_timeout=30 # 30 second connection timeout
)

# Create the Amazon Bedrock client with proper error handling
self.Amazon_Bedrock_client = await self.session.create_client(
service_name=”Amazon_Bedrock-runtime”,
region_name=self.bot_config.aws_region,
aws_access_key_id=self.bot_config.aws_access_key,
aws_secret_access_key=self.bot_config.aws_secret_key,
aws_session_token=self.bot_config.aws_security_token,
config=config
).__aenter__()
except Exception as e:
self.Amazon_Bedrock_client = None
raise

Asynchronous processing
The entire framework uses asynchronous processing to handle concurrent requests efficiently:

# From core/refactored_chat_bot.py – Asynchronous request handling
async def get_response(self, query: QueryRequest) -> AsyncIterable[PartialResponse]:
try:
# Ensure AWS client is set up
await aws_service.setup_client()

# Validate and format the conversation
conversation = await conversation_service.validate_conversation(query)

# Process the request with streaming
if self.bot_config.streaming:
async for chunk in streaming_service.stream_Amazon_Bedrock_response(conversation, request_id):
yield chunk
else:
# Non-streaming mode
response_text, input_tokens, output_tokens = await streaming_service.non_stream_Amazon_Bedrock_response(conversation, request_id)
if response_text:
yield PartialResponse(text=response_text)
else:
yield PartialResponse(text=self.bot_config.fallback_response)
# Send done event for non-streaming mode
yield self.done_event()

except Exception as e:
# Error handling
error_message = error_service.log_error(e, request_id, “Error during request processing”)
yield PartialResponse(text=error_message)
yield self.done_event()

Error recovery and retry logic
The system implements sophisticated error recovery and retry logic to handle transient issues:

# From services/streaming_service.py – Retry logic
max_retries = 3
base_delay = 1 # Start with 1 second delay

for attempt in range(max_retries):
try:
if not self.aws_service.Amazon_Bedrock_client:
yield PartialResponse(text=”Error: Amazon Bedrock client is not initialized”)
break

response = await self.aws_service.Amazon_Bedrock_client.converse_stream(**stream_config)
# Process response…
break # Success, exit retry loop

except ClientError as e:
if “ThrottlingException” in str(e):
if attempt < max_retries – 1:
delay = base_delay * (2 ** attempt) # Exponential backoff
await asyncio.sleep(delay)
continue
error_message = f”Amazon Bedrock API Error: {str(e)}”
yield PartialResponse(text=f”Error: {error_message}”)
break

Performance metrics
The system collects detailed performance metrics to help identify bottlenecks and optimize performance:

# From services/streaming_service.py – Performance metrics
# Log token usage and latency
latency = time.perf_counter() – start_time

self.logger.info(
f”[{request_id}] Streaming Response Metrics:n”
f” Time to First Token: {first_token_time:.4f} secondsn”
f” Input Tokens: {input_tokens} (includes system prompt)n”
f” Input Tokens for Billing: {input_tokens – system_tokens} (excludes system prompt)n”
f” Output Tokens: {output_tokens}n”
f” Total Tokens: {total_tokens}n”
f” Amazon Bedrock Latency: {latency:.4f} secondsn”
f” Latency Optimization: {‘enabled’ if hasattr(self.bot_config, ‘optimize_latency’) and self.bot_config.optimize_latency else ‘disabled’}”
)

Security considerations
Security is a critical aspect of the wrapper implementation, with several key features to support secure operation.
JWT validation with AWS SigV4 signing
The system integrates JWT validation for Poe’s authentication with AWS SigV4 signing for Amazon Bedrock API calls:

JWT validation – Makes sure only authorized Poe requests can access the wrapper API
SigV4 signing – Makes sure the wrapper API can securely authenticate with Amazon Bedrock
Credential management – AWS credentials are securely managed and not exposed to clients

Secrets management
The system integrates with AWS Secrets Manager to securely store and retrieve sensitive credentials:

# From services/aws_service.py – Secrets management
@staticmethod
def get_secret(secret_name: str, region_name: str = “us-east-1”) -> Dict[str, Any]:
“””
Retrieve a secret from AWS Secrets Manager.

Args:
secret_name: Name of the secret to retrieve
region_name: AWS region where the secret is stored

Returns:
Dict[str, Any]: The secret value as a dictionary
“””
# Create a Secrets Manager client
session = boto3.session.Session()
client = session.client(
service_name=’secretsmanager’,
region_name=region_name
)

try:
get_secret_value_response = client.get_secret_value(
SecretId=secret_name
)
except Exception as e:
logging.error(f”Error retrieving secret {secret_name}: {str(e)}”)
raise

# Depending on whether the secret is a string or binary, one of these fields will be populated.
if ‘SecretString’ in get_secret_value_response:
import json
try:
# Explicitly annotate the return type for mypy
result: Dict[str, Any] = json.loads(get_secret_value_response[‘SecretString’])
return result
except json.JSONDecodeError:
# If not a JSON, return as a single-key dictionary
return {“SecretString”: get_secret_value_response[‘SecretString’]}
else:
import base64
decoded_binary_secret = base64.b64decode(get_secret_value_response[‘SecretBinary’])
return {“SecretBinary”: decoded_binary_secret}

Secure connection management
The system implements secure connection management to help prevent credential leakage and facilitate proper cleanup:

# From services/aws_service.py – Secure connection cleanup
async def cleanup(self) -> None:
“””Clean up AWS client resources.”””
try:
if self.Amazon_Bedrock_client:
try:
await self.Amazon_Bedrock_client.__aexit__(None, None, None)
except Exception as e:
self.logger.error(f”Error closing Amazon Bedrock client: {str(e)}”)
finally:
self.Amazon_Bedrock_client = None

self.logger.info(“Successfully cleaned up AWS client resources”)
except Exception as e:
# Even if cleanup fails, reset the references to avoid stale connections
self.Amazon_Bedrock_client = None

Troubleshooting and debugging
The wrapper API includes comprehensive logging and debugging capabilities to help identify and resolve issues. The system implements detailed logging throughout the request processing flow. Each request is assigned a unique ID that is used throughout the processing flow to enable tracing:

# From core/refactored_chat_bot.py – Request tracing
request_id = str(id(query))
start_time = time.perf_counter()

# Used in all log messages
self.logger.info(f”[{request_id}] Incoming request received”)

Lessons learned and best practices
Through this collaboration, several important technical insights emerged that might benefit others undertaking similar projects:

Configuration-driven architecture – Using configuration files rather than code for model-specific behaviors proved enormously beneficial for maintenance and extensibility. This approach allowed new models to be added without code changes, significantly reducing the risk of introducing bugs.
Protocol translation challenges – The most complex aspect was handling the subtle differences in streaming protocols between different models. Building a robust abstraction required careful consideration of edge cases and comprehensive error handling.
Error normalization – Creating a consistent error experience across diverse models required sophisticated error handling that could translate model-specific errors into user-friendly, actionable messages. This improved both developer and end-user experiences.
Type safety – Strong typing (using Python’s type hints extensively) was crucial for maintaining code quality across a complex codebase with multiple contributors. This practice reduced bugs and improved code maintainability.
Security first – Integrating Secrets Manager from the start made sure credentials were handled securely throughout the system’s lifecycle, helping prevent potential security vulnerabilities.

Conclusion
The collaboration between the AWS Generative AI Innovation Center and Quora demonstrates how thoughtful architectural design can dramatically accelerate AI deployment and innovation. By creating a unified wrapper API for Amazon Bedrock models, the teams were able to reduce deployment time from days to minutes while expanding model diversity and improving user experience.
This approach—focusing on abstraction, configuration-driven development, and robust error handling—offers valuable lessons for organizations looking to integrate multiple AI models efficiently. The patterns and techniques demonstrated in this solution can be applied to similar challenges across a wide range of AI integration scenarios.
For technology leaders and developers working on similar challenges, this case study highlights the value of investing in flexible integration frameworks rather than point-to-point integrations. The initial investment in building a robust abstraction layer pays dividends in long-term maintenance and capability expansion.
To learn more about implementing similar solutions, explore the following resources:

The AWS Well-Architected Framework for best practices in building secure, high-performing, resilient, and efficient infrastructure
The Amazon Bedrock Developer Guide for detailed information on working with FMs
The AWS Generative AI Innovation Center for assistance with your generative AI projects
AWS Prescriptive Guidance for LLM Deployment for best practices in deploying large language models

The AWS Generative AI Innovation Center and Quora teams continue to collaborate on enhancements to this framework, making sure Poe users have access to the latest and most capable AI models with minimal deployment delay.

About the authors
Dr. Gilbert V Lepadatu is a Senior Deep Learning Architect at the AWS Generative AI Innovation Center, where he helps enterprise customers design and deploy scalable, cutting-edge GenAI solutions. With a PhD in Philosophy and dual Master’s degrees, he brings a holistic and interdisciplinary approach to data science and AI.
Nick Huber is the AI Ecosystem Lead for Poe (by Quora), where he is responsible for ensuring high-quality & timely integrations of the leading AI models onto the Poe platform.