September 2025 - Page 4 of 9

Monitor Amazon Bedrock batch inference using Amazon CloudWatch metrics

Posted on September 19, 2025 by i-genie

As organizations scale their use of generative AI, many workloads require cost-efficient, bulk processing rather than real-time responses. Amazon Bedrock batch inference addresses this need by enabling large datasets to be processed in bulk with predictable performance—at 50% lower cost than on-demand inference. This makes it ideal for tasks such as historical data analysis, large-scale text summarization, and background processing workloads.
In this post, we explore how to monitor and manage Amazon Bedrock batch inference jobs using Amazon CloudWatch metrics, alarms, and dashboards to optimize performance, cost, and operational efficiency.
New features in Amazon Bedrock batch inference
Batch inference in Amazon Bedrock is constantly evolving, and recent updates bring significant enhancements to performance, flexibility, and cost transparency:

Expanded model support – Batch inference now supports additional model families, including Anthropic’s Claude Sonnet 4 and OpenAI OSS models. For the most up-to-date list, refer to Supported Regions and models for batch inference.
Performance enhancements – Batch inference optimizations on newer Anthropic Claude and OpenAI GPT OSS models now deliver higher batch throughput as compared to previous models, helping you process large workloads more quickly.
Job monitoring capabilities – You can now track how your submitted batch jobs are progressing directly in CloudWatch, without the heavy lifting of building custom monitoring solutions. This capability provides AWS account-level visibility into job progress, making it straightforward to manage large-scale workloads.

Use cases for batch inference
AWS recommends using batch inference in the following use cases:

Jobs are not time-sensitive and can tolerate minutes to hours of delay
Processing is periodic, such as daily or weekly summarization of large datasets (news, reports, transcripts)
Bulk or historical data needs to be analyzed, such as archives of call center transcripts, emails, or chat logs
Knowledge bases need enrichment, including generating embeddings, summaries, tags, or translations at scale
Content requires large-scale transformation, such as classification, sentiment analysis, or converting unstructured text into structured outputs
Experimentation or evaluation is needed, for example testing prompt variations or generating synthetic datasets
Compliance and risk checks must be run on historical content for sensitive data detection or governance

Launch an Amazon Bedrock batch inference job
You can start a batch inference job in Amazon Bedrock using the AWS Management Console, AWS SDKs, or AWS Command Line Interface (AWS CLI). For detailed instructions, see Create a batch inference job.
To use the console, complete the following steps:

On the Amazon Bedrock console, choose Batch inference under Infer in the navigation pane.
Choose Create batch inference job.
For Job name, enter a name for your job.
For Model, choose the model to use.
For Input data, enter the location of the Amazon Simple Storage Service (Amazon S3) input bucket (JSONL format).
For Output data, enter the S3 location of the output bucket.
For Service access, select your method to authorize Amazon Bedrock.
Choose Create batch inference job.

Monitor batch inference with CloudWatch metrics
Amazon Bedrock now automatically publishes metrics for batch inference jobs under the AWS/Bedrock/Batch namespace. You can track batch workload progress at the AWS account level with the following CloudWatch metrics. For current Amazon Bedrock models, these metrics include records pending processing, input and output tokens processed per minute, and for Anthropic Claude models, they also include tokens pending processing.
The following metrics can be monitored by modelId:

NumberOfTokensPendingProcessing – Shows how many tokens are still waiting to be processed, helping you gauge backlog size
NumberOfRecordsPendingProcessing – Tracks how many inference requests remain in the queue, giving visibility into job progress
NumberOfInputTokensProcessedPerMinute – Measures how quickly input tokens are being consumed, indicating overall processing throughput
NumberOfOutputTokensProcessedPerMinute – Measures generation speed

To view these metrics using the CloudWatch console, complete the following steps:

On the CloudWatch console, choose Metrics in the navigation pane.
Filter metrics by AWS/Bedrock/Batch.
Select your modelId to view detailed metrics for your batch job.

To learn more about how to use CloudWatch to monitor metrics, refer to Query your CloudWatch metrics with CloudWatch Metrics Insights.
Best practices for monitoring and managing batch inference
Consider the following best practices for monitoring and managing your batch inference jobs:

Cost monitoring and optimization – By monitoring token throughput metrics (NumberOfInputTokensProcessedPerMinute and NumberOfOutputTokensProcessedPerMinute) alongside your batch job schedules, you can estimate inference costs using information on the Amazon Bedrock pricing page. This helps you understand how fast tokens are being processed, what that means for cost, and how to adjust job size or scheduling to stay within budget while still meeting throughput needs.
SLA and performance tracking – The NumberOfTokensPendingProcessing metric is useful for understanding your batch backlog size and tracking overall job progress, but it should not be relied on to predict job completion times because they might vary depending on overall inference traffic to Amazon Bedrock. To understand batch processing speed, we recommend monitoring throughput metrics (NumberOfInputTokensProcessedPerMinute and NumberOfOutputTokensProcessedPerMinute) instead. If these throughput rates fall significantly below your expected baseline, you can configure automated alerts to trigger remediation steps—for example, shifting some jobs to on-demand processing to meet your expected timelines.
Job completion tracking – When the metric NumberOfRecordsPendingProcessing reaches zero, it indicates that all running batch inference jobs are complete. You can use this signal to trigger stakeholder notifications or start downstream workflows.

Example of CloudWatch metrics
In this section, we demonstrate how you can use CloudWatch metrics to set up proactive alerts and automation.
For example, you can create a CloudWatch alarm that sends an Amazon Simple Notification Service (Amazon SNS) notification when the average NumberOfInputTokensProcessedPerMinute exceeds 1 million within a 6-hour period. This alert could prompt an Ops team review or trigger downstream data pipelines.

The following screenshot shows that the alert has In alarm status because the batch inference job met the threshold. The alarm will trigger the target action, in our case an SNS notification email to the Ops team.

The following screenshot shows an example of the email the Ops team received, notifying them that the number of processed tokens exceeded their threshold.

You can also build a CloudWatch dashboard displaying the relevant metrics. This is ideal for centralized operational monitoring and troubleshooting.

Conclusion
Amazon Bedrock batch inference now offers expanded model support, improved performance, deeper visibility into the progress of your batch workloads, and enhanced cost monitoring.
Get started today by launching an Amazon Bedrock batch inference job, setting up CloudWatch alarms, and building a monitoring dashboard, so you can maximize efficiency and value from your generative AI workloads.

About the authors
Vamsi Thilak Gudi is a Solutions Architect at Amazon Web Services (AWS) in Austin, Texas, helping Public Sector customers build effective cloud solutions. He brings diverse technical experience to show customers what’s possible with AWS technologies. He actively contributes to the AWS Technical Field Community for Generative AI.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Avish Khosla is a software developer on Bedrock’s Batch Inference team, where the team build reliable, scalable systems to run large-scale inference workloads on generative AI models. he care about clean architecture and great docs. When he is not shipping code, he is on a badminton court or glued to a good cricket match.
Chintan Vyas serves as a Principal Product Manager–Technical at Amazon Web Services (AWS), where he focuses on Amazon Bedrock services. With over a decade of experience in Software Engineering and Product Management, he specializes in building and scaling large-scale, secure, and high-performance Generative AI services. In his current role, he leads the enhancement of programmatic interfaces for Amazon Bedrock. Throughout his tenure at AWS, he has successfully driven Product Management initiatives across multiple strategic services, including Service Quotas, Resource Management, Tagging, Amazon Personalize, Amazon Bedrock, and more. Outside of work, Chintan is passionate about mentoring emerging Product Managers and enjoys exploring the scenic mountain ranges of the Pacific Northwest.
Mayank Parashar is a Software Development Manager for Amazon Bedrock services.

IBM AI Releases Granite-Docling-258M: An Open-Source, Enterprise-Ready …

Posted on September 18, 2025 by i-genie

IBM has released Granite-Docling-258M, an open-source (Apache-2.0) vision-language model designed specifically for end-to-end document conversion. The model targets layout-faithful extraction—tables, code, equations, lists, captions, and reading order—emitting a structured, machine-readable representation rather than lossy Markdown. It is available on Hugging Face with a live demo and MLX build for Apple Silicon.

What’s new compared to SmolDocling?

Granite-Docling is the product-ready successor to SmolDocling-256M. IBM replaced the earlier backbone with a Granite 165M language model and upgraded the vision encoder to SigLIP2 (base, patch16-512) while retaining the Idefics3-style connector (pixel-shuffle projector). The resulting model has 258M parameters and shows consistent accuracy gains across layout analysis, full-page OCR, code, equations, and tables (see metrics below). IBM also addressed instability failure modes observed in the preview model (e.g., repetitive token loops).

Architecture and training pipeline

Backbone: Idefics3-derived stack with SigLIP2 vision encoder → pixel-shuffle connector → Granite 165M LLM.

Training framework: nanoVLM (lightweight, pure-PyTorch VLM training toolkit).

Representation: Outputs DocTags, an IBM-authored markup designed for unambiguous document structure (elements + coordinates + relationships), which downstream tools convert to Markdown/HTML/JSON.

Compute: Trained on IBM’s Blue Vela H100 cluster.

Quantified improvements (Granite-Docling-258M vs. SmolDocling-256M preview)

Evaluated with docling-eval, LMMS-Eval, and task-specific datasets:

Layout: MAP 0.27 vs. 0.23; F1 0.86 vs. 0.85.

Full-page OCR: F1 0.84 vs. 0.80; lower edit distance.

Code recognition: F1 0.988 vs. 0.915; edit distance 0.013 vs. 0.114.

Equation recognition: F1 0.968 vs. 0.947.

Table recognition (FinTabNet @150dpi): TEDS-structure 0.97 vs. 0.82; TEDS with content 0.96 vs. 0.76.

Other benchmarks: MMStar 0.30 vs. 0.17; OCRBench 500 vs. 338.

Stability: “Avoids infinite loops more effectively” (production-oriented fix).

Multilingual support

Granite-Docling adds experimental support for Japanese, Arabic, and Chinese. IBM marks this as early-stage; English remains the primary target.

How the DocTags pathway changes Document AI

Conventional OCR-to-Markdown pipelines lose structural information and complicate downstream retrieval-augmented generation (RAG). Granite-Docling emits DocTags—a compact, LLM-friendly structural grammar—which Docling converts into Markdown/HTML/JSON. This preserves table topology, inline/floating math, code blocks, captions, and reading order with explicit coordinates, improving index quality and grounding for RAG and analytics.

Inference and integration

Docling Integration (recommended): The docling CLI/SDK automatically pulls Granite-Docling and converts PDFs/office docs/images to multiple formats. IBM positions the model as a component inside Docling pipelines rather than a general VLM.

Runtimes: Works with Transformers, vLLM, ONNX, and MLX; a dedicated MLX build is optimized for Apple Silicon. A Hugging Face Space provides an interactive demo (ZeroGPU).

License: Apache-2.0.

Why Granite-Docling?

For enterprise document AI, small VLMs that preserve structure reduce inference cost and pipeline complexity. Granite-Docling replaces multiple single-purpose models (layout, OCR, table, code, equations) with a single component that emits a richer intermediate representation, improving downstream retrieval and conversion fidelity. The measured gains—in TEDS for tables, F1 for code/equations, and reduced instability—make it a practical upgrade from SmolDocling for production workflows.

Demo

Summary

Granite-Docling-258M marks a significant advancement in compact, structure-preserving document AI. By combining IBM’s Granite backbone, SigLIP2 vision encoder, and the nanoVLM training framework, it delivers enterprise-ready performance across tables, equations, code, and multilingual text—all while remaining lightweight and open-source under Apache 2.0. With measurable gains over its SmolDocling predecessor and seamless integration into Docling pipelines, Granite-Docling provides a practical foundation for document conversion and RAG workflows where precision and reliability are critical.

Check out the Models on Hugging Face and Demo here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post IBM AI Releases Granite-Docling-258M: An Open-Source, Enterprise-Ready Document AI Model appeared first on MarkTechPost.

Meta AI Researchers Release MapAnything: An End-to-End Transformer Arc …

Posted on September 18, 2025 by i-genie

A team of researchers from Meta Reality Labs and Carnegie Mellon University has introduced MapAnything, an end-to-end transformer architecture that directly regresses factored metric 3D scene geometry from images and optional sensor inputs. Released under Apache 2.0 with full training and benchmarking code, MapAnything advances beyond specialist pipelines by supporting over 12 distinct 3D vision tasks in a single feed-forward pass.

https://map-anything.github.io/assets/MapAnything.pdf

Why a Universal Model for 3D Reconstruction?

Image-based 3D reconstruction has historically relied on fragmented pipelines: feature detection, two-view pose estimation, bundle adjustment, multi-view stereo, or monocular depth inference. While effective, these modular solutions require task-specific tuning, optimization, and heavy post-processing.

Recent transformer-based feed-forward models such as DUSt3R, MASt3R, and VGGT simplified parts of this pipeline but remained limited: fixed numbers of views, rigid camera assumptions, or reliance on coupled representations that needed expensive optimization.

MapAnything overcomes these constraints by:

Accepting up to 2,000 input images in a single inference run.

Flexibly using auxiliary data such as camera intrinsics, poses, and depth maps.

Producing direct metric 3D reconstructions without bundle adjustment.

The model’s factored scene representation—composed of ray maps, depth, poses, and a global scale factor—provides modularity and generality unmatched by prior approaches.

Architecture and Representation

At its core, MapAnything employs a multi-view alternating-attention transformer. Each input image is encoded with DINOv2 ViT-L features, while optional inputs (rays, depth, poses) are encoded into the same latent space via shallow CNNs or MLPs. A learnable scale token enables metric normalization across views.

The network outputs a factored representation:

Per-view ray directions (camera calibration).

Depth along rays, predicted up-to-scale.

Camera poses relative to a reference view.

A single metric scale factor converting local reconstructions into a globally consistent frame.

This explicit factorization avoids redundancy, allowing the same model to handle monocular depth estimation, multi-view stereo, structure-from-motion (SfM), or depth completion without specialized heads.

https://map-anything.github.io/assets/MapAnything.pdf

Training Strategy

MapAnything was trained across 13 diverse datasets spanning indoor, outdoor, and synthetic domains, including BlendedMVS, Mapillary Planet-Scale Depth, ScanNet++, and TartanAirV2. Two variants are released:

Apache 2.0 licensed model trained on six datasets.

CC BY-NC model trained on all thirteen datasets for stronger performance.

Key training strategies include:

Probabilistic input dropout: During training, geometric inputs (rays, depth, pose) are provided with varying probabilities, enabling robustness across heterogeneous configurations.

Covisibility-based sampling: Ensures input views have meaningful overlap, supporting reconstruction up to 100+ views.

Factored losses in log-space: Depth, scale, and pose are optimized using scale-invariant and robust regression losses to improve stability.

Training was performed on 64 H200 GPUs with mixed precision, gradient checkpointing, and curriculum scheduling, scaling from 4 to 24 input views.

Benchmarking Results

Multi-View Dense Reconstruction

On ETH3D, ScanNet++ v2, and TartanAirV2-WB, MapAnything achieves state-of-the-art (SoTA) performance across pointmaps, depth, pose, and ray estimation. It surpasses baselines like VGGT and Pow3R even when limited to images only, and improves further with calibration or pose priors.

For example:

Pointmap relative error (rel) improves to 0.16 with only images, compared to 0.20 for VGGT.

With images + intrinsics + poses + depth, the error drops to 0.01, while achieving >90% inlier ratios.

Two-View Reconstruction

Against DUSt3R, MASt3R, and Pow3R, MapAnything consistently outperforms across scale, depth, and pose accuracy. Notably, with additional priors, it achieves >92% inlier ratios on two-view tasks, significantly beyond prior feed-forward models.

Single-View Calibration

Despite not being trained specifically for single-image calibration, MapAnything achieves an average angular error of 1.18°, outperforming AnyCalib (2.01°) and MoGe-2 (1.95°).

Depth Estimation

On the Robust-MVD benchmark:

MapAnything sets new SoTA for multi-view metric depth estimation.

With auxiliary inputs, its error rates rival or surpass specialized depth models such as MVSA and Metric3D v2.

Overall, benchmarks confirm 2× improvement over prior SoTA methods in many tasks, validating the benefits of unified training.

Key Contributions

The research team highlight four major contributions:

Unified Feed-Forward Model capable of handling more than 12 problem settings, from monocular depth to SfM and stereo.

Factored Scene Representation enabling explicit separation of rays, depth, pose, and metric scale.

State-of-the-Art Performance across diverse benchmarks with fewer redundancies and higher scalability.

Open-Source Release including data processing, training scripts, benchmarks, and pretrained weights under Apache 2.0.

Conclusion

MapAnything establishes a new benchmark in 3D vision by unifying multiple reconstruction tasks—SfM, stereo, depth estimation, and calibration—under a single transformer model with a factored scene representation. It not only outperforms specialist methods across benchmarks but also adapts seamlessly to heterogeneous inputs, including intrinsics, poses, and depth. With open-source code, pretrained models, and support for over 12 tasks, MapAnything lays the groundwork for a truly general-purpose 3D reconstruction backbone.

Check out the Paper, Codes and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meta AI Researchers Release MapAnything: An End-to-End Transformer Architecture that Directly Regresses Factored, Metric 3D Scene Geometry appeared first on MarkTechPost.

How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face …

Posted on September 18, 2025 by i-genie

In this tutorial, we build an advanced voice AI agent using Hugging Face’s freely available models, and we keep the entire pipeline simple enough to run smoothly on Google Colab. We combine Whisper for speech recognition, FLAN-T5 for natural language reasoning, and Bark for speech synthesis, all connected through transformers pipelines. By doing this, we avoid heavy dependencies, API keys, or complicated setups, and we focus on showing how we can turn voice input into meaningful conversation and get back natural-sounding voice responses in real time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install “transformers>=4.42.0” accelerate torchaudio sentencepiece gradio soundfile

import os, torch, tempfile, numpy as np
import gradio as gr
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

DEVICE = 0 if torch.cuda.is_available() else -1

asr = pipeline(
“automatic-speech-recognition”,
model=”openai/whisper-small.en”,
device=DEVICE,
chunk_length_s=30,
return_timestamps=False
)

LLM_MODEL = “google/flan-t5-base”
tok = AutoTokenizer.from_pretrained(LLM_MODEL)
llm = AutoModelForSeq2SeqLM.from_pretrained(LLM_MODEL, device_map=”auto”)

tts = pipeline(“text-to-speech”, model=”suno/bark-small”)

We install the necessary libraries and load three Hugging Face pipelines: Whisper for speech-to-text, FLAN-T5 for generating responses, and Bark for text-to-speech. We set the device automatically so that we can use GPU if available. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserSYSTEM_PROMPT = (
“You are a helpful, concise voice assistant. ”
“Prefer direct, structured answers. ”
“If the user asks for steps or code, use short bullet points.”
)

def format_dialog(history, user_text):
turns = []
for u, a in history:
if u: turns.append(f”User: {u}”)
if a: turns.append(f”Assistant: {a}”)
turns.append(f”User: {user_text}”)
prompt = (
“Instruction:n”
f”{SYSTEM_PROMPT}nn”
“Dialog so far:n” + “n”.join(turns) + “nn”
“Assistant:”
)
return prompt

We define a system prompt that guides our agent to stay concise and structured, and we implement a format_dialog function that takes past conversation history along with the user input and builds a prompt string for the model to generate the assistant’s reply. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef transcribe(filepath):
out = asr(filepath)
text = out[“text”].strip()
return text

def generate_reply(history, user_text, max_new_tokens=256):
prompt = format_dialog(history, user_text)
inputs = tok(prompt, return_tensors=”pt”, truncation=True).to(llm.device)
with torch.no_grad():
ids = llm.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True,
top_p=0.9,
repetition_penalty=1.05,
)
reply = tok.decode(ids[0], skip_special_tokens=True).strip()
return reply

def synthesize_speech(text):
out = tts(text)
audio = out[“audio”]
sr = out[“sampling_rate”]
audio = np.asarray(audio, dtype=np.float32)
return (sr, audio)

We create three core functions for our voice agent: transcribe converts recorded audio into text using Whisper, generate_reply builds a context-aware response from FLAN-T5, and synthesize_speech turns that response back into spoken audio with Bark. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef clear_history():
return [], []

def voice_to_voice(mic_file, history):
history = history or []
if not mic_file:
return history, None, “Please record something!”
try:
user_text = transcribe(mic_file)
except Exception as e:
return history, None, f”ASR error: {e}”

if not user_text:
return history, None, “Didn’t catch that. Try again?”

try:
reply = generate_reply(history, user_text)
except Exception as e:
return history, None, f”LLM error: {e}”

try:
sr, wav = synthesize_speech(reply)
except Exception as e:
return history + [(user_text, reply)], None, f”TTS error: {e}”

return history + [(user_text, reply)], (sr, wav), f”User: {user_text}nAssistant: {reply}”

def text_to_voice(user_text, history):
history = history or []
user_text = (user_text or “”).strip()
if not user_text:
return history, None, “Type a message first.”
try:
reply = generate_reply(history, user_text)
sr, wav = synthesize_speech(reply)
except Exception as e:
return history, None, f”Error: {e}”
return history + [(user_text, reply)], (sr, wav), f”User: {user_text}nAssistant: {reply}”

def export_chat(history):
lines = []
for u, a in history or []:
lines += [f”User: {u}”, f”Assistant: {a}”, “”]
text = “n”.join(lines).strip() or “No conversation yet.”
with tempfile.NamedTemporaryFile(delete=False, suffix=”.txt”, mode=”w”) as f:
f.write(text)
path = f.name
return path

We add interactive functions for our agent: clear_history resets the conversation, voice_to_voice handles speech input and returns a spoken reply, text_to_voice processes typed input and speaks back, and export_chat saves the entire dialog into a downloadable text file. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserwith gr.Blocks(title=”Advanced Voice AI Agent (HF Pipelines)”) as demo:
gr.Markdown(
“## Advanced Voice AI Agent (Hugging Face Pipelines Only)n”
“- **ASR**: openai/whisper-small.enn”
“- **LLM**: google/flan-t5-basen”
“- **TTS**: suno/bark-smalln”
“Speak or type; the agent replies with voice + text.”
)

with gr.Row():
with gr.Column(scale=1):
mic = gr.Audio(sources=[“microphone”], type=”filepath”, label=”Record”)
say_btn = gr.Button(” Speak”)
text_in = gr.Textbox(label=”Or type instead”, placeholder=”Ask me anything…”)
text_btn = gr.Button(” Send”)
export_btn = gr.Button(” Export Chat (.txt)”)
reset_btn = gr.Button(” Reset”)
with gr.Column(scale=1):
audio_out = gr.Audio(label=”Assistant Voice”, autoplay=True)
transcript = gr.Textbox(label=”Transcript”, lines=6)
chat = gr.Chatbot(height=360)
state = gr.State([])

def update_chat(history):
return [(u, a) for u, a in (history or [])]

say_btn.click(voice_to_voice, [mic, state], [state, audio_out, transcript]).then(
update_chat, inputs=state, outputs=chat
)
text_btn.click(text_to_voice, [text_in, state], [state, audio_out, transcript]).then(
update_chat, inputs=state, outputs=chat
)
reset_btn.click(clear_history, None, [chat, state])
export_btn.click(export_chat, state, gr.File(label=”Download chat.txt”))

demo.launch(debug=False)

We build a clean Gradio UI that lets us speak or type and then hear the agent’s response. We wire buttons to our callbacks, maintain chat state, and stream results into a chatbot, transcript, and audio player, all launched in one Colab app.

In conclusion, we see how seamlessly Hugging Face pipelines enable us to create a voice-driven conversational agent that listens, thinks, and responds. We now have a working demo that captures audio, transcribes it, generates intelligent responses, and returns speech output, all inside Colab. With this foundation, we can experiment with larger models, add multilingual support, or even extend the system with custom logic. Still, the core idea remains the same: we can bring together ASR, LLM, and TTS into one smooth workflow for an interactive voice AI experience.

The post How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face Pipelines? appeared first on MarkTechPost.

Supercharge your organization’s productivity with the Amazon Q Busin …

Posted on September 18, 2025 by i-genie

Generative AI solutions like Amazon Q Business are transforming the way employees work. Organizations in every industry are embracing these tools to help their workforce extract valuable insights from increasingly fragmented data to accelerate decision-making processes. However, the adoption of generative AI tools hasn’t been without its challenges.
Two hurdles have emerged in the implementation of generative AI solutions. First, users often find themselves compelled to abandon familiar workflows, manually transferring data to an AI assistant for analysis. This creates unnecessary friction and increases the time to value. Second, the absence of generative AI tools in commonly used software makes it difficult for employees to identify opportunities where AI can significantly boost their productivity.
Enter Amazon Q Business, a generative AI-powered assistant tailored for the modern workplace, so you can engage in conversations, solve complex problems, and take action by seamlessly connecting to company data and enterprise systems. Amazon Q Business provides employees with instant access to relevant information and advice, streamlining tasks, accelerating decision-making, and fostering creativity and innovation in the workplace. We recently launched the Amazon Q Business browser extension in Amazon Q Business, and it is now available to Amazon Q Business subscribers (Lite and Pro). The Amazon Q Business browser extension brings the power of Amazon Q Business directly into your browsers, so you can receive context-aware, generative AI assistance and get on-the-go help for daily tasks.
In this post, we show how to implement this solution for your own enterprise, giving your team seamless access to AI-driven insights and assistance.
Use cases for the Amazon Q Business browser extension
The Amazon Q Business browser extension is deployed to all Amazonians, making tens of thousands of users more productive every day. In this section, we highlight some of the most impactful use cases for which Amazonians use the Amazon Q Business browser extension to boost their productivity.
Analyze web content
Business and technical teams need to analyze and synthesize information across various reports, competitive analyses, and industry documents found outside the company’s data to develop insights and strategy. They must make sure their strategic recommendations are based on verified data sources and trustworthy industry information. Additionally, identifying patterns across multiple sources is time-consuming and complex. With the Amazon Q Business browser extension, strategists can quickly generate industry insights and identify trends across trusted internal and external data sources in seconds, while maintaining the human element in strategic thinking.
Check out the following demo video:

Improve content quality
The Amazon Q Business browser extension brings the unique ability to incorporate context that might not be readily available to your generative AI assistant. You can use the Amazon Q Business browser extension for content creation and content quality improvements by including multiple disparate sources in your queries that typically aren’t available to generative AI assistants. You can use it to perform real-time validation of content from various sources and incorporate web-based style guides and best practices to accelerate content creation.
Check out the following demo video:

Solution overview
In the following sections, we walk through how to get started with the Amazon Q Business browser extension if you have already enabled Amazon Q Business for your organization. To learn more, see Configuring the Amazon Q Business browser extension for use.
Prerequisites
Complete the prerequisite steps in this section before deploying the browser extension.
Create an Amazon Q Business application and subscribe your users
The Amazon Q Business browser extension is a feature of Amazon Q Business and requires customers to first create an Amazon Q Business application and subscribe their users before the browser extension can be enabled. To learn more about how you can get started with Amazon Q Business, see Getting started with Amazon Q Business.
Set up the Amazon Q Business web experience
The browser extension uses the Amazon Q Business web experience client as the mechanism to authenticate users and offer Amazon Q Business features. The first step to enabling the browser extension is to create an Amazon Q Business web experience. If you have already created a web experience for your users, you can skip this step. However, if you have developed a custom web experience using the Amazon Q Business APIs, complete the following steps to create an Amazon Q Business web experience:

On the Amazon Q Business console, go to your Amazon Q Business application.

The Web experience settings section shows if you already have a web experience deployed. If you don’t have a web experience deployed, this section will be empty, with the message “A web experience needs to be created before deploying.”

At the top of your application details page, choose Edit.

For Outcome, select Web experience.
Choose Update.

This step might take a few minutes to complete.

After your web experience is deployed, you will find a URL where your web experience is hosted on your Amazon Q Business application details page. Save this URL for later.

Grant users access to send queries directly to the large language model
The Amazon Q Business browser extension can include your users’ web page context in queries by passing the web page content as file attachments alongside a user’s prompt. Because the file attachment feature is available only for General knowledge mode, the browser extension requires Amazon Q Business admins to grant users access to send queries directly to the large language model (LLM) to take advantage of the full feature set of the browser extension. Without this prerequisite, users can only access their company knowledge through the browser extension and can’t ask Amazon Q Business questions about their web page content.
Amazon Q Business does not store user conversation data and does not use queries or conversations for training its LLMs. Conversations are only stored within the application for 30 days. You can delete these conversations by accessing the Amazon Q Business web experience and choosing Chat in the navigation pane, as shown in the following screenshot.

To grant users access to send queries directly to the Amazon Q LLM, complete the following steps:

On the Amazon Q Business console, go to your application.
Choose Admin controls and guardrails in the navigation pane.

In the Global controls section, choose Edit.

Select Allow end users to send queries directly to the LLM.
Choose Save.

You are now ready to enable the browser extension for your users.
Configure the Amazon Q Business browser extension
Now that you have completed the prerequisites for the browser extension, complete the following steps to enable the browser extension for your users:

On the Amazon Q Business console, go to your application.
Under Enhancements in the navigation pane, choose Integrations.
In the Browser extensions section, choose Edit.

Select the check boxes for the browser extensions you want to enable:

The Chromium check box enables the Chrome store extension, which supports Google Chrome and Microsoft Edge browsers.
The Firefox check box enables the Firefox Browser add-on for Firefox browsers.

You can also view the Chrome or Firefox store pages for the extension using the links in the respective Learn more sections.

Choose Save.

Your users will now see instructions to install the Amazon Q Business browser extension the next time they log in to the Amazon Q Business web experience. If you have not yet done so, share the web experience URL you obtained in the earlier steps with your users so they can follow the steps to install the browser extension.
Activate the browser extension if you are using IAM federation authentication for Amazon Q Business
If you’re using an external identity provider (IdP) for your Amazon Q Business application, you must allow-list the browser extension with the external provider before your users can start using the browser extension. You can allow-list the following URLs with your IdP to activate the browser extension:

For the Chromium browser extension (suitable for Google Chrome and Microsoft Edge), use https://feihpdljijcgnokhfoibicengfiellbp.chromiumapp.org/
For the Mozilla Firefox browser extension, https://ba6e8e6e4fa44c1057cf5f26fba9b2e788dfc34f.extensions.allizom.org/

You don’t need to take the aforementioned steps if you’re using AWS IAM Identity Center as the authentication solution for your Amazon Q Business application.
Get started with the browser extension
After you share the web experience URL with your users, they can use it to find the browser extension store page and install the browser extension. Users can complete the following steps:

You will notice a banner letting you know that your admin has enabled the browser extension for you.

Choose Install extension.

The link will take you to the appropriate Amazon Q Business browser extension store page based on the browser you’re using.

Choose Add to Chrome or the appropriate installation option for your browser.

Upon installing the extension, you will find it in your browser’s tool bar under Extensions. You can choose the pin icon to pin the browser extension.

After you open your browser extension, you will see a side pane as shown in the following screenshot. It will automatically detect the correct web experience URL from your open tabs to help you sign in. If it doesn’t, enter the web experience URL provided by your admin in the Amazon Q URL section and choose Sign in.

Upon sign in, you’re ready to go! Refer to the earlier section discussing Amazon’s use cases for inspiration on how you can use the extension to boost your productivity.

Deploy the Amazon Q Business browser extension on behalf of your users
Some admins might choose to directly deploy the Amazon Q Business browser extension on their users’ browsers to streamline and accelerate adoption.
Enterprises use varying mobile device management software and have differing requirements for their browser policies. To deploy the Amazon Q Business browser extension, refer to the following resources:

Mozilla Firefox policy settings
Google Chrome policy settings
Microsoft Edge:

Policy settings
Reference guide

Customize the Amazon Q Business browser extension for your enterprise
Some admins might choose to customize the look and feel of the Amazon Q Business browser extension to fit their enterprise’s needs. This section outlines the extension’s supported customization functionality and the corresponding browser extension policy values to configure on your users’ browsers.
Remove the Amazon Q Business URL input from the browser extension login page
If you don’t want to require an Amazon Q Business web experience URL from your users at sign-in, you can set a default URL on their behalf by setting the Q_BIZ_BROWSER_EXTENSION_URL policy to the appropriate Amazon Q Business web experience URL for your users.

Replace the browser extension’s toolbar icon
You can modify the toolbar icon of your browser extension by setting the value of one or more of the following browser policy keys to the URL of your PNG or SVG image or a valid datauri for your users:

Q_BIZ_BROWSER_EXTENSION_ICON_128 (mandatory)
Q_BIZ_BROWSER_EXTENSION_ICON_16 (optional)
Q_BIZ_BROWSER_EXTENSION_ICON_32 (optional)
Q_BIZ_BROWSER_EXTENSION_ICON_48 (optional)

Replace the logo or icon in the browser extension window
To change the logo or icon in your browser extension window, set the value of the Q_BIZ_BROWSER_EXTENSION_LOGO policy key with a URL to your PNG or SVG image or a valid datauri for your users.

Modify the name of the browser extension shown in the browser extension window
To replace references to “Amazon Q,” “Amazon Q Business,” “AWS,” and “Amazon Web Services” with a name of your choice inside the browser extension window, set the value of the Q_BIZ_BROWSER_EXTENSION_ENTERPRISE_NAME policy key with the new name for your users.

Modify the title of your browser extension in hover text
To change the title of your browser extension as it shows in the text when hovering over your extension (“Amazon Q Business has access to this site,” as seen in the prior screenshot), set the Q_BIZ_BROWSER_EXTENSION_TITLE_NAME policy to the appropriate string for your users.

Replace the AI policy link in the browser extension footer with your own link
To replace the link text in the footer of your browser extension, set Q_BIZ_BROWSER_EXTENSION_FOOTER_POLICY_NAME to the appropriate string for your users.
To replace the URL in the footer of your browser extension, set Q_BIZ_BROWSER_EXTENSION_FOOTER_POLICY_URL to the appropriate URL for your users.

Congratulations! You and your organization are ready to receive generative assistance for your browser-based tasks.
Clean up
This section outlines the steps to disable or remove the browser extension or revert deployments and customization for your users.
Disable the Amazon Q Business browser extension through the Amazon Q Business console
You can disable the Amazon Q Business browser extension from the Amazon Q Business console whenever you choose, even before removing the browser extension from your users’ browsers. To do so, complete the following steps:

On the Amazon Q Business console, go to your application.
Under Enhancements in the navigation pane, choose Integrations.
In the Browser extensions section, choose Edit.

Deselect the check boxes for the browser extensions you want to disable:

The Chromium check box disables the Chrome store extension, which supports Google Chrome and Microsoft Edge browsers.
The Firefox check box disables the Firefox Browser add-on for Firefox browsers.

Choose Save.

Revert the deployment of the Amazon Q Business browser extension on behalf of your users
Enterprises use varying mobile device management software and have differing requirements for their browser policies. If you deployed the browser extension by updating your browser policy settings, you should remove those policies by following the guidance in the policy settings documentation for the respective browsers:

Mozilla Firefox policy settings
Google Chrome policy settings
Microsoft Edge:

Policy settings
Reference guide

Revert the deployment of the Amazon Q Business browser extension on behalf of your users
If you customized the Amazon Q Business browser extension by modifying browser policies as detailed earlier in this post, you can revert those customizations by simply removing the corresponding policy entry in your browser policy settings.
Conclusion
In this post, we showed how to use the Amazon Q Business browser extension to give your team seamless access to AI-driven insights and assistance. The browser extension is now available in US East (N. Virginia) and US West (Oregon) AWS Regions for Mozilla, Google Chrome, and Microsoft Edge as part of the Lite Subscription. There is no additional cost to use the browser extension.
To get started, log in to the Amazon Q Business console and setup the browser extension for your Amazon Q Business application. To learn more, see Configuring the Amazon Q Business browser extension for use.

About the authors
Firaz Akmal is a Sr. Product Manager for Amazon Q Business and has been at AWS for 8+ years. He is a customer advocate, helping customers transform their search and generative AI use-cases on AWS. Outside of work Firaz enjoys spending time in the mountains of the PNW or experiencing the world through his daughter’s perspective.
Abhinand Sukumar is a Senior Product Manager at Amazon Web Services for Amazon Q Business, where he drives the product vision and roadmap for innovative generative AI solutions. Abhinand works closely with customers and engineering to deliver successful integrations, including the browser extension. His expertise spans generative AI experiences and AI/ML educational devices, with a deep passion for education, artificial intelligence, and design thinking. Prior to joining AWS, Abhinand worked as an embedded software engineer in the networking industry. With 5-6 years of experience in technology,

Build Agentic Workflows with OpenAI GPT OSS on Amazon SageMaker AI and …

Posted on September 18, 2025 by i-genie

OpenAI has released two open-weight models, gpt-oss-120b (117 billion parameters) and gpt-oss-20b (21 billion parameters), both built with a Mixture of Experts (MoE) design and a 128K context window. These models are the leading open source models, according to Artificial Analysis benchmarks, and excel at reasoning and agentic workflows. With Amazon SageMaker AI, you can fine-tune or customize models and deploy with your choice of framework through a fully managed service. Amazon SageMaker Inference gives you the flexibility to bring your own inference code and framework without having to build and maintain your own clusters.
Although large language models (LLMs) excel at understanding language and generating content, building real-world agentic applications requires complex workflow management, tool calling capabilities, and context management. Multi-agent architectures address these challenges by breaking down complex systems into specialized components, but they introduce new complexities in agent coordination, memory management, and workflow orchestration.
In this post, we show how to deploy gpt-oss-20b model to SageMaker managed endpoints and demonstrate a practical stock analyzer agent assistant example with LangGraph, a powerful graph-based framework that handles state management, coordinated workflows, and persistent memory systems. We will then deploy our agents to Amazon Bedrock AgentCore, a unified orchestration layer that abstracts away infrastructure and allows you to securely deploy and operate AI agents at scale.
Solution overview
In this solution, we build an agentic stock analyzer with the following key components:

The GPT OSS 20B model deployed to a SageMaker endpoint using vLLM, an open source serving framework for LLMs
LangGraph to build a multi-agent orchestration framework
Amazon Bedrock AgentCore to deploy the agents

The following diagram illustrates the solution architecture.

This architecture illustrates a multi-agent workflow hosted on Amazon Bedrock AgentCore Runtime running on AWS. A user submits a query, which is handled by a pipeline of specialized agents—Data Gathering Agent, Stock Performance Analyzer Agent, and Stock Report Generation Agent—that are each responsible for a distinct part of the stock evaluation process.
These agents collaborate within Amazon Bedrock AgentCore Runtime, and when language understanding or generation is required, they invoke a GPT OSS model hosted on SageMaker AI. The model processes the input and returns structured outputs that inform agent actions, enabling a fully serverless, modular, and scalable agentic system using open-source models.
Prerequisites

Ensure that you have required quota for G6e instances to deploy the model. Request quota here if you do not.
If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain.
Ensure your IAM role has required permissions to deploy SageMaker Models and Endpoints. For more information, see How Amazon SageMaker AI works with IAM in the SageMaker Developer Guide.

Deploy GPT-OSS models to SageMaker Inference
Customers who want to customize their models and frameworks can deploy using serverful deployments, but this requires access to GPUs, serving frameworks, load balancers, and infrastructure setup. SageMaker AI provides a fully managed hosting platform that takes care of provisioning the infrastructure with the necessary drivers, downloads the models, and deploys them. OpenAI’s GPT-OSS models are launched with a 4-bit quantization scheme (MXFP4), enabling fast inference while keeping resource usage low. These models can run on P5(H100), P6(H200), and P4(A100) and G6e(L40) instances.The GPT-OSS models are sparse MoE architectures with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts with no shared expert. Using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single H100 GPU.
To deploy these models effectively, you need a powerful serving framework like vLLM. To deploy the model, we build a vLLM container with the latest version that supports GPT OSS models on SageMaker AI.
You can use the following Docker file and script to build the container and push it to a local Amazon Elastic Container Registry (Amazon ECR). The recommended approach is to do this directly from Amazon SageMaker Studio, which provides a managed JupyterLab environment with AWS CLI access where you can build and push images to ECR as part of your SageMaker workflow. Alternatively, you can also perform the same steps on an Amazon Elastic Compute Cloud (Amazon EC2) instance with Docker installed.
After you have built and pushed the container to Amazon ECR, you can open Amazon SageMaker Studio by going to the SageMaker AI console, as shown in the following screenshot.

You can then create a Jupyter space or use an existing one to launch JupyterLab and run notebooks.

Clone the following notebook and run “Option 3: Deploying from HF using BYOC.” Update the required parameters, such as the inference image in the notebook with the container image. We also provide necessary environment variables, as shown in the following code.

inference_image  f”{account_id}.dkr.ecr.{region}.amazonaws.com/vllm:v0.10.0-gpt-oss”
instance_type  “ml.g6e.4xlarge”
num_gpu  1
model_name  sagemakerutilsname_from_base(“model-byoc”)
endpoint_name  model_name
inference_component_name  f”ic-{model_name}”
config  {
“OPTION_MODEL”: “openai/gpt-oss-20b”,
“OPTION_SERVED_MODEL_NAME”: “model”,
“OPTION_TENSOR_PARALLEL_SIZE”: jsondumps(num_gpu),
“OPTION_ASYNC_SCHEDULING”: “true”,
}

After you set up the deployment configuration, you can deploy to SageMaker AI using the following code:

from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

lmi_model = sagemaker.Model(
   image_uri=inference_image,
   env=config,
   role=role,
   name=model_name,
)

lmi_model.deploy(
   initial_instance_count=1,
   instance_type=instance_type,
   container_startup_health_check_timeout=600,
   endpoint_name=endpoint_name,
   endpoint_type=sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
   inference_component_name=inference_component_name,
   resources=ResourceRequirements(requests={“num_accelerators”: num_gpu, “memory”: 1024*5, “copies”: 1,}),
)

You can now run an inference example:

payload={
   “messages”: [
   {“role”: “user”, “content”: “Name popular places to visit in London?”}
   ],
}
res = llm.predict(payload)
print(“—–n” + res[“choices”][0][“message”][“content”] + “n—–n”)
print(res[“usage”])

—–
Here are some of the must‑see spots in London — a mix of iconic landmarks, world‑class museums, and vibrant neighborhoods:

These spots cover history, culture, shopping, and leisure—perfect for a first visit or a weekend escape in London!
—–

Use LangGraph to build a stock analyzer agent
For our stock analyzing multi-agent system, we use LangGraph to orchestrate the workflow. Jupyter notebook for the code is located in this github repository. The system comprises three specialized tools that work together to analyze stocks comprehensively:

The gather_stock_data tool collects comprehensive stock data for a given ticker symbol, including current price, historical performance, financial metrics, and market data. It returns formatted information covering price history, company fundamentals, trading metrics, and recent news headlines.
The analyze_stock_performance tool performs detailed technical and fundamental analysis of stock data, calculating metrics like price trends, volatility, and overall investment scores. It evaluates multiple factors including P/E ratios, profit margins, and dividend yields to provide a comprehensive performance analysis
The generate_stock_reporttool creates professional PDF reports from the gathered stock data and analysis, automatically uploading them to Amazon S3 with organized date-based folders.

For local testing, you can use a simplified version of the system by importing the necessary functions from your local script. For example:

from langgraph_stock_local import langgraph_stock_sagemaker
# Test the agent locally
result = langgraph_stock_sagemaker({
“prompt”: “Analyze SIM_STOCK Stock for Investment purposes.”
})
print(result)

This way, you can iterate quickly on your agent’s logic before deploying it to a scalable platform, making sure each component functions correctly and the overall workflow produces the expected results for different types of stocks.
Deploy to Amazon Bedrock AgentCore
After you have developed and tested your LangGraph framework locally, you can deploy it to Amazon Bedrock AgentCore Runtime. Amazon Bedrock AgentCore handles the heavy lifting of container orchestration, session management, scalability and abstracting the management of infrastructure. It provides persistent execution environments that can maintain an agent’s state across multiple invocations.
Before deploying our stock analyzer agent to Amazon Bedrock AgentCore Runtime, we need to create an AWS Identity and Access Management IAM role with the appropriate permissions. This role allows Amazon Bedrock AgentCore to invoke your SageMaker endpoint for GPT-OSS model inference, manage ECR repositories for storing container images, write Amazon CloudWatch logs for monitoring and debugging, access Amazon Bedrock AgentCore workload services for runtime operations, and send telemetry data to AWS X-Ray and CloudWatch for observability. See the following code:

from create_agentcore_role import create_bedrock_agentcore_role
role_arn = create_bedrock_agentcore_role(
   role_name=”MyStockAnalyzerRole”,
   sagemaker_endpoint_name=”your-endpoint-name”,
   region=”us-west-2″
)

After creating the role, you can use the Amazon Bedrock AgentCore Starter Toolkit to deploy your agent. The toolkit simplifies the deployment process by packaging your code, creating the necessary container image, and configuring the runtime environment:

from bedrock_agentcore_starter_toolkit import Runtime
agentcore_runtime = Runtime()
# Configure the agent
response = agentcore_runtime.configure(
   entrypoint=”langgraph_stock_sagemaker_gpt_oss.py”,
   execution_role=role_arn,
   auto_create_ecr=True,
   requirements_file=”requirements.txt”,
   region=”us-west-2″,
   agent_name=”stock_analyzer_agent”
)
# Deploy to the cloud
launch_result = agentcore_runtime.launch(local=False, local_build=False)

When you’re using BedrockAgentCoreApp, it automatically creates an HTTP server that listens on port 8080, implements the required /invocations endpoint for processing the agent’s requirements, implements the/ping endpoint for health checks (which is very important for asynchronous agents), handles proper content types and response formats, and manages error handling according to AWS standards.
After you deploy to Amazon Bedrock AgentCore Runtime, you will be able to see the status show as Ready on the Amazon Bedrock AgentCore console.

Invoke the agent
After you create the agent, you must set up the agent invocation entry point. With Amazon AgentCore Runtime, we decorate the invocation part of our agent with the @app.entrypoint decorator and use it as the entry point for our runtime. After you deploy the agent to Amazon AgentCore Runtime, you can invoke it using the AWS SDK:

import boto3
import json
agentcore_client = boto3.client(‘bedrock-agentcore’, region_name=’us-west-2′)
response = agentcore_client.invoke_agent_runtime(
   agentRuntimeArn=launch_result.agent_arn,
   qualifier=”DEFAULT”,
   payload=json.dumps({
   “prompt”: “Analyze SIM_STOCK for investment purposes”
   })
)

After invoking the stock analyzer agent through Amazon Bedrock AgentCore Runtime, you must parse and format the response for clear presentation. The response processing involves the following steps:

Decode the byte stream from Amazon Bedrock AgentCore into readable text.
Parse the JSON response containing the complete stock analysis.
Extract three main sections using regex pattern matching:

Stock Data Gathering Section: Extracts core stock information including symbol, company details, current pricing, market metrics, financial ratios, trading data, and recent news headlines.
Performance Analysis section: Analyzes technical indicators, fundamental metrics, and volatility measures to generate comprehensive stock analysis.
Stock Report Generation Section: Generates a detailed PDF report with all the Stock Technical Analysis.

The system also includes error handling that gracefully handles JSON parsing errors, falls back to plain text display if structured parsing fails, and provides debugging information for troubleshooting parsing issues of the stock analysis response.

stock_analysis = parse_bedrock_agentcore_stock_response(invoke_response)

This formatted output makes it straightforward to review the agent’s decision-making process and present professional stock analysis results to stakeholders, completing the end-to-end workflow from model deployment to meaningful business output:

STOCK DATA GATHERING REPORT:
================================
Stock Symbol: SIM_STOCK
Company Name: Simulated Stock Inc.
Sector: SIM_SECTOR
Industry: SIM INDUSTRY
CURRENT MARKET DATA:
– Current Price: $29.31
– Market Cap: $3,958
– 52-Week High: $29.18
– 52-Week Low: $16.80
– YTD Return: 1.30%
– Volatility (Annualized): 32.22%
FINANCIAL METRICS:
– P/E Ratio: 44.80
– Forward P/E: 47.59
– Price-to-Book: 11.75
– Dividend Yield: 0.46%
– Revenue (TTM): $4,988
– Profit Margin: 24.30%

STOCK PERFORMANCE ANALYSIS:
===============================
Stock: SIM_STOCK | Current Price: $29.31
TECHNICAL ANALYSIS:
– Price Trend: SLIGHT UPTREND
– YTD Performance: 1.03%
– Technical Score: 3/5
FUNDAMENTAL ANALYSIS:
– P/E Ratio: 34.80
– Profit Margin: 24.30%
– Dividend Yield: 0.46%
– Beta: 1.165
– Fundamental Score: 3/5
STOCK REPORT GENERATION:
===============================
Stock: SIM_STOCK
Sector: SIM_INDUSTRY
Current Price: $29.78
REPORT SUMMARY:
– Technical Analysis: 8.33% YTD performance
– Report Type: Comprehensive stock analysis for informational purposes
– Generated: 2025-09-04 23:11:55
PDF report uploaded to S3: s3://amzn-s3-demo-bucket/2025/09/04/SIM_STOCK_Stock_Report_20250904_231155.pdf
REPORT CONTENTS:
• Executive Summary with key metrics
• Detailed market data and financial metrics
• Technical and fundamental analysis
• Professional formatting for documentation

Clean up
You can delete the SageMaker endpoint to avoid accruing costs after your testing by running the following cells in the same notebook:

sessdelete_inference_component(inference_component_name)
sessdelete_endpoint(endpoint_name)
sessdelete_endpoint_config(endpoint_name)
sessdelete_model(model_name)

You can also delete Amazon Bedrock AgentCore resources using the following commands:

runtime_delete_response agentcore_control_clientdelete_agent_runtime(
agentRuntimeIdlaunch_resultagent_id
)
response ecr_clientdelete_repository(
repositoryNamelaunch_resultecr_urisplit(‘/’)[1],
force
)

Conclusion
In this post, we built an end-to-end solution for deploying OpenAI’s open-weight models on a single G6e(L40s) GPU, creating a multi-agent stock analysis system with LangGraph and deploying it seamlessly with Amazon Bedrock AgentCore. This implementation demonstrates how organizations can now use powerful open source LLMs cost-effectively with efficient serving frameworks such as vLLM. Beyond the technical implementation, enhancing this workflow can provide significant business value, such as reduction in stock analysis processing time, increased analyst productivity by automating routine stock assessments. Furthermore, by freeing analysts from repetitive tasks, organizations can redirect skilled professionals toward complex cases and relationship-building activities that drive business growth.
We invite you to try out our code samples and iterate your agentic workflows to meet your use cases.

About the authors
Vivek Gangasani is a Worldwide Lead GenAI Specialist Solutions Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product strategy for SageMaker Inference. He also helps enterprises and startups deploy, manage, and scale their GenAI models with SageMaker and GPUs. Currently, he is focused on developing strategies and solutions for optimizing inference performance and GPU efficiency for hosting Large Language Models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the Llama family, and Qwen, focusing on their fine-tuning and optimization for specific scientific applications. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker, enabling the scaling of foundation models from development to production. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.

A Coding Guide to Implement Zarr for Large-Scale Data: Chunking, Compr …

Posted on September 17, 2025 by i-genie

In this tutorial, we take a deep dive into the capabilities of Zarr, a library designed for efficient storage & manipulation of large, multidimensional arrays. We begin by exploring the basics, creating arrays, setting chunking strategies, and modifying values directly on disk. From there, we expand into more advanced operations such as experimenting with chunk sizes for different access patterns, applying multiple compression codecs to optimize both speed and storage efficiency, and comparing their performance on synthetic datasets. We also build hierarchical structures enriched with metadata, simulate realistic workflows with time-series and volumetric data, and demonstrate advanced indexing to extract meaningful subsets. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install zarr numcodecs -q
import zarr
import numpy as np
import matplotlib.pyplot as plt
from numcodecs import Blosc, Delta, FixedScaleOffset
import tempfile
import shutil
import os
from pathlib import Path

print(f”Zarr version: {zarr.__version__}”)
print(f”NumPy version: {np.__version__}”)

print(“=== BASIC ZARR OPERATIONS ===”)

We begin our tutorial by installing Zarr and Numcodecs, along with essential libraries like NumPy and Matplotlib. We then set up the environment and verify the versions, preparing ourselves to dive into basic Zarr operations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsertutorial_dir = Path(tempfile.mkdtemp(prefix=”zarr_tutorial_”))
print(f”Working directory: {tutorial_dir}”)

z1 = zarr.zeros((1000, 1000), chunks=(100, 100), dtype=’f4′,
store=str(tutorial_dir / ‘basic_array.zarr’), zarr_format=2)
z2 = zarr.ones((500, 500, 10), chunks=(100, 100, 5), dtype=’i4′,
store=str(tutorial_dir / ‘multi_dim.zarr’), zarr_format=2)

print(f”2D Array shape: {z1.shape}, chunks: {z1.chunks}, dtype: {z1.dtype}”)
print(f”3D Array shape: {z2.shape}, chunks: {z2.chunks}, dtype: {z2.dtype}”)

z1[100:200, 100:200] = np.random.random((100, 100)).astype(‘f4′)
z2[:, :, 0] = np.arange(500*500).reshape(500, 500)

print(f”Memory usage estimate: {z1.nbytes_stored() / 1024**2:.2f} MB”)

We create our working directory and initialize Zarr arrays: a 2D array of zeros and a 3D array of ones. We then fill them with random and sequential values, while also checking their shapes, chunk sizes, and memory usage in real time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n=== ADVANCED CHUNKING ===”)

time_steps, height, width = 365, 1000, 2000
time_series = zarr.zeros(
(time_steps, height, width),
chunks=(30, 250, 500),
dtype=’f4’,
store=str(tutorial_dir / ‘time_series.zarr’),
zarr_format=2
)

for t in range(0, time_steps, 30):
end_t = min(t + 30, time_steps)
seasonal = np.sin(2 * np.pi * np.arange(t, end_t) / 365)[:, None, None]
spatial = np.random.normal(20, 5, (end_t – t, height, width))
time_series[t:end_t] = (spatial + 10 * seasonal).astype(‘f4′)

print(f”Time series created: {time_series.shape}”)
print(f”Approximate chunks created”)

import time
start = time.time()
temporal_slice = time_series[:, 500, 1000]
temporal_time = time.time() – start

start = time.time()
spatial_slice = time_series[100, :200, :200]
spatial_time = time.time() – start

print(f”Temporal access time: {temporal_time:.4f}s”)
print(f”Spatial access time: {spatial_time:.4f}s”)

In this step, we simulate a year-long time-series dataset with optimized chunking for both temporal and spatial access. We add seasonal patterns and spatial noise, then measure access speeds, allowing us to see firsthand how chunking impacts performance in real-world data exploration. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n=== COMPRESSION AND CODECS ===”)

data = np.random.randint(0, 1000, (1000, 1000), dtype=’i4’)

from zarr.codecs import BloscCodec, BytesCodec

z_none = zarr.array(data, chunks=(100, 100),
codecs=[BytesCodec()],
store=str(tutorial_dir / ‘no_compress.zarr’))

z_lz4 = zarr.array(data, chunks=(100, 100),
codecs=[BytesCodec(), BloscCodec(cname=’lz4′, clevel=5)],
store=str(tutorial_dir / ‘lz4_compress.zarr’))

z_zstd = zarr.array(data, chunks=(100, 100),
codecs=[BytesCodec(), BloscCodec(cname=’zstd’, clevel=9)],
store=str(tutorial_dir / ‘zstd_compress.zarr’))

sequential_data = np.cumsum(np.random.randint(-5, 6, (1000, 1000)), axis=1)
z_delta = zarr.array(sequential_data, chunks=(100, 100),
codecs=[BytesCodec(), BloscCodec(cname=’zstd’, clevel=5)],
store=str(tutorial_dir / ‘sequential_compress.zarr’))

sizes = {
‘No compression’: z_none.nbytes_stored(),
‘LZ4’: z_lz4.nbytes_stored(),
‘ZSTD’: z_zstd.nbytes_stored(),
‘Sequential+ZSTD’: z_delta.nbytes_stored()
}

print(“Compression comparison:”)
original_size = data.nbytes
for name, size in sizes.items():
ratio = size / original_size
print(f”{name}: {size/1024**2:.2f} MB (ratio: {ratio:.3f})”)

print(“n=== HIERARCHICAL DATA ORGANIZATION ===”)

root = zarr.open_group(str(tutorial_dir / ‘experiment.zarr’), mode=’w’)

raw_data = root.create_group(‘raw_data’)
processed = root.create_group(‘processed’)
metadata = root.create_group(‘metadata’)

raw_data.create_dataset(‘images’, shape=(100, 512, 512), chunks=(10, 128, 128), dtype=’u2′)
raw_data.create_dataset(‘timestamps’, shape=(100,), dtype=’datetime64[ns]’)

processed.create_dataset(‘normalized’, shape=(100, 512, 512), chunks=(10, 128, 128), dtype=’f4′)
processed.create_dataset(‘features’, shape=(100, 50), chunks=(20, 50), dtype=’f4′)

root.attrs[‘experiment_id’] = ‘EXP_2024_001’
root.attrs[‘description’] = ‘Advanced Zarr tutorial demonstration’
root.attrs[‘created’] = str(np.datetime64(‘2024-01-01’))

raw_data.attrs[‘instrument’] = ‘Synthetic Camera’
raw_data.attrs[‘resolution’] = [512, 512]
processed.attrs[‘normalization’] = ‘z-score’

timestamps = np.datetime64(‘2024-01-01’) + np.arange(100) * np.timedelta64(1, ‘h’)
raw_data[‘timestamps’][:] = timestamps

for i in range(100):
frame = np.random.poisson(100 + 50 * np.sin(2 * np.pi * i / 100), (512, 512)).astype(‘u2’)
raw_data[‘images’][i] = frame

print(f”Created hierarchical structure with {len(list(root.group_keys()))} groups”)
print(f”Data arrays and groups created successfully”)

print(“n=== ADVANCED INDEXING ===”)

volume_data = zarr.zeros((50, 20, 256, 256), chunks=(5, 5, 64, 64), dtype=’f4′,
store=str(tutorial_dir / ‘volume.zarr’), zarr_format=2)

for t in range(50):
for z in range(20):
y, x = np.ogrid[:256, :256]
center_y, center_x = 128 + 20*np.sin(t*0.1), 128 + 20*np.cos(t*0.1)
focus_quality = 1 – abs(z – 10) / 10

signal = focus_quality * np.exp(-((y-center_y)**2 + (x-center_x)**2) / (50**2))
noise = 0.1 * np.random.random((256, 256))
volume_data[t, z] = (signal + noise).astype(‘f4′)

print(“Various slicing operations:”)

max_projection = np.max(volume_data[:, 10], axis=0)
print(f”Max projection shape: {max_projection.shape}”)

z_stack = volume_data[25, :, 100:156, 100:156]
print(f”Z-stack subset: {z_stack.shape}”)

bright_pixels = volume_data[volume_data > 0.5]
print(f”Pixels above threshold: {len(bright_pixels)}”)

We benchmark compression by writing the same data with no compression, LZ4, and ZSTD, then compare on-disk sizes to see practical savings. Next, we organize an experiment as a Zarr group hierarchy with rich attributes, images, and timestamps. Finally, we generate a synthetic 4D volume and perform advanced indexing, max projections, sub-stacks, and thresholding, to validate fast, slice-wise access. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n=== PERFORMANCE OPTIMIZATION ===”)

def process_chunk_serial(data, func):
results = []
for i in range(0, len(dt), 100):
chunk = data[i:i+100]
results.append(func(chunk))
return np.concatenate(results)

def gaussian_filter_1d(x, sigma=1.0):
kernel_size = int(4 * sigma)
if kernel_size % 2 == 0:
kernel_size += 1
kernel = np.exp(-0.5 * ((np.arange(kernel_size) – kernel_size//2) / sigma)**2)
kernel = kernel / kernel.sum()
return np.convolve(x.astype(float), kernel, mode=’same’)

large_array = zarr.random.random((10000,), chunks=(1000,),
store=str(tutorial_dir / ‘large.zarr’), zarr_format=2)

start_time = time.time()
chunk_size = 1000
filtered_data = []
for i in range(0, len(large_array), chunk_size):
end_idx = min(i + chunk_size, len(large_array))
chunk_data = large_array[i:end_idx]
smoothed = np.convolve(chunk_data, np.ones(5)/5, mode=’same’)
filtered_data.append(smoothed)

result = np.concatenate(filtered_data)
processing_time = time.time() – start_time

print(f”Chunk-aware processing time: {processing_time:.4f}s”)
print(f”Processed {len(large_array):,} elements”)

print(“n=== VISUALIZATION ===”)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle(‘Advanced Zarr Tutorial – Data Visualization’, fontsize=16)

axes[0,0].plot(temporal_slice)
axes[0,0].set_title(‘Temporal Evolution (Single Pixel)’)
axes[0,0].set_xlabel(‘Day of Year’)
axes[0,0].set_ylabel(‘Temperature’)

im1 = axes[0,1].imshow(spatial_slice, cmap=’viridis’)
axes[0,1].set_title(‘Spatial Pattern (Day 100)’)
plt.colorbar(im1, ax=axes[0,1])

methods = list(sizes.keys())
ratios = [sizes[m]/original_size for m in methods]
axes[0,2].bar(range(len(methods)), ratios)
axes[0,2].set_xticks(range(len(methods)))
axes[0,2].set_xticklabels(methods, rotation=45)
axes[0,2].set_title(‘Compression Ratios’)
axes[0,2].set_ylabel(‘Size Ratio’)

axes[1,0].imshow(max_projection, cmap=’hot’)
axes[1,0].set_title(‘Max Intensity Projection’)

z_profile = np.mean(volume_data[25, :, 120:136, 120:136], axis=(1,2))
axes[1,1].plot(z_profile, ‘o-‘)
axes[1,1].set_title(‘Z-Profile (Center Region)’)
axes[1,1].set_xlabel(‘Z-slice’)
axes[1,1].set_ylabel(‘Mean Intensity’)

axes[1,2].plot(result[:1000])
axes[1,2].set_title(‘Processed Signal (First 1000 points)’)
axes[1,2].set_xlabel(‘Sample’)
axes[1,2].set_ylabel(‘Amplitude’)

plt.tight_layout()
plt.show()

We optimize performance by processing data in chunk-sized batches, applying simple smoothing filters without loading everything into memory. We then visualize temporal trends, spatial patterns, compression effects, and volume profiles, allowing us to see at a glance how our choices in chunking and compression shape the results. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n=== TUTORIAL SUMMARY ===”)
print(“Zarr features demonstrated:”)
print(“✓ Multi-dimensional array creation and manipulation”)
print(“✓ Optimal chunking strategies for different access patterns”)
print(“✓ Advanced compression with multiple codecs”)
print(“✓ Hierarchical data organization with metadata”)
print(“✓ Advanced indexing and data views”)
print(“✓ Performance optimization techniques”)
print(“✓ Integration with visualization tools”)

def show_tree(path, prefix=””, max_depth=3, current_depth=0):
if current_depth > max_depth:
return
items = sorted(path.iterdir())
for i, item in enumerate(items):
is_last = i == len(items) – 1
current_prefix = “└── ” if is_last else “├── ”
print(f”{prefix}{current_prefix}{item.name}”)
if item.is_dir() and current_depth < max_depth:
next_prefix = prefix + (” ” if is_last else “│ “)
show_tree(item, next_prefix, max_depth, current_depth + 1)

print(f”nFiles created in {tutorial_dir}:”)
show_tree(tutorial_dir)

print(f”nTotal disk usage: {sum(f.stat().st_size for f in tutorial_dir.rglob(‘*’) if f.is_file()) / 1024**2:.2f} MB”)

print(“n Advanced Zarr tutorial completed successfully!”)

We wrap up the tutorial by highlighting everything we explored: array creation, chunking, compression, hierarchical organization, indexing, performance tuning, and visualization. We also review the files generated during the session and confirm total disk usage, giving us a complete picture of how Zarr handles large-scale data efficiently from start to finish.

In conclusion, we move beyond the fundamentals and gain a comprehensive view of how Zarr fits into modern data workflows. We see how it handles storage optimization through compression, organizes complex experiments through hierarchical groups, and enables smooth access to slices of large datasets with minimal overhead. Performance enhancements, such as chunk-aware processing and integration with visualization tools, bring additional depth, demonstrating how theory is directly translated into practice.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide to Implement Zarr for Large-Scale Data: Chunking, Compression, Indexing, and Visualization Techniques appeared first on MarkTechPost.

Google AI Ships TimesFM-2.5: Smaller, Longer-Context Foundation Model …

Posted on September 17, 2025 by i-genie

Google Research has released TimesFM-2.5, a 200M-parameter, decoder-only time-series foundation model with a 16K context length and native probabilistic forecasting support. The new checkpoint is live on Hugging Face. On GIFT-Eval, TimesFM-2.5 now tops the leaderboard across accuracy metrics (MASE, CRPS) among zero-shot foundation models.

What is Time-Series Forecasting?

Time-series forecasting is the practice of analyzing sequential data points collected over time to identify patterns and predict future values. It underpins critical applications across industries, including forecasting product demand in retail, monitoring weather and precipitation trends, and optimizing large-scale systems such as supply chains and energy grids. By capturing temporal dependencies and seasonal variations, time-series forecasting enables data-driven decision-making in dynamic environments.

What changed in TimesFM-2.5 vs v2.0?

Parameters: 200M (down from 500M in 2.0).

Max context: 16,384 points (up from 2,048).

Quantiles: Optional 30M-param quantile head for continuous quantile forecasts up to 1K horizon.

Inputs: No “frequency” indicator required; new inference flags (flip-invariance, positivity inference, quantile-crossing fix).

Roadmap: Upcoming Flax implementation for faster inference; covariates support slated to return; docs being expanded.

Why does a longer context matter?

16K historical points allow a single forward pass to capture multi-seasonal structure, regime breaks, and low-frequency components without tiling or hierarchical stitching. In practice, that reduces pre-processing heuristics and improves stability for domains where context >> horizon (e.g., energy load, retail demand). The longer context is a core design change explicitly noted for 2.5.

What’s the research context?

TimesFM’s core thesis—a single, decoder-only foundation model for forecasting—was introduced in the ICML 2024 paper and Google’s research blog. GIFT-Eval (Salesforce) emerged to standardize evaluation across domains, frequencies, horizon lengths, and univariate/multivariate regimes, with a public leaderboard hosted on Hugging Face.

Key Takeaways

Smaller, Faster Model: TimesFM-2.5 runs with 200M parameters (half of 2.0’s size) while improving accuracy.

Longer Context: Supports 16K input length, enabling forecasts with deeper historical coverage.

Benchmark Leader: Now ranks #1 among zero-shot foundation models on GIFT-Eval for both MASE (point accuracy) and CRPS (probabilistic accuracy).

Production-Ready: Efficient design and quantile forecasting support make it suitable for real-world deployments across industries.

Broad Availability: The model is live on Hugging Face.

Summary

TimesFM-2.5 shows that foundation models for forecasting are moving past proof-of-concept into practical, production-ready tools. By cutting parameters in half while extending context length and leading GIFT-Eval across both point and probabilistic accuracy, it marks a step-change in efficiency and capability. With Hugging Face access already live and BigQuery/Model Garden integration on the way, the model is positioned to accelerate adoption of zero-shot time-series forecasting in real-world pipelines.

Check out the Model card (HF), Repo, Benchmark and Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Ships TimesFM-2.5: Smaller, Longer-Context Foundation Model That Now Leads GIFT-Eval (Zero-Shot Forecasting) appeared first on MarkTechPost.

Stanford Researchers Introduced MedAgentBench: A Real-World Benchmark …

Posted on September 17, 2025 by i-genie

A team of Stanford University researchers have released MedAgentBench, a new benchmark suite designed to evaluate large language model (LLM) agents in healthcare contexts. Unlike prior question-answering datasets, MedAgentBench provides a virtual electronic health record (EHR) environment where AI systems must interact, plan, and execute multi-step clinical tasks. This marks a significant shift from testing static reasoning to assessing agentic capabilities in live, tool-based medical workflows.

https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

Why Do We Need Agentic Benchmarks in Healthcare?

Recent LLMs have moved beyond static chat-based interactions toward agentic behavior—interpreting high-level instructions, calling APIs, integrating patient data, and automating complex processes. In medicine, this evolution could help address staff shortages, documentation burden, and administrative inefficiencies.

While general-purpose agent benchmarks (e.g., AgentBench, AgentBoard, tau-bench) exist, healthcare lacked a standardized benchmark that captures the complexity of medical data, FHIR interoperability, and longitudinal patient records. MedAgentBench fills this gap by offering a reproducible, clinically relevant evaluation framework.

What Does MedAgentBench Contain?

How Are the Tasks Structured?

MedAgentBench consists of 300 tasks across 10 categories, written by licensed physicians. These tasks include patient information retrieval, lab result tracking, documentation, test ordering, referrals, and medication management. Tasks average 2–3 steps and mirror workflows encountered in inpatient and outpatient care.

What Patient Data Supports the Benchmark?

The benchmark leverages 100 realistic patient profiles extracted from Stanford’s STARR data repository, comprising over 700,000 records including labs, vitals, diagnoses, procedures, and medication orders. Data was de-identified and jittered for privacy while preserving clinical validity.

How Is the Environment Built?

The environment is FHIR-compliant, supporting both retrieval (GET) and modification (POST) of EHR data. AI systems can simulate realistic clinical interactions such as documenting vitals or placing medication orders. This design makes the benchmark directly translatable to live EHR systems.

How Are Models Evaluated?

Metric: Task success rate (SR), measured with strict pass@1 to reflect real-world safety requirements.

Models Tested: 12 leading LLMs including GPT-4o, Claude 3.5 Sonnet, Gemini 2.0, DeepSeek-V3, Qwen2.5, and Llama 3.3.

Agent Orchestrator: A baseline orchestration setup with nine FHIR functions, limited to eight interaction rounds per task.

Which Models Performed Best?

Claude 3.5 Sonnet v2: Best overall with 69.67% success, especially strong in retrieval tasks (85.33%).

GPT-4o: 64.0% success, showing balanced retrieval and action performance.

DeepSeek-V3: 62.67% success, leading among open-weight models.

Observation: Most models excelled at query tasks but struggled with action-based tasks requiring safe multi-step execution.

https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

What Errors Did Models Make?

Two dominant failure patterns emerged:

Instruction adherence failures — invalid API calls or incorrect JSON formatting.

Output mismatch — providing full sentences when structured numerical values were required.

These errors highlight gaps in precision and reliability, both critical in clinical deployment.

Summary

MedAgentBench establishes the first large-scale benchmark for evaluating LLM agents in realistic EHR settings, pairing 300 clinician-authored tasks with a FHIR-compliant environment and 100 patient profiles. Results show strong potential but limited reliability—Claude 3.5 Sonnet v2 leads at 69.67%—highlighting the gap between query success and safe action execution. While constrained by single-institution data and EHR-focused scope, MedAgentBench provides an open, reproducible framework to drive the next generation of dependable healthcare AI agents

Check out the PAPER and Technical Blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Stanford Researchers Introduced MedAgentBench: A Real-World Benchmark for Healthcare AI Agents appeared first on MarkTechPost.

Streamline access to ISO-rating content changes with Verisk rating ins …

Posted on September 17, 2025 by i-genie

This post is co-written with Samit Verma, Eusha Rizvi, Manmeet Singh, Troy Smith, and Corey Finley from Verisk.
Verisk Rating Insights as a feature of ISO Electronic Rating Content (ERC) is a powerful tool designed to provide summaries of ISO Rating changes between two releases. Traditionally, extracting specific filing information or identifying differences across multiple releases required manual downloads of full packages, which was time-consuming and prone to inefficiencies. This challenge, coupled with the need for accurate and timely customer support, prompted Verisk to explore innovative ways to enhance user accessibility and automate repetitive processes. Using generative AI and Amazon Web Services (AWS) services, Verisk has made significant strides in creating a conversational user interface for users to easily retrieve specific information, identify content differences, and improve overall operational efficiency.
In this post, we dive into how Verisk Rating Insights, powered by Amazon Bedrock, large language models (LLM), and Retrieval Augmented Generation (RAG), is transforming the way customers interact with and access ISO ERC changes.
The challenge
Rating Insights provides valuable content, but there were significant challenges with user accessibility and the time it took to extract actionable insights:

Manual downloading – Customers had to download entire packages to get even a small piece of relevant information. This was inefficient, especially when only a part of the filing needed to be reviewed.
Inefficient data retrieval – Users couldn’t quickly identify the differences between two content packages without downloading and manually comparing them, which could take hours and sometimes days of analysis.
Time-consuming customer support – Verisk’s ERC Customer Support team spent 15% of their time weekly addressing queries from customers who were impacted by these inefficiencies. Furthermore, onboarding new customers required half a day of repetitive training to ensure they understood how to access and interpret the data.
Manual analysis time – Customers often spent 3–4 hours per test case analyzing the differences between filings. With multiple test cases to address, this led to significant delays in critical decision-making.

Solution overview
To solve these challenges, Verisk embarked on a journey to enhance Rating Insights with generative AI technologies. By integrating Anthropic’s Claude, available in Amazon Bedrock, and Amazon OpenSearch Service, Verisk created a sophisticated conversational platform where users can effortlessly access and analyze rating content changes.
The following diagram illustrates the high-level architecture of the solution, with distinct sections showing the data ingestion process and inference loop. The architecture uses multiple AWS services to add generative AI capabilities to the Ratings Insight system. This system’s components work together seamlessly, coordinating multiple LLM calls to generate user responses.

The following diagram shows the architectural components and the high-level steps involved in the Data Ingestion process.

The steps in the data ingestion process proceed as follows:

This process is triggered when a new file is dropped. It is responsible for chunking the document using a custom chunking strategy. This strategy recursively checks each section and keeps them intact without overlap. The process then embeds the chunks and stores them in OpenSearch Service as vector embeddings.
The embedding model used in Amazon Bedrock is amazon titan-embed-g1-text-02.
Amazon OpenSearch Serverless is utilized as a vector embedding store with metadata filtering capability.

The following diagram shows the architectural components and the high-level steps involved in the inference loop to generate user responses.

The steps in the inference loop proceed as follows:

This component is responsible for multiple tasks: it supplements user questions with recent chat history, embeds the questions, retrieves relevant chunks from the vector database, and finally calls the generation model to synthesize a response.
Amazon ElastiCache is used for storing recent chat history.
The embedding model utilized in Amazon Bedrock is amazon titan-embed-g1-text-02.
OpenSearch Serverless is implemented for RAG (Retrieval-Augmented Generation).
For generating responses to user queries, the system uses Anthropic’s Claude Sonnet 3.5 (model ID: anthropic.claude-3-5-sonnet-20240620-v1:0), which is available through Amazon Bedrock.

Key technologies and frameworks used
We used Anthropic’s Claude Sonnet 3.5 (model ID: anthropic.claude-3-5-sonnet-20240620-v1:0) to understand user input and provide detailed, contextually relevant responses. Anthropic’s Claude Sonnet 3.5 enhances the platform’s ability to interpret user queries and deliver accurate insights from complex content changes. LlamaIndex, which is an open source framework, served as the chain framework for efficiently connecting and managing different data sources to enable dynamic retrieval of content and insights.
We implemented RAG, which allows the model to pull specific, relevant data from the OpenSearch Serverless vector database. This means the system generates precise, up-to-date responses based on a user’s query without needing to sift through massive content downloads. The vector database enables intelligent search and retrieval, organizing content changes in a way that makes them quickly and easily accessible. This eliminates the need for manual searching or downloading of entire content packages. Verisk applied guardrails in Amazon Bedrock Guardrails along with custom guardrails around the generative model so the output adheres to specific compliance and quality standards, safeguarding the integrity of responses.
Verisk’s generative AI solution is a comprehensive, secure, and flexible service for building generative AI applications and agents. Amazon Bedrock connects you to leading FMs, services to deploy and operate agents, and tools for fine-tuning, safeguarding, and optimizing models along with knowledge bases to connect applications to your latest data so that you have everything you need to quickly move from experimentation to real-world deployment.
Given the novelty of generative AI, Verisk has established a governance council to oversee its solutions, ensuring they meet security, compliance, and data usage standards. Verisk implemented strict controls within the RAG pipeline to ensure data is only accessible to authorized users. This helps maintain the integrity and privacy of sensitive information. Legal reviews ensure IP protection and contract compliance.
How it works
The integration of these advanced technologies enables a seamless, user-friendly experience. Here’s how Verisk Rating Insights now works for customers:

Conversational user interface – Users can interact with the platform by using a conversational interface. Instead of manually reviewing content packages, users enter a natural language query (for example, “What are the changes in coverage scope between the two recent filings?”). The system uses Anthropic’s Claude Sonnet 3.5 to understand the intent and provides an instant summary of the relevant changes.
Dynamic content retrieval – Thanks to RAG and OpenSearch Service, the platform doesn’t require downloading entire files. Instead, it dynamically retrieves and presents the specific changes a user is seeking, enabling quicker analysis and decision-making.
Automated difference analysis – The system can automatically compare two content packages, highlighting the differences without requiring manual intervention. Users can query for precise comparisons (for example, “Show me the differences in rating criteria between Release 1 and Release 2”).
Customized insights – The guardrails in place mean that responses are accurate, compliant, and actionable. Additionally, if needed, the system can help users understand the impact of changes and assist them in navigating the complexities of filings, providing clear, concise insights.

The following diagram shows the architectural components and the high-level steps involved in the evaluation loop to generate relevant and grounded responses.

The steps in the evaluation loop proceed as follows:

This component is responsible for calling Anthropic’s Claude Sonnet 3.5 model and subsequently invoking the custom-built evaluation APIs to ensure response accuracy.
The generation model employed is Anthropic’s Claude Sonnet 3.5, which handles the creation of responses.
The Evaluation API ensures that responses remain relevant to user queries and stay grounded within the provided context.

The following diagram shows the process of capturing the chat history as contextual memory and storage for analysis.

Quality benchmarks
The Verisk Rating Insights team has implemented a comprehensive evaluation framework and feedback loop mechanism respectively, shown in the above figures, to support continuous improvement and address the issues that might arise.
Ensuring high accuracy and consistency in responses is essential for Verisk’s generative AI solutions. However, LLMs can sometimes produce hallucinations or provide irrelevant details, affecting reliability. To address this, Verisk implemented:

Evaluation framework – Integrated into the query pipeline, it validates responses for precision and relevance before delivery.
Extensive testing – Product subject matter experts (SMEs) and quality experts rigorously tested the solution to ensure accuracy and reliability. Verisk collaborated with in-house insurance domain experts to develop SME evaluation metrics for accuracy and consistency. Multiple rounds of SME evaluations were conducted, where experts graded these metrics on a 1–10 scale. Latency was also tracked to assess speed. Feedback from each round was incorporated into subsequent tests to drive improvements.
Continual model improvement – Using customer feedback serves as a crucial component in driving the continuous evolution and refinement of the generative models, improving both accuracy and relevance. By seamlessly integrating user interactions and feedback with chat history, a robust data pipeline is created that streams the user interactions to an Amazon Simple Storage Service (Amazon S3) bucket, which acts as a data hub. The interactions then go into Snowflake, which is a cloud-based data platform and data warehouse as a service that offers capabilities such as data warehousing, data lakes, data sharing, and data exchange. Through this integration, we built comprehensive analytics dashboards that provide valuable insights into user experience patterns and pain points.

Although the initial results were promising, they didn’t meet the desired accuracy and consistency levels. The development process involved several iterative improvements, such as redesigning the system and making multiple calls to the LLM. The primary metric for success was a manual grading system where business experts compared the results and provided continuous feedback to improve overall benchmarks.
Business impact and opportunity
By integrating generative AI into Verisk Rating Insights, the business has seen a remarkable transformation. Customers enjoyed significant time savings. By eliminating the need to download entire packages and manually search for differences, the time spent on analysis has been drastically reduced. Customers no longer spend 3–4 hours per test case. What at one time took days now takes minutes.
This time savings brought increased productivity. With an automated solution that instantly provides relevant insights, customers can focus more on decision-making rather than spending time on manual data retrieval. And by automating difference analysis and providing a centralized, effortless platform, customers can be more confident in the accuracy of their results and avoid missing critical changes.
For Verisk, the benefit was a reduced customer support burden because the ERC customer support team now spends less time addressing queries. With the AI-powered conversational interface, users can self-serve and get answers in real time, freeing up support resources for more complex inquiries.
The automation of repetitive training tasks meant quicker and more efficient customer onboarding. This reduces the need for lengthy training sessions, and new customers become proficient faster. The integration of generative AI has reduced redundant workflows and the need for manual intervention. This streamlines operations across multiple departments, leading to a more agile and responsive business.
Conclusion
Looking ahead, Verisk plans to continue enhancing the Rating Insights platform twofold. First, we’ll expand the scope of queries, enabling more sophisticated queries related to different filing types and more nuanced coverage areas. Second, we’ll scale the platform. With Amazon Bedrock providing the infrastructure, Verisk aims to scale this solution further to support more users and additional content sets across various product lines.
Verisk Rating Insights, now powered by generative AI and AWS technologies, has transformed the way customers interact with and access rating content changes. Through a conversational user interface, RAG, and vector databases, Verisk intends to eliminate inefficiencies and save customers valuable time and resources while enhancing overall accessibility. For Verisk, this solution has improved operational efficiency and provided a strong foundation for continued innovation.
With Amazon Bedrock and a focus on automation, Verisk is driving the future of intelligent customer support and content management, empowering both their customers and their internal teams to make smarter, faster decisions.
For more information, refer to the following resources:

Explore generative AI on AWS
Learn about unlocking the business value of generative AI
Learn more about Anthropic’s Claude 3 models on Amazon Bedrock
Learn about Amazon Bedrock and how to build and scale generative AI applications with FMs
Explore generative AI quick start proofs of concept

About the authors
Samit Verma serves as the Director of Software Engineering at Verisk, overseeing the Rating and Coverage development teams. In this role, he plays a key part in architectural design and provides strategic direction to multiple development teams, enhancing efficiency and ensuring long-term solution maintainability. He holds a master’s degree in information technology.
Eusha Rizvi serves as a Software Development Manager at Verisk, leading several technology teams within the Ratings Products division. Possessing strong expertise in system design, architecture, and engineering, Eusha offers essential guidance that advances the development of innovative solutions. He holds a bachelor’s degree in information systems from Stony Brook University.
Manmeet Singh is a Software Engineering Lead at Verisk and AWS Certified Generative AI Specialist. He leads the development of an agentic RAG-based generative AI system on Amazon Bedrock, with expertise in LLM orchestration, prompt engineering, vector databases, microservices, and high-availability architecture. Manmeet is passionate about applying advanced AI and cloud technologies to deliver resilient, scalable, and business-critical systems.
Troy Smith is a Vice President of Rating Solutions at Verisk. Troy is a seasoned insurance technology leader with more than 25 years of experience in rating, pricing, and product strategy. At Verisk, he leads the team behind ISO Electronic Rating Content, a widely used resource across the insurance industry. Troy has held leadership roles at Earnix and Capgemini and was the cofounder and original creator of the Oracle Insbridge Rating Engine.
Corey Finley is a Product Manager at Verisk. Corey has over 22 years of experience across personal and commercial lines of insurance. He has worked in both implementation and product support roles and has led efforts for major carriers including Allianz, CNA, Citizens, and others. At Verisk, he serves as Product Manager for VRI, RaaS, and ERC.
Arun Pradeep Selvaraj is a Senior Solutions Architect at Amazon Web Services (AWS). Arun is passionate about working with his customers and stakeholders on digital transformations and innovation in the cloud while continuing to learn, build, and reinvent. He is creative, energetic, deeply customer-obsessed, and uses the working backward process to build modern architectures to help customers solve their unique challenges. Connect with him on LinkedIn.
Ryan Doty is a Solutions Architect Manager at Amazon Web Services (AWS), based out of New York. He helps financial services customers accelerate their adoption of the AWS Cloud by providing architectural guidelines to design innovative and scalable solutions. Coming from a software development and sales engineering background, the possibilities that the cloud can bring to the world excite him.

Unified multimodal access layer for Quora’s Poe using Amazon Bedrock

Posted on September 17, 2025 by i-genie

Organizations gain competitive advantage by deploying and integrating new generative AI models quickly through Generative AI Gateway architectures. This unified interface approach simplifies access to multiple foundation models (FMs), addressing a critical challenge: the proliferation of specialized AI models, each with unique capabilities, API specifications, and operational requirements. Rather than building and maintaining separate integration points for each model, the smart move is to build an abstraction layer that normalizes these differences behind a single, consistent API.
The AWS Generative AI Innovation Center and Quora recently collaborated on an innovative solution to address this challenge. Together, they developed a unified wrapper API framework that streamlines the deployment of Amazon Bedrock FMs on Quora’s Poe system. This architecture delivers a “build once, deploy multiple models” capability that significantly reduces deployment time and engineering effort, with real protocol bridging code visible throughout the codebase.
For technology leaders and developers working on AI multi-model deployment at scale, this framework demonstrates how thoughtful abstraction and protocol translation can accelerate innovation cycles while maintaining operational control.
In this post, we explore how the AWS Generative AI Innovation Center and Quora collaborated to build a unified wrapper API framework that dramatically accelerates the deployment of Amazon Bedrock FMs on Quora’s Poe system. We detail the technical architecture that bridges Poe’s event-driven ServerSentEvents protocol with Amazon Bedrock REST-based APIs, demonstrate how a template-based configuration system reduced deployment time from days to 15 minutes, and share implementation patterns for protocol translation, error handling, and multi-modal capabilities. We show how this “build once, deploy multiple models” approach helped Poe integrate over 30 Amazon Bedrock models across text, image, and video modalities while reducing code changes by up to 95%.
Quora and Amazon Bedrock
Poe.com is an AI system developed by Quora that users and developers can use to interact with a wide range of advanced AI models and assistants powered by multiple providers. The system offers multi-model access, enabling side-by-side conversations with various AI chatbots for tasks such as natural language understanding, content generation, image creation, and more.
This screenshot below showcases the user interface of Poe, the AI platform created by Quora. The image displays Poe’s extensive library of AI models, which are presented as individual “chatbots” that users can interact with.

The following screenshot provides a view of the Model Catalog within Amazon Bedrock, a fully managed service from Amazon Web Services (AWS) that offers access to a diverse range of foundation models (FMs). This catalog acts as a central hub for developers to discover, evaluate, and access state-of-the-art AI from various providers.

Initially, integrating the diverse FMs available through Amazon Bedrock presented significant technical challenges for the Poe.com team. The process required substantial engineering resources to establish connections with each model while maintaining consistent performance and reliability standards. Maintainability emerged as an extremely important consideration, as was the ability to efficiently onboard new models as they became available—both factors adding further complexity to the integration challenges.
Technical challenge: Bridging different systems
The integration between Poe and Amazon Bedrock presented fundamental architectural challenges that required innovative solutions. These systems were built with different design philosophies and communication patterns, creating a significant technical divide that the wrapper API needed to bridge.
Architectural divide
The core challenge stems from the fundamentally different architectural approaches of the two systems. Understanding these differences is essential to appreciating the complexity of the integration solution. Poe operates on a modern, reactive, ServerSentEvents-based architecture through the Fast API library (fastapi_poe). This architecture is stream-optimized for real-time interactions and uses an event-driven response model designed for continuous, conversational AI. Amazon Bedrock, on the other hand, functions as an enterprise cloud service. It offers REST-based with AWS SDK access patterns, SigV4 authentication requirements, AWS Region-specific model availability, and a traditional request-response pattern with streaming options. This fundamental API mismatch creates several technical challenges that the Poe wrapper API solves, as detailed in the following table.

Challenge Category
Technical Issue
Source Protocol
Target Protocol
Integration Complexity

Protocol Translation
Converting between WebSocket-based protocol and REST APIs
WebSocket (bidirectional, persistent)
REST (request/response, stateless)
High: Requires protocol bridging

Authentication Bridging
Connecting JWT validation with AWS SigV4 signing
JWT token validation
AWS SigV4 authentication
Medium: Credential transformation needed

Response Format Transformation
Adapting JSON responses into expected format
Standard JSON structure
Custom format requirements
Medium: Data structure mapping

Streaming Reconciliation
Mapping chunked responses to ServerSentEvents
Chunked HTTP responses
ServerSentEvents stream
High: Real-time data flow conversion

Parameter Standardization
Creating unified parameter space across models
Model-specific parameters
Standardized parameter interface
Medium: Parameter normalization

API evolution and the Converse API
In May 2024, Amazon Bedrock introduced the Converse API, which offered standardization benefits that significantly simplified the integration architecture:

Unified interface across diverse model providers (such as Anthropic, Meta, and Mistral)
Conversation memory with consistent handling of chat history
Streaming and non-streaming modes through a single API pattern
Multimodal support for text, images, and structured data
Parameter normalization that reduces model-specific implementation quirks
Built-in content moderation capabilities

The solution presented in this post uses the Converse API where appropriate, while also maintaining compatibility with model-specific APIs for specialized capabilities. This hybrid approach provides flexibility while taking advantage of the Converse API’s standardization benefits.
Solution overview
The wrapper API framework provides a unified interface between Poe and Amazon Bedrock models. It serves as a translation layer that normalizes the differences between models and protocols while maintaining the unique capabilities of each model.
The solution architecture follows a modular design that separates concerns and enables flexible scaling, as illustrated in the following diagram.

The wrapper API consists of several key components working together to provide a seamless integration experience:

Client – The entry point where users interact with AI capabilities through various interfaces.
Poe layer – Consists of the following:

Poe UI – Handles user experience, request formation, parameters controls, file uploads, and response visualization.
Poe FastAPI – Standardizes user interactions and manages the communication protocol between clients and underlying systems.

Bot Factory – Dynamically creates appropriate model handlers (bots) based on the requested model type (chat, image, or video). This factory pattern provides extensibility for new model types and variations. See the following code:

# From core/bot_factory.py – Actual implementation
class BotFactory:
“””
Factory for creating different types of bots.
Handles bot creation based on the bot type and configuration.
“””
@staticmethod
def create_bot(bot_config: BotConfig) -> PoeBot:
# Check if a custom bot class is specified
if hasattr(bot_config, ‘bot_class’) and bot_config.bot_class:
# Use the custom bot class directly
bot = bot_config.bot_class(bot_config)

# Explicitly ensure we’re returning a PoeBot
if not isinstance(bot, PoeBot):
raise TypeError(f”Custom bot class must return a PoeBot instance, got {type(bot)}”)
return bot

# Determine bot type based on configuration
if hasattr(bot_config, ‘enable_video_generation’) and bot_config.enable_video_generation:
# Video generation bot
if ‘luma’ in bot_config.bot_name:
from core.refactored_luma_bot import LumaVideoBot
return LumaVideoBot(bot_config)
else:
from core.refactored_nova_reel_bot import NovaReelVideoBot
return NovaReelVideoBot(bot_config)

elif hasattr(bot_config, ‘enable_image_generation’) and bot_config.enable_image_generation:
# Image generation bot
if hasattr(bot_config, ‘model_id’) and “stability” in bot_config.model_id.lower():
# Stability AI image generation bot
from core.refactored_image_stability_ai import AmazonBedrockImageStabilityAIBot
return AmazonBedrockImageStabilityAIBot(bot_config)
else:
# Other image generation bot (Titan, Canvas, etc.)
from core.refactored_image_bot_amazon import RefactoredAmazonImageGenerationBot
return RefactoredAmazonImageGenerationBot(bot_config)

else:
# Check if this is a Claude 3.7 model
if hasattr(bot_config, ‘model_id’) and “claude-3-7″ in bot_config.model_id.lower():
return ClaudePlusBot(bot_config)
else:
# Default to standard chat bot
return RefactoredAmazonBedrockPoeBot(bot_config)

Service manager: Orchestrates the services needed to process requests effectively. It coordinates between different specialized services, including:

Token services – Managing token limits and counting.
Streaming services – Handling real-time responses.
Error services – Normalizing and handling errors.
AWS service integration – Managing API calls to Amazon Bedrock.

AWS services component – Converts responses from Amazon Bedrock format to Poe’s expected format and vice-versa, handling streaming chunks, image data, and video outputs.
Amazon Bedrock layer – Amazon’s FM service that provides the actual AI processing capabilities and model hosting, including:

Model diversity – Provides access to over 30 text models (such as Amazon Titan, Amazon Nova, Anthropic’s Claude, Meta’s Llama, Mistral, and more), image models, and video models.
API structure – Exposes both model-specific APIs and the unified Converse API.
Authentication – Requires AWS SigV4 signing for secure access to model endpoints.
Response management – Returns model outputs with standardized metadata and usage statistics.

The request processing flow in this unified wrapper API shows the orchestration required when bridging Poe’s event-driven ServerSentEvents protocol with Amazon Bedrock REST-based APIs, showcasing how multiple specialized services work together to deliver a seamless user experience.
The flow begins when a client sends a request through Poe’s interface, which then forwards it to the Bot Factory component. This factory pattern dynamically creates the appropriate model handler based on the requested model type, whether for chat, image, or video generation. The service manager component then orchestrates the various specialized services needed to process the request effectively, including token services, streaming services, and error handling services.
The following sequence diagram illustrates the complete request processing flow.

Configuration template for rapid multi-bot deployment
The most powerful aspect of the wrapper API is its unified configuration template system, which supports rapid deployment and management of multiple bots with minimal code changes. This approach is central to the solution’s success in reducing deployment time.
The system uses a template-based configuration approach with shared defaults and model-specific overrides:

# Bot configurations using the template pattern

CHAT_BOTS = {
‘poe-nova-micro’: BotConfig(
# Identity
bot_name=’poe-nova-micro’,
model_id=’amazon.nova-micro-v1:0′,
aws_region=aws_config[‘region’],
poe_access_key=’XXXXXXXXXXXXXXXXXXXXXX’,

# Model-specific parameters
supports_system_messages=True,
enable_image_comprehension=True,
expand_text_attachments=True,
streaming=True,
max_tokens=1300,
temperature=0.7,
top_p=0.9,

# Model-specific pricing
enable_monetization=True,
pricing_type=”variable”,
input_token_cost_milli_cents=2,
output_token_cost_milli_cents=4,
image_analysis_cost_milli_cents=25,

# Generate rate card with model-specific values
custom_rate_card=create_rate_card(2, 4, 25),

# Include common parameters
**DEFAULT_CHAT_CONFIG
),

‘poe-mistral-pixtral’: BotConfig(
# Identity
bot_name=’poe-mistral-pixtral’,
model_id=’us.mistral.pixtral-large-2502-v1:0′,
aws_region=aws_config[‘region’],
poe_access_key=’XXXXXXXXXXXXXXXXXXXXXX’,

# Model-specific parameters
supports_system_messages=False,
enable_image_comprehension=False,
# …
# Include common parameters
**DEFAULT_CHAT_CONFIG
)
}

This configuration-driven architecture offers several significant advantages:

Rapid deployment – Adding new models requires only creating a new configuration entry rather than writing integration code. This is a key factor in the significant improvement in deployment time.
Consistent parameter management – Common parameters are defined one time in DEFAULT_CHAT_CONFIG and inherited by bots, maintaining consistency and reducing duplication.
Model-specific customization – Each model can have its own unique settings while still benefiting from the shared infrastructure.
Operational flexibility – Parameters can be adjusted without code changes, allowing for quick experimentation and optimization.
Centralized credential management – AWS credentials are managed in one place, improving security and simplifying updates.
Region-specific deployment – Models can be deployed to different Regions as needed, with Region settings controlled at the configuration level.

The BotConfig class provides a structured way to define bot configurations with type validation:

# From config/bot_config.py – Actual implementation (partial)
class BotConfig(BaseModel):
# Core Bot Identity
bot_name: str = Field(…, description=”Name of the bot”)
model_id: str = Field(…, description=”Identifier for the AI model”)

# AWS Configuration
aws_region: Optional[str] = Field(default=”us-east-1″, description=”AWS region for deployment”)
aws_access_key: Optional[str] = Field(default=None, description=”AWS access key”)
aws_secret_key: Optional[str] = Field(default=None, description=”AWS secret key”)
aws_security_token: Optional[str] = None

# Poe Configuration
poe_access_key: str = Field(…, description=”Poe access key”)
modal_app_name: str = Field(…, description=”Modal app name”)

# Capability Flags
allow_attachments: bool = Field(default=True, description=”Whether to allow file attachments in Poe”)
supports_system_messages: bool = Field(default=False)
enable_image_comprehension: bool = Field(default=False)
expand_text_attachments: bool = Field(default=False)
streaming: bool = Field(default=False)
enable_image_generation: bool = Field(default=False)
enable_video_generation: bool = Field(default=False)

# Inference Configuration
max_tokens: Optional[int] = Field(default=None, description=”Maximum number of tokens to generate”)
temperature: Optional[float] = Field(default=None, description=”Temperature for sampling”)
top_p: Optional[float] = Field(default=None, description=”Top-p sampling parameter”)
optimize_latency: bool = Field(default=False, description=”Enable latency optimization with performanceConfig”)

# Reasoning Configuration (Claude 3.7+)
enable_reasoning: bool = Field(default=False, description=”Enable Claude’s reasoning capability”)
reasoning_budget: Optional[int] = Field(default=1024, description=”Token budget for reasoning (1024-4000 recommended)”)

# Monetization Configuration
enable_monetization: bool = Field(default=False, description=”Enable variable pricing monetization”)
custom_rate_card: Optional[str] = Field(
default=None,
description=”Custom rate card for variable pricing in markdown format”
)
input_token_cost_milli_cents: Optional[int] = Field(
default=None,
description=”Cost per input token in thousandths of a cent”
)
output_token_cost_milli_cents: Optional[int] = Field(
default=None,
description=”Cost per output token in thousandths of a cent”
)
image_analysis_cost_milli_cents: Optional[int] = Field(
default=None,
description=”Cost per image analysis in thousandths of a cent”
)

Advanced multimodal capabilities
One of the most powerful aspects of the framework is how it handles multimodal capabilities through simple configuration flags:

enable_image_comprehension – When set to True for text-only models like Amazon Nova Micro, Poe itself uses vision capabilities to analyze images and convert them into text descriptions that are sent to the Amazon Bedrock model. This enables even text-only models to classify images without having built-in vision capabilities.
expand_text_attachments – When set to True, Poe parses uploaded text files and includes their content in the conversation, enabling models to work with document content without requiring special file handling capabilities.
supports_system_messages – This parameter controls whether the model can accept system prompts, allowing for consistent behavior across models with different capabilities.

These configuration flags create a powerful abstraction layer that offers the following benefits:

Extends model capabilities – Text-only models gain pseudo-multimodal capabilities through Poe’s preprocessing
Optimizes built-in features – True multimodal models can use their built-in capabilities for optimal results
Simplifies integration – It’s controlled through simple configuration switches rather than code changes
Maintains consistency – It provides a uniform user experience regardless of the underlying model’s native capabilities

Next, we explore the technical implementation of the solution in more detail.
Protocol translation layer
The most technically challenging aspect of the solution was bridging between Poe’s API protocols and the diverse model interfaces available through Amazon Bedrock. The team accomplished this through a sophisticated protocol translation layer:

# From services/streaming_service.py – Actual implementation
def _extract_content_from_event(self, event: Dict[str, Any]) -> Optional[str]:
“””Extract content from a streaming event based on model provider.”””
try:
# Handle Anthropic Claude models
if “message” in event:
message = event.get(“message”, {})
if “content” in message and isinstance(message[“content”], list):
for content_item in message[“content”]:
if content_item.get(“type”) == “text”:
return content_item.get(“text”, “”)
elif “content” in message:
return str(message.get(“content”, “”))

# Handle Amazon Titan models
if “delta” in event:
delta = event.get(“delta”, {})
if “text” in delta:
return delta.get(“text”, “”)

# Handle other model formats
if “chunk” in event:
chunk_data = event.get(“chunk”, {})
if “bytes” in chunk_data:
# Process binary data if present
try:
text = chunk_data[“bytes”].decode(“utf-8”)
return json.loads(text).get(“completion”, “”)
except Exception:
self.logger.warning(“Failed to decode bytes in chunk”)

# No matching format found
return None

This translation layer handles subtle differences between models and makes sure that regardless of which Amazon Bedrock model is being used, the response to Poe is consistent and follows Poe’s expected format.
Error handling and normalization
A critical aspect of the implementation is comprehensive error handling and normalization. The ErrorService provides consistent error handling across different models:

# Simplified example of error handling (not actual code)
class ErrorService:
def normalize_Amazon_Bedrock_error(self, error: Exception) -> str:
“””Normalize Amazon Bedrock errors into a consistent format.”””
if isinstance(error, ClientError):
if “ThrottlingException” in str(error):
return “The model is currently experiencing high demand. Please try again in a moment.”
elif “ValidationException” in str(error):
return “There was an issue with the request parameters. Please try again with different settings.”
elif “AccessDeniedException” in str(error):
return “Access to this model is restricted. Please check your permissions.”
else:
return f”An error occurred while communicating with the model: {str(error)}”
elif isinstance(error, ConnectionError):
return “Connection error. Please check your network and try again.”
else:
return f”An unexpected error occurred: {str(error)}”

This approach makes sure users receive meaningful error messages regardless of the underlying model or error condition.
Token counting and optimization
The system implements sophisticated token counting and optimization to maximize effective use of models:

# From services/streaming_service.py – Actual implementation (partial)
# Calculate approximate JSON overhead
user_message_tokens = 0
for msg in conversation[‘messages’]:
for content_block in msg.get(‘content’, []):
if ‘text’ in content_block:
# Simple word-based estimation of actual text content
user_message_tokens += len(content_block[‘text’].split())

# Estimate JSON structure overhead (difference between total and content)
json_overhead = int((input_tokens – system_tokens) – user_message_tokens)

# Ensure we’re working with integers for calculations
input_tokens_for_pct = int(input_tokens)
system_tokens_for_pct = int(system_tokens)
json_overhead_for_pct = int(json_overhead)

# Calculate percentage with float arithmetic and proper integer division
json_overhead_percent = (float(json_overhead_for_pct) / max(1, input_tokens_for_pct – system_tokens_for_pct)) * 100
…

This detailed token tracking enables accurate cost estimation and optimization, facilitating efficient use of model resources.
AWS authentication and security
The AwsClientService handles authentication and security for Amazon Bedrock API calls.This implementation provides secure authentication with AWS services while providing proper error handling and connection management.
Comparative analysis
The implementation of the wrapper API dramatically improved the efficiency and capabilities of deploying Amazon Bedrock models on Poe, as detailed in the following table.

Feature
Before (Direct API)
After (Wrapper API)

Deployment Time
Days per model
Minutes per model

Developer Focus
Configuration and plumbing
Innovation and features

Model Diversity
Limited by integration capacity
Extensive (across Amazon Bedrock models)

Maintenance Overhead
High (separate code for each model)
Low (configuration-based)

Error Handling
Custom per model
Standardized across models

Cost Tracking
Complex (multiple integrations)
Simplified (centralized)

Multimodal Support
Fragmented
Unified

Security
Varied implementations
Consistent best practices

This comparison highlights the significant improvements achieved through the wrapper API approach, demonstrating the value of investing in a robust abstraction layer.
Performance metrics and business impact
The wrapper API framework delivered significant and measurable business impact across multiple dimensions, including increased model diversity, deployment efficiency, and developer productivity.
Poe can rapidly expand its model offerings, integrating tens of Amazon Bedrock models across text, image, and video modalities. This expansion occurred over a period of weeks rather than the months it would have taken with the previous approach.
The following table summarizes the deployment efficiency metrics.

Metric
Before
After
Improvement

New Model Deployment
2 –3 days
15 minutes
96x faster

Code Changes Required
500+ lines
20–30 lines
95% reduction

Testing Time
8–12 hours
30–60 minutes
87% reduction

Deployment Steps
10–15 steps
3–5 steps
75% reduction

These metrics were measured through direct comparison of engineering hours required before and after implementation, tracking actual deployments of new models.
The engineering team saw a dramatic shift in focus from integration work to feature development, as detailed in the following table.

Activity
Before (% of time)
After (% of time)
Change

API Integration
65%
15%
-50%

Feature Development
20%
60%
+40%

Testing
10%
15%
+5%

Documentation
5%
10%
+5%

Scaling and performance considerations
The wrapper API is designed to handle high-volume production workloads with robust scaling capabilities.
Connection pooling
To handle multiple concurrent requests efficiently, the wrapper implements connection pooling using aiobotocore. This allows it to maintain a pool of connections to Amazon Bedrock, reducing the overhead of establishing new connections for each request:

# From services/aws_service.py – Connection management
async def setup_client(self) -> None:
“””Initialize AWS client with proper configuration.”””
async with self._client_lock:
try:
# Always clean up existing clients first to avoid stale connections
if self.Amazon_Bedrock_client:
await self.cleanup()

# Increase timeout for image generation
config = Config(
read_timeout=300, # 5 minutes timeout
retries={‘max_attempts’: 3, ‘mode’: ‘adaptive’},
connect_timeout=30 # 30 second connection timeout
)

# Create the Amazon Bedrock client with proper error handling
self.Amazon_Bedrock_client = await self.session.create_client(
service_name=”Amazon_Bedrock-runtime”,
region_name=self.bot_config.aws_region,
aws_access_key_id=self.bot_config.aws_access_key,
aws_secret_access_key=self.bot_config.aws_secret_key,
aws_session_token=self.bot_config.aws_security_token,
config=config
).__aenter__()
except Exception as e:
self.Amazon_Bedrock_client = None
raise

Asynchronous processing
The entire framework uses asynchronous processing to handle concurrent requests efficiently:

# From core/refactored_chat_bot.py – Asynchronous request handling
async def get_response(self, query: QueryRequest) -> AsyncIterable[PartialResponse]:
try:
# Ensure AWS client is set up
await aws_service.setup_client()

# Validate and format the conversation
conversation = await conversation_service.validate_conversation(query)

# Process the request with streaming
if self.bot_config.streaming:
async for chunk in streaming_service.stream_Amazon_Bedrock_response(conversation, request_id):
yield chunk
else:
# Non-streaming mode
response_text, input_tokens, output_tokens = await streaming_service.non_stream_Amazon_Bedrock_response(conversation, request_id)
if response_text:
yield PartialResponse(text=response_text)
else:
yield PartialResponse(text=self.bot_config.fallback_response)
# Send done event for non-streaming mode
yield self.done_event()

except Exception as e:
# Error handling
error_message = error_service.log_error(e, request_id, “Error during request processing”)
yield PartialResponse(text=error_message)
yield self.done_event()

Error recovery and retry logic
The system implements sophisticated error recovery and retry logic to handle transient issues:

# From services/streaming_service.py – Retry logic
max_retries = 3
base_delay = 1 # Start with 1 second delay

for attempt in range(max_retries):
try:
if not self.aws_service.Amazon_Bedrock_client:
yield PartialResponse(text=”Error: Amazon Bedrock client is not initialized”)
break

response = await self.aws_service.Amazon_Bedrock_client.converse_stream(**stream_config)
# Process response…
break # Success, exit retry loop

except ClientError as e:
if “ThrottlingException” in str(e):
if attempt < max_retries – 1:
delay = base_delay * (2 ** attempt) # Exponential backoff
await asyncio.sleep(delay)
continue
error_message = f”Amazon Bedrock API Error: {str(e)}”
yield PartialResponse(text=f”Error: {error_message}”)
break

Performance metrics
The system collects detailed performance metrics to help identify bottlenecks and optimize performance:

# From services/streaming_service.py – Performance metrics
# Log token usage and latency
latency = time.perf_counter() – start_time

self.logger.info(
f”[{request_id}] Streaming Response Metrics:n”
f” Time to First Token: {first_token_time:.4f} secondsn”
f” Input Tokens: {input_tokens} (includes system prompt)n”
f” Input Tokens for Billing: {input_tokens – system_tokens} (excludes system prompt)n”
f” Output Tokens: {output_tokens}n”
f” Total Tokens: {total_tokens}n”
f” Amazon Bedrock Latency: {latency:.4f} secondsn”
f” Latency Optimization: {‘enabled’ if hasattr(self.bot_config, ‘optimize_latency’) and self.bot_config.optimize_latency else ‘disabled’}”
)

Security considerations
Security is a critical aspect of the wrapper implementation, with several key features to support secure operation.
JWT validation with AWS SigV4 signing
The system integrates JWT validation for Poe’s authentication with AWS SigV4 signing for Amazon Bedrock API calls:

JWT validation – Makes sure only authorized Poe requests can access the wrapper API
SigV4 signing – Makes sure the wrapper API can securely authenticate with Amazon Bedrock
Credential management – AWS credentials are securely managed and not exposed to clients

Secrets management
The system integrates with AWS Secrets Manager to securely store and retrieve sensitive credentials:

# From services/aws_service.py – Secrets management
@staticmethod
def get_secret(secret_name: str, region_name: str = “us-east-1”) -> Dict[str, Any]:
“””
Retrieve a secret from AWS Secrets Manager.

Args:
secret_name: Name of the secret to retrieve
region_name: AWS region where the secret is stored

Returns:
Dict[str, Any]: The secret value as a dictionary
“””
# Create a Secrets Manager client
session = boto3.session.Session()
client = session.client(
service_name=’secretsmanager’,
region_name=region_name
)

try:
get_secret_value_response = client.get_secret_value(
SecretId=secret_name
)
except Exception as e:
logging.error(f”Error retrieving secret {secret_name}: {str(e)}”)
raise

# Depending on whether the secret is a string or binary, one of these fields will be populated.
if ‘SecretString’ in get_secret_value_response:
import json
try:
# Explicitly annotate the return type for mypy
result: Dict[str, Any] = json.loads(get_secret_value_response[‘SecretString’])
return result
except json.JSONDecodeError:
# If not a JSON, return as a single-key dictionary
return {“SecretString”: get_secret_value_response[‘SecretString’]}
else:
import base64
decoded_binary_secret = base64.b64decode(get_secret_value_response[‘SecretBinary’])
return {“SecretBinary”: decoded_binary_secret}

Secure connection management
The system implements secure connection management to help prevent credential leakage and facilitate proper cleanup:

# From services/aws_service.py – Secure connection cleanup
async def cleanup(self) -> None:
“””Clean up AWS client resources.”””
try:
if self.Amazon_Bedrock_client:
try:
await self.Amazon_Bedrock_client.__aexit__(None, None, None)
except Exception as e:
self.logger.error(f”Error closing Amazon Bedrock client: {str(e)}”)
finally:
self.Amazon_Bedrock_client = None

self.logger.info(“Successfully cleaned up AWS client resources”)
except Exception as e:
# Even if cleanup fails, reset the references to avoid stale connections
self.Amazon_Bedrock_client = None

Troubleshooting and debugging
The wrapper API includes comprehensive logging and debugging capabilities to help identify and resolve issues. The system implements detailed logging throughout the request processing flow. Each request is assigned a unique ID that is used throughout the processing flow to enable tracing:

# From core/refactored_chat_bot.py – Request tracing
request_id = str(id(query))
start_time = time.perf_counter()

# Used in all log messages
self.logger.info(f”[{request_id}] Incoming request received”)

Lessons learned and best practices
Through this collaboration, several important technical insights emerged that might benefit others undertaking similar projects:

Configuration-driven architecture – Using configuration files rather than code for model-specific behaviors proved enormously beneficial for maintenance and extensibility. This approach allowed new models to be added without code changes, significantly reducing the risk of introducing bugs.
Protocol translation challenges – The most complex aspect was handling the subtle differences in streaming protocols between different models. Building a robust abstraction required careful consideration of edge cases and comprehensive error handling.
Error normalization – Creating a consistent error experience across diverse models required sophisticated error handling that could translate model-specific errors into user-friendly, actionable messages. This improved both developer and end-user experiences.
Type safety – Strong typing (using Python’s type hints extensively) was crucial for maintaining code quality across a complex codebase with multiple contributors. This practice reduced bugs and improved code maintainability.
Security first – Integrating Secrets Manager from the start made sure credentials were handled securely throughout the system’s lifecycle, helping prevent potential security vulnerabilities.

Conclusion
The collaboration between the AWS Generative AI Innovation Center and Quora demonstrates how thoughtful architectural design can dramatically accelerate AI deployment and innovation. By creating a unified wrapper API for Amazon Bedrock models, the teams were able to reduce deployment time from days to minutes while expanding model diversity and improving user experience.
This approach—focusing on abstraction, configuration-driven development, and robust error handling—offers valuable lessons for organizations looking to integrate multiple AI models efficiently. The patterns and techniques demonstrated in this solution can be applied to similar challenges across a wide range of AI integration scenarios.
For technology leaders and developers working on similar challenges, this case study highlights the value of investing in flexible integration frameworks rather than point-to-point integrations. The initial investment in building a robust abstraction layer pays dividends in long-term maintenance and capability expansion.
To learn more about implementing similar solutions, explore the following resources:

The AWS Well-Architected Framework for best practices in building secure, high-performing, resilient, and efficient infrastructure
The Amazon Bedrock Developer Guide for detailed information on working with FMs
The AWS Generative AI Innovation Center for assistance with your generative AI projects
AWS Prescriptive Guidance for LLM Deployment for best practices in deploying large language models

The AWS Generative AI Innovation Center and Quora teams continue to collaborate on enhancements to this framework, making sure Poe users have access to the latest and most capable AI models with minimal deployment delay.

About the authors
Dr. Gilbert V Lepadatu is a Senior Deep Learning Architect at the AWS Generative AI Innovation Center, where he helps enterprise customers design and deploy scalable, cutting-edge GenAI solutions. With a PhD in Philosophy and dual Master’s degrees, he brings a holistic and interdisciplinary approach to data science and AI.
Nick Huber is the AI Ecosystem Lead for Poe (by Quora), where he is responsible for ensuring high-quality & timely integrations of the leading AI models onto the Poe platform.

Building an Advanced Convolutional Neural Network with Attention for D …

Posted on September 16, 2025 by i-genie

In this tutorial, we take a hands-on approach to building an advanced convolutional neural network for DNA sequence classification. We focus on simulating real biological tasks, such as promoter prediction, splice site detection, and regulatory element identification. By combining one-hot encoding, multi-scale convolutional layers, and an attention mechanism, we design a model that not only learns complex motifs but also provides interpretability. As we progress, we generate synthetic data, train with robust callbacks, and visualize results to ensure we fully understand the strengths and limitations of our approach. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import random

np.random.seed(42)
tf.random.set_seed(42)
random.seed(42)

We begin by importing the libraries for deep learning, data handling, and visualization. We set random seeds to ensure reproducibility so that our experiments run consistently each time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass DNASequenceClassifier:
def __init__(self, sequence_length=200, num_classes=2):
self.sequence_length = sequence_length
self.num_classes = num_classes
self.model = None
self.history = None

def one_hot_encode(self, sequences):
mapping = {‘A’: 0, ‘T’: 1, ‘G’: 2, ‘C’: 3}
encoded = np.zeros((len(sequences), self.sequence_length, 4))

for i, seq in enumerate(sequences):
for j, nucleotide in enumerate(seq[:self.sequence_length]):
if nucleotide in mapping:
encoded[i, j, mapping[nucleotide]] = 1
return encoded

def attention_layer(self, inputs, name=”attention”):
attention_weights = layers.Dense(1, activation=’tanh’, name=f”{name}_weights”)(inputs)
attention_weights = layers.Flatten()(attention_weights)
attention_weights = layers.Activation(‘softmax’, name=f”{name}_softmax”)(attention_weights)
attention_weights = layers.RepeatVector(inputs.shape[-1])(attention_weights)
attention_weights = layers.Permute([2, 1])(attention_weights)

attended = layers.Multiply(name=f”{name}_multiply”)([inputs, attention_weights])
return layers.GlobalMaxPooling1D()(attended)

def build_model(self):
inputs = layers.Input(shape=(self.sequence_length, 4), name=”dna_input”)

conv_layers = []
filter_sizes = [3, 7, 15, 25]

for i, filter_size in enumerate(filter_sizes):
conv = layers.Conv1D(
filters=64,
kernel_size=filter_size,
activation=’relu’,
padding=’same’,
name=f”conv_{filter_size}”
)(inputs)
conv = layers.BatchNormalization(name=f”bn_conv_{filter_size}”)(conv)
conv = layers.Dropout(0.2, name=f”dropout_conv_{filter_size}”)(conv)

attended = self.attention_layer(conv, name=f”attention_{filter_size}”)
conv_layers.append(attended)

if len(conv_layers) > 1:
merged = layers.Concatenate(name=”concat_multiscale”)(conv_layers)
else:
merged = conv_layers[0]

dense = layers.Dense(256, activation=’relu’, name=”dense_1″)(merged)
dense = layers.BatchNormalization(name=”bn_dense_1″)(dense)
dense = layers.Dropout(0.5, name=”dropout_dense_1″)(dense)

dense = layers.Dense(128, activation=’relu’, name=”dense_2″)(dense)
dense = layers.BatchNormalization(name=”bn_dense_2″)(dense)
dense = layers.Dropout(0.3, name=”dropout_dense_2″)(dense)

if self.num_classes == 2:
outputs = layers.Dense(1, activation=’sigmoid’, name=”output”)(dense)
loss = ‘binary_crossentropy’
metrics = [‘accuracy’, ‘precision’, ‘recall’]
else:
outputs = layers.Dense(self.num_classes, activation=’softmax’, name=”output”)(dense)
loss = ‘categorical_crossentropy’
metrics = [‘accuracy’]

self.model = keras.Model(inputs=inputs, outputs=outputs, name=”DNA_CNN_Classifier”)

optimizer = keras.optimizers.Adam(
learning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-7
)

self.model.compile(
optimizer=optimizer,
loss=loss,
metrics=metrics
)

return self.model

def generate_synthetic_data(self, n_samples=10000):
sequences = []
labels = []

positive_motifs = [‘TATAAA’, ‘CAAT’, ‘GGGCGG’, ‘TTGACA’]
negative_motifs = [‘AAAAAAA’, ‘TTTTTTT’, ‘CCCCCCC’, ‘GGGGGGG’]

nucleotides = [‘A’, ‘T’, ‘G’, ‘C’]

for i in range(n_samples):
sequence = ”.join(random.choices(nucleotides, k=self.sequence_length))

if i < n_samples // 2:
motif = random.choice(positive_motifs)
pos = random.randint(0, self.sequence_length – len(motif))
sequence = sequence[:pos] + motif + sequence[pos + len(motif):]
label = 1
else:
if random.random() < 0.3:
motif = random.choice(negative_motifs)
pos = random.randint(0, self.sequence_length – len(motif))
sequence = sequence[:pos] + motif + sequence[pos + len(motif):]
label = 0

sequences.append(sequence)
labels.append(label)

return sequences, np.array(labels)

def train(self, X_train, y_train, X_val, y_val, epochs=50, batch_size=32):
callbacks = [
keras.callbacks.EarlyStopping(
monitor=’val_loss’,
patience=10,
restore_best_weights=True
),
keras.callbacks.ReduceLROnPlateau(
monitor=’val_loss’,
factor=0.5,
patience=5,
min_lr=1e-6
)
]

self.history = self.model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=epochs,
batch_size=batch_size,
callbacks=callbacks,
verbose=1
)

return self.history

def evaluate_and_visualize(self, X_test, y_test):
y_pred_proba = self.model.predict(X_test)
y_pred = (y_pred_proba > 0.5).astype(int).flatten()

print(“Classification Report:”)
print(classification_report(y_test, y_pred))

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

axes[0,0].plot(self.history.history[‘loss’], label=’Training Loss’)
axes[0,0].plot(self.history.history[‘val_loss’], label=’Validation Loss’)
axes[0,0].set_title(‘Training History – Loss’)
axes[0,0].set_xlabel(‘Epoch’)
axes[0,0].set_ylabel(‘Loss’)
axes[0,0].legend()

axes[0,1].plot(self.history.history[‘accuracy’], label=’Training Accuracy’)
axes[0,1].plot(self.history.history[‘val_accuracy’], label=’Validation Accuracy’)
axes[0,1].set_title(‘Training History – Accuracy’)
axes[0,1].set_xlabel(‘Epoch’)
axes[0,1].set_ylabel(‘Accuracy’)
axes[0,1].legend()

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt=’d’, ax=axes[1,0], cmap=’Blues’)
axes[1,0].set_title(‘Confusion Matrix’)
axes[1,0].set_ylabel(‘Actual’)
axes[1,0].set_xlabel(‘Predicted’)

axes[1,1].hist(y_pred_proba[y_test==0], bins=50, alpha=0.7, label=’Negative’, density=True)
axes[1,1].hist(y_pred_proba[y_test==1], bins=50, alpha=0.7, label=’Positive’, density=True)
axes[1,1].set_title(‘Prediction Score Distribution’)
axes[1,1].set_xlabel(‘Prediction Score’)
axes[1,1].set_ylabel(‘Density’)
axes[1,1].legend()

plt.tight_layout()
plt.show()

return y_pred, y_pred_proba

We define a DNASequenceClassifier that encodes sequences, learns multi-scale motifs with CNNs, and applies an attention mechanism for interpretability. We build and compile the model, generate synthetic motif-rich data, and then train with robust callbacks and visualize performance to evaluate classification quality. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
print(” Advanced DNA Sequence Classification with CNN”)
print(“=” * 50)

classifier = DNASequenceClassifier(sequence_length=200, num_classes=2)

print(“Generating synthetic DNA sequences…”)
sequences, labels = classifier.generate_synthetic_data(n_samples=10000)

print(“Encoding DNA sequences…”)
X = classifier.one_hot_encode(sequences)

X_trn, X_test, y_trn, y_test = train_test_split(
X, labels, test_size=0.2, random_state=42, stratify=labels
)
X_trn, X_val, y_trn, y_val = train_test_split(
X_trn, y_trn, test_size=0.2, random_state=42, stratify=y_train
)

print(f”Training set: {X_train.shape}”)
print(f”Validation set: {X_val.shape}”)
print(f”Test set: {X_test.shape}”)

print(“Building CNN model…”)
model = classifier.build_model()
print(model.summary())

print(“Training model…”)
classifier.train(X_train, y_train, X_val, y_val, epochs=30, batch_size=64)

print(“Evaluating model…”)
y_pred, y_pred_proba = classifier.evaluate_and_visualize(X_test, y_test)

print(” Training and evaluation complete!”)

if __name__ == “__main__”:
main()

We wrap up the workflow in the main() function, where we generate synthetic DNA data, encode it, split it into training, validation, and test sets, then build, train, and evaluate our CNN model. We conclude by visualizing the performance and confirming that the classification pipeline runs successfully from start to finish.

In conclusion, we successfully demonstrate how a carefully designed CNN with attention can classify DNA sequences with high accuracy and interpretability. We see how synthetic biological motifs help validate the model’s capacity for pattern recognition, and how visualization techniques provide meaningful insights into training dynamics and predictions. Through this journey, we enhance our ability to integrate deep learning architectures with biological data, laying the groundwork for applying these methods to real-world genomics research.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Building an Advanced Convolutional Neural Network with Attention for DNA Sequence Classification and Interpretability appeared first on MarkTechPost.

OpenAI Introduces GPT-5-Codex: An Advanced Version of GPT-5 Further Op …

Posted on September 16, 2025 by i-genie

OpenAI has just released GPT-5-Codex, a version of GPT-5 further optimized for “agentic coding” tasks within the Codex ecosystem. The goal: improve reliability, speed, and autonomous behavior so that Codex acts more like a teammate, not just a prompt-executor.

Codex is now available across the full developer workflow: CLI, IDE extensions, web, mobile, GitHub code reviews. It integrates well with cloud environments and developer tools.

https://openai.com/index/introducing-upgrades-to-codex/

Key Capabilities / Improvements

Agentic behaviorGPT-5-Codex can take on long, complex, multi-step tasks more autonomously. It balances “interactive” sessions (short feedback loops) with “independent execution” (long refactors, tests, etc.).

Steerability & style complianceLess need for developers to micro-specify style / hygiene. The model better understands high-level instructions (“do this”, “follow cleanliness guidelines”) without being told every detail each time.

Code review improvements

Trained to catch critical bugs, not just surface or stylistic issues.

It examines the full context: codebase, dependencies, tests.

Can run code & tests to validate behavior.

Evaluated on pull requests / commits from popular open source. Feedback from actual engineers confirms fewer “incorrect/unimportant” comments.

Performance & efficiency

For small requests, the model is “snappier”.

For big tasks, it “thinks more”—spends more compute/time reasoning, editing, iterating.

On internal testing: bottom-10% of user turns (by tokens) use ~93.7% fewer tokens than vanilla GPT-5. Top-10% use roughly twice as much reasoning/iteration.

Tooling & integration improvements

Codex CLI: better tracking of progress (to-do lists), ability to embed/share images (wireframes, screenshots), upgraded terminal UI, improved permission modes.

IDE Extension: works in VSCode, Cursor (and forks); maintains context of open files / selection; allows switching between cloud/local work seamlessly; preview local code changes directly.

Cloud environment enhancements:

Cached containers → median completion time for new tasks / follow-ups ↓ ~90%.

Automatic setup of environments (scanning for setup scripts, installing dependencies).

Configurable network access and ability to run pip installs etc. at runtime.

Visual & front-end contextThe model now accepts image or screenshot inputs (e.g. UI designs or bugs) and can show visual output, e.g. screenshots of its work. Better human preference performance in mobile web / front-end tasks.

Safety, trust, and deployment controls

Default sandboxed execution (network access disabled unless explicitly permitted).

Approval modes in tools: read-only vs auto access vs full access.

Support for reviewing agent work, terminal logs, test results.

Marked as “High capability” in Biological / Chemical domains; extra safeguards.

Use Cases & Scenarios

Large scale refactoring: changing architecture, propagating context (e.g. threading a variable through many modules) in multiple languages (Python, Go, OCaml) as demonstrated.

Feature additions with tests: generate new functionality and tests, fixing broken tests, handling test failures.

Continuous code reviews: PR review suggestions, catching regressions or security flaws earlier.

Front-end / UI design workflows: prototype or debug UI from specs/screenshots.

Hybrid workflows human + agent: human gives high-level instruction; Codex manages sub-tasks, dependencies, iteration.

https://openai.com/index/introducing-upgrades-to-codex/

Implications

For engineering teams: can shift more burden to Codex for repetitive / structurally heavy work (refactoring, test scaffolding), freeing human time for architectural decisions, design, etc.

For codebases: maintaining consistency in style, dependencies, test coverage could be easier since Codex consistently applies patterns.

For hiring / workflow: teams may need to adjust roles: reviewer focus may shift from “spotting minor errors” to oversight of agent suggestions.

Tool ecosystem: tighter IDE integrations mean workflows become more seamless; code reviews via bots may become more common & expected.

Risk management: organizations will need policy & audit controls for agentic code tasks, esp. for production-critical or high-security code.

Comparison: GPT-5 vs GPT-5-Codex

DimensionGPT-5 (base)GPT-5-CodexAutonomy on long tasksLess, more interactive / prompt heavyMore: longer independent execution, iterative workUse in agentic coding environmentsPossible, but not optimizedPurpose-built and tuned for Codex workflows onlySteerability & instruction complianceRequires more detailed directionsBetter adherence to high-level style / code quality instructionsEfficiency (token usage, latency)More tokens and passes; slower on big tasksMore efficient on small tasks; spends extra reasoning only when needed

Conclusion

GPT-5-Codex represents a meaningful step forward in AI-assisted software engineering. By optimizing for long tasks, autonomous work, and integrating deeply into developer workflows (CLI, IDE, cloud, code review), it offers tangible improvements in speed, quality, and efficiency. But it does not eliminate the need for expert oversight; safe usage requires policies, review loops, and understanding of the system’s limitations.

Check out the FULL TECHNICAL DETAILS here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post OpenAI Introduces GPT-5-Codex: An Advanced Version of GPT-5 Further Optimized for Agentic Coding in Codex appeared first on MarkTechPost.

NVIDIA AI Open-Sources ViPE (Video Pose Engine): A Powerful and Versat …

Posted on September 16, 2025 by i-genie

How do you create 3D datasets to train AI for Robotics without expensive traditional approaches? A team of researchers from NVIDIA released “ViPE: Video Pose Engine for 3D Geometric Perception” bringing a key improvement for Spatial AI. It addresses the central, agonizing bottleneck that has constrained the field of 3D computer vision for years.

ViPE is a robust, versatile engine designed to process raw, unconstrained, “in-the-wild” video footage and automatically output the critical elements of 3D reality:

Camera Intrinsics (sensor calibration parameters)

Precise Camera Motion (pose)

Dense, Metric Depth Maps (real-world distances for every pixel)

To truly know the magnitude of this breakthrough, we must first understand the profound difficulty of the problem it solves.

The challenge: Unlocking 3D Reality from 2D Video

The ultimate goal of Spatial AI is to enable machines, robots , autonomous vehicles, and AR glasses, to perceive and interact with the world in 3D. We live in a 3D world, but the vast majority of our recorded data, from smartphone clips to cinematic footage, is trapped in 2D.

The Core Problem: How do we reliably and scalably reverse-engineer the 3D reality hidden inside these flat video streams?

Achieving this accurately from everyday video, which features shaky movements, dynamic objects, and unknown camera types, is notoriously difficult, yet it is the essential first step for virtually any advanced spatial application.

Problems with Existing Approaches

For decades, the field has been forced to choose between 2 powerful yet flawed paradigms.

1. The Precision Trap (Classical SLAM/SfM)

Traditional methods like Simultaneous Localization and Mapping (SLAM) and Structure-from-Motion (SfM) rely on sophisticated geometric optimization. They are capable of pinpoint accuracy under ideal conditions.

The Fatal Flaw: Brittleness. These systems generally assume the world is static. Introduce a moving car, a textureless wall, or use an unknown camera, and the entire reconstruction can shatter. They are too delicate for the messy reality of everyday video.

2. The Scalability Wall (End-to-End Deep Learning)

Recently, powerful deep learning models have emerged. By training on vast datasets, they learn robust “priors” about the world and are impressively resilient to noise and dynamism.

The Fatal Flaw: Intractability. These models are computationally hungry. Their memory requirements explode as video length increases, making the processing of long videos practically impossible. They simply do not scale.

This deadlock created a dilemma. The future of advanced AI demands massive datasets annotated with perfect 3D geometry, but the tools required to generate that data were either too brittle or too slow to deploy at scale.

Meet ViPE: NVIDIA’s Hybrid Breakthrough Shatters the Mold

This is where ViPE changes the game. It is not merely an incremental improvement; it is a well-designed and well-integrated hybrid pipeline that successfully fuses the best of both worlds. It takes the efficient, mathematically rigorous optimization framework of classical SLAM and injects it with the powerful, learned intuition of modern deep neural networks.

This synergy allows ViPE to be accurate, robust, efficient, and versatile simultaneously. ViPE delivers a solution that scales without compromising on precision.

How it Works: Inside the ViPE Engine

ViPE‘s architecture uses a keyframe-based Bundle Adjustment (BA) framework for efficiency.

Here are the Key Innovations:

Key Innovation 1: A Synergy of Powerful Constraints

ViPE achieves unprecedented accuracy by masterfully balancing three critical inputs:

Dense Flow (Learned Robustness): Uses a learned optical flow network for robust correspondences between frames, even in tough conditions.

Sparse Tracks (Classical Precision): Incorporates high-resolution, traditional feature tracking to capture fine-grained details, drastically improving localization accuracy.

Metric Depth Regularization (Real-World Scale): ViPE integrates priors from state-of-the-art monocular depth models to produce results in true, real-world metric scale.

Key Innovation 2: Mastering Dynamic, Real-World Scenes

To handle the chaos of real-world video, ViPE employs advanced foundational segmentation tools, GroundingDINO and Segment Anything (SAM), to identify and mask out moving objects (e.g., people, cars). By intelligently ignoring these dynamic regions, ViPE ensures the camera motion is calculated based only on the static environment.

Key Innovation 3: Fast Speed & General Versatility

ViPE operates at a remarkable 3-5 FPS on a single GPU, making it significantly faster than comparable methods. Furthermore, ViPE is universally applicable, supporting diverse camera models including standard, wide-angle/fisheye, and even 360° panoramic videos, automatically optimizing the intrinsics for each.

Key Innovation 4: High-Fidelity Depth Maps

The final output is enhanced by a sophisticated post-processing step. ViPE smoothly aligns high-detail depth maps with the geometrically consistent maps from its core process. The result is stunning: depth maps that are both high-fidelity and temporally stable.

The results are stunning even complex scenes…see below

Proven Performance

ViPE demonstrates superior performance, outperforming existing uncalibrated pose estimation baselines by a staggering:

18% on the TUM dataset (indoor dynamics)

50% on the KITTI dataset (outdoor driving)

Crucially, the evaluations confirm that ViPE provides accurate metric scale, while other approaches/engines often produce inconsistent, unusable scales.

The Real Innovation: A Data Explosion for Spatial AI

The most significant contribution of this work is not just the engine itself, but its deployment as a large-scale data annotation factory to fuel the future of AI. The lack of massive, diverse, geometrically annotated video data has been the primary bottleneck for training robust 3D models. ViPE solves this problem!.How

The research team used ViPE to create and release an unprecedented dataset totaling approximately 96 million annotated frames:

Dynpose-100K++: Nearly 100,000 real-world internet videos (15.7M frames) with high-quality poses and dense geometry.

Wild-SDG-1M: A massive collection of 1 million high-quality, AI-generated videos (78M frames).

Web360: A specialized dataset of annotated panoramic videos.

This massive release provides the necessary fuel for the next generation of 3D geometric foundation models and is already proving instrumental in training advanced world generation models like NVIDIA’s Gen3C and Cosmos.

By resolving the fundamental conflicts between accuracy, robustness, and scalability, ViPE provides the practical, efficient, and universal tool needed to unlock the 3D structure of almost any video. Its release is poised to dramatically accelerate innovation across the entire landscape of Spatial AI, robotics, and AR/VR.

NVIDIA AI has released the code here

Sources /links

https://research.nvidia.com/labs/toronto-ai/vipe/

https://github.com/nv-tlabs/vipe

Datasets:

https://huggingface.co/datasets/nvidia/vipe-dynpose-100kpp

https://huggingface.co/datasets/nvidia/vipe-wild-sdg-1m

https://huggingface.co/datasets/nvidia/vipe-web360

https://www.nvidia.com/en-us/ai/cosmos/

Thanks to the NVIDIA team for the thought leadership/ Resources for this article. NVIDIA team has supported and sponsored this content/article.
The post NVIDIA AI Open-Sources ViPE (Video Pose Engine): A Powerful and Versatile 3D Video Annotation Tool for Spatial AI appeared first on MarkTechPost.

Schedule topology-aware workloads using Amazon SageMaker HyperPod task …

Posted on September 16, 2025 by i-genie

Today, we are excited to announce a new capability of Amazon SageMaker HyperPod task governance to help you optimize training efficiency and network latency of your AI workloads. SageMaker HyperPod task governance streamlines resource allocation and facilitates efficient compute resource utilization across teams and projects on Amazon Elastic Kubernetes Service (Amazon EKS) clusters. Administrators can govern accelerated compute allocation and enforce task priority policies, improving resource utilization. This helps organizations focus on accelerating generative AI innovation and reducing time to market, rather than coordinating resource allocation and replanning tasks. Refer to Best practices for Amazon SageMaker HyperPod task governance for more information.
Generative AI workloads typically demand extensive network communication across Amazon Elastic Compute Cloud (Amazon EC2) instances, where network bandwidth impacts both workload runtime and processing latency. The network latency of these communications depends on the physical placement of instances within a data center’s hierarchical infrastructure. Data centers can be organized into nested organizational units such as network nodes and node sets, with multiple instances per network node and multiple network nodes per node set. For example, instances within the same organizational unit experience faster processing time compared to those across different units. This means fewer network hops between instances results in lower communication.
To optimize the placement of your generative AI workloads in your SageMaker HyperPod clusters by considering the physical and logical arrangement of resources, you can use EC2 network topology information during your job submissions. An EC2 instance’s topology is described by a set of nodes, with one node in each layer of the network. Refer to How Amazon EC2 instance topology works for details on how EC2 topology is arranged. Network topology labels offer the following key benefits:

Reduced latency by minimizing network hops and routing traffic to nearby instances
Improved training efficiency by optimizing workload placement across network resources

With topology-aware scheduling for SageMaker HyperPod task governance, you can use topology network labels to schedule your jobs with optimized network communication, thereby improving task efficiency and resource utilization for your AI workloads.
In this post, we introduce topology-aware scheduling with SageMaker HyperPod task governance by submitting jobs that represent hierarchical network information. We provide details about how to use SageMaker HyperPod task governance to optimize your job efficiency.
Solution overview
Data scientists interact with SageMaker HyperPod clusters. Data scientists are responsible for the training, fine-tuning, and deployment of models on accelerated compute instances. It’s important to make sure data scientists have the necessary capacity and permissions when interacting with clusters of GPUs.
To implement topology-aware scheduling, you first confirm the topology information for all nodes in your cluster, then run a script that tells you which instances are on the same network nodes, and finally schedule a topology-aware training task on your cluster. This workflow facilitates higher visibility and control over the placement of your training instances.
In this post, we walk through viewing node topology information and submitting topology-aware tasks to your cluster. For reference, NetworkNodes describes the network node set of an instance. In each network node set, three layers comprise the hierarchical view of the topology for each instance. Instances that are closest to each other will share the same layer 3 network node. If there are no common network nodes in the bottom layer (layer 3), then see if there is commonality at layer 2.
Prerequisites
To get started with topology-aware scheduling, you must have the following prerequisites:

An EKS cluster
A SageMaker HyperPod cluster with instances enabled for topology information
The SageMaker HyperPod task governance add-on installed (version 1.2.2 or later)
Kubectl installed
(Optional) The SageMaker HyperPod CLI installed

Get node topology information
Run the following command to show node labels in your cluster. This command provides network topology information for each instance.

kubectl get nodes -L topology.k8s.aws/network-node-layer-1
kubectl get nodes -L topology.k8s.aws/network-node-layer-2
kubectl get nodes -L topology.k8s.aws/network-node-layer-3

Instances with the same network node layer 3 are as close as possible, following EC2 topology hierarchy. You should see a list of node labels that look like the following:topology.k8s.aws/network-node-layer-3: nn-33333exampleRun the following script to show the nodes in your cluster that are on the same layers 1, 2, and 3 network nodes:

git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/1.architectures/7.sagemaker-hyperpod-eks/task-governance
chmod +x visualize_topology.sh
bash visualize_topology.sh

The output of this script will print a flow chart that you can use in a flow diagram editor such as Mermaid.js.org to visualize the node topology of your cluster. The following figure is an example of the cluster topology for a seven-instance cluster.

Submit tasks
SageMaker HyperPod task governance offers two ways to submit tasks using topology awareness. In this section, we discuss these two options and a third alternative option to task governance.
Modify your Kubernetes manifest file
First, you can modify your existing Kubernetes manifest file to include one of two annotation options:

kueue.x-k8s.io/podset-required-topology – Use this option if you must have all pods scheduled on nodes on the same network node layer in order to begin the job
kueue.x-k8s.io/podset-preferred-topology – Use this option if you ideally want all pods scheduled on nodes in the same network node layer, but you have flexibility

The following code is an example of a sample job that uses the kueue.x-k8s.io/podset-required-topology setting to schedule pods that share the same layer 3 network node:

apiVersion: batch/v1
kind: Job
metadata:
  name: test-tas-job
  namespace: hyperpod-ns-team-a
  labels:
   kueue.x-k8s.io/queue-name: hyperpod-ns-team-a-localqueue
   kueue.x-k8s.io/priority-class: inference-priority
spec:
  parallelism: 10
  completions: 10
  suspend: true
  template:
   metadata:
   labels:
   kueue.x-k8s.io/queue-name: hyperpod-ns-team-a-localqueue
   annotations:
   kueue.x-k8s.io/podset-required-topology: “topology.k8s.aws/network-node-layer-3”
   spec:
   containers:
   – name: dummy-job
   image: public.ecr.aws/docker/library/alpine:latest
   command: [“sleep”, “3600s”]
   resources:
   requests:
   cpu: “1”
   restartPolicy: Never

To verify which nodes your pods are running on, use the following command to view node IDs per pod:kubectl get pods -n hyperpod-ns-team-a -o wide
Use the SageMaker HyperPod CLI
The second way to submit a job is through the SageMaker HyperPod CLI. Be sure to install the latest version (version pending) to use topology-aware scheduling. To use topology-aware scheduling with the SageMaker HyperPod CLI, you can include either the –preferred-topology parameter or the –required-topology parameter in your create job command.
The following code is an example command to start a topology-aware mnist training job using the SageMaker HyperPod CLI, replace XXXXXXXXXXXX with your AWS account ID:

hyp create hyp-pytorch-job
–job-name test-pytorch-job-cli
–image XXXXXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com/ptjob:mnist
–pull-policy “Always”
–tasks-per-node 1
–max-retry 1
–preferred-topology topology.k8s.aws/network-node-layer-3

Clean up
If you deployed new resources while following this post, refer to the Clean Up section in the SageMaker HyperPod EKS workshop to make sure you don’t accrue unwanted charges.
Conclusion
During large language model (LLM) training, pod-to-pod communication distributes the model across multiple instances, requiring frequent data exchange between these instances. In this post, we discussed how SageMaker HyperPod task governance helps schedule workloads to enable job efficiency by optimizing throughput and latency. We also walked through how to schedule jobs using SageMaker HyperPod topology network information to optimize network communication latency for your AI tasks.
We encourage you to try out this solution and share your feedback in the comments section.

About the authors
Nisha Nadkarni is a Senior GenAI Specialist Solutions Architect at AWS, where she guides companies through best practices when deploying large scale distributed training and inference on AWS. Prior to her current role, she spent several years at AWS focused on helping emerging GenAI startups develop models from ideation to production.
Siamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, ML model management, and ML governance to improve overall organizational efficiency and productivity. He has extensive experience automating processes and deploying various technologies.
Zican Li is a Senior Software Engineer at Amazon Web Services (AWS), where he leads software development for Task Governance on SageMaker HyperPod. In his role, he focuses on empowering customers with advanced AI capabilities while fostering an environment that maximizes engineering team efficiency and productivity.
Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on generative AI model training and inference. He partners with top frontier model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.