UltraCUA: A Foundation Computer-Use Agents Model that Bridges the Gap …

Computer-use agents have been limited to primitives. They click, they type, they scroll. Long action chains amplify grounding errors and waste steps. Apple Researchers introduce UltraCUA, a foundation model that builds an hybrid action space that lets an agent interleave low level GUI actions with high level programmatic tool calls. The model chooses the cheaper and more reliable move at each step. The approach improves success and reduces steps on OSWorld, and transfers to WindowsAgentArena without Windows specific training.

https://arxiv.org/pdf/2510.17790

What hybrid action changes?

Hybrid action treats tools as first class actions. A tool call encapsulates a multi step operation as a single function with a clear signature and a docstring. A click or a key press still exists when no programmatic path is available. The agent learns to alternate between both modes. The goal is to reduce cascade errors and to cut step counts. The research team positions this as a bridge between GUI only CUAs and tool centric agent frameworks.

https://arxiv.org/pdf/2510.17790

Scaled tool acquisition

UltraCUA builds its tool library with an automated pipeline. The system extracts keyboard shortcuts and commands from software documentation. The system integrates open source implementations from agent toolkits. The system also uses coding agents to synthesize new tools. Each tool is a callable interface that hides a long GUI sequence. The research team reports coverage across 10 desktop domains with 881 tools. The largest buckets include VS Code with 135 tools and LibreOffice Writer with 123 tools. Thunderbird and GIMP also have deep coverage.

https://arxiv.org/pdf/2510.17790

Verifiable synthetic tasks and trajectories

Training requires grounded supervision and stable rewards. UltraCUA uses a dual synthetic engine. An evaluator first pipeline composes atomic verifiers for browsers, files, images, and system state, then generates tasks that satisfy those checks. An instruction first pipeline explores the OS and proposes context aligned tasks which are then verified. The result is 17,864 verifiable tasks across 10 domains such as Chrome, LibreOffice, GIMP, VS Code, system, Thunderbird, VLC, and multi app workflows. Chrome has 2,826 tasks. The LibreOffice suite sums to 5,885 tasks. Multi app tasks reach 2,113.

https://arxiv.org/pdf/2510.17790

A multi agent rollout produces successful hybrid trajectories. The planner uses OpenAI o3 for decision making. The grounder uses GTA1-7B for accurate visual localization. The rollout yields about 26.8K successful trajectories that show when to use a tool and when to act in the GUI. These trajectories are the core of the supervised phase.

Training Approach

Training has two stages. Stage 1 is supervised fine tuning. The models train for 3 epochs at a learning rate of 2e-5 on the successful trajectories. Loss is applied turn wise to avoid over weighting early steps. Stage 2 is online reinforcement learning. The models train for 150 steps at a learning rate of 1e-6 on verified tasks that are sampled by difficulty. The policy optimization follows a GRPO variant with clip higher, and removes KL regularization and format rewards. The reward combines sparse task outcome with a tool use term. Experiments use NVIDIA H100 GPUs. The context is kept near 32K by controlling the number of exposed tools.

Results on OSWorld

UltraCUA improves success at both 7B and 32B scales. Under 15 step budgets, UltraCUA-32B reaches 41.0 percent success. OpenCUA-32B reaches 29.7 percent. The absolute gain is 11.3 points. UltraCUA-7B reaches 28.9 percent. UI-TARS-1.5-7B reaches 23.4 percent. Gains persist under 50 step budgets. A per domain breakdown shows consistent lifts across Chrome, Writer, VS Code, and cross application tasks. Average steps decrease against baselines. These shifts indicate better action selection rather than only more attempts.

https://arxiv.org/pdf/2510.17790

https://arxiv.org/pdf/2510.17790

Cross platform transfer on WindowsAgentArena

UltraCUA trains only on Ubuntu based OSWorld data. The model is then evaluated on WindowsAgentArena. UltraCUA-7B reaches 21.7 percent success. This exceeds UI-TARS-1.5-7B at 18.1 percent and a Qwen2 baseline trained with Windows data at 13.5 percent. The result suggests that hybrid action strategies learned on one platform transfer to other platforms. The paper highlights this as zero shot platform generalization.

https://arxiv.org/pdf/2510.17790

Key Takeaways

UltraCUA formalizes a hybrid action space that lets a single agent alternate between GUI primitives and programmatic tool calls, which reduces long error prone action chains.

The research team scales a reusable tool library through an automated pipeline and pairs it with a synthetic data engine, yielding 17,000 plus verifiable computer use tasks for training and evaluation.

Training follows a two stage recipe, supervised fine tuning on successful hybrid trajectories then online reinforcement learning on verifiable tasks, which optimizes when to call tools versus act in the GUI.

On OSWorld, UltraCUA reports an average 22 percent relative improvement over base models and 11 percent fewer steps, which indicates gains in reliability and efficiency.

The 7B model reaches 21.7 percent success on WindowsAgentArena without Windows specific training, which shows cross platform generalization of the hybrid action policy.

Editorial Comments

UltraCUA moves computer use agents from brittle primitive action chains to a hybrid action policy, integrating GUI primitives with programmatic tool calls, which reduces error propagation and step counts. It scales tools via an automated pipeline and pairs them with a synthetic data engine that yields 17,000 plus verifiable tasks, enabling supervised fine tuning and online reinforcement learning on grounded signals. Reported results include 22 percent relative improvement on OSWorld with 11 percent fewer steps, and 21.7 percent success on WindowsAgentArena without Windows specific training, which indicates cross platform transfer of the policy.

Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post UltraCUA: A Foundation Computer-Use Agents Model that Bridges the Gap between General-Purpose GUI Agents and Specialized API-based Agents appeared first on MarkTechPost.

A Coding Guide to Build a Fully Functional Multi-Agent Marketplace Usi …

In this tutorial, we explore how to build a small yet functional multi-agent system using the uAgents framework. We set up three agents — Directory, Seller, and Buyer — that communicate via well-defined message protocols to simulate a real-world marketplace interaction. We design message schemas, define agent behaviors, and implement request-response cycles to demonstrate discovery, negotiation, and transaction among agents, all running asynchronously in a shared event loop. Through this, we understand how autonomous agents collaborate, trade, and efficiently maintain decentralized workflows. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser!pip -q install “uagents>=0.11.2″

import asyncio, random
from typing import List, Dict, Optional
from uagents import Agent, Context, Bureau, Model, Protocol

class ServiceAnnounce(Model):
category: str
endpoint: str

class ServiceQuery(Model):
category: str

class ServiceList(Model):
addresses: List[str]

class OfferRequest(Model):
item: str
max_price: int

class Offer(Model):
item: str
price: int
qty: int

class Order(Model):
item: str
qty: int

class Receipt(Model):
item: str
qty: int
total: int
ok: bool
note: Optional[str] = None

We begin by installing the uAgents library and defining all the message models that underpin our communication system. We create structured data types for announcements, queries, offers, and orders, enabling agents to exchange information seamlessly. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserregistry_proto = Protocol(name=”registry”, version=”1.0″)
trade_proto = Protocol(name=”trade”, version=”1.0″)

directory = Agent(name=”directory”, seed=”dir-seed-001″)
seller = Agent(name=”seller”, seed=”seller-seed-001″)
buyer = Agent(name=”buyer”, seed=”buyer-seed-001″)

directory.include(registry_proto)
seller.include(trade_proto)
buyer.include(registry_proto)
buyer.include(trade_proto)

@registry_proto.on_message(model=ServiceAnnounce)
async def on_announce(ctx: Context, sender: str, msg: ServiceAnnounce):
reg = await ctx.storage.get(“reg”) or {}
reg.setdefault(msg.category, set()).add(sender)
await ctx.storage.set(“reg”, reg)
ctx.logger.info(f”Registered {sender} under ‘{msg.category}'”)

@registry_proto.on_message(model=ServiceQuery)
async def on_query(ctx: Context, sender: str, msg: ServiceQuery):
reg = await ctx.storage.get(“reg”) or {}
addrs = sorted(list(reg.get(msg.category, set())))
await ctx.send(sender, ServiceList(addresses=addrs))
ctx.logger.info(f”Returned {len(addrs)} providers for ‘{msg.category}'”)

We set up the Directory, Seller, and Buyer agents and define the registry protocol that manages service discovery. We make the directory respond to announcements and queries, allowing agents to register and locate each other dynamically. Check out the Full Codes here.

Copy CodeCopiedUse a different BrowserCATALOG: Dict[str, Dict[str, int]] = {
“camera”: {“price”: 120, “qty”: 3},
“laptop”: {“price”: 650, “qty”: 2},
“headphones”: {“price”: 60, “qty”: 5},
}

@seller.on_event(“startup”)
async def seller_start(ctx: Context):
await ctx.send(directory.address, ServiceAnnounce(category=”electronics”, endpoint=seller.address))
ctx.logger.info(“Seller announced to directory”)

@trade_proto.on_message(model=OfferRequest)
async def on_offer_request(ctx: Context, sender: str, req: OfferRequest):
item = CATALOG.get(req.item)
if not item:
await ctx.send(sender, Offer(item=req.item, price=0, qty=0))
return
price = max(1, int(item[“price”] * (0.9 + 0.2 * random.random())))
if price > req.max_price or item[“qty”] <= 0:
await ctx.send(sender, Offer(item=req.item, price=0, qty=0))
return
await ctx.send(sender, Offer(item=req.item, price=price, qty=item[“qty”]))
ctx.logger.info(f”Offered {req.item} at {price} with qty {item[‘qty’]}”)

@trade_proto.on_message(model=Order)
async def on_order(ctx: Context, sender: str, order: Order):
item = CATALOG.get(order.item)
if not item or item[“qty”] < order.qty:
await ctx.send(sender, Receipt(item=order.item, qty=0, total=0, ok=False, note=”Not enough stock”))
return
total = item[“price”] * order.qty
item[“qty”] -= order.qty
await ctx.send(sender, Receipt(item=order.item, qty=order.qty, total=total, ok=True, note=”Thanks!”))

We create the Seller agent’s catalog and implement logic for responding to offer requests and processing orders. We simulate real-world trading by adding variable pricing and stock management, showing how the seller negotiates and completes transactions. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser@buyer.on_event(“startup”)
async def buyer_start(ctx: Context):
ctx.logger.info(“Buyer querying directory for electronics…”)
resp = await ctx.ask(directory.address, ServiceQuery(category=”electronics”), expects=ServiceList, timeout=5.0)
sellers = resp.addresses if resp else []
if not sellers:
return
target = sellers[0]
desired = “laptop”
budget = 700
ctx.logger.info(f”Requesting offer for ‘{desired}’ within budget {budget} from {target}”)
offer = await ctx.ask(target, OfferRequest(item=desired, max_price=budget), expects=Offer, timeout=5.0)
if not offer or offer.price <= 0:
return
qty = 1 if offer.qty >= 1 else 0
if qty == 0:
return
ctx.logger.info(f”Placing order for {qty} x {offer.item} at {offer.price}”)
receipt = await ctx.ask(target, Order(item=offer.item, qty=qty), expects=Receipt, timeout=5.0)
if receipt and receipt.ok:
ctx.logger.info(f”ORDER SUCCESS: {receipt.qty} x {receipt.item} | total={receipt.total}”)

We program the Buyer agent to discover sellers, request offers, and place orders based on availability and budget. We observe how the buyer interacts with the seller through asynchronous communication to complete a purchase successfully. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser@buyer.on_interval(period=6.0)
async def periodic_discovery(ctx: Context):
seen = await ctx.storage.get(“seen”) or 0
if seen >= 1:
return
await ctx.storage.set(“seen”, seen + 1)
ctx.logger.info(“Periodic discovery tick -> re-query directory”)
resp = await ctx.ask(directory.address, ServiceQuery(category=”electronics”), expects=ServiceList, timeout=3.0)
n = len(resp.addresses) if resp else 0
ctx.logger.info(f”Periodic: directory reports {n} seller(s)”)

bureau = Bureau()
bureau.add(directory)
bureau.add(seller)
bureau.add(buyer)

async def run_demo(seconds=10):
task = asyncio.create_task(bureau.run_async())
try:
await asyncio.sleep(seconds)
finally:
task.cancel()
try:
await task
except asyncio.CancelledError:
pass
print(“n Demo run complete.n”)

try:
loop = asyncio.get_running_loop()
await run_demo(10)
except RuntimeError:
asyncio.run(run_demo(10))

We add periodic discovery to have the buyer recheck available sellers, then have the Bureau run all agents together. We launch the asynchronous runtime to see the full marketplace simulation unfold and complete smoothly.

In conclusion, we have seen our agents discover one another, negotiate an offer, and complete a transaction entirely through message-based interactions. We realize how uAgents simplifies multi-agent orchestration by combining structure, communication, and state management seamlessly within Python. As we run this example, we not only witness a dynamic, autonomous system in action but also gain insight into how the same architecture can be extended to complex decentralized marketplaces, AI collaborations, and intelligent service networks, all within a lightweight, easy-to-use framework.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Build a Fully Functional Multi-Agent Marketplace Using uAgent appeared first on MarkTechPost.

Generate Gremlin queries using Amazon Bedrock models

Graph databases have revolutionized how organizations manage complex, interconnected data. However, specialized query languages such as Gremlin often create a barrier for teams looking to extract insights efficiently. Unlike traditional relational databases with well-defined schemas, graph databases lack a centralized schema, requiring deep technical expertise for effective querying.
To address this challenge, we explore an approach that converts natural language to Gremlin queries, using Amazon Bedrock models such as Amazon Nova Pro. This approach helps business analysts, data scientists, and other non-technical users access and interact with graph databases seamlessly.
In this post, we outline our methodology for generating Gremlin queries from natural language, comparing different techniques and demonstrating how to evaluate the effectiveness of these generated queries using large language models (LLMs) as judges.
Solution overview
Transforming natural language queries into Gremlin queries requires a deep understanding of graph structures and the domain-specific knowledge encapsulated within the graph database. To achieve this, we divided our approach into three key steps:

Understanding and extracting graph knowledge
Structuring the graph similar to text-to-SQL processing
Generating and executing Gremlin queries

The following diagram illustrates this workflow.

Step 1: Extract graph knowledge
A successful query generation framework must integrate both graph knowledge and domain knowledge to accurately translate natural language queries. Graph knowledge encompasses structural and semantic information extracted directly from the graph database. Specifically, it includes:

Vertex labels and properties – A listing of vertex types, names, and their associated attributes
Edge labels and properties – Information about edge types and their attributes
One-hop neighbors for each vertex – Capturing local connectivity information, such as direct relationships between vertices

With this graph-specific knowledge, the framework can effectively reason about the heterogeneous properties and complex connections inherent to graph databases.
Domain knowledge captures additional context that augments the graph knowledge and is tailored specifically to the application domain. It is sourced in two ways:

Customer-provided domain knowledge – For example, the customer kscope.ai helped specify those vertices that represent metadata and should never be queried. Such constraints are encoded to guide the query generation process.
LLM-generated descriptions – To enhance the system’s understanding of vertex labels and their relevance to specific questions, we use an LLM to generate detailed semantic descriptions of vertex names, properties, and edges. These descriptions are stored within the domain knowledge repository and provide additional context to improve the relevance of the generated queries.

Step 2: Structure the graph as a text-to-SQL schema
To improve the model’s comprehension of graph structures, we adopt an approach similar to text-to-SQL processing, where we construct a schema representing vertex types, edges, and properties. This structured representation enhances the model’s ability to interpret and generate meaningful queries.
The question processing component transforms natural language input into structured elements for query generation. It operates in three stages:

Entity recognition and classification – Identifies key database elements in the input question (such as vertices, edges, and properties) and categorizes the question based on its intent
Context enhancement – Enriches the question with relevant information from the knowledge component, so both graph-specific and domain-specific context is properly captured
Query planning – Maps the enhanced question to specific database elements needed for query execution

The context generation component makes sure the generated queries accurately reflect the underlying graph structure by assembling the following:

Element properties – Retrieves attributes of vertices and edges along with their data types
Graph structure – Facilitates alignment with the database’s topology
Domain rules – Applies business constraints and logic

Step 3: Generate and execute Gremlin queries
The final step is query generation, where the LLM constructs a Gremlin query based on the extracted context. The process follows these steps:

The LLM generates an initial Gremlin query.
The query is executed within a Gremlin engine.
If the execution is successful, results are returned.
If execution fails, an error message parsing mechanism analyzes the returned errors and refines the query using LLM-based feedback.

This iterative refinement makes sure the generated queries align with the database’s structure and constraints, improving overall accuracy and usability.
Prompt template
Our final prompt template is as follows:

## Request
Please write a gremlin query to answer the given question:
{{question}}
You will be provided with couple relevant vertices, together with their
schema and other information.
Please choose the most relevant vertex according to its schema and other
information to make the gremlin query correct.

## Instructions
1. Here are related vertices and their details:
{{schema}}
2. Don’t rename properties.
3. Don’t change lines (using slash n) in the generated query.

## IMPORTANT
Return the results in the following XML format:

<Results>
<Query>INSERT YOUR QUERY HERE</Query>
<Explanation>
PROVIDE YOUR EXPLANATION ON HOW THIS QUERY WAS GENERATED
AND HOW THE PROVIDED SCHEMA WAS LEVERAGED
</Explanation>
</Results>

Comparing LLM-generated queries to ground truth
We implemented an LLM-based evaluation system using Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock as a judge to assess both query generation and execution results for Amazon Nova Pro and a benchmark model. The system operates in two key areas:

Query evaluation – Assesses correctness, efficiency, and similarity to ground-truth queries; calculates exact matching component percentages; and provides an overall rating based on predefined rules developed with domain experts
Execution evaluation – Initially used a single-stage approach to compare generated results with ground truth, then enhanced to a two-stage evaluation process:

Item-by-item verification against ground truth
Calculation of overall match percentage

Testing across 120 questions demonstrated the framework’s ability to effectively distinguish correct from incorrect queries. The two-stage approach particularly improved the reliability of execution result evaluation by conducting thorough comparison before scoring.
Experiments and results
In this section, we discuss the experiments we conducted and their results.
Query similarity
In the query evaluation case, we propose two metrics: query exact match and query overall rating. An exact match score is calculated by identifying matching vs. non-matching components between generated and ground truth queries. The following table summarizes the scores for query exact match.

Easy
Medium
Hard
Overall

Amazon Nova Pro
82.70%
61%
46.60%
70.36%

Benchmark Model
92.60%
68.70%
56.20%
78.93%

An overall rating is provided after considering factors including query correctness, efficiency, and completeness as instructed in the prompt. The overall rating is on scale 1–10. The following table summarizes the scores for query overall rating.

Easy
Medium
Hard
Overall

Amazon Nova Pro
8.7
7
5.3
7.6

Benchmark Model
9.7
8
6.1
8.5

One limitation in the current query evaluation setup is that we rely solely on the LLM’s ability to compare ground truth against LLM-generated queries and arrive at the final scores. As a result, the LLM can fail to align with human preferences and under- or over-penalize the generated query. To address this, we recommend working with a subject matter expert to include domain-specific rules in the evaluation prompt.
Execution accuracy
To calculate accuracy, we compare the results of the LLM-generated Gremlin queries against the results of ground truth queries. If the results from both queries match exactly, we count the instance as correct; otherwise, it is considered incorrect. Accuracy is then computed as the ratio of correct query executions to the total number of queries tested. This metric provides a straightforward evaluation of how well the model-generated queries retrieve the expected information from the graph database, facilitating alignment with the intended query logic.
The following table summarizes the scores for execution results count match.

Easy
Medium
Hard
Overall

Amazon Nova Pro
80%
50%
10%
60.42%

Benchmark Model
90%
70%
30%
74.83%

Query execution latency
In addition to accuracy, we evaluate the efficiency of generated queries by measuring their runtime and comparing it with the ground truth queries. For each query, we record the runtime in milliseconds and analyze the difference between the generated query and the corresponding ground truth query. A lower runtime indicates a more optimized query, whereas significant deviations might suggest inefficiencies in query structure or execution planning. By considering both accuracy and runtime, we gain a more comprehensive assessment of query quality, making sure the generated queries are correct and performant within the graph database. The following box plot showcases query execution latency with respect to time for the ground truth query and the query generated by Amazon Nova Pro. As illustrated, all three types of queries exhibit comparable runtimes, with similar median latencies and overlapping interquartile ranges. Although the ground truth queries display a slightly wider range and a higher outlier, the median values across all three groups remain close. This suggests that the model-generated queries are at the same level as human-written ones in terms of execution efficiency, supporting the claim that AI-generated queries are of similar quality and don’t incur additional latency overhead.

Query generation latency and cost
Finally, we compare the time taken to generate each query and calculate the cost based on token consumption. More specifically, we measure the query generation time and track the number of tokens used, because most LLM-based APIs charge based on token usage. By analyzing both the generation speed and token cost, we can determine whether the model is efficient and cost-effective. These results provide insights in selecting the optimal model that balances query accuracy, execution efficiency, and economic feasibility.
As shown in the following plots, Amazon Nova Pro consistently outperforms the benchmark model in both generation latency and cost. In the left plot, which depicts query generation latency, Amazon Nova Pro demonstrates a significantly lower median generation time, with most values clustered between 1.8–4 seconds, compared to the benchmark model’s broader range from around 5–11 seconds. The right plot, illustrating query generation cost, shows that Amazon Nova Pro maintains a much smaller cost per query—centered well below $0.005—whereas the benchmark model incurs higher and more variable costs, reaching up to $0.025 in some cases. These results highlight Amazon Nova Pro’s advantage in terms of both speed and affordability, making it a strong candidate for deployment in time-sensitive or large-scale systems.

Conclusion
We experimented with all 120 ground truth queries provided to us by kscope.ai and achieved an overall accuracy of 74.17% in generating correct results. The proposed framework demonstrates its potential by effectively addressing the unique challenges of graph query generation, including handling heterogeneous vertex and edge properties, reasoning over complex graph structures, and incorporating domain knowledge. Key components of the framework, such as the integration of graph and domain knowledge, the use of Retrieval Augmented Generation (RAG) for query plan creation, and the iterative error-handling mechanism for query refinement, have been instrumental in achieving this performance.
In addition to improving accuracy, we are actively working on several enhancements. These include refining the evaluation methodology to handle deeply nested query results more effectively and further optimizing the use of LLMs for query generation. Moreover, we are using the RAGAS-faithfulness metric to improve the automated evaluation of query results, resulting in greater reliability and consistency in assessing the framework’s outputs.

About the authors
Mengdie (Flora) Wang is a Data Scientist at AWS Generative AI Innovation Center, where she works with customers to architect and implement scalable Generative AI solutions that address their unique business challenges. She specializes in model customization techniques and agent-based AI systems, helping organizations harness the full potential of generative AI technology. Prior to AWS, Flora earned her Master’s degree in Computer Science from the University of Minnesota, where she developed her expertise in machine learning and artificial intelligence.
Jason Zhang has expertise in machine learning, reinforcement learning, and generative AI. He earned his Ph.D. in Mechanical Engineering in 2014, where his research focused on applying reinforcement learning to real-time optimal control problems. He began his career at Tesla, applying machine learning to vehicle diagnostics, then advanced NLP research at Apple and Amazon Alexa. At AWS, he worked as a Senior Data Scientist on generative AI solutions for customers.
Rachel Hanspal is a Deep Learning Architect at AWS Generative AI Innovation Center, specializing in end-to-end GenAI solutions with a focus on frontend architecture and LLM integration. She excels in translating complex business requirements into innovative applications, leveraging expertise in natural language processing, automated visualization, and secure cloud architectures.
Zubair Nabi is the CTO and Co-Founder of Kscope, an Integrated Security Posture Management (ISPM) platform. His expertise lies at the intersection of Big Data, Machine Learning, and Distributed Systems, with over a decade of experience building software, data, and AI platforms. Zubair is also an adjunct faculty member at George Washington University and the author of Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark. He holds an MPhil from the University of Cambridge.
Suparna Pal – CEO & Co-Founder of kscope.ai – 20+ years of journey of building innovative platforms & solutions for Industrial, Health Care and IT operations at PTC, GE, and Cisco.
Wan Chen is an Applied Science Manager at AWS Generative AI Innovation Center. As a ML/AI veteran in tech industry, she has wide range of expertise on traditional machine learning, recommender system, deep learning and Generative AI. She is a stronger believer of Superintelligence and is very passionate to push the boundary of AI research and application to enhance human life and drive business growth. She holds Ph.D in Applied Mathematics from University of British Columbia and had worked as postdoctoral fellow in Oxford University.
Mu Li is a Principal Solutions Architect with AWS Energy. He’s also the Worldwide Tech Leader for the AWS Energy & Utilities Technical Field Community (TFC), a community of 300+ industry and technical experts. Li is passionate about working with customers to achieve business outcomes using technology. Li has worked with customers to migrate all-in to AWS from on-prem and Azure, launch the Production Monitoring and Surveillance industry solution, deploy ION/OpenLink Endur on AWS, and implement AWS-based IoT and machine learning workloads. Outside of work, Li enjoys spending time with his family, investing, following Houston sports teams, and catching up on business and technology.

Incorporating responsible AI into generative AI project prioritization

Over the past two years, companies have seen an increasing need to develop a project prioritization methodology for generative AI. There is no shortage of generative AI use cases to consider. Rather, companies want to evaluate the business value against the cost, level of effort, and other concerns, for a large number of potential generative AI projects. One new concern for generative AI compared to other domains is considering issues like hallucination, generative AI agents making incorrect decisions and then acting on those decisions through tool calls to downstream systems, and dealing with the rapidly changing regulatory landscape. In this post we describe how to incorporate responsible AI practices into a prioritization method to systematically address these types of concerns.
Responsible AI overview
The AWS Well-Architected Framework defines responsible AI as “the practice of designing, developing, and using AI technology with the goal of maximizing benefits and minimizing risks.” The AWS responsible AI framework begins by defining eight dimensions of responsible AI: fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency. At key points in the development lifecycle, a generative AI team should consider the possible harms or risks for each dimension (inherent and residual risks), implements risk mitigations, and monitors risk on an ongoing basis. Responsible AI applies across the entire development lifecycle and should be considered during initial project prioritization. That’s especially true for generative AI projects, where there are novel types of risks to consider, and mitigations might not be as well understood or researched. Considering responsible AI up front gives a more accurate picture of project risk and mitigation level of effort and reduces the chance of costly rework if risks are uncovered later in the development lifecycle. In addition to potentially delayed projects due to rework, unmitigated concerns might also harm customer trust, result in representational harm, or fail to meet regulatory requirements.
Generative AI prioritization
While most companies have their own prioritization methods, here we’ll demonstrate how to use the weighted shortest job first (WSJF) method from the Scaled Agile system. WSJF assigns a priority using this formula:
Priority = (cost of delay) / (job size)
The cost of delay is a measure of business value. It includes the direct value (for example, additional revenue or cost savings), the timeliness (such as, is shipping this project worth a lot more today than a year from now), and the adjacent opportunities (such as, would delivering this project open up other opportunities down the road).
The job size is where you consider the level of effort to deliver the project. That normally includes direct development costs and paying for any infrastructure or software you need. The job size is where you can include the results of the initial responsible AI risk assessment and expected mitigations. For example, if the initial assessment uncovers three risks that require mitigation, you include the development cost for those mitigations in the job size. You can also qualitatively assess that a project with ten high-priority risks is more complex than a project with only two high-priority risks.
Example scenario
Now, let’s walk through a prioritization exercise that compares two generative AI projects. The first project uses a large language model (LLM) to generate product descriptions. A marketing team will use this application to automatically create production descriptions that go into the online product catalog website. The second project uses a text-to-image model to generate new visuals for advertising campaigns and the product catalog. The marketing team will use this application to more quickly create customized brand assets.
First pass prioritization
First, we’ll go through the prioritization method without considering responsible AI, assigning a score of 1–5 for each part of the WSJF formula. The specific scores vary by organization. Some companies prefer to use t-shirt sizing (S, M, L, and XL), others prefer a score of 1–5, and others will use a more granular score. A score of 1–5 is a common and straightforward way to start. For example, the direct value scores can be calculated as:
1 = no direct value
2 = 20% improvement in KPI (time to create high-quality descriptions)
3 = 40% improvement in KPI
4 = 80% improvement in KPI
5 = 100% or more improvement in KPI

Project 1: Automated product descriptions (scored from 1–5)
Project 2: Creating visual brand assets (scored from 1–5)

Direct value
3: Helps marketing team create higher quality descriptions more quickly
3: Helps marketing team create higher quality assets more quickly

Timeliness
2: Not particularly urgent
4: New ad campaign planned this quarter; without this project, cannot create enough brand assets without hiring a new agency to supplement the team

Adjacent opportunities
2: Might be able to reuse for similar scenarios)
3: Experience gained in image generation will build competence for future projects

Job size
2: Basic, well-known pattern
2: Basic, well-known pattern

Score
(3+2+2)/2 = 3.5
(3+4+3)/2 = 5

At first glance, it looks like Project 2 is more compelling. Intuitively that makes sense—it takes people a lot longer to make high-quality visuals than to create textual product descriptions.
Risk assessment
Now let’s go through a risk assessment for each project. The following table lists a brief overview of the outcome of a risk assessment along each of the AWS responsible AI dimensions, along with a t-shirt size (S, M, L, and XL) severity level. The table also includes suggested mitigations.

Project 1: Automated product descriptions
Project 2: Creating visual brand assets

Fairness
L: Are descriptions appropriate in terms of gender and demographics? Mitigate using guardrails.
L: Images must not portray particular demographics in a biased way. Mitigate using human and automated checks.

Explainability
No risks identified.
No risks identified.

Privacy and security
L: Some product information is proprietary and cannot be listed on a public site. Mitigate using data governance.
L: Model must not be trained on any images that contain proprietary information. Mitigate using data governance.

Safety
M: Language must be age-appropriate and not cover offensive topics. Mitigate using guardrails.
L: Images must not contain adult content or images of drugs, alcohol, or weapons. Mitigate using guardrails.

Controllability
S: Need to track customer feedback on the descriptions. Mitigate using customer feedback collection.
L: Do images align to our brand guidelines? Mitigate using human and automated checks.

Veracity and robustness
M: Will the system hallucinate and imply product capabilities that aren’t real? Mitigate using guardrails.
L: Are images realistic enough to avoid uncanny valley effects? Mitigate using human and automated checks.

Governance
M: Prefer LLM providers that offer copyright indemnification. Mitigate using LLM provider selection.
L: Require copyright indemnification and image source attribution. Mitigate using model provider selection.

Transparency
S: Disclose that descriptions are AI generated.
S: Disclose that descriptions are AI generated.

The risks and mitigations are use-case specific. The preceding table is for illustrative purposes only.
Second pass prioritization
How does the risk assessment affect the prioritization?

Project 1: Automated product descriptions (scored from 1–5)
Project 2: Creating visual brand assets (scored from 1–5)

Job size
3: Basic, well-known pattern; requires fairly standard guardrails, governance, and feedback collection.
5: Basic, well-known pattern. Requires advanced image guardrails with human oversight, and a more expensive commercial model. Research spike needed.

Score
(3+2+2)/3 = 2.3
(3+4+3)/5 = 2

Now it looks like Project 1 is a better one to start with. Intuitively, after you consider responsible AI, that makes sense. Poorly crafted or offensive images are more noticeable and have a larger impact than a poorly phrased product description. And the guardrails you can use for maintaining image safety are less mature than the equivalent guardrails for text, particularly in ambiguous cases like adhering to brand guidelines. In fact, an image guardrail system might require training a monitoring model or using people to spot-check some percentage of the output. You might need to dedicate a small science team to study this problem first.
Conclusion
In this post, you saw how to include responsible AI considerations in a generative AI project prioritization method. You saw how conducting a responsible AI risk assessment in the initial prioritization phase can change the outcome by uncovering a substantial amount of mitigation work. Moving forward, you should develop your own responsible AI policy and start adopting responsible AI practices for generative AI projects. You can find additional details and resources at Transform responsible AI from theory into practice.

About the author
Randy DeFauw is a Sr. Principal Solutions Architect at AWS. He has over 20 years of experience in technology, starting with his university work on autonomous vehicles. He has worked with and for customers ranging from startups to Fortune 50 companies, launching Big Data and Machine Learning applications. He holds an MSEE and an MBA, serves as a board advisor to K-12 STEM education initiatives, and has spoken at leading conferences including Strata and GlueCon. He is the co-author of the books SageMaker Best Practices and Generative AI Cloud Solutions. Randy currently acts as a technical advisor to AWS’ director of technology in North America.

PokeeResearch-7B: An Open 7B Deep-Research Agent Trained with Reinforc …

Pokee AI has open sourced PokeeResearch-7B, a 7B parameter deep research agent that executes full research loops, decomposes a query, issues search and read calls, verifies candidate answers, then synthesizes multiple research threads into a final response.

The agent runs a research and verification loop. In research, it calls external tools for web search and page reading or proposes an interim answer. In verification, it checks the answer against retrieved evidence, and either accepts or restarts research. This structure reduces brittle trajectories and catches obvious errors before finalization. The research team formalizes this loop and adds a test-time synthesis stage that merges several independent research threads.

Training recipe, RLAIF with RLOO

PokeeResearch-7B is finetuned from Qwen2.5-7B-Instruct using an annotation-free Reinforcement Learning from AI Feedback, called RLAIF, with the REINFORCE Leave-One-Out algorithm, called RLOO. The reward targets semantic correctness, citation faithfulness, and instruction adherence, not token overlap. The Model’s Hugging Face card lists batch size 64, 8 research threads per prompt during RL, learning rate 3e-6, 140 steps, context 32,768 tokens, bf16 precision, and a checkpoint near 13 GB. The research team emphasizes that RLOO provides an unbiased on policy gradient and contrasts it with the PPO family that is approximately on policy and biased.

https://arxiv.org/pdf/2510.15862

Reasoning scaffold and Research Threads Synthesis

The scaffold includes three mechanisms. Self correction, the agent detects malformed tool calls and retries. Self verification, the agent inspects its own answer against evidence. Research Threads Synthesis, the agent runs several independent threads per question, summarizes them, then synthesizes a final answer. The research team reports that synthesis improves accuracy on difficult benchmarks.

https://arxiv.org/pdf/2510.15862

Evaluation protocol

The research team evaluates text only questions from 10 benchmarks, NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, and Humanity’s Last Exam. They sample 125 questions per dataset, except GAIA with 103, for a total of 1,228 questions. For each question, they run 4 research threads, then compute mean accuracy, mean at 4, using Gemini-2.5-Flash-lite to judge correctness. The maximum interaction turns are set to 100.

https://github.com/Pokee-AI/PokeeResearchOSS

https://github.com/Pokee-AI/PokeeResearchOSS

Results at 7B scale

PokeeResearch-7B reports the best mean at 4 accuracy among 7B deep research agents across the 10 datasets. On HLE the model reports 15.2 without RTS and 17.6 with RTS. On GAIA the model reports 36.9 without RTS and 41.3 with RTS. On BrowseComp the model reports 5.4 without RTS and 8.4 with RTS. On the seven QA benchmarks, Bamboogle, 2WikiMultiHopQA, TriviaQA, NQ, PopQA, Musique, HotpotQA, the model improves over recent 7B baselines. Gains from RTS are largest on HLE, GAIA, and BrowseComp, and smaller on the QA sets.

Key Takeaways

Training: PokeeResearch-7B fine tunes Qwen2.5-7B-Instruct with RLAIF using the RLOO estimator, optimizing rewards for factual accuracy, citation faithfulness, and instruction adherence, not token overlap.

Scaffold: The agent runs a research and verification loop with Research Threads Synthesis, executing multiple independent threads, then synthesizing evidence to a final answer.

Evaluation protocol: Benchmarks span 10 datasets with 125 questions each, except GAIA with 103, 4 threads per question, mean@4 accuracy judged by Gemini-2.5-Flash-lite, with a 100 turn cap.

Results and release: PokeeResearch-7B reports state of the art among 7B deep research agents, for example HLE 17.6 with RTS, GAIA 41.3 with RTS, BrowseComp 8.4 with RTS, and is released under Apache-2.0 with code and weights public.

Editorial Comments

PokeeResearch-7B is a useful step for practical deep research agents. It aligns training with RLAIF using RLOO, so the objective targets semantic correctness, citation faithfulness, and instruction adherence. The reasoning scaffold includes self verification and Research Threads Synthesis, which improves difficult benchmarks. The evaluation uses mean at 4 with Gemini 2.5 Flash lite as the judge, across 10 datasets. The release ships Apache 2.0 code and weights with a clear tool stack using Serper and Jina. The setup runs on a single A100 80 GB and scales.

Check out the Paper, Model on HF and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post PokeeResearch-7B: An Open 7B Deep-Research Agent Trained with Reinforcement Learning from AI Feedback (RLAIF) and a Robust Reasoning Scaffold appeared first on MarkTechPost.

How to Design a Fully Functional Enterprise AI Assistant with Retrieva …

In this tutorial, we explore how we can build a compact yet powerful Enterprise AI assistant that runs effortlessly on Colab. We start by integrating retrieval-augmented generation (RAG) using FAISS for document retrieval and FLAN-T5 for text generation, both fully open-source and free. As we progress, we embed enterprise policies such as data redaction, access control, and PII protection directly into the workflow, ensuring our system is intelligent and compliant. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install faiss-cpu transformers==4.44.2 accelerate sentence-transformers==3.0.1

from typing import List, Dict, Tuple
import re, textwrap, numpy as np, torch
from sentence_transformers import SentenceTransformer
import faiss
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

GEN_MODEL = “google/flan-t5-base”
EMB_MODEL = “sentence-transformers/all-MiniLM-L6-v2″

gen_tok = AutoTokenizer.from_pretrained(GEN_MODEL)
gen_model = AutoModelForSeq2SeqLM.from_pretrained(GEN_MODEL, device_map=”auto”)
generate = pipeline(“text2text-generation”, model=gen_model, tokenizer=gen_tok)

emb_device = “cuda” if torch.cuda.is_available() else “cpu”
emb_model = SentenceTransformer(EMB_MODEL, device=emb_device)

We begin by setting up our environment and loading the required models. We initialize FLAN-T5 for text generation and MiniLM for embedding representations. We ensure both models are configured to automatically use the GPU when available, so our pipeline runs efficiently. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserDOCS = [
{“id”:”policy_sec_001″,”title”:”Data Security Policy”,
“text”:”All customer data must be encrypted at rest (AES-256) and in transit (TLS 1.2+). Access is role-based (RBAC). Secrets are stored in a managed vault. Backups run nightly with 35-day retention. PII includes name, email, phone, address, PAN/Aadhaar.”},
{“id”:”policy_ai_002″,”title”:”Responsible AI Guidelines”,
“text”:”Use internal models for confidential data. Retrieval sources must be logged. No customer decisioning without human-in-the-loop. Redact PII in prompts and outputs. All model prompts and outputs are stored for audit for 180 days.”},
{“id”:”runbook_inc_003″,”title”:”Incident Response Runbook”,
“text”:”If a suspected breach occurs, page on-call SecOps. Rotate keys, isolate affected services, perform forensic capture, notify DPO within regulatory SLA. Communicate via the incident room only.”},
{“id”:”sop_sales_004″,”title”:”Sales SOP – Enterprise Deals”,
“text”:”For RFPs, use the approved security questionnaire responses. Claims must match policy_sec_001. Custom clauses need Legal sign-off. Keep records in CRM with deal room links.”}
]

def chunk(text:str, chunk_size=600, overlap=80):
w = text.split()
if len(w) <= chunk_size: return [text]
out=[]; i=0
while i < len(w):
j=min(i+chunk_size, len(w)); out.append(” “.join(w[i:j]))
if j==len(w): break
i = j – overlap
return out

CORPUS=[]
for d in DOCS:
for i,c in enumerate(chunk(d[“text”])):
CORPUS.append({“doc_id”:d[“id”],”title”:d[“title”],”chunk_id”:i,”text”:c})

We create a small enterprise-style document set to simulate internal policies and procedures. We then break these long texts into manageable chunks so they can be embedded and retrieved effectively. This chunking helps our AI assistant handle contextual information with better precision. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef build_index(chunks:List[Dict]) -> Tuple[faiss.IndexFlatIP, np.ndarray]:
vecs = emb_model.encode([c[“text”] for c in chunks], normalize_embeddings=True, convert_to_numpy=True)
index = faiss.IndexFlatIP(vecs.shape[1]); index.add(vecs); return index, vecs

INDEX, VECS = build_index(CORPUS)

PII_PATTERNS = [
(re.compile(r”bd{10}b”), “<REDACTED_PHONE>”),
(re.compile(r”b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,}b”, re.I), “<REDACTED_EMAIL>”),
(re.compile(r”bd{12}b”), “<REDACTED_ID12>”),
(re.compile(r”b[A-Z]{5}d{4}[A-Z]b”), “<REDACTED_PAN>”)
]
def redact(t:str)->str:
for p,r in PII_PATTERNS: t = p.sub(r, t)
return t

POLICY_DISALLOWED = [
re.compile(r”b(share|exfiltrate)b.*b(raw|all)b.*bdatab”, re.I),
re.compile(r”bdisableb.*bencryptionb”, re.I),
]
def policy_check(q:str):
for r in POLICY_DISALLOWED:
if r.search(q): return False, “Request violates security policy (data exfiltration/encryption tampering).”
return True, “”

We embed all chunks using Sentence Transformers and store them in a FAISS index for fast retrieval. We introduce PII redaction rules and policy checks to prevent misuse of data. By doing this, we ensure our assistant adheres to enterprise security and compliance guidelines. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef retrieve(query:str, k=4)->List[Dict]:
qv = emb_model.encode([query], normalize_embeddings=True, convert_to_numpy=True)
scores, idxs = INDEX.search(qv, k)
return [{**CORPUS[i], “score”: float(s)} for s,i in zip(scores[0], idxs[0])]

SYSTEM = (“You are an enterprise AI assistant.n”
“- Answer strictly from the provided CONTEXT.n”
“- If missing info, say what is unknown and suggest the correct policy/runbook.n”
“- Keep it concise and cite titles + doc_ids inline like [Title (doc_id:chunk)].”)
def build_prompt(user_q:str, ctx_blocks:List[Dict])->str:
ctx = “nn”.join(f”[{i+1}] {b[‘title’]} (doc:{b[‘doc_id’]}:{b[‘chunk_id’]})n{b[‘text’]}” for i,b in enumerate(ctx_blocks))
uq = redact(user_q)
return f”SYSTEM:n{SYSTEM}nnCONTEXT:n{ctx}nnUSER QUESTION:n{uq}nnINSTRUCTIONS:n- Cite sources inline.n- Keep to 5-8 sentences.n- Preserve redactions.”

def answer(user_q:str, k=4, max_new_tokens=220)->Dict:
ok,msg = policy_check(user_q)
if not ok: return {“answer”: f” {msg}”, “ctx”:[]}
ctx = retrieve(user_q, k=k); prompt = build_prompt(user_q, ctx)
out = generate(prompt, max_new_tokens=max_new_tokens, do_sample=False)[0][“generated_text”].strip()
return {“answer”: out, “ctx”: ctx}

We design the retrieval function to fetch relevant document sections for each user query. We then construct a structured prompt combining context and questions for FLAN-T5 to generate precise answers. This step ensures that our assistant produces grounded, policy-compliant responses. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef eval_query(user_q:str, ctx:List[Dict])->Dict:
terms = [w.lower() for w in re.findall(r”[a-zA-Z]{4,}”, user_q)]
ctx_text = ” “.join(c[“text”].lower() for c in ctx)
hits = sum(t in ctx_text for t in terms)
return {“terms”: len(terms), “hits”: hits, “hit_rate”: round(hits/max(1,len(terms)), 2)}

QUERIES = [
“What encryption and backup rules do we follow for customer data?”,
“Can we auto-answer RFP security questionnaires? What should we cite?”,
“If there is a suspected breach, what are the first three steps?”,
“Is it allowed to share all raw customer data externally for testing?”
]
for q in QUERIES:
res = answer(q, k=3)
print(“n” + “=”*100); print(“Q:”, q); print(“nA:”, res[“answer”])
if res[“ctx”]:
ev = eval_query(q, res[“ctx”]); print(“nRetrieved Context (top 3):”)
for r in res[“ctx”]: print(f”- {r[‘title’]} [{r[‘doc_id’]}:{r[‘chunk_id’]}] score={r[‘score’]:.3f}”)
print(“Eval:”, ev)

We evaluate our system using sample enterprise queries that test encryption, RFPs, and incident procedures. We display retrieved documents, answers, and simple hit-rate scores to check relevance. Through this demo, we observe our Enterprise AI assistant performing retrieval-augmented reasoning securely and accurately.

In conclusion, we successfully created a self-contained enterprise AI system that retrieves, analyzes, and responds to business queries while maintaining strong guardrails. We appreciate how seamlessly we can combine FAISS for retrieval, Sentence Transformers for embeddings, and FLAN-T5 for generation to simulate an internal enterprise knowledge engine. As we finish, we realize that this simple Colab-based implementation can serve as a blueprint for scalable, auditable, and compliant enterprise deployments.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Fully Functional Enterprise AI Assistant with Retrieval Augmentation and Policy Guardrails Using Open Source AI Models appeared first on MarkTechPost.

Google AI Introduces VISTA: A Test Time Self Improving Agent for Text …

TLDR: VISTA is a multi agent framework that improves text to video generation during inference, it plans structured prompts as scenes, runs a pairwise tournament to select the best candidate, uses specialized judges across visual, audio, and context, then rewrites the prompt with a Deep Thinking Prompting Agent, the method shows consistent gains over strong prompt optimization baselines in single scene and multi scene settings, and human raters prefer its outputs.

https://arxiv.org/pdf/2510.15831

What VISTA is?

VISTA stands for Video Iterative Self improvemenT Agent. It is a black box, multi agent loop that refines prompts and regenerates videos at test time. The system targets 3 aspects jointly, visual, audio, and context. It follows 4 steps, structured video prompt planning, pairwise tournament selection, multi dimensional multi agent critiques, and a Deep Thinking Prompting Agent for prompt rewriting.

The research team evaluates VISTA on a single scene benchmark and on an internal multi scene set. It reports consistent improvements and up to 60 percent pairwise win rate against state of the art baselines in some settings, and a 66.4 percent human preference over the strongest baseline.

https://arxiv.org/pdf/2510.15831

Understanding the key problem

Text to video models like Veo 3 can produce high quality video and audio, yet outputs remain sensitive to exact prompt phrasing, adherence to physics can fail, and alignment to user goals can drift, which forces manual trial and error. VISTA frames this as a test time optimization problem. It seeks unified improvement across visual signals, audio signals, and contextual alignment.

How VISTA works, step by step?

Step 1: structured video prompt planning

The user prompt is decomposed into timed scenes. Each scene carries 9 properties, duration, scene type, characters, actions, dialogues, visual environment, camera, sounds, moods. A multimodal LLM fills missing properties and enforces constraints on realism, relevancy, and creativity by default. The system also keeps the original user prompt in the candidate set to allow models that do not benefit from decomposition.

Step 2: pairwise tournament video selection

The system samples multiple video, prompt pairs. An MLLM acts as a judge with binary tournaments and bidirectional swapping to reduce token order bias. The default criteria include visual fidelity, physical commonsense, text video alignment, audio video alignment, and engagement. The method first elicits probing critiques to support analysis, then performs pairwise comparison, and applies customizable penalties for common text to video failures.

Step 3: multi dimensional multi agent critiques

The champion video and prompt receive critiques along 3 dimensions, visual, audio, and context. Each dimension uses a triad, a normal judge, an adversarial judge, and a meta judge that consolidates both sides. Metrics include visual fidelity, motions and dynamics, temporal consistency, camera focus, and visual safety for visual, audio fidelity, audio video alignment, and audio safety for audio, situational appropriateness, semantic coherence, text video alignment, physical commonsense, engagement, and video format for context. Scores are on a 1 to 10 scale, which supports targeted error discovery.

Step 4: Deep Thinking Prompting Agent

The reasoning module reads the meta critiques and runs a 6 step introspection, it identifies low scoring metrics, clarifies expected outcomes, checks prompt sufficiency, separates model limits from prompt issues, detects conflicts or vagueness, proposes modification actions, then samples refined prompts for the next generation cycle.

https://arxiv.org/pdf/2510.15831

Understanding the results

Automatic evaluation: The research study reports win, tie, loss rates on ten criteria using an MLLM as a judge, with bidirectional comparisons. VISTA achieves a win rate over direct prompting that rises across iterations, reaching 45.9 percent in single scene and 46.3 percent in multi scene at iteration 5. It also wins directly against each baseline under the same compute budget.

Human studies: Annotators with prompt optimization experience prefer VISTA in 66.4 percent of head to head trials against the best baseline at iteration 5. Experts rate optimization trajectories higher for VISTA, and they score visual quality and audio quality higher than direct prompting.

Cost and scaling: Average tokens per iteration are about 0.7 million across two datasets, generation tokens are not included. Most token use comes from selection and critiques, which process videos as long context inputs. Win rate tends to increase as the number of sampled videos and tokens per iteration increases.

Ablations: Removing prompt planning weakens initialization. Removing tournament selection destabilizes later iterations. Using only one judge type reduces performance. Removing the Deep Thinking Prompting Agent lowers final win rates.

Evaluators: The research team repeated evaluation with alternative evaluator models and observe similar iterative improvements, which supports robustness of the trend.

https://arxiv.org/pdf/2510.15831

https://arxiv.org/pdf/2510.15831

Key Takeaways

VISTA is a test time, multi agent loop that jointly optimizes visual, audio, and context for text to video generation.

It plans prompts as timed scenes with 9 attributes, duration, scene type, characters, actions, dialogues, visual environment, camera, sounds, moods.

Candidate videos are selected via pairwise tournaments using an MLLM judge with bidirectional swap, scored on visual fidelity, physical commonsense, text video alignment, audio video alignment, and engagement.

A triad of judges per dimension, normal, adversarial, meta, produces 1 to 10 scores that guide the Deep Thinking Prompting Agent to rewrite the prompt and iterate.

Results show 45.9 percent wins on single scene and 46.3 percent on multi scene at iteration 5 over direct prompting, human raters prefer VISTA in 66.4 percent of trials, average token cost per iteration is about 0.7 million.

Editorial Comments

VISTA is a practical step toward reliable text to video generation, it treats inference as an optimization loop and keeps the generator as a black box. The structured video prompt planning is useful for early engineers, the 9 scene attributes give a concrete checklist. The pairwise tournament selection with a multimodal LLM judge and bidirectional swap is a sensible way to reduce ordering bias, the criteria target real failure modes, visual fidelity, physical commonsense, text video alignment, audio video alignment, engagement. The multi dimensional critiques separate visual, audio, and context, the normal, adversarial, and meta judges expose weaknesses that single judges miss. The Deep Thinking Prompting Agent turns those diagnostics into targeted prompt edits. The use of Gemini 2.5 Flash and Veo 3 clarifies the reference setup, the Veo 2 study is a helpful lower bound. The reported 45.9 and 46.3 percent win rates and 66.4 percent human preference indicate repeatable gains. The 0.7 million token cost is non trivial, yet transparent and scalable.

Check out the Paper and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Introduces VISTA: A Test Time Self Improving Agent for Text to Video Generation appeared first on MarkTechPost.

Build scalable creative solutions for product teams with Amazon Bedroc …

Creative teams and product developers are constantly seeking ways to streamline their workflows and reduce time to market while maintaining quality and brand consistency. This post demonstrates how to use AWS services, particularly Amazon Bedrock, to transform your creative processes through generative AI. You can implement a secure, scalable solution that accelerates your creative workflow, such as managing product launches, creating marketing campaigns, or developing multimedia content.
This post examines how product teams can deploy a generative AI application that enables rapid content iteration across formats. The solution addresses comprehensive needs—from product descriptions and marketing copy to visual concepts and video content for social media. By integrating with brand guidelines and compliance requirements, teams can significantly reduce time to market while maintaining creative quality and consistency.
Solution overview
Consider a product development team at an ecommerce company creating multimedia marketing campaigns for their seasonal product launches. Their traditional workflow has bottlenecks due to lengthy revisions, manual compliance reviews, and complex coordination across creative teams. The team is exploring solutions to rapidly iterate through creative concepts, generate multiple variations of marketing materials.
By using Amazon Bedrock and Amazon Nova models, the team can transform its creative process. Amazon Nova models enable the generation of product descriptions and marketing copy. The team creates concept visuals and product mockups with Amazon Nova Canvas, and uses Amazon Nova Reel to produce engaging video content for social media presence. Amazon Bedrock Guardrails can help the team maintain consistent brand guidelines with configurable safeguards and governance for its generative AI applications at scale.
The team can further enhance its brand consistency with Amazon Bedrock Knowledge Bases, which can serve as a centralized repository for brand style guides, visual identity documentation, and successful campaign materials. This comprehensive knowledge base makes sure generated content is informed by the organization’s historical success and established brand standards. Product specifications, market research, and approved messaging are seamlessly integrated into the creative process, enabling more relevant and effective content generation.
With this solution, the team can simultaneously develop materials for multiple channels while maintaining consistent brand voice across their content. Creative professionals can now focus their energy on strategic decisions rather than repetitive tasks, leading to higher-quality outputs and improved team satisfaction.
The following sample application creates a scalable environment that streamlines the creative workflow. It helps product teams move seamlessly from initial concept to market-ready materials with automated systems handling compliance and consistency checks throughout the journey.

The solution’s workflow begins with the application engineer’s setup:

Creative assets and brand guidelines are securely stored in encrypted Amazon Simple Storage Service (Amazon S3) buckets. This content is then indexed in Amazon OpenSearch Service to create a comprehensive knowledge base.
Guardrails are configured to enforce brand standards and compliance requirements.

The user experience flows from authentication to content delivery:

Creative team members access the interface through a secure portal hosted in Amazon S3.
Authentication is managed through Amazon Cognito.
Team members’ submitted creative briefs or requirements are routed to Amazon API Gateway.
An AWS Lambda function queries relevant brand guidelines and assets from the knowledge base.
The Lambda function sends the contextual information from the knowledge base to Amazon Bedrock, along with the user’s creative briefs.
The prompt and generated response are filtered through Amazon Bedrock Guardrails.
Amazon Polly converts text into lifelike speech, generating audio streams that can be played immediately and stored in S3 buckets for later use.
The models’ generated content is delivered to the user.
Chat history stored in Amazon DynamoDB.

Prerequisites
The following prerequisites are required before continuing:

An AWS account
An AWS Identity and Access Management (IAM) role with permission to manage AWS Marketplace subscriptions and AWS services
AWS services:

AWS CloudFormation
Amazon API Gateway
AWS CloudFormation
Amazon Cognito
Amazon DynamoDB
Amazon Polly
Amazon S3
Amazon Virtual Private Cloud (Amazon VPC) with two public subnets

Amazon Bedrock models enabled:

Amazon Nova Canvas
Amazon Nova Reels
Amazon Nova Pro
Amazon Nova Lite

Anthropic models (optional):

Anthropic’s Claude 3 Sonnet

Select the Models to Use in Amazon Bedrock
When working with Amazon Bedrock for generative AI applications, one of the first steps is selecting which foundation models you want to access. Amazon Bedrock offers a variety of models from different providers, and you’ll need to explicitly enable the ones we plan to use in this blog.

In the Amazon Bedrock console, find and select Model access from the navigation menu on the left.
Click the Modify model access button to begin selecting your models.
Select the following Amazon models:

Nova Canvas
Nova Premier Cross-region inference Nova Pro
Titan Embeddings G1 – Text
Titan Text Embeddings V2

Select the Anthropic Claude 3.7 Sonnet model.
Choose Next.
Review your selections carefully on the summary page, then choose Submit to confirm your choices.

Set up the CloudFormation template
We use a use a CloudFormation template to deploy all necessary solution resources. Follow these steps to prepare your installation files:

Clone the GitHub repository:

git clone https://github.com/aws-samples/aws-service-catalog-reference-architectures.git

Navigate to the solution directory:

cd aws-service-catalog-reference-architectures/blog_content/bedrock_genai

(Make note of this location as you’ll need it in the following steps)
Sign in to your AWS account with administrator privileges to ensure you can create all required AWS resources.
Create an S3 bucket in the AWS Region where you plan to deploy this solution. Remember the bucket name for later steps.
Upload the entire content folder to your newly created S3 bucket.
Navigate to the content/genairacer/src folder in your S3 bucket.
Copy the URL for the content/genairacer/src/genairacer_setup.json file. You’ll need this URL for the deployment phase.

Deploy the CloudFormation template
Complete the following steps to use the provided CloudFormation template to automatically create and configure the application components within your AWS account:

On the CloudFormation console, choose Stacks in navigation pane.
Choose Create stack and select with new resources (standard).
On the Create stack page, under Specify template, for Object URL, enter the URL copied from the previous step, then choose Next.
On the Specify stack details page, enter a stack name.
Under Parameters, choose Next.
On the Configure stack options page, choose Next.
On the Review page, select the acknowledgement check boxes and choose Submit.

Sign in to the Amazon Bedrock generative AI application
Accessing your newly deployed application is simple and straightforward. Follow these steps to log in for the first time and start exploring the Amazon Bedrock generative AI interface.

On the CloudFormation console, select the stack you deployed and select the Outputs tab.
Find the FrontendURL value and open the provided link.
When the sign-in screen displays, enter the username you specified during the CloudFormation deployment process.
Enter the temporary password that was sent to the email address you provided during setup.
After you sign in, follow the prompts to change your password.
Choose Send to confirm your new credentials.

Once authenticated, you’ll be directed to the main Amazon Bedrock generative AI dashboard, where you can begin exploring all the features and capabilities of your new application.
Using the application
Now that the application has been deployed, you can use it for text, image, and audio management. In the following sections, we explore some sample use cases.
Text generation
The creative team at the ecommerce company wants to draft compelling product descriptions. By inputting the basic product features and desired tone, the LLM generates engaging and persuasive text that highlights the unique selling points of each item, making sure the online store’s product pages are both informative and captivating for potential customers.
To use the text generation feature and perform actions with the supported text models using Amazon Bedrock, follow these steps:

On the AWS CloudFormation console, go to the stack you created.
Choose the Outputs tab.
Choose the link for FrontendURL.
Log in using the credentials sent to the email you provided during the stack deployment process.
On the Text tab, enter your desired prompt in the input field.
Choose the specific model ID you want Amazon Bedrock to use from the available options.
Choose Run.

Repeat this process for any additional prompts you want to process.

Image generation
The creative team can now conceptualize and produce stunning product images. By describing the desired scene, style, and product placement, they can enhance the online shopping experience and increase the likelihood of customer engagement and purchase.To use the image generation feature, follow these steps:

In the UI, choose the Images tab.
Enter your desired text-to-image prompt in the input field.
Choose the specific model ID you want Amazon Bedrock to utilize from the available options.
Optionally, choose the desired style of the image from the provided style options.
Choose Generate Image.

Repeat this process for any additional prompts you want to process.

Audio generation
The ecommerce company’s creative team wants to develop audio content for marketing campaigns. By specifying the message, brand voice, target demographic, and audio components, they can compose scripts and generate voiceovers for promotional videos and audio ads, resulting in consistent and professional audio materials that effectively convey the brand’s message and values.To use the audio generation feature, follow these steps:

In the UI, choose the Audio tab.
Enter your desired prompt in the input field.
Choose Run. An audio file will appear and start to play.
Choose the file (right-click) and choose Save Audio As to save the file.

Amazon Bedrock Knowledge Bases
With Amazon Bedrock Knowledge Bases, you can provide foundation models (FMs) and agents with contextual information from your organization’s private data sources, to deliver more relevant, accurate, and tailored responses. It is a powerful and user-friendly implementation of the Retrieval Augmented Generation (RAG) approach. The application showcased in this post uses the Amazon Bedrock components in the backend, simplifying the process to merely uploading a document using the application’s GUI, and then entering a prompt that will query the documents you upload.
For our example use case, the creative team now needs to research information about internal processes and customer data, which are typically stored in documentation. When this documentation is stored in the knowledge base, they can query it on the KnowledgeBase tab. The queries executed on this tab will search the documents for the specific information they are looking for.
Manage documents
The documents you have uploaded will be listed on the KnowledgeBase tab. To add more, complete the following steps:

In the UI, choose the KnowledgeBase tab.
Choose Manage Document.
Choose Browse, then choose a file.
Choose Upload.

You will see a message confirming that the file was uploaded successfully.The Amazon Bedrock Knowledge Bases syncing process is triggered when the file is uploaded. The application will be ready for queries against the new document within a minute.
Query the knowledge base
To query the knowledge base, complete the following steps:

In the UI, choose the KnowledgeBase tab.
Enter your query in the input field.
For Model, choose the model you want Amazon Bedrock to use for performing the query.
Choose Run.

The generated text response from Amazon Bedrock will appear.
Amazon Bedrock guardrails
You can use the Guardrails tab to manage your guardrails, and create and remove guardrails as needed. Guardrails are used on the Text tab when performing queries.
Create a guardrail
Complete the following steps to create a new guardrail:

In the UI, choose the Guardrails tab.
Enter the required fields or choose the appropriate options.
Choose the type of guardrail under Content Filter Type.
Choose Create Guardrail.

The newly created guardrail will appear in the right pane.
Delete a guardrail
Complete the following steps to delete a guardrail:

In the UI, choose the Guardrails tab.
Choose the guardrail you want to delete in the right pane.
Choose the X icon next to the guardrail.

By following these steps, you can effectively manage your guardrails, for a seamless and controlled experience when performing queries in the Text tab.
Use guardrails
The creative team requires access to information about internal processes and customer data, which are securely stored in documentation within the knowledge base. To enforce compliance with personally identifiable information (PII) guardrails, queries executed using the Text tab are designed to search documents for specific, non-sensitive information while preventing the exposure or inclusion of PII in both prompts and answers. This approach helps the team retrieve necessary data without compromising privacy or security standards.
To use the guardrails feature, complete the following steps:

In the UI, choose the Text tab.
Enter your prompt in the input field.
For Model ID, choose the specific model ID you want Amazon Bedrock to use.
Turn on Guardrails.
For Select Filter, choose the guardrail you want to use.
Choose Run.

The generated text from Amazon Bedrock will appear within a few seconds. Repeat this process for any additional prompts you want to process.

Clean up
To avoid incurring costs, delete resources that are no longer needed. If you no longer need the solution, complete the following steps to delete all resources you created from your AWS account:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Select the stack you deployed and choose Delete.

Conclusion
By combining Amazon Bedrock, Knowledge Bases, and Guardrails with Cognito, API Gateway, and Lambda, organizations can give employees powerful AI tools for text, image, and data work. This serverless approach integrates generative AI into daily workflows securely and scalably, boosting productivity and innovation across teams..
For more information about generative AI and Amazon Bedrock, refer to the Amazon Bedrock category in the AWS News Blog.

About the authors
Kenneth Walsh is a Senior AI Acceleration Architect based in New York who transforms AWS builder productivity through innovative generative AI automation tools. With a strategic focus on standardized frameworks, Kenneth accelerates partner adoption of generative AI technologies at scale. As a trusted advisor, he guides customers through their GenAI journeys with both technical expertise and genuine passion. Outside the world of artificial intelligence, Kenneth enjoys crafting culinary creations, immersing himself in audiobooks, and cherishing quality time with his family and dog.
Wanjiko Kahara is a New York–based Solutions Architect with a interest area in generative AI. Wanjiko is excited about learning new technology to help her customers be successful. Outside of work, Wanjiko loves to travel, explore the outdoors, and read.
Greg Medard is a Solutions Architect with AWS. Greg guides clients in architecting, designing, and developing cloud-optimized infrastructure solutions. His drive lies in fostering cultural shifts by embracing DevOps principles that overcome organizational hurdles. Beyond work, he cherishes quality time with loved ones, tinkering with the latest tech gadgets, or embarking on adventures to discover new destinations and culinary delights.
Bezuayehu Wate is a Specialist Solutions Architect at AWS, with a focus on big data analytics. Passionate about helping customers design, build, and modernize their cloud-based analytics solutions, she finds joy in learning and exploring new technologies. Outside of work, Bezuayehu enjoys quality time with family and traveling.
Nicole Murray is a generative AI Senior Solutions Architect at AWS, specializing in MLOps and Cloud Operations for AI startups. With 17 years of experience—including helping government agencies design secure, compliant applications on AWS—she now partners with startup founders to build and scale innovative AI/ML solutions. Nicole helps teams navigate secure cloud management, technical strategy, and regulatory best practices in the generative AI space, and is also a passionate speaker and educator known for making complex cloud and AI topics accessible.

Build a proactive AI cost management system for Amazon Bedrock – Par …

In Part 1 of our series, we introduced a proactive cost management solution for Amazon Bedrock, featuring a robust cost sentry mechanism designed to enforce real-time token usage limits. We explored the core architecture, token tracking strategies, and initial budget enforcement techniques that help organizations control their generative AI expenses.
Building upon that foundation, this post explores advanced cost monitoring strategies for generative AI deployments. We introduce granular custom tagging approaches for precise cost allocation, and develop comprehensive reporting mechanisms.
Solution overview
The cost sentry solution introduced in Part 1 was developed as a centralized mechanism to proactively limit generative AI usage to adhere to prescribed budgets. The following diagram illustrates the core components of the solution, adding in cost monitoring through AWS Billing and Cost Management.

Invocation-level tagging for enhanced traceability
Invocation-level tagging extends our solution’s capabilities by attaching rich metadata to every API request, creating a comprehensive audit trail within Amazon CloudWatch logs. This becomes particularly valuable when investigating budget-related decisions, analyzing rate-limiting impacts, or understanding usage patterns across different applications and teams. To support this, the main AWS Step Functions workflow was updated, as illustrated in the following figure.

Enhanced API input
We also evolved the API input to support custom tagging. The new input structure introduces optional parameters for model-specific configurations and custom tagging:

{
  “model”: “string”,     // e.g., “claude-3” or “anthropic.claude-3-sonnet-20240229-v1:0”
  “prompt”: {
    “messages”: [
      {
        “role”: “string”,    // “system”, “user”, or “assistant”
        “content”: “string”
      }
    ],
    “parameters”: {
      “max_tokens”: number,    // Optional, model-specific defaults
      “temperature”: number,   // Optional, model-specific defaults
      “top_p”: number,         // Optional, model-specific defaults
      “top_k”: number          // Optional, model-specific defaults
    }
  },
  “tags”: {
    “applicationId”: “string”,  // Required
    “costCenter”: “string”,     // Optional
    “environment”: “string”     // Optional – dev/staging/prod
  }
}

The input structure comprises three key components:

model – Maps simple names (for example, claude-3) to full Amazon Bedrock model IDs (for example, anthropic.claude-3-sonnet-20240229-v1:0)
input – Provides a messages array for prompts, supporting both single-turn and multi-turn conversations
tags – Supports application-level tracking, with applicationId as the required field and costCenter and environment as optional fields

In this example, we use different cost centers for sales, services, and support to simulate the use of a business attribute to track usage and spend for inference in Amazon Bedrock. For example:

{
  “model”: “claude-3-5-haiku”,
  “prompt”: {
    “messages”: [
      {
        “role”: “user”,
        “content”: “Explain the benefits of using S3 using only 100 words.”
      },
      {
        “role”: “assistant”,
        “content”: “You are a helpful AWS expert.”
      }
    ],
    “parameters”: {
      “max_tokens”: 2000,
      “temperature”: 0.7,
      “top_p”: 0.9,
      “top_k”: 50
    }
  },
  “tags”: {
    “applicationId”: “aws-documentation-helper”,
    “costCenter”: “support”,
    “environment”: “production”
  }
}

Validation and tagging
A new validation step was added to the workflow for tagging. This step uses an AWS Lambda function to add validation checks and maps the model requested to the specific model ID in Amazon Bedrock. It supplements the tags object with tags that will be required for downstream analysis.
The following code is an example of a simple map to get the appropriate model ID from the model specified:

MODEL_ID_MAPPING = {
    “nova-lite”: “amazon.nova-lite-v1:0”,
    “nova-micro”: “amazon.nova-micro-v1:0”,
    “claude-2”: “anthropic.claude-v2:0”,
    “claude-3-haiku”: “anthropic.claude-3-haiku-20240307-v1:0”,
    “claude-3-5-sonnet-v2”: “us.anthropic.claude-3-5-sonnet-20241022-v2:0”,
    “claude-3-5-haiku”: “us.anthropic.claude-3-5-haiku-20241022-v1:0”
}

Logging and analysis
By using CloudWatch metrics with custom-generated tags and dimensions, you can track detailed metrics across multiple dimensions such as model type, cost center, application, and environment. Custom tags and dimensions show how teams use AI services. To see this analysis, steps were implemented to generate custom tags, store metric data, and analyze metric data:

We include a unique set of tags that capture contextual information. This can include user-supplied tags as well as ones that are dynamically generated, such as requestId and timestamp:

  “tags”: {
    “requestId”: “ded98994-eb76-48d9-9dbc-f269541b5e49”,
    “timestamp”: “2025-01-31T14:05:26.854682”,
    “applicationId”: “aws-documentation-helper”,
    “costCenter”: “support”,
    “environment”: “production”
}

As each workflow is executed, the limit for each model will be evaluated to make sure the request is within budgetary guidelines. The workflow will end based on three possible outcomes:

Rate limit approved and invocation successful
Rate limit approved and invocation unsuccessful
Rate limit denied
The custom metric data is saved in CloudWatch in the GenAIRateLimiting namespace. This namespace includes the following key metrics:

TotalRequests – Counts every invocation attempt regardless of outcome
RateLimitApproved – Tracks requests that passed rate limiting checks
RateLimitDenied – Tracks requests blocked by rate limiting
InvocationFailed – Counts requests that failed during model invocation
InputTokens – Measures input token consumption for successful requests
OutputTokens – Measures output token consumption for successful requests
Each metric includes dimensions for Model, ModelId, CostCenter, Application, and Environment for data analysis.
We use CloudWatch metrics query capabilities with math expressions to analyze the data collected by the workflow. The data can be displayed in a variety of visual formats to get a granular view of requests by the dimensions provided, such as model or cost center. The following screenshot shows an example dashboard that displays invocation metrics where one model has reached its limit.

Additional Amazon Bedrock analytics
In addition to the custom metrics dashboard, CloudWatch provides automatic dashboards for monitoring Amazon Bedrock performance and usage. The Bedrock dashboard offers visibility into key performance metrics and operational insights, as shown in the following screenshot.

Cost tagging and reporting
Amazon Bedrock has introduced application inference profiles, a new capability that organizations can use to apply custom cost allocation tags to track and manage their on-demand foundation model (FM) usage. This feature addresses a previous limitation where tagging wasn’t possible for on-demand FMs, making it difficult to track costs across different business units and applications. You can now create custom inference profiles for base FMs and apply cost allocation tags like department, team, and application identifiers. These tags integrate with AWS cost management tools including AWS Cost Explorer, AWS Budgets, and AWS Cost Anomaly Detection, enabling detailed cost analysis and budget control.
Application inference profiles
To start, you must create application inference profiles for each type of usage you want to track. In this case, the solution defines custom tags for costCenter, environment, and applicationId. An inference profile will also be based on an existing Amazon Bedrock model profile, so you must combine the desired tags and model into the profile. At the time of writing, you must use the AWS Command Line Interface (AWS CLI) or AWS API to create one. See the following example code:

aws bedrock create-inference-profile
  –inference-profile-name “aws-docs-sales-prod”
  –model-source ‘{“copyFrom”:  “arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-haiku-20240307-v1:0”}’
  –tags ‘[
    {“key”: “applicationId”, “value”: “aws-documentation-helper”},
    {“key”: “costCenter”, “value”: “sales”},
    {“key”: “environment”, “value”: “production”}
  ]’

This command creates a profile for the sales cost center and production environment using Anthropic’s Claude Haiku 3.5 model. The output from this command is an Amazon Resource Name (ARN) that you will use as the model ID. In this solution, the ValidateAndSetContext Lambda function was modified to allow for specifying the model by cost center (for example, sales). To see which profiles you created, use the following command:
aws bedrock list-inference-profiles –type-equals APPLICATION
After the profiles have been created and the validation has been updated to map cost centers to the profile ARNs, the workflow will start running inference requests with the aligned profile. For example, when the user submits a request, they will specify the model as sales, services, or support to align with the three cost centers defined. The following code is a similar map to the previous example:

MODEL_ID_MAPPING = {
    “sales”: “arn:aws:bedrock:<region>:<account>:application-inference-profile/<unique id1>”,
    “services”: “arn:aws:bedrock:<region>:<account>:application-inference-profile/<unique id2>”,
    “support”: “arn:aws:bedrock:<region>:<account>:application-inference-profile/<unique id3>”
   }

To query CloudWatch metrics for the model usage correctly when using application inference profiles, you must specify the unique ID for the profile (the last part of the ARN). CloudWatch will store metrics like token usage based on the unique ID. To support both profile and direct model usage, the Lambda function was modified to add a new tag for modelMetric to be the appropriate term to use to query for token usage. See the following code:

  “tags”: {
    “requestId”: “ded98994-eb76-48d9-9dbc-f269541b5e49”,
    “timestamp”: “2025-01-31T14:05:26.854682”,
    “applicationId”: “aws-documentation-helper”,
    “costCenter”: “support”,
    “environment”: “production”,    
    “modelMetric”: “<unique id> | <model id>”
  }

Cost Explorer
Cost Explorer is a powerful cost management tool that provides comprehensive visualization and analysis of your cloud spending across AWS services, including Amazon Bedrock. It offers intuitive dashboards to track historical costs, forecast future expenses, and gain insights into your cloud consumption. With Cost Explorer, you can break down expenses by service, tags, and custom dimensions, for detailed financial analysis. The tool updates on a daily basis.
When you use application inference profiles with Amazon Bedrock, your AI service usage is automatically tagged and flows directly into Billing and Cost Management. These tags enable detailed cost tracking across different dimensions like cost center, application, and environment. This means you can generate reports that break down Amazon Bedrock AI expenses by specific business units, projects, or organizational hierarchies, providing clear visibility into your generative AI spending.
Cost allocation tags
Cost allocation tags are key-value pairs that help you categorize and track AWS resource costs across your organization. In the context of Amazon Bedrock, these tags can include attributes like application name, cost center, environment, or project ID. To activate a cost allocation tag, you must first enable it on the Billing and Cost Management console. After they’re activated, these tags will appear in your AWS Cost and Usage Report (CUR), helping you break down Amazon Bedrock expenses with granular detail.
To activate a cost allocation tag, complete the following steps:

On the Billing and Cost Management console, in the navigation pane, choose Cost Allocation Tags.
Locate your tag (for this example, it’s named costCenter) and choose Activate.
Confirm the activation.

After activation, the costCenter tag will appear in your CUR and will be used in Cost Explorer. It might take 24 hours for the tag to become fully active in your billing reports.

Cost Explorer reporting
To create an Amazon Bedrock usage report in Cost Explorer based on your tag, complete the following steps:

On the Billing and Cost Management console, choose Cost Explorer in the navigation pane.
Set your desired date range (relative time range or custom period).
Select Daily or Monthly granularity.
On the Group by dropdown menu, choose Tag.
Choose costCenter as the tag key.
Review the displayed Amazon Bedrock costs broken down by each unique cost center value.
Optionally, filter the values by applying a filter in the Filters section:

Choose Tag filter.
Choose the costCenter tag.
Choose specific cost center values you want to analyze.

The resulting report will provide a detailed view of Amazon Bedrock AI service expenses, helping you compare spending across different organizational units or projects with precision.

Summary
The AWS Cost and Usage Reports (including budgets) act as trailing edge indicators because they show what you’ve already spent on Amazon Bedrock after the fact. By blending real-time alerts from Step Functions with comprehensive cost reports, you can get a 360-degree view of your Amazon Bedrock usage. This reporting can alert you before you overspend and help you understand your actual consumption. This approach gives you the power to manage AI resources proactively, keeping your innovation budget on track and your projects running smoothly.
Try out this cost management approach for your own use case, and share your feedback in the comments.

About the Author
Jason Salcido is a Startups Senior Solutions Architect with nearly 30 years of experience pioneering innovative solutions for organizations from startups to enterprises. His expertise spans cloud architecture, serverless computing, machine learning, generative AI, and distributed systems. Jason combines deep technical knowledge with a forward-thinking approach to design scalable solutions that drive value, while translating complex concepts into actionable strategies.

Build a proactive AI cost management system for Amazon Bedrock – Par …

As organizations embrace generative AI powered by Amazon Bedrock, they face the challenge of managing costs associated with the token-based pricing model. Amazon Bedrock offers a pay-as-you-go pricing structure that can potentially lead to unexpected and excessive bills if usage is not carefully monitored. Traditional methods of cost monitoring, such as budget alerts and cost anomaly detection, can help spot unexpectedly high usage but are reactive in nature. To address costs proactively, it is vital to use both leading and trailing indicators.
Leading indicators are predictive signals that help you anticipate future trends or potential issues before they fully materialize. These indicators provide proactive insights that allow for timely intervention. In contrast, trailing indicators are retrospective measurements that confirm what has already occurred. By understanding and tracking both types of indicators, organizations can develop more strategic and responsive decision-making processes.
In this two-part series, we introduce a comprehensive solution for proactively managing Amazon Bedrock inference costs. Our approach features a cost sentry mechanism designed to establish and enforce token usage limits, providing organizations with a robust framework for controlling generative AI expenses. In this post, we focus on core architecture, cost sentry design, token usage tracking, and initial budget enforcement strategies. In Part 2, we explore advanced monitoring techniques, custom tagging, reporting, and long-term cost optimization best practices. The goal is to deliver a predictable, cost-effective approach to Amazon Bedrock deployments that aligns with organizational financial constraints.
Solution overview
Amazon Bedrock is billed on a token usage-based policy with charges based on the input and output tokens used. The rate charged depends on the model used and AWS Region where inference is performed. Developers must implement robust token management strategies in their applications to help prevent runaway costs, making sure generative AI applications include circuit breakers and consumption limits that align with budgetary constraints.
To address this, you can configure Amazon CloudWatch alarms or monitor costs with billing alerts and budgets, but these mechanisms look at incurred costs or usage after the fact. Another option is the Generative AI Gateway Solution in the AWS Solutions Library, which uses LiteLLM to enforce budgetary limits for Amazon Bedrock and other model providers.
This solution was developed to identify a proactive, centralized mechanism that could limit the generative AI usage to a specific budget that can be adjusted. This approach uses serverless workflows and native Amazon Bedrock integration that offers less operational complexity while providing large-scale performance and scaling.
When building applications with Amazon Bedrock, it is common practice to access the service through a developed API, either synchronously through a REST API or asynchronously through a queuing system. The following diagram compares these architectures.

For synchronous interactions, clients make direct REST API calls to Amazon Bedrock, passing in the necessary parameters. In an asynchronous architecture, clients submit inference requests to a queue or message broker, such as Amazon Simple Queue Service (Amazon SQS). A backend processing system, often implemented as a serverless function or a containerized application, continuously monitors the queue and processes incoming requests. This approach decouples the client from the inference processing, enabling scalability and resilience in handling bursts of requests.
This solution is a centralized mechanism that can be used to interact with Amazon Bedrock to serve as a proactive cost sentry. It is designed using a serverless architecture that uses AWS Step Functions to orchestrate a workflow that validates token usage against configured limits before allowing Amazon Bedrock inference requests to proceed. This solution makes sure that generative AI applications stay within predefined budgetary boundaries, providing cost predictability and control.
The following diagram illustrates the architecture we build in this post.

The core components of this solution include:

Rate limiter workflow – A Step Functions workflow that retrieves current token usage metrics from CloudWatch, compares them against predefined limits stored in Amazon DynamoDB, and determines whether to proceed with or deny the Amazon Bedrock inference request.
Amazon Bedrock model router – A separate Step Functions state machine that acts as a centralized gateway for invoking various Amazon Bedrock models. This component abstracts the complexity of handling different I/O parameters required by each model.
Token usage tracking – Uses CloudWatch metrics integration with Amazon Bedrock to retrieve current token usage data for input and output tokens across all or specific models.
Budget configuration – Allows setting token usage limits on a per-model basis by storing the desired budget values in DynamoDB. A default limit can also be set to apply to models without specific budgets defined.
Cost and usage visibility – Provides visibility for AI usage with CloudWatch dashboards and cost over time reporting in AWS Cost Explorer.

The solution follows a serverless architecture approach, using managed AWS services like Step Functions, AWS Lambda, DynamoDB, and CloudWatch to provide a scalable, extensible, and cost-effective implementation.
The goal is to provide a proactive method of setting generative AI usage limits that operate as a leading indicator to limit usage:

Proactive budgeting – Enforces token usage limits before allowing inference requests, helping prevent accidental overspending
Model-specific budgets – Supports setting individual budgets for different Amazon Bedrock models based on their pricing and usage patterns
Default budget fallback – If no specific budget is defined for a model, a default limit can be applied to enable cost control
Monitoring – Uses CloudWatch metrics integration to track token usage, enabling accurate budget enforcement
Serverless architecture – Uses Step Functions, Lambda, DynamoDB, and CloudWatch for a scalable and cost-effective solution
Extensibility – The modular design allows for seamless integration of additional Amazon Bedrock models or alternative inference methods

Step Functions workflows
In this section, we explore how the solution uses Step Functions to implement rate limiting and model routing workflows.
Rate limiting workflow
The rate limiting workflow is designed to take a minimal JSON document as input with the following format:

{
  “modelId”: “string”,       // e.g. “anthropic.claude-3-sonnet-20240229-v1:0”
  “prompt”: {
    “messages”: [
      {
        “role”: “string”,    // “system”, “user”, or “assistant”
        “content”: “string”
      }
    ]
  }
}

This workflow is the core component that enforces budgetary controls. The key steps are as follows:

A Lambda function retrieves the start and end dates for the current month, which is used to query token usage metrics for the appropriate time range.
The workflow queries CloudWatch to retrieve the current month’s token usage metrics for the specified Amazon Bedrock model.
The workflow retrieves the configured token usage limit for the specified Amazon Bedrock model from DynamoDB. If no specific limit is found, it falls back to retrieving the default limit.
The workflow compares the current token usage against the configured limit to determine if the budget has been exceeded or not.
If the token usage is within the budget, this step invokes the Amazon Bedrock model router state machine to perform the actual inference request.
Depending on the outcome of the budget check, the workflow returns either the formatted inference result or an error indicating that the budget has been exceeded.

The following diagram illustrates the Step Functions workflow.

Amazon Bedrock model router workflow
The Amazon Bedrock model router workflow is a separate Step Functions state machine responsible for invoking the appropriate Amazon Bedrock model based on the request parameters. It abstracts the complexity of handling different I/O formats required by various Amazon Bedrock models and combines the result into a standardized format.
The key steps in the workflow include:

Based on the provided model ID, the workflow determines the specific Amazon Bedrock model to be invoked.
The workflow calls the appropriate Amazon Bedrock model with the required input parameters.
The workflow normalizes the output from the Amazon Bedrock model to a consistent format for further processing or returning to the client.
The workflow returns the transformed inference result to the calling workflow (budget sentry workflow).

The following diagram illustrates the Step Functions workflow.

You can implement additional steps to handle error conditions and format the output appropriately. In this example, the Anthropic flow includes error processing.
Token usage tracking with CloudWatch metrics
The Amazon Bedrock cost sentry uses the CloudWatch integration with Amazon Bedrock to retrieve current token usage metrics. These metrics are used to enforce budgetary limits proactively. For example, see the following query:

{
    “sparkline”: false,
    “metrics”: [
        [ { “expression”: “SEARCH(‘{AWS/Bedrock} MetricName=”InputTokenCount”‘, ‘Sum’, 60)”, “region”: “us-east-1” } ],
        [ { “expression”: “SEARCH(‘{AWS/Bedrock} MetricName=”OutputTokenCount”‘, ‘Sum’, 60)”, “region”: “us-east-1” } ]
    ],
    “legend”: {
        “position”: “right”
    },
    “title”: “InputTokenCount, OutputTokenCount”,
    “region”: “us-east-1”,
    “liveData”: true,
    “view”: “gauge”,
    “stacked”: false,
    “period”: 2592000,
    “table”: {
        “summaryColumns”: [
            “SUM”
        ]
    },
    “yAxis”: {
        “left”: {
            “min”: 0,
            “max”: 1000000
        }
    },
    “setPeriodToTimeRange”: true,
    “trend”: false,
    “startTime”: “2024-05-01T00:00:00Z”,
    “endTime”: “2024-05-30T23:59:59Z”
}

This CloudWatch metric query retrieves the total input and output token counts for a specified time range, allowing the rate limiter workflow to accurately enforce budgets based on real-time usage data.
Budget configuration with DynamoDB
The Amazon Bedrock cost sentry stores token usage limits in a DynamoDB table, providing seamless configuration and updates to individual model budgets or the default limit. For example, see the following code:

{
    “modelId”: “anthropic.claude-3-sonnet-20240229-v1:0”,
    “limit”: {
        “input”: 1000000,
        “output”: 3000000
    }
}

In this example, the token usage limit for the specified Amazon Bedrock model (anthropic.claude-3-sonnet-20240229-v1:0) is set to 1,000,000 input tokens and 3,000,000 output tokens.
Administrators can quickly update these limits by modifying the corresponding DynamoDB records, providing flexibility in adjusting budgets as needed.
Performance analysis of the rate limiter workflow
To assess the performance impact of introducing the workflow, we used an array of inference requests. Test cases included various prompts designed to generate responses ranging from concise answers to detailed explanations over 500 words, effectively testing the workflow’s performance across different output token sizes. The workflow demonstrated exceptional performance characteristics across 501 successful executions, handling a diverse set of inference requests from brief responses to extensive content generation.
The workflow maintains consistent execution patterns while processing requests ranging from 6.76 seconds to 32.24 seconds in total duration, with the variation primarily reflecting the different output token requirements of each request:

Quick responses (under 10 seconds) – Typically handling concise answers and simple queries
Medium-length content (11–22 seconds) – Common for detailed explanations and multi-paragraph responses
Extended generation (up to 32 seconds) – Handling comprehensive responses requiring more than 500 words

The following diagram illustrates our time distribution findings.

The time distribution analysis reveals highly optimized resource utilization:

Amazon Bedrock model router – 5.80–31.99 seconds (98.26% of runtime)
Other workflow steps – 0.11–4.74 seconds (1.65% of runtime)
System overhead – 0.02 seconds average (0.09% of runtime)

This performance profile aligns with best practices for workflow orchestration, where minimizing overhead and maintaining consistent execution patterns are crucial for reliability. The workflow’s efficiency is evidenced by its remarkably low system overhead of just 0.09%, demonstrating effective use of the built-in controls and state management capabilities of Step Functions regardless of the response size being generated.
The execution consistency is particularly noteworthy, with a predictable event pattern of 47–49 events per execution, regardless of the inference request complexity or output size. This predictability is essential for workload management and resource planning, especially when handling varied request complexities and token outputs.
These metrics indicate a well-architected workflow that effectively uses Step Functions Express workflow capabilities for high-volume event processing while maintaining minimal overhead and consistent performance characteristics across both simple queries and complex, token-intensive inference requests.
Cost analysis
To analyze the cost implications, estimates were generated using the AWS Pricing Calculator for both Standard and Express Step Functions workflows, assuming 100,000 requests per month. The following table summarizes these estimates.

Detailed Estimate

Region
Description
Service
Upfront
Monthly
First 12 Months Total
Currency
Configuration Summary

US East (Ohio)
Step Functions Standard
Step Functions – Standard Workflows
0
$37.40
$448.80
USD
Workflow requests (100,000 per month)State transitions per workflow (15)

US East (Ohio)
Step Functions Express
Step Functions – Express Workflows
0
$3.75
$45
USD
Duration of each workflow (35,000)Memory consumed by each workflow (64 MB)Workflow requests (100,000 per month)

The cost analysis revealed that the Step Functions Express workflow offers a more cost-effective solution compared to the Standard workflow, with potential cost savings of up to 90% for the same workload. There is a potential for cost reduction for Standard if the number of steps can be optimized. For example, a few formatting pass steps could potentially be removed, but these steps help format the downstream input to later steps.
Consult the AWS Pricing Calculator for more details on pricing and to run your own scenario.
Conclusion
In this solution, we used Step Functions to build a system that serves as a leading indicator because it tracks rate limiting and token usage, warning us immediately when we’re approaching our usage limits. In Part 2, we discuss combining this with trailing indicators to stay aware of usage and costs.

About the author
Jason Salcido is a Startups Senior Solutions Architect with nearly 30 years of experience pioneering innovative solutions for organizations from startups to enterprises. His expertise spans cloud architecture, serverless computing, machine learning, generative AI, and distributed systems. Jason combines deep technical knowledge with a forward-thinking approach to design scalable solutions that drive value, while translating complex concepts into actionable strategies.

Serverless deployment for your Amazon SageMaker Canvas models

Deploying machine learning (ML) models into production can often be a complex and resource-intensive task, especially for customers without deep ML and DevOps expertise. Amazon SageMaker Canvas simplifies model building by offering a no-code interface, so you can create highly accurate ML models using your existing data sources and without writing a single line of code. But building a model is only half the journey; deploying it efficiently and cost-effectively is just as crucial. Amazon SageMaker Serverless Inference is designed for workloads with variable traffic patterns and idle periods. It automatically provisions and scales infrastructure based on demand, alleviating the need to manage servers or pre-configure capacity.
In this post, we walk through how to take an ML model built in SageMaker Canvas and deploy it using SageMaker Serverless Inference. This solution can help you go from model creation to production-ready predictions quickly, efficiently, and without managing any infrastructure.
Solution overview
To demonstrate serverless endpoint creation for a SageMaker Canvas trained model, let’s explore an example workflow:

Add the trained model to the Amazon SageMaker Model Registry.
Create a new SageMaker model with the correct configuration.
Create a serverless endpoint configuration.
Deploy the serverless endpoint with the created model and endpoint configuration.

You can also automate the process, as illustrated in the following diagram.

In this example, we deploy a pre-trained regression model to a serverless SageMaker endpoint. This way, we can use our model for variable workloads that don’t require real-time inference.
Prerequisites
As a prerequisite, you must have access to Amazon Simple Storage Service (Amazon S3) and Amazon SageMaker AI. If you don’t already have a SageMaker AI domain configured in your account, you also need permissions to create a SageMaker AI domain.
You must also have a regression or classification model that you have trained. You can train your SageMaker Canvas model as you normally would. This includes creating the Amazon SageMaker Data Wrangler flow, performing necessary data transformations, and choosing the model training configuration. If you don’t already have a trained model, you can follow one of the labs in the Amazon SageMaker Canvas Immersion Day to create one before continuing. For this example, we use a classification model that was trained on the canvas-sample-shipping-logs.csv sample dataset.
Save your model to the SageMaker Model Registry
Complete the following steps to save your model to the SageMaker Model Registry:

On the SageMaker AI console, choose Studio to launch Amazon SageMaker Studio.
In the SageMaker Studio interface, launch SageMaker Canvas, which will open in a new tab.

Locate the model and model version that you want to deploy to your serverless endpoint.
On the options menu (three vertical dots), choose Add to Model Registry.

You can now exit SageMaker Canvas by logging out. To manage costs and prevent additional workspace charges, you can also configure SageMaker Canvas to automatically shut down when idle.
Approve your model for deployment
After you have added your model to the Model Registry, complete the following steps:

In the SageMaker Studio UI, choose Models in the navigation pane.

The model you just exported from SageMaker Canvas should be added with a deployment status of Pending manual approval.

Choose the model version you want to deploy and update the status to Approved by choosing the deployment status.

Choose the model version and navigate to the Deploy tab. This is where you will find the information related to the model and associated container.
Select the container and model location related to the trained model. You can identify it by checking the presence of the environment variable SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT.

Create a new model
Complete the following steps to create a new model:

Without closing the SageMaker Studio tab, open a new tab and open the SageMaker AI console.
Choose Models in the Inference section and choose Create model.
Name your model.
Leave the container input option as Provide model artifacts and inference image location and used the CompressedModel type.
Enter the Amazon Elastic Container Registry (Amazon ECR) URI, Amazon S3 URI, and environment variables that you located in the previous step.

The environment variables will be shown as a single line in SageMaker Studio, with the following format:

SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT: text/csv, SAGEMAKER_INFERENCE_OUTPUT: predicted_label, SAGEMAKER_INFERENCE_SUPPORTED: predicted_label, SAGEMAKER_PROGRAM: tabular_serve.py, SAGEMAKER_SUBMIT_DIRECTORY: /opt/ml/model/code

You might have different variables than those in the preceding example. All variables from your environment variables should be added to your model. Make sure that each environment variable is on its own line when creating you new model.

Choose Create model.

Create an endpoint configuration
Complete the following steps to create an endpoint configuration:

On the SageMaker AI console, choose Endpoint configurations to create a new model endpoint configuration.
Set the type of endpoint to Serverless and set the model variant to the model created in the previous step.

Choose Create endpoint configuration.

Create an endpoint
Complete the following steps to create an endpoint:

On the SageMaker AI console, choose Endpoints in the navigation pane and create a new endpoint.
Name the endpoint.
Select the endpoint configuration created in the previous step and choose Select endpoint configuration.
Choose Create endpoint.

The endpoint might take a few minutes to be created. When the status is updated to InService, you can begin calling the endpoint.
The following sample code demonstrates how you can call an endpoint from a Jupyter notebook located in your SageMaker Studio environment:

import boto3
import csv
from io import StringIO
import time

def invoke_shipping_prediction(features):
sagemaker_client = boto3.client(‘sagemaker-runtime’)

# Convert to CSV string format
output = StringIO()
csv.writer(output).writerow(features)
payload = output.getvalue()

response = sagemaker_client.invoke_endpoint(
EndpointName=’canvas-shipping-data-model-1-serverless-endpoint’,
ContentType=’text/csv’,
Accept=’text/csv’,
Body=payload
)

response_body = response[‘Body’].read().decode()
reader = csv.reader(StringIO(response_body))
result = list(reader)[0] # Get first row

# Parse the response into a more usable format
prediction = {
‘predicted_label’: result[0],
‘confidence’: float(result[1]),
‘class_probabilities’: eval(result[2]),
‘possible_labels’: eval(result[3])
}

return prediction

# Features for inference
features_set_1 = [
“Bell”,
“Base”,
14,
6,
11,
11,
“GlobalFreight”,
“Bulk Order”,
“Atlanta”,
“2020-09-11 00:00:00”,
“Express”,
109.25199890136719
]

features_set_2 = [
“Bell”,
“Base”,
14,
6,
15,
15,
“MicroCarrier”,
“Single Order”,
“Seattle”,
“2021-06-22 00:00:00”,
“Standard”,
155.0483856201172
]

# Invoke the SageMaker endpoint for feature set 1
start_time = time.time()
result = invoke_shipping_prediction(features_set_1)

# Print Output and Timing
end_time = time.time()
total_time = end_time – start_time

print(f”Total response time with endpoint cold start: {total_time:.3f} seconds”)
print(f”Prediction for feature set 1: {result[‘predicted_label’]}”)
print(f”Confidence for feature set 1: {result[‘confidence’]*100:.2f}%”)
print(“nProbabilities for feature set 1:”)
for label, prob in zip(result[‘possible_labels’], result[‘class_probabilities’]):
print(f”{label}: {prob*100:.2f}%”)

print(“———————————————————“)

# Invoke the SageMaker endpoint for feature set 2
start_time = time.time()
result = invoke_shipping_prediction(features_set_2)

# Print Output and Timing
end_time = time.time()
total_time = end_time – start_time

print(f”Total response time with warm endpoint: {total_time:.3f} seconds”)
print(f”Prediction for feature set 2: {result[‘predicted_label’]}”)
print(f”Confidence for feature set 2: {result[‘confidence’]*100:.2f}%”)
print(“nProbabilities for feature set 2:”)
for label, prob in zip(result[‘possible_labels’], result[‘class_probabilities’]):
print(f”{label}: {prob*100:.2f}%”)

Automate the process
To automatically create serverless endpoints each time a new model is approved, you can use the following YAML file with AWS CloudFormation. This file will automate the creation of SageMaker endpoints with the configuration you specify.
This sample CloudFormation template is provided solely for inspirational purposes and is not intended for direct production use. Developers should thoroughly test this template according to their organization’s security guidelines before deployment.

AWSTemplateFormatVersion: “2010-09-09”
Description: Template for creating Lambda function to handle SageMaker model
package state changes and create serverless endpoints

Parameters:
MemorySizeInMB:
Type: Number
Default: 1024
Description: Memory size in MB for the serverless endpoint (between 1024 and 6144)
MinValue: 1024
MaxValue: 6144

MaxConcurrency:
Type: Number
Default: 20
Description: Maximum number of concurrent invocations for the serverless endpoint
MinValue: 1
MaxValue: 200

AllowedRegion:
Type: String
Default: “us-east-1”
Description: AWS region where SageMaker resources can be created

AllowedDomainId:
Type: String
Description: SageMaker Studio domain ID that can trigger deployments
NoEcho: true

AllowedDomainIdParameterName:
Type: String
Default: “/sagemaker/serverless-deployment/allowed-domain-id”
Description: SSM Parameter name containing the SageMaker Studio domain ID that can trigger deployments

Resources:
AllowedDomainIdParameter:
Type: AWS::SSM::Parameter
Properties:
Name: !Ref AllowedDomainIdParameterName
Type: String
Value: !Ref AllowedDomainId
Description: SageMaker Studio domain ID that can trigger deployments

SageMakerAccessPolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
Description: Managed policy for SageMaker serverless endpoint creation
PolicyDocument:
Version: “2012-10-17”
Statement:
– Effect: Allow
Action:
– sagemaker:CreateModel
– sagemaker:CreateEndpointConfig
– sagemaker:CreateEndpoint
– sagemaker:DescribeModel
– sagemaker:DescribeEndpointConfig
– sagemaker:DescribeEndpoint
– sagemaker:DeleteModel
– sagemaker:DeleteEndpointConfig
– sagemaker:DeleteEndpoint
Resource: !Sub “arn:aws:sagemaker:${AllowedRegion}:${AWS::AccountId}:*”
– Effect: Allow
Action:
– sagemaker:DescribeModelPackage
Resource: !Sub “arn:aws:sagemaker:${AllowedRegion}:${AWS::AccountId}:model-package/*/*”
– Effect: Allow
Action:
– iam:PassRole
Resource: !Sub “arn:aws:iam::${AWS::AccountId}:role/service-role/AmazonSageMaker-ExecutionRole-*”
Condition:
StringEquals:
“iam:PassedToService”: “sagemaker.amazonaws.com”
– Effect: Allow
Action:
– ssm:GetParameter
Resource: !Sub “arn:aws:ssm:${AllowedRegion}:${AWS::AccountId}:parameter${AllowedDomainIdParameterName}”

LambdaExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: “2012-10-17″
Statement:
– Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
– arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
– !Ref SageMakerAccessPolicy

ModelDeploymentFunction:
Type: AWS::Lambda::Function
Properties:
Handler: index.handler
Role: !GetAtt LambdaExecutionRole.Arn
Code:
ZipFile: |
import os
import json
import boto3

sagemaker_client = boto3.client(‘sagemaker’)
ssm_client = boto3.client(‘ssm’)

def handler(event, context):
print(f”Received event: {json.dumps(event, indent=2)}”)
try:
# Get details directly from the event
detail = event[‘detail’]
print(f’detail: {detail}’)

# Get allowed domain ID from SSM Parameter Store
parameter_name = os.environ.get(‘ALLOWED_DOMAIN_ID_PARAMETER_NAME’)
try:
response = ssm_client.get_parameter(Name=parameter_name)
allowed_domain = response[‘Parameter’][‘Value’]
except Exception as e:
print(f”Error retrieving parameter {parameter_name}: {str(e)}”)
allowed_domain = ‘*’ # Default fallback

# Check if domain ID is allowed
if allowed_domain != ‘*’:
created_by_domain = detail.get(‘CreatedBy’, {}).get(‘DomainId’)
if created_by_domain != allowed_domain:
print(f”Domain {created_by_domain} not allowed. Allowed: {allowed_domain}”)
return {‘statusCode’: 403, ‘body’: ‘Domain not authorized’}

# Get the model package ARN from the event resources
model_package_arn = event[‘resources’][0]

# Get the model package details from SageMaker
model_package_response = sagemaker_client.describe_model_package(
ModelPackageName=model_package_arn
)

# Parse model name and version from ModelPackageName
model_name, version = detail[‘ModelPackageName’].split(‘/’)
serverless_model_name = f”{model_name}-{version}-serverless”

# Get all container details directly from the event
container_defs = detail[‘InferenceSpecification’][‘Containers’]

# Get the execution role from the event and convert to proper IAM role ARN format
assumed_role_arn = detail[‘CreatedBy’][‘IamIdentity’][‘Arn’]
execution_role_arn = assumed_role_arn.replace(‘:sts:’, ‘:iam:’)
.replace(‘assumed-role’, ‘role/service-role’)
.rsplit(‘/’, 1)[0]

# Prepare containers configuration for the model
containers = []
for i, container_def in enumerate(container_defs):
# Get environment variables from the model package for this container
environment_vars = model_package_response[‘InferenceSpecification’][‘Containers’][i].get(‘Environment’, {}) or {}

containers.append({
‘Image’: container_def[‘Image’],
‘ModelDataUrl’: container_def[‘ModelDataUrl’],
‘Environment’: environment_vars
})

# Create model with all containers
if len(containers) == 1:
# Use PrimaryContainer if there’s only one container
create_model_response = sagemaker_client.create_model(
ModelName=serverless_model_name,
PrimaryContainer=containers[0],
ExecutionRoleArn=execution_role_arn
)
else:
# Use Containers parameter for multiple containers
create_model_response = sagemaker_client.create_model(
ModelName=serverless_model_name,
Containers=containers,
ExecutionRoleArn=execution_role_arn
)

# Create endpoint config
endpoint_config_name = f”{serverless_model_name}-config”
create_endpoint_config_response = sagemaker_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[{
‘VariantName’: ‘AllTraffic’,
‘ModelName’: serverless_model_name,
‘ServerlessConfig’: {
‘MemorySizeInMB’: int(os.environ.get(‘MEMORY_SIZE_IN_MB’)),
‘MaxConcurrency’: int(os.environ.get(‘MAX_CONCURRENT_INVOCATIONS’))
}
}]
)

# Create endpoint
endpoint_name = f”{serverless_model_name}-endpoint”
create_endpoint_response = sagemaker_client.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name
)

return {
‘statusCode’: 200,
‘body’: json.dumps({
‘message’: ‘Serverless endpoint deployment initiated’,
‘endpointName’: endpoint_name
})
}

except Exception as e:
print(f”Error: {str(e)}”)
raise
Runtime: python3.12
Timeout: 300
MemorySize: 128
Environment:
Variables:
MEMORY_SIZE_IN_MB: !Ref MemorySizeInMB
MAX_CONCURRENT_INVOCATIONS: !Ref MaxConcurrency
ALLOWED_DOMAIN_ID_PARAMETER_NAME: !Ref AllowedDomainIdParameterName

EventRule:
Type: AWS::Events::Rule
Properties:
Description: Rule to trigger Lambda when SageMaker Model Package state changes
EventPattern:
source:
– aws.sagemaker
detail-type:
– SageMaker Model Package State Change
detail:
ModelApprovalStatus:
– Approved
UpdatedModelPackageFields:
– ModelApprovalStatus
State: ENABLED
Targets:
– Arn: !GetAtt ModelDeploymentFunction.Arn
Id: ModelDeploymentFunction

LambdaInvokePermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref ModelDeploymentFunction
Action: lambda:InvokeFunction
Principal: events.amazonaws.com
SourceArn: !GetAtt EventRule.Arn

Outputs:
LambdaFunctionArn:
Description: ARN of the Lambda function
Value: !GetAtt ModelDeploymentFunction.Arn
EventRuleArn:
Description: ARN of the EventBridge rule
Value: !GetAtt EventRule.Arn

This stack will limit automated serverless endpoint creation to a specific AWS Region and domain. You can find your domain ID when accessing SageMaker Studio from the SageMaker AI console, or by running the following command: aws sagemaker list-domains —region [your-region]
Clean up
To manage costs and prevent additional workspace charges, make sure that you have logged out of SageMaker Canvas. If you tested your endpoint using a Jupyter notebook, you can shut down your JupyterLab instance by choosing Stop or configuring automated shutdown for JupyterLab.

In this post, we showed how to deploy a SageMaker Canvas model to a serverless endpoint using SageMaker Serverless Inference. By using this serverless approach, you can quickly and efficiently serve predictions from your SageMaker Canvas models without needing to manage the underlying infrastructure.
This seamless deployment experience is just one example of how AWS services like SageMaker Canvas and SageMaker Serverless Inference simplify the ML journey, helping businesses of different sizes and technical proficiencies unlock the value of AI and ML. As you continue exploring the SageMaker ecosystem, be sure to check out how you can unlock data governance for no-code ML with Amazon DataZone, and seamlessly transition between no-code and code-first model development using SageMaker Canvas and SageMaker Studio.

About the authors
Nadhya Polanco is a Solutions Architect at AWS based in Brussels, Belgium. In this role, she supports organizations looking to incorporate AI and Machine Learning into their workloads. In her free time, Nadhya enjoys indulging in her passion for coffee and traveling.
Brajendra Singh is a Principal Solutions Architect at Amazon Web Services, where he partners with enterprise customers to design and implement innovative solutions. With a strong background in software development, he brings deep expertise in Data Analytics, Machine Learning, and Generative AI.

Building a multi-agent voice assistant with Amazon Nova Sonic and Amaz …

Amazon Nova Sonic is a foundation model that creates natural, human-like speech-to-speech conversations for generative AI applications, allowing users to interact with AI through voice in real-time, with capabilities for understanding tone, enabling natural flow, and performing actions.
Multi-agent architecture offers a modular, robust, and scalable design pattern for production-level voice assistants. This blog post explores Amazon Nova Sonic voice agent applications and demonstrates how they integrate with Strands Agents framework sub-agents while leveraging Amazon Bedrock AgentCore to create an effective multi-agent system.
Why multi-agent architecture?
Imagine developing a financial assistant application responsible for user onboarding, information collection, identity verification, account inquiries, exception handling, and handing off to human agents based on predefined conditions. As functional requirements expand, the voice agent continues to add new inquiry types. The system prompt grows enormous, and the underlying logic becomes increasingly complex, illustrates a persistent challenge in software development: monolithic designs lead to systems that are difficult to maintain and enhance.
Think of multi-agent architecture as building a team of specialized AI assistants rather than relying on a single do-it-all helper. Just like companies divide responsibilities across different departments, this approach breaks complex tasks into smaller, manageable pieces. Each AI agent becomes an expert in a specific area—whether that’s fact-checking, data processing, or handling specialized requests. For the user, the experience feels seamless: there’s no delay, no change in voice, and no visible handoff. The system functions behind the scenes, directing each expert agent to step in at the right moment.
In addition to modular and robust benefits, multi-agent systems offer advantages similar to a microservice architecture, a popular enterprise software design pattern, providing scalability, distribution and maintainability while allowing organizations to reuse agentic workflows already developed for their large language model (LLM)-powered applications.
Sample application
In this blog, we refer to the Amazon Nova Sonic workshop multi-agent lab code, which uses the banking voice assistant as a sample to demonstrate how to deploy specialized agents on Amazon Bedrock AgentCore. It uses Nova Sonic as the voice interface layer and acts as an orchestrator to delegate detailed inquiries to sub-agents written in Strands Agents hosted on AgentCore Runtime. You can find the sample source code on the GitHub repo.
In the banking voice agent sample, the conversation flow begins with a greeting and collecting the user’s name, and then it handles inquiries related to banking or mortgages. We use three secondary level agents hosted on AgentCore to handle specialized logic:

Authenticate sub-agent: Handles user authentication using the account ID and other information
Banking sub-agent: Handles account balance checks, statements, and other banking-related inquiries
Mortgage sub-agent: Handles mortgage-related inquiries, including refinancing, rates, and repayment options

Sub-agents are self-contained, handling their own logic such as input validation. For instance, the authentication agent validates account IDs and returns errors to Nova Sonic if needed. This simplifies the reasoning logic in Nova Sonic while keeping business logic encapsulated, similar to the software engineering modular design patterns.
Integrate Nova Sonic with AgentCore through tool use events
Amazon Nova Sonic relies on tool use to integrate with agentic workflows. During the Nova Sonic event lifecycle, you can provide tool use configurations through the promptStart event, which is designed to initiate when Sonic receives specific types of input.
For example, in the following Sonic tool configuration sample, tool use is configured to initiate events based on Sonic’s built-in reasoning model, which classifies the inquiry for routing to the banking sub-agents.

[
    {
        “toolSpec”: {
            “name”: “bankAgent”,
            “description”: `Use this tool whenever the customer asks about their **bank account balance** or **bank statement**.  
                    It should be triggered for queries such as:  
                    – “What’s my balance?”  
                    – “How much money do I have in my account?”  
                    – “Can I see my latest bank statement?”  
                    – “Show me my account summary.”`,
            “inputSchema”: {
                “json”: JSON.stringify({
                “type”: “object”,
                “properties”: {
                    “accountId”: {
                        “type”: “string”,
                        “description”: “This is a user input. It is the bank account Id which is a numeric number.”
                    },
                    “query”: {
                        “type”: “string”,
                        “description”: “The inquiry to the bank agent such as check account balance, get statement etc.”
                    }
                },
                “required”: [
                    “accountId”, “query”
                ]
                })
            }
        }
    }
]

When a user asks Nova Sonic a question such as ‘What is my account balance?’, Sonic sends a toolUse event to the client application with the specified toolName (for example, bankAgent) defined in the configuration. The application can then invoke the sub-agent hosted on AgentCore to handle the banking logic and return the response to Sonic, which in turn generates an audio reply for the user.

{
“event”: {
“toolUse”: {
“completionId”: “UUID”,
“content”: “{“accountId”:”one two three four five”,”query”:”check account balance”}”,
“contentId”: “UUID”,
“promptName”: “UUID”,
“role”: “TOOL”,
“sessionId”: “UUID”,
“toolName”: “bankAgent”,
“toolUseId”: “UUID”
}
}
}

Sub-agent on AgentCore
The following sample showcases the banking sub-agent developed using the Strands Agents framework, specifically configured for deployment on Bedrock AgentCore. It leverages Nova Lite through Amazon Bedrock as its reasoning model, providing effective cognitive capabilities with minimal latency. The agent implementation features a system prompt that defines its banking assistant responsibilities, complemented by two specialized tools: one for account balance inquiries and another for bank statement retrieval.

from strands import Agent, tool
import json
from bedrock_agentcore.runtime import BedrockAgentCoreApp
from strands.models import BedrockModel
import re, argparse

app = BedrockAgentCoreApp()

@tool
def get_account_balance(account_id) -> str:
    “””Get account balance for given account Id

    Args:
        account_id: Bank account Id
    “””

    # The actual implementation will retrieve information from a database API or another backend service.
    
    return {“result”: result}

@tool
def get_statement(account_id: str, year_and_month: str) -> str:
    “””Get account statement for a given year and month
    Args:
        account_id: Bank account Id
        year_and_month: Year and month of the bank statement. For example: 2025_08 or August 2025
    “””
    # The actual implementation will retrieve information from a database API or another backend service.
    
    return {“result”: result}

# Specify Bedrock LLM for the Agent
bedrock_model = BedrockModel(
    model_id=”amazon.nova-lite-v1:0″,
)
# System prompt
system_prompt = ”’
You are a banking agent. You will receive requests that include:  
– `account_id`  
– `query` (the inquiry type, such as **balance** or **statement**, plus any additional details like month).  

## Instructions
1. Use the provided `account_id` and `query` to call the tools.  
2. The tool will return a JSON response.  
3. Summarize the result in 2–3 sentences.  
   – For a **balance inquiry**, give the account balance with currency and date.  
   – For a **statement inquiry**, provide opening balance, closing balance, and number of transactions.  
4. Do not return raw JSON. Always respond in natural language.  
”’

# Create an agent with tools, LLM, and system prompt
agent = Agent(
    tools=[ get_account_balance, get_statement],
    model=bedrock_model,
    system_prompt=system_prompt
)

@app.entrypoint
def banking_agent(payload):
    response = agent(json.dumps(payload))
    return response.message[‘content’][0][‘text’]
    
if __name__ == “__main__”:
    app.run()

Best practices for voice-based multi-agent systems
Multi-agent architecture provides exceptional flexibility and a modular design approach, allowing developers to structure voice assistants efficiently and potentially reuse existing specialized agent workflows. When implementing voice-first experiences, there are important best practices to consider that address the unique challenges of this modality.

Balance flexibility and latency: Although the ability to invoke sub-agents using Nova Sonic tool use events creates powerful capabilities, it can introduce additional latency to voice responses. For the use cases that require a synchronized experience, each agent handoff represents a potential delay point in the interaction flow. Therefore, it’s important to design with response time in mind.
Optimize model selection for sub-agents: Starting with smaller, more efficient models like Nova Lite for sub-agents can significantly reduce latency while still handling specialized tasks effectively. Reserve larger, more capable models for complex reasoning or when sophisticated natural language understanding is essential.
Craft voice-optimized responses: Voice assistants perform best with concise, focused responses that can be followed by additional details when needed. This approach not only improves latency but also creates a more natural conversational flow that aligns with human expectations for verbal communication.

Consider stateless vs. stateful sub-agent design
Stateless sub-agents handle each request independently, without retaining memory of past interactions or session-level states. They are simple to implement, easy to scale, and work well for straightforward, one-off tasks. However, they cannot provide context-aware responses unless external state management is introduced.
Stateful sub-agents, on the other hand, maintain memory across interactions to support context-aware responses and session-level states. This enables more personalized and cohesive user experiences, but comes with added complexity and resource requirements. They are best suited for scenarios involving multi-turn interactions and user or session-level context caching.
Conclusion
Multi-agent architectures unlock flexibility, scalability, and accuracy for complex AI-driven workflows. By combining the Nova Sonic conversational capabilities with the orchestration power of Bedrock AgentCore, you can build intelligent, specialized agents that work together seamlessly. If you’re exploring ways to enhance your AI applications, multi-agent patterns with Nova Sonic and AgentCore are a powerful approach worth testing.
Learn more about Amazon Nova Sonic by visiting the User Guide, building your application with the sample applications, and exploring the Nova Sonic workshop to get started. You can also refer to the technical report and model card for additional benchmarks.

About the authors
Lana Zhang is a Senior Specialist Solutions Architect for Generative AI at AWS within the Worldwide Specialist Organization. She specializes in AI/ML, with a focus on use cases such as AI voice assistants and multimodal understanding. She works closely with customers across diverse industries, including media and entertainment, gaming, sports, advertising, financial services, and healthcare, to help them transform their business solutions through AI.

Accelerate large-scale AI training with Amazon SageMaker HyperPod trai …

Large-scale AI model training faces significant challenges with failure recovery and monitoring. Traditional training requires complete job restarts when even a single training process fails, resulting in additional downtime and increased costs. As training clusters expand, identifying and resolving critical issues like stalled GPUs and numerical instabilities typically requires complex custom monitoring code.
With Amazon SageMaker HyperPod you can accelerate AI model development across hundreds or thousands of GPUs with built-in resiliency, decreasing model training time by up to 40%. The Amazon SageMaker HyperPod training operator further enhances training resilience for Kubernetes workloads through pinpoint recovery and customizable monitoring capabilities.
In this blog post, we show you how to deploy and manage machine learning training workloads using the Amazon SageMaker HyperPod training operator, including setup instructions and a complete training example.
Amazon SageMaker HyperPod training operator
The Amazon SageMaker HyperPod training operator helps you accelerate generative AI model development by efficiently managing distributed training across large GPU clusters. The Amazon SageMaker HyperPod training operator uses built-in fault resiliency components, comes packaged as an Amazon Elastic Kubernetes Service (Amazon EKS) add-on, and deploys the necessary custom resource definitions (CRDs) to the HyperPod cluster.
Solution overview
The following diagram depicts the architecture of Amazon SageMaker HyperPod training operator.

The HyperPod training operator follows Kubernetes operator pattern and has the following major components:

Custom Resource Definition (CRDs): HyperPodPyTorchJob defines the job specification (for example, node count, image) and serves as the interface for customers to submit jobs. apiVersion: sagemaker.amazonaws.com/v1 kind: HyperPodPyTorchJob
RBAC policies: Defines the actions the controller is allowed to perform, such as creating pods and managing HyperPodPyTorchJob resources.
Job controller: Listens to job creation and fulfills requests by creating job pods and pod managers.
Pod manager: Monitors training process health on each pod. The number of Pod Managers is determined by the number of pods required by the job. One Pod Manager currently controls several hundred pods.
HyperPod elastic agent: Customers install the elastic agent into their training container. It orchestrates lifecycles of training workers on each container and communicates with the Amazon SageMaker HyperPod training operator. The HyperPod elastic agent is an extension of PyTorch’s ElasticAgent.

The job Controller uses fault detection components such as the SageMaker HyperPod health-monitoring agent and node health check mechanisms like AWS retirement notices to update job state and repair faults. It also relies on the HyperPod elastic agent to check the status of training processes for crashes and hung job detection.
When a HyperPodPyTorch job is submitted, the Amazon SageMaker HyperPod training operator spins up job pods along with pod manager pods that help manage the training job lifecycle. The pod managers interact with the HyperPod elastic agent so that all job pods maintain a healthy state.
Benefits of using the operator
The Amazon SageMaker HyperPod training operator can be installed as an EKS add-on on your cluster. The key benefits include:

Centralized training process monitoring and restart – The HyperPod training operator maintains a control plane with a global view of health across all ranks. When one rank encounters an issue, it broadcasts a stop signal to all ranks to prevent other ranks from failing individually at different times due to collective communication timeout. This supports more efficient fault detection and recovery.
Centralized efficient rank assignment – A separate HyperPod rendezvous backend allows the HyperPod training operator to assign ranks directly. This reduces initialization overhead by eliminating the need for worker-to-worker discovery.
Unhealthy training node detection and job restart – The HyperPod training operator is fully integrated with the HyperPod EKS cluster resiliency features, helping restart jobs or training processes due to bad nodes and hardware issues in ML workloads. This reduces the need to self-manage job recovery solutions.
Granular process recovery – Rather than restarting entire jobs when failures occur, the operator precisely targets and restarts only training processes, reducing recovery times from tens of minutes to seconds. This makes HyperPod training operator job recovery time scale linearly as cluster size grows.
Hanging job detection and performance degradation detection – Based on training script log monitoring, the HyperPod training operator helps overcome problematic training scenarios including stalled training batches, non-numeric loss values, and performance degradation through simple YAML configurations. For more information see, Using the training operator to run jobs in the Amazon SageMaker AI Developer Guide.

Training operator setup
This section walks through installing the Amazon SageMaker HyperPod training operator as an Amazon EKS add-on.
Estimated Setup Time: 30-45 minutes
Prerequisites
Before getting started, verify that you have the following resources and permissions.
Required AWS resources:

Active AWS account
Amazon EKS cluster (version 1.28 or later)
Amazon SageMaker HyperPod EKS cluster
Amazon ECR repository for container images

Required IAM permissions:

AmazonSageMakerHyperPodTrainingOperatorAccess managed policy
EKS cluster access permissions
ECR push/pull permissions
eks-pod-identity-agent add-on installed on EKS cluster

Required software:

kubectl (version 1.28 or later), for more information see the kubectl installation documentation
docker (version 20.10 or later), for more information see the docker installation documentation
AWS Command Line Interface (AWS CLI) (version 2.0 or later), for more information see the AWS CLI installation documentation
envsubst utility
HuggingFace account with access token

Installation instructions
Before running the installation steps below, you’ll need to first create a HyperPod cluster. If you haven’t done this one already please follow the instructions to create an EKS-orchestrated SageMaker HyperPod cluster to get started. Make sure to install eks-pod-identity-agent add-on on the EKS cluster, by following the Set up the Amazon EKS Pod Identity Agent instructions.
Install cert-manager
First, install the cert-manager add-on which is required for the HyperPod training operator:

Open the Amazon EKS console
Navigate to your EKS cluster and go to the Add-ons page
On the Add-ons page, locate Get more add-ons and navigate to the Community add-ons section
Find the Cert Manager add-on, select it, and choose Next
On the add-on configuration page, proceed with default settings and choose Next
Preview all selections for the Cert Manager add-on and choose Create
Wait for the add-on status to change to Active before proceeding

Install the HyperPod training operator add-on
Once cert-manager is active, install the Amazon SageMaker HyperPod training operator:

Open the Amazon SageMaker console
Navigate to your cluster’s details page
On the Dashboard tab, locate Amazon SageMaker HyperPod training operator and choose Install

During installation, SageMaker creates an IAM execution role with permissions similar to the AmazonSageMakerHyperPodTrainingOperatorAccess managed policy and creates a pod identity association between your Amazon EKS cluster and the new execution role.

Verify installation
We have now successfully setup of the Amazon SageMaker HyperPod training operator. You can confirm that the pods are running by using the following command:

kubectl -n aws-hyperpod get pods -l hp-training-control-plane=hp-training-operator-controller-manager

Your output should contain the training operator controller as shown below:

NAME READY      STATUS         RESTARTS        AGE
hp-training-operator-hp-training-controller-manager-85c68bmd79b    1/1              Running         0                        24m

Set up training job
Let’s run a PyTorch-based training example on a Llama model. We begin by checking out the following code base:

git clone 
cd awsome-distributed-training/tree/main/3.test_cases/pytorch/FSDP

These scripts provide an easy way to get started with multinode FSDP training on EKS. It is designed to be as simple as possible, requires no data preparation, and uses a container image.
Next, build the docker container image.

aws ecr-public get-login-password —region us-east-1 | docker login —username AWS —password-stdin public.ecr.aws/hpc-cloud
export REGION=$(aws ec2 describe-availability-zones —output text —query ‘AvailabilityZones[0].[RegionName]’)
export ACCOUNT=$(aws sts get-caller-identity —query Account —output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${REGION}.

docker build -t ${REGISTRY}fsdp:pytorch2.5.1 .    

The above command works with linux based environments, if you are on a Mac, use buildx to target linux/amd64 architecture:

docker buildx build –platform linux/amd64 -t ${REGISTRY}fsdp:pytorch2.5.1 .

Push the image to Amazon ECR:

# Create registry if needed
REGISTRY_COUNT=$(aws ecr describe-repositories | grep “fsdp” | wc -l)
if [ “$REGISTRY_COUNT” -eq 0 ]; then
    aws ecr create-repository –repository-name fsdp
fi

# Login to registry
echo “Logging in to $REGISTRY …”
aws ecr get-login-password | docker login –username AWS –password-stdin $REGISTRY

# Push image to registry
docker image push ${REGISTRY}fsdp:pytorch2.5.1

Note: Pushing the image may take some time depending on your network bandwidth.
Data
For this example, we’ll be using the allenai/c4 dataset. Instead of downloading the whole thing, the create_streaming_dataloaders function will stream the dataset from HuggingFace, so there’s no data prep required for running this training.
If you’d like to instead use your own dataset, you can do so by formatting it as a HuggingFace dataset, and passing its location to the –dataset_path argument.
For the dataset, you will need a Hugging Face access token. First, create a Hugging Face account. Then generate your access token with read permissions.
We will reference this token in the next step by setting it as an environment variable.
This example uses envsubst to generate a Kubernetes manifest file from a template file and parameters. If you don’t have envsubst on your development environment, install it by following the installation instructions.
Launch Llama 3.1 8B training job
Next, we generate the Kubernetes manifest and apply it to the cluster. Let’s navigate to the FSDP source repo:

cd awsome-distributed-training/tree/main/3.test_cases/pytorch/FSDP/Kubernetes

Here, we start by creating environment variables that are used in our training job. Fill out the placeholders as per your cluster size.

cat << EOF > env_vars
export ACCOUNT_ID=<AWS_ACCOUNT_ID>
export REGION=<REGION>
export REGISTRY=${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com
export IMAGE_URI=${REGISTRY}/fsdp:pytorch2.5.1
export INSTANCE_TYPE=<INSTANCE TYPE> # ml.p5.48xlarge
export NUM_NODES=<NUMBER OF NODES> # 2
export GPU_PER_NODE=<NUMBER OF GPUS PER NODE> # 8
export EFA_PER_NODE=<NUMBER OF EFA PER NODE> # 32
export FI_PROVIDER=efa
export HF_TOKEN=<YOUR HF ACCESS TOKEN> # HF_xxxx
EOF

Once you fill in env_vars and then source variables:

source env_vars

You can apply yaml to submit the training job:

envsubst < llama3_1_8b-fsdp-hpto.yaml | kubectl apply -f –

You can also adjust the training parameters in the TRAINING_ARGS section of the llama3_1_8b-fsdp-hpto.yaml. Additional parameters can be found under model/arguments.py. Note that we use the same directory for both –checkpoint_dir and –resume_from_checkpoint. If there are multiple checkpoints, –resume_from_checkpoint will automatically select the most recent one. This way if our training is interrupted for any reason, it will automatically pick up the most recent checkpoint.
Additionally, you can also prepare and submit your jobs compatible with the Amazon SageMaker HyperPod training operator through the HyperPod CLI and SDK capabilities that have been recently announced, more reading information on how to use it is available in this development guide.
Monitor training job
To see the status of your job, use the following command:

kubectl get hyperpodpytorchjobs

Use the following command to list the jobs ran using HyperPod training operator:

NAME              AGE
llama2-13b-fsdp   2m15s

kubectl get pods 

Use the following command to list all the pods for the training jobs:

NAME                    READY  STATUS   RESTARTS AGE
llama2-13b-fsdp-pods-0  1/1    Running   0       13s
llama2-13b-fsdp-pods-1  1/1    Running   0       13s
llama2-13b-fsdp-pods-2  1/1    Running   0       13s
llama2-13b-fsdp-pods-3  1/1    Running   0       13s

To check the pod logs run the below command to continuously stream the logs to stdout, use the following command:

kubectl logs -f llama2-13b-fsdp-pods-0

Configure log monitoring
With Amazon SageMaker HyperPod training operators users can configure log patterns that the operator continuously monitors. The HyperPod operator continuously looks for the configured regex pattern and stops the training job if it finds a violation. The llama3_1_13b-fsdp-hpto.yaml file that we used previously contains log monitoring configurations for tracking Job start hangs, hang detection during training, and checkpoint creation failures as shown below:

 logMonitoringConfiguration:
      – name: “JobStart”
        logPattern: “.*Loss:.*”
        expectedStartCutOffInSeconds: 240 # job should print loss within 4 mins of start time
      – name: “JobHangingDetection”
        logPattern: “.*Loss:.*”
        expectedRecurringFrequencyInSeconds: 300 # if next batch is not printed within 300 seconds
      – name: “NoSCheckpointingDetection”
        logPattern: “.*Completed checkpoint.*”
        expectedRecurringFrequencyInSeconds: 600 # If next checkpoint upload doesn’t happen within 10 mins, mark it hang.
        expectedStartCutOffInSeconds: 900 # Allow 30 minutes for first checkpoint upload

And the corresponding code files in /src/train.py have the necessary log statements.

logger.info(
             “Batch %d Loss: %.5f, Speed: %.2f samples/sec, lr: %.6f”,  # pylint: disable=line-too-long
                    batch_idx,
                    loss_scalar,
                    throughput,
                    current_lr,
                )
               

Any time these metrics exhibit deviation from their expected values, the operator will detect it as a fault, and trigger a recovery process to re-execute the job, up to a user-specified maximum number of retries.
Additionally, the HyperPod training operator also supports integration with Amazon SageMaker Task Governance.
Integration with HyperPod Observability
SageMaker HyperPod offers a managed observability experience through the newly launched the HyperPod Monitoring and Observability EKS add-on. The observability add-on automatically populates Kubeflow Training metrics in Grafana dashboards out of the box, but for HyperPod PyTorch job metrics, you would have to turn on the advanced training metrics which leverage the HyperPod training operator to show information around job downtime, job recovery and faults, and downtime.
To get these advanced metrics, you can refer to Setting up the SageMaker HyperPod observability add-on. This helps to streamline the process of manually setting up a scraper and building dashboards.
Clean up
To avoid incurring unnecessary charges, clean up the resources created in this walkthrough.
Delete training jobs
Remove all HyperPod training jobs:

kubectl delete hyperpodpytorchjobs –all

Verify jobs are deleted:

kubectl get hyperpodpytorchjobs

Remove container images
Delete the ECR repository and images:

aws ecr delete-repository –repository-name fsdp –force

Remove add-ons:
Remove the following add-ons:
Remove the Amazon SageMaker HyperPod training operator add-on:

Open the Amazon SageMaker console
Navigate to your cluster’s details page
On the Add-ons tab, select the Amazon SageMaker HyperPod training operator
Choose Remove

Remove the cert manager add-on:

Open the Amazon EKS console
Navigate to your EKS cluster’s Add-ons page
Select Cert Manager and choose Remove

Additional clean up
Consider removing these resources if no longer needed:

Any persistent volumes created during training
CloudWatch log groups (if you want to retain logs, leave these)
Custom IAM roles created specifically for this example
The HyperPod cluster itself (if no longer needed).

Conclusion
As organizations continue to push the boundaries of AI model development, tools like the Amazon SageMaker HyperPod training operator can be used to maintain efficiency and reliability at scale. Amazon SageMaker HyperPod training operator offers a robust solution to common challenges in large model training. Key takeaways include:

One-click installation through AWS SageMaker HyperPod cluster console user-interface.
Custom rendezvous backend eliminates initialization and worker synchronization overhead which results in faster job starts and recovery.
Process level restarts maximize recovery efficiency when runtime faults occur.
Customizable hang job detection during training.
Comprehensive monitoring for early detection of training issues.
Out-of-box integration with existing HyperPod resiliency features.

To get started with the Amazon SageMaker HyperPod training operator, follow the setup instructions provided in this post and explore the example training job to understand how it can benefit your specific use case. For more information and best practices, visit the Amazon SageMaker documentation.

About the authors
Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker AI. He holds a Master’s degree from UIUC with a specialization in Data science. He specializes in Generative AI workloads, helping customers build and deploy LLM’s using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.
Haard Mehta is a Software Engineer with Amazon’s SageMaker AI team and holds a Master’s degree in Computer Science with a specialization in big data systems from Arizona State University. He has extensive experience building managed machine learning services at scale, with a focus on hardware resiliency and enabling customers to succeed in their AI use cases without complex infrastructure management. Haard enjoys exploring new places, photography, cooking, and road trips.
Anirudh Viswanathan is a Sr Product Manager, Technical – External Services with the SageMaker AI Training team. He holds a Masters in Robotics from Carnegie Mellon University, an MBA from the Wharton School of Business, and is named inventor on over 40 patents. He enjoys long-distance running, visiting art galleries, and Broadway shows.

Google AI Research Releases DeepSomatic: A New AI Model that Identifie …

A team of researchers from Google Research and UC Santa Cruz released DeepSomatic, an AI model that identifies cancer cell genetic variants. In research with Children’s Mercy, it found 10 variants in pediatric leukemia cells missed by other tools. DeepSomatic has a somatic small variant caller for cancer genomes that works across Illumina short reads, PacBio HiFi long reads, and Oxford Nanopore long reads. The method extends DeepVariant, detects single nucleotide variants and small insertions and deletions in whole genome and whole exome data, and supports tumor normal and tumor only workflows, including FFPE models.

https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct

How It Works?

DeepSomatic converts aligned reads into image like tensors that encode pileups, base qualities, and alignment context. A convolutional neural network classifies candidate sites as somatic or not and the pipeline emits VCF or gVCF. This design is platform agnostic because the tensor summarizes local haplotype and error patterns across technologies. Google researchers describe the approach and its focus on distinguishing inherited and acquired variants including difficult samples such as glioblastoma and pediatric leukemia.

Datasets and Benchmarking

Training and evaluation use CASTLE, Cancer Standards Long read Evaluation. CASTLE contains 6 matched tumor and normal cell line pairs that were whole genome sequenced on Illumina, PacBio HiFi, and Oxford Nanopore. The research team releases benchmark sets and accessions for reuse. This fills a gap in multi technology somatic training and testing resources.

https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct

Reported Results

The research team report consistent gains over widely used methods in both single nucleotide variants and indels. On Illumina indels, the next best method is about 80 percent F1, DeepSomatic is about 90 percent. On PacBio indels, the next best method is under 50 percent, DeepSomatic is above 80 percent. Baselines include SomaticSniper, MuTect2, and Strelka2 for short reads and ClairS for long reads. The study reports 329,011 somatic variants across the reference lines and an additional preserved sample. Google research team reports that DeepSomatic outperforms current methods with particular strength on indels.

https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct

Generalization to Real Samples

The research team evaluates transfer to cancers beyond the training set. A glioblastoma sample shows recovery of known drivers. Pediatric leukemia samples test the tumor only mode where a clean normal is not available. The tool recovers known calls and reports additional variants in that cohort. These studies indicate the representation and training scheme generalize to new disease contexts and to settings without matched normals.

Key Takeaways

DeepSomatic detects somatic SNVs (single nucleotide variants) and indels across Illumina, PacBio HiFi, and Oxford Nanopore, and builds on the DeepVariant methodology.

The pipeline supports tumor normal and tumor only workflows, includes FFPE WGS and WES models, and is released on GitHub.

It encodes read pileups as image like tensors and uses a convolutional neural network to classify somatic sites and emit VCF or gVCF.

Training and evaluation use the CASTLE dataset with 6 matched tumor normal cell line pairs sequenced on three platforms, with benchmarks and accessions provided.

Reported results show about 90 percent indel F1 on Illumina and above 80 percent on PacBio, outperforming common baselines, with 329,011 somatic variants identified across reference samples.

Editorial Comments

DeepSomatic is a pragmatic step for somatic variant calling across sequencing platforms, the model keeps DeepVariant’s image tensor representation and a convolutional neural network, so the same architecture scales from Illumina to PacBio HiFi to Oxford Nanopore with consistent preprocessing and outputs. The CASTLE dataset is the right move, it supplies matched tumor and normal cell lines across 3 technologies, which strengthens training and benchmarking and aids reproducibility. Reported results emphasize indel accuracy, about 90% F1 on Illumina and more than 80% on PacBio against lower baselines, which addresses a long running weakness in indel detection. The pipeline supports WGS and WES, tumor normal and tumor only, and FFPE, which matches real laboratory constraints.

Check out the Technical Paper, Technical details, Dataset and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants appeared first on MarkTechPost.

DeepSeek Just Released a 3B OCR Model: A 3B VLM Designed for High-Perf …

DeepSeek-AI released 3B DeepSeek-OCR, an end to end OCR and document parsing Vision-Language Model (VLM) system that compresses long text into a small set of vision tokens, then decodes those tokens with a language model. The method is simple, images carry compact representations of text, which reduces sequence length for the decoder. The research team reports 97% decoding precision when text tokens are within 10 times the vision tokens on Fox benchmark, and useful behavior even at 20 times compression. It also reports competitive results on OmniDocBench with far fewer tokens than common baselines.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Architecture, what is actually new?

DeepSeek-OCR-3B has two components, a vision encoder named DeepEncoder and a Mixture of Experts decoder named DeepSeek3B-MoE-A570M. The encoder is designed for high resolution inputs with low activation cost and with few output tokens. It uses a window attention stage based on SAM for local perception, a 2 layer convolutional compressor for 16× token downsampling, and a dense global attention stage based on CLIP for visual knowledge aggregation. This design keeps activation memory controlled at high resolution, and keeps the vision token count low. The decoder is a 3B parameter MoE model (named as DeepSeek3B-MoE-A570M) with about 570M active parameters per token.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Multi resolution modes, engineered for token budgets

DeepEncoder supports native modes and dynamic modes. Native modes are Tiny with 64 tokens at 512 by 512 pixels, Small with 100 tokens at 640 by 640, Base with 256 tokens at 1024 by 1024, and Large with 400 tokens at 1280 by 1280. Dynamic modes named Gundam and Gundam-Master mix tiled local views with a global view. Gundam yields n×100 plus 256 tokens, or n×256 plus 400 tokens, with n in the range 2 to 9. For padded modes, the research team gives a formula for valid tokens, which is lower than the raw token count, and depends on the aspect ratio. These modes let AI developers and researchers align token budgets with page complexity.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Compression results, what the numbers say…..

The Fox benchmark study measures precision as exact text match after decoding. With 100 vision tokens, pages with 600 to 700 text tokens reach 98.5% precision at 6.7× compression. Pages with 900 to 1000 text tokens reach 96.8% precision at 9.7× compression. With 64 vision tokens, precision decreases as compression increases, for example 59.1% at about 19.7× for 1200 to 1300 text tokens. These values come directly from Table 2.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

On OmniDocBench, the abstract reports that DeepSeek-OCR surpasses GOT-OCR 2.0 when using only 100 vision tokens per page, and that under 800 vision tokens it outperforms MinerU 2.0, which uses over 6000 tokens per page on average. The benchmark section presents overall performance in terms of edit distance.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Training details that matter….

The research team describes a two phase training pipeline. It first trains DeepEncoder with next token prediction on OCR 1.0 and OCR 2.0 data and 100M LAION samples, then trains the full system with pipeline parallelism across 4 partitions. For hardware, the run used 20 nodes, each with 8 A100 40G GPUs, and used AdamW. The team reports a training speed of 90B tokens per day on text only data, and 70B tokens per day on multimodal data. In production, it reports the ability to generate over 200k pages per day on a single A100 40G node.

How to evaluate it in a practical stack

If your target documents are typical reports or books, start with Small mode at 100 tokens, then adjust upward only if the edit distance is unacceptable. If your pages contain dense small fonts or very high token counts, use a Gundam mode, since it combines global and local fields of view with explicit token budgeting. If your workload includes charts, tables, or chemical structures, review the “Deep parsing” qualitative section, which shows conversions to HTML tables and SMILES and structured geometry, then design outputs that are easy to validate.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Key Takeaways

DeepSeek OCR targets token efficiency using optical context compression with near lossless decoding at about 10 times compression, and around 60 percent precision at about 20 times compression.

The HF release expose explicit token budgets, Tiny uses 64 tokens at 512 by 512, Small uses 100 tokens at 640 by 640, Base uses 256 tokens at 1024 by 1024, Large uses 400 tokens at 1280 by 1280, and Gundam composes n views at 640 by 640 plus one global view at 1024 by 1024.

The system structure is a DeepEncoder that compresses pages into vision tokens and a DeepSeek3B MoE decoder with about 570M active parameters, as described by the research team in the technical report.

The Hugging Face model card documents a tested setup for immediate use, Python 3.12.9, CUDA 11.8, PyTorch 2.6.0, Transformers 4.46.3, Tokenizers 0.20.3, and Flash Attention 2.7.3.

Editorial Comments

DeepSeek OCR is a practical step for document AI, it treats pages as compact optical carriers that reduce decoder sequence length without discarding most information, the model card and technical report describe 97 percent decoding precision at about 10 times compression on Fox benchmark, which is the key claim to test in real workloads. The released model is a 3B MoE decoder with a DeepEncoder front end, packaged for Transformers, with tested versions for PyTorch 2.6.0, CUDA 11.8, and Flash Attention 2.7.3, which lowers setup cost for engineers. The repository shows a single 6.67 GB safetensors shard, which suits common GPUs. Overall, DeepSeek OCR operationalizes optical context compression with a 3B MoE decoder, reports about 97% decoding precision at 10x compression on Fox, provides explicit token budget modes, and includes a tested Transformers setup, validate the throughput claim in your own pipeline.

Check out the Technical Paper, Model on HF and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post DeepSeek Just Released a 3B OCR Model: A 3B VLM Designed for High-Performance OCR and Structured Document Conversion appeared first on MarkTechPost.