How to Build a Fully Functional Computer-Use Agent that Thinks, Plans, …

In this tutorial, we build an advanced computer-use agent from scratch that can reason, plan, and perform virtual actions using a local open-weight model. We create a miniature simulated desktop, equip it with a tool interface, and design an intelligent agent that can analyze its environment, decide on actions like clicking or typing, and execute them step by step. By the end, we see how the agent interprets goals such as opening emails or taking notes, demonstrating how a local language model can mimic interactive reasoning and task execution. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q transformers accelerate sentencepiece nest_asyncio
import torch, asyncio, uuid
from transformers import pipeline
import nest_asyncio
nest_asyncio.apply()

We set up our environment by installing essential libraries such as Transformers, Accelerate, and Nest Asyncio, which enable us to run local models and asynchronous tasks seamlessly in Colab. We prepare the runtime so that the upcoming components of our agent can work efficiently without external dependencies. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass LocalLLM:
def __init__(self, model_name=”google/flan-t5-small”, max_new_tokens=128):
self.pipe = pipeline(“text2text-generation”, model=model_name, device=0 if torch.cuda.is_available() else -1)
self.max_new_tokens = max_new_tokens
def generate(self, prompt: str) -> str:
out = self.pipe(prompt, max_new_tokens=self.max_new_tokens, temperature=0.0)[0][“generated_text”]
return out.strip()

class VirtualComputer:
def __init__(self):
self.apps = {“browser”: “https://example.com”, “notes”: “”, “mail”: [“Welcome to CUA”, “Invoice #221”, “Weekly Report”]}
self.focus = “browser”
self.screen = “Browser open at https://example.comnSearch bar focused.”
self.action_log = []
def screenshot(self):
return f”FOCUS:{self.focus}nSCREEN:n{self.screen}nAPPS:{list(self.apps.keys())}”
def click(self, target:str):
if target in self.apps:
self.focus = target
if target==”browser”:
self.screen = f”Browser tab: {self.apps[‘browser’]}nAddress bar focused.”
elif target==”notes”:
self.screen = f”Notes AppnCurrent notes:n{self.apps[‘notes’]}”
elif target==”mail”:
inbox = “n”.join(f”- {s}” for s in self.apps[‘mail’])
self.screen = f”Mail App Inbox:n{inbox}n(Read-only preview)”
else:
self.screen += f”nClicked ‘{target}’.”
self.action_log.append({“type”:”click”,”target”:target})
def type(self, text:str):
if self.focus==”browser”:
self.apps[“browser”] = text
self.screen = f”Browser tab now at {text}nPage headline: Example Domain”
elif self.focus==”notes”:
self.apps[“notes”] += (“n”+text)
self.screen = f”Notes AppnCurrent notes:n{self.apps[‘notes’]}”
else:
self.screen += f”nTyped ‘{text}’ but no editable field.”
self.action_log.append({“type”:”type”,”text”:text})

We define the core components, a lightweight local model, and a virtual computer. We use Flan-T5 as our reasoning engine and create a simulated desktop that can open apps, display screens, and respond to typing and clicking actions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ComputerTool:
def __init__(self, computer:VirtualComputer):
self.computer = computer
def run(self, command:str, argument:str=””):
if command==”click”:
self.computer.click(argument)
return {“status”:”completed”,”result”:f”clicked {argument}”}
if command==”type”:
self.computer.type(argument)
return {“status”:”completed”,”result”:f”typed {argument}”}
if command==”screenshot”:
snap = self.computer.screenshot()
return {“status”:”completed”,”result”:snap}
return {“status”:”error”,”result”:f”unknown command {command}”}

We introduce the ComputerTool interface, which acts as the communication bridge between the agent’s reasoning and the virtual desktop. We define high-level operations such as click, type, and screenshot, enabling the agent to interact with the environment in a structured way. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ComputerAgent:
def __init__(self, llm:LocalLLM, tool:ComputerTool, max_trajectory_budget:float=5.0):
self.llm = llm
self.tool = tool
self.max_trajectory_budget = max_trajectory_budget
async def run(self, messages):
user_goal = messages[-1][“content”]
steps_remaining = int(self.max_trajectory_budget)
output_events = []
total_prompt_tokens = 0
total_completion_tokens = 0
while steps_remaining>0:
screen = self.tool.computer.screenshot()
prompt = (
“You are a computer-use agent.n”
f”User goal: {user_goal}n”
f”Current screen:n{screen}nn”
“Think step-by-step.n”
“Reply with: ACTION <click/type/screenshot> ARG <target or text> THEN <assistant message>.n”
)
thought = self.llm.generate(prompt)
total_prompt_tokens += len(prompt.split())
total_completion_tokens += len(thought.split())
action=”screenshot”; arg=””; assistant_msg=”Working…”
for line in thought.splitlines():
if line.strip().startswith(“ACTION “):
after = line.split(“ACTION “,1)[1]
action = after.split()[0].strip()
if “ARG ” in line:
part = line.split(“ARG “,1)[1]
if ” THEN ” in part:
arg = part.split(” THEN “)[0].strip()
else:
arg = part.strip()
if “THEN ” in line:
assistant_msg = line.split(“THEN “,1)[1].strip()
output_events.append({“summary”:[{“text”:assistant_msg,”type”:”summary_text”}],”type”:”reasoning”})
call_id = “call_”+uuid.uuid4().hex[:16]
tool_res = self.tool.run(action, arg)
output_events.append({“action”:{“type”:action,”text”:arg},”call_id”:call_id,”status”:tool_res[“status”],”type”:”computer_call”})
snap = self.tool.computer.screenshot()
output_events.append({“type”:”computer_call_output”,”call_id”:call_id,”output”:{“type”:”input_image”,”image_url”:snap}})
output_events.append({“type”:”message”,”role”:”assistant”,”content”:[{“type”:”output_text”,”text”:assistant_msg}]})
if “done” in assistant_msg.lower() or “here is” in assistant_msg.lower():
break
steps_remaining -= 1
usage = {“prompt_tokens”: total_prompt_tokens,”completion_tokens”: total_completion_tokens,”total_tokens”: total_prompt_tokens + total_completion_tokens,”response_cost”: 0.0}
yield {“output”: output_events, “usage”: usage}

We construct the ComputerAgent, which serves as the system’s intelligent controller. We program it to reason about goals, decide which actions to take, execute those through the tool interface, and record each interaction as a step in its decision-making process. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserasync def main_demo():
computer = VirtualComputer()
tool = ComputerTool(computer)
llm = LocalLLM()
agent = ComputerAgent(llm, tool, max_trajectory_budget=4)
messages=[{“role”:”user”,”content”:”Open mail, read inbox subjects, and summarize.”}]
async for result in agent.run(messages):
print(“==== STREAM RESULT ====”)
for event in result[“output”]:
if event[“type”]==”computer_call”:
a = event.get(“action”,{})
print(f”[TOOL CALL] {a.get(‘type’)} -> {a.get(‘text’)} [{event.get(‘status’)}]”)
if event[“type”]==”computer_call_output”:
snap = event[“output”][“image_url”]
print(“SCREEN AFTER ACTION:n”, snap[:400],”…n”)
if event[“type”]==”message”:
print(“ASSISTANT:”, event[“content”][0][“text”], “n”)
print(“USAGE:”, result[“usage”])

loop = asyncio.get_event_loop()
loop.run_until_complete(main_demo())

We bring everything together by running the demo, where the agent interprets a user’s request and performs tasks on the virtual computer. We observe it generating reasoning, executing commands, updating the virtual screen, and achieving its goal in a clear, step-by-step manner.

In conclusion, we implemented the essence of a computer-use agent capable of autonomous reasoning and interaction. We witness how local language models like Flan-T5 can powerfully simulate desktop-level automation within a safe, text-based sandbox. This project helps us understand the architecture behind intelligent agents such as those in computer-use agents, bridging natural language reasoning with virtual tool control. It lays a strong foundation for extending these capabilities toward real-world, multimodal, and secure automation systems.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Fully Functional Computer-Use Agent that Thinks, Plans, and Executes Virtual Actions Using Local AI Models appeared first on MarkTechPost.

Google vs OpenAI vs Anthropic: The Agentic AI Arms Race Breakdown

Table of contentsOpenAI: CUA for GUI Autonomy, Responses as Agent Surface, and AgentKit for LifecycleGoogle: Gemini 2.0 and Astra for Perception, Vertex AI Agent Builder for Orchestration, Gemini Enterprise for GovernanceAnthropic: Computer Use and App-Builder Path via ArtifactsBenchmarks That Matter for Agent SelectionComparative AnalysisDeployment Guidance for Technical TeamsBottom Line by VendorEditorial Comments

In this article we will analyze how Google, OpenAI, and Anthropic are productizing ‘agentic’ capabilities across computer-use control, tool/function calling, orchestration, governance, and enterprise packaging.

Agent platforms, not only models, now define competitive advantage. Google is aligning Gemini 2.0 with an enterprise control plane on Vertex AI and a new ‘front door’ called Gemini Enterprise. OpenAI is consolidating developer early around the Responses API, packaging agent lifecycle elements as AgentKit, and deploying a general GUI controller called the Computer-Using Agent (CUA). Anthropic is expanding Computer Use while turning Artifacts into a lightweight app-builder for rapid internal tools. ​

OpenAI: CUA for GUI Autonomy, Responses as Agent Surface, and AgentKit for Lifecycle

Computer-Using Agent (CUA)

OpenAI introduced Operator in January 2025, powered by the CUA model. CUA combines GPT-4o-class vision with reinforcement learning for GUI policies, executing using human-like early development: screen perception, mouse, and keyboard. The stated purpose is a single interface that generalizes across web and desktop tasks.​

Responses API

OpenAI repositioned Responses as the primary agent-native API. The design folds chat, tool use, state, and multimodality into one early step and is marketed as the integration surface for GPT-5-era reasoning workflow. This simplifies the historical split across Chat Completions and Assistants, formalizing hosted tools and persistent reasoning in a single endpoint.​

AgentKit

Launched in October 2025, AgentKit packages agent building blocks: visual design surfaces, connectors/registries, evaluation hooks, and embeddable agent UIs. The aim is to reduce orchestration sprawl and standardize agent lifecycle from design to deployment. ​

Risk Profile

Early third-party evaluations note brittleness on practical automations: flaky DOM targets, window focus loss, and recovery failure on layout changes. While not unique to OpenAI, this matters for production SLAs. Teams should instrument retries, stabilize selectors, and gate high-risk steps behind review. Pair CUA experiments with execution-based evaluation such as OSWorld tasks.​

Position: OpenAI is optimizing for a programmable agent substrate: a single API surface (Responses), a lifecycle kit (AgentKit), and a universal GUI controller (CUA). For teams willing to own their evaluation harness and operations, this stack provides tight control and fast iteration loops.​

Google: Gemini 2.0 and Astra for Perception, Vertex AI Agent Builder for Orchestration, Gemini Enterprise for Governance

Models and Runtime

Google frames Gemini 2.0 as ‘built for the agentic era,’ with native tool use and multimodal I/O including image/audio output. Project Astra demonstrations highlight low-latency, always-on perception and continuous assistance patterns that map to planning plus acting loops. These capabilities are intended to feed Gemini Live and the broader agent runtime.​

Vertex AI Agent Builder

Google’s control plane for building and deploying agents on GCP is Vertex AI Agent Builder. The official documentation shows Agent Garden for templates and tools, orchestration for multi-agent experiences, and integration with other Vertex components. This serves as the platform to implement policies, logging, and evaluation pipelines for GCP users.​

Gemini Enterprise

In October 2025, Google announced Gemini Enterprise as a governed front door to ‘discover, create, share, and run AI agents’ with central policy and visibility. It emphasize cross-suite context spanning Google Workspace and Microsoft 365/SharePoint, plus line-of-business integrations such as Salesforce and SAP. This is positioned as a fleet-level governance layer, not only a development kit.​

Application Surface

Google is also pushing agentic control into end-user environments. Agent Mode in the Gemini app and Project Mariner extend consumer and prosumer workflows: teach-and-repeat, multi-task management, and autonomous execution for common tasks like search and filtering. This serves as both a data source for guardrails and a proving ground for UI-safety patterns.​

Position: Google is optimizing for governed enterprise deployment with wide surface integration. If you need centralized policy/visibility across many agents, with Workspace and cross-suite context, the Gemini Enterprise + Vertex pairing offers the most prescriptive path today.​

Anthropic: Computer Use and App-Builder Path via Artifacts

Computer Use

Anthropic introduced Computer Use for Claude 3.5 Sonnet in October 2024, explicitly as a beta capability that requires appropriate software setup to emulate human cursor and keyboard interactions. The company has been quite transparent about error profiles and the need for careful mediation. For production, expect policy-first defaults and incremental broadening rather than a hard pivot to full autonomy.​

Artifacts → App Building

In June 2025, Anthropic extended Artifacts from an inline canvas to build, host, and share interactive apps directly from Claude. The feature targets rapid internal tools and shareable mini-apps. Developers can create apps that call back into Claude via a new API, and published app usage bills the end user rather than the author.​

Position: Anthropic is optimizing for fast human-in-the-loop creation with explicit safety posture. The combination of Computer Use and Artifacts supports a design pattern where users co-pilot agents, validate actions, and graduate prototypes into shareable internal apps without heavy scaffolding.​

Benchmarks That Matter for Agent Selection

Function/Tool Calling

The Berkeley Function-Calling Leaderboard (BFCL) V4 expands beyond single calls to multi-turn planning, live/non-live settings, and hallucination measurement. You can use BFCL for tool-routing quality, argument fidelity, and sequencing under state changes.​

Computer/Web Use

OSWorld defines a benchmark of 369 real desktop tasks with execution-based evaluations across OSes and multi-app workflows. Original results showed large human–agent gaps and identified GUI grounding as a major bottleneck. You can treat OSWorld as the minimum bar for assessing GUI agents, then layer domain-specific workflows.​

Conversational Tool Agents

τ-Bench simulates dynamic conversations where an agent must follow domain rules and interact with tools; the 2025 τ²-Bench extension adds dual-control scenarios where both the user and agent can act, increasing realism for support workflows. You can use these when you care about policy adherence, user guidance, and multi-trial reliability.​

Software-Engineering Agents

SWE-Bench family leaderboards cover end-to-end issue resolution; SWE-Bench Pro (2025) raises task difficulty and adds contamination resistance with 1,865 instances across 41 repositories. For engineering assistants, you should not rely on ‘Lite’ alone—run Verified or Pro with a locked scaffold.​​

Comparative Analysis

Model Core and Modality

OpenAI currently couples GPT-5-era orchestration via Responses with a general GUI controller (CUA). This allows one integration surface for reasoning and tools plus a controller trained with RL for on-screen actions. Google pushes Gemini 2.0 and Astra for low-latency multimodal perception with tool use, then exposes agent plumbing through Vertex and Gemini Enterprise. Anthropic advances Claude 3.5 with Computer Use, while offering Artifacts to transform prompts into shareable apps that can call the model. The differences map to strategy: programmable substrate (OpenAI), governed enterprise scale (Google), and human-in-the-loop app creation (Anthropic).​

Agent Platform and Lifecycle

OpenAI’s AgentKit is an opinionated toolkit that reduces custom scaffolds and aligns with Responses. Google’s Vertex AI Agent Builder offers multi-agent orchestration plus governance hooks in a GCP-native control plane. Anthropic’s Artifacts/app-builder anchors a rapid prototyping loop for internal tools and user-validated workflows. Select based on where you want to spend engineering effort: programmable pipelines (OpenAI), centralized IT management (Google), or fastest human-supervised iteration (Anthropic).​

Governance and Policy

Google’s Gemini Enterprise is the clearest statement of fleet-level governance: central policy, visibility, cross-suite context for Workspace and Microsoft 365, and connectors for line-of-business apps. OpenAI’s consolidation into Responses reduces integration surfaces and should simplify policy attachment, but enterprise posture varies by customer architecture. Anthropic’s default stance is cautious feature rollout with explicit policy framing and human mediation.​

Evaluation Story and External Signals

OpenAI claims strong computer-/browser-use performance for CUA, but independent harnesses like OSWorld still report significant gaps across agents. Google’s agent messaging leans on demonstrations and enterprise rollouts; verify claims on BFCL, OSWorld, and domain workloads in Vertex. Anthropic’s Artifacts provides a pathway to test-and-deploy small apps quickly, then measure them against τ-Bench-style dialogue tasks and OSWorld-style GUI tasks.

Deployment Guidance for Technical Teams

1) Lock the Runner Before the Model

You can adopt execution-based, state-aware harnesses. For GUI control, use OSWorld’s verified setups and task scripts. For tool orchestration, use BFCL V4’s multi-turn and hallucination components. For policy-bound dialogues, prefer τ/τ²-Bench. For engineering assistants, add SWE-Bench Verified or Pro. Keep the runner constant while iterating on models, prompts, and retries.​

2) Decide Where Governance Lives

If you need centralized visibility across many agents plus Workspace and Microsoft 365 context, Google’s Gemini Enterprise combined with Vertex AI Agent Builder provides the most prescriptive governance plane. If you want a programmable substrate and will own policy integration yourself, OpenAI’s Responses + AgentKit stack is coherent. Anthropic’s approach favors human-in-the-loop controls with clear policy boundaries through the product surface.​

3) Design for GUI Failure and Recovery

Selectors drift, window focus changes, and visual similarity confuses detectors. You can build retries, add ‘are we on the right page’ checks, and gate irreversible actions behind review. This guidance applies to OpenAI CUA and Anthropic Computer Use alike, and the gaps are documented in OSWorld results.​

4) Optimize for Your Iteration Style

If you prototype many small internal tools, Anthropic’s Artifacts/app-builder minimizes scaffolding and lets non-specialists contribute. If you need deeply programmable pipelines with hosted tools and memory, Responses plus AgentKit offers the most consolidated primitives today. For governed, fleet-level rollouts, Google’s Vertex + Gemini Enterprise stack is designed for IT-managed scale.​

Bottom Line by Vendor

OpenAI: A programmable agent substrate: Responses as the unifying API, AgentKit for lifecycle, and CUA for GUI autonomy. This stack is attractive when you want direct control over tools, memory, and evaluation and are prepared to operate your own runners. You can validate GUI tasks on OSWorld and dialogue planning on τ-Bench.​

Google: A governed enterprise plane: Vertex AI Agent Builder for orchestration and Gemini Enterprise for organization-wide policy, visibility, and cross-suite context. This may be the clearest route to standardized agent operations in large estates using Workspace or hybrid 365 environments. You can test tool quality on BFCL and GUI reliability on OSWorld before scaling.​

Anthropic: A human-in-the-loop path: Computer Use plus Artifacts/app-builder for rapid creation and sharing of internal apps. This works well for teams that want fast iteration with explicit checkpoints and policy framing. You can use τ-Bench to assess policy adherence and user guidance, and OSWorld to check GUI action reliability.​

Editorial Comments

The agentic AI landscape of 2025 reveals three fundamentally different philosophies that will likely define the next phase of enterprise AI adoption. OpenAI’s bet on a unified, programmable substrate reflects their developer-first DNA, but risks overwhelming teams without strong engineering capabilities. Google’s enterprise governance play is strategically sound given their Workspace dominance, yet feels bureaucratic compared to the nimble iteration cycles that define successful AI deployments. Anthropic’s human-in-the-loop approach appears most aligned with current organizational realities—where trust, not just capability, remains the bottleneck for AI adoption. The real winner may not be determined by technical superiority alone, but by which vendor best navigates the gap between AI possibility and enterprise practicality. With 95% of generative AI pilots failing to reach production according to MIT research, the platform that solves deployment friction rather than just model performance will likely capture the largest share of the projected $47.1 billion AI agent market by 2030.

References: ​

https://www.fanktank.ch/en/blog/choosing-ai-models-openai-anthropic-google-2025

https://www.mindset.ai/blogs/in-the-loop-ep15-the-three-battles-to-own-all-ai

https://deeplp.com/f/xxx

https://akka.io/blog/agentic-ai-tools

https://www.alvarezandmarsal.com/thought-leadership/demystifying-ai-agents-in-2025-separating-hype-from-reality-and-navigating-market-outlook

https://www.datacamp.com/blog/best-ai-agents

https://mashable.com/article/best-ai-agents-work

https://claude.ai/public/artifacts/e7c1cf72-338c-4b70-bab2-fff4bf0ac553

OpenAI launches Operator, an AI agent that performs tasks autonomously

https://openai.com/index/introducing-agentkit/

https://cloud.google.com/blog/products/ai-machine-learning/introducing-gemini-enterprise

https://www.anthropic.com/news/3-5-models-and-computer-use

https://openai.com/index/introducing-operator/

https://openai.com/index/computer-using-agent/

https://openai.com/index/new-tools-and-features-in-the-responses-api/

https://developers.openai.com/blog/responses-api/

OpenAI launches AgentKit to help developers build and ship AI agents 

OpenAI Launches AgentKit for Building AI Agents – Here Is All You Need To Know

https://www.technologyreview.com/2025/01/23/1110484/openai-launches-operator-an-agent-that-can-use-a-computer-for-you/

https://shellypalmer.com/2024/12/google-launches-gemini-2-0-ushering-in-the-agentic-era/

https://blog.google/products/gemini/google-gemini-ai-collection-2024/

https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/

Google ramps up its ‘AI in the workplace’ ambitions with Gemini Enterprise

https://www.reuters.com/business/google-launches-gemini-enterprise-ai-platform-business-clients-2025-10-09/

https://blog.google/products/google-cloud/gemini-enterprise-sundar-pichai/

https://www.anthropic.com/news/developing-computer-use

https://www.nist.gov/news-events/news/2024/11/pre-deployment-evaluation-anthropics-upgraded-claude-35-sonnet

https://www.infoq.com/news/2025/06/anthropic-artifacts-app/

https://www.anthropic.com/news/build-artifacts

https://www.anthropic.com/news/claude-powered-artifacts

https://gorilla.cs.berkeley.edu/leaderboard.html

https://gorilla.cs.berkeley.edu/blogs/15_bfcl_v4_web_search.html

https://openreview.net/forum?id=2GmDdhBdDk

https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf

The post Google vs OpenAI vs Anthropic: The Agentic AI Arms Race Breakdown appeared first on MarkTechPost.

Liquid AI’s LFM2-VL-3B Brings a 3B Parameter Vision Language Model ( …

Liquid AI released LFM2-VL-3B, a 3B parameter vision language model for image text to text tasks. It extends the LFM2-VL family beyond the 450M and 1.6B variants. The model targets higher accuracy while preserving the speed profile of the LFM2 architecture. It is available on LEAP and Hugging Face under the LFM Open License v1.0.

Model overview and interface

LFM2-VL-3B accepts interleaved image and text inputs and produces text outputs. The model exposes a ChatML like template. The processor inserts an <image> sentinel that is replaced with encoded image tokens at run time. The default text context length is 32,768 tokens. These details help devs reproduce evaluations and integrate the model with existing multimodal pipelines.

https://www.liquid.ai/blog/lfm2-vl-3b-a-new-efficient-vision-language-for-the-edge

Architecture

The stack pairs a language tower with a shape aware vision tower and a projector. The language tower is LFM2-2.6B, a hybrid convolution plus attention backbone. The vision tower is SigLIP2 NaFlex at 400M parameters, it preserves native aspect ratios and avoids distortion. The connector is a 2 layer MLP with pixel unshuffle, it compresses image tokens before fusion with the language space. This design lets users cap vision token budgets without retraining the model.

The encoder processes native resolutions up to 512×512. Larger inputs are split into non overlapping 512×512 patches. A thumbnail pathway provides global context during tiling. The efficient token mapping is documented with concrete examples, a 256×384 image maps to 96 tokens, a 1000×3000 image maps to 1,020 tokens. The model card exposes user controls for minimum and maximum image tokens and the tiling switch. These controls tune speed and quality at inference time.

Inference settings

The Hugging Face model card provides recommended parameters. Text generation uses temperature 0.1, min p 0.15, and a repetition penalty of 1.05. Vision settings use min image tokens 64, max image tokens 256, and image splitting enabled. The processor applies the chat template and the image sentinel automatically. The example uses AutoModelForImageTextToText and AutoProcessor with bfloat16 precision.

How is it trained?

Liquid AI describes a staged approach. The team performs joint mid training that adjusts the text to image ratio over time. The model then undergoes supervised fine tuning focused on image understanding. The data sources are large scale open datasets plus in house synthetic vision data for task coverage.

Benchmarks

The research team reports competitive results among lightweight open VLMs. On MM-IFEval the model reaches 51.83. On RealWorldQA it reaches 71.37. On MMBench dev en it reaches 79.81. The POPE score is 89.01. The table notes that scores for other systems were computed with VLMEvalKit. The table excludes Qwen3-VL-2B because that system was released one day earlier.

https://www.liquid.ai/blog/lfm2-vl-3b-a-new-efficient-vision-language-for-the-edge

The language capability remains close to the LFM2-2.6B backbone. The research team cites 30 percent on GPQA and 63 percent on MMLU. This matters when perception tasks include knowledge queries. The team also states expanded multilingual visual understanding across English, Japanese, French, Spanish, German, Italian, Portuguese, Arabic, Chinese, and Korean.

Why edge users should care?

The architecture keeps compute and memory within small device budgets. Image tokens are compressible and user constrained, so throughput is predictable. SigLIP2 400M NaFlex encoder preserves aspect ratios, which helps fine grained perception. The projector reduces tokens at the connector, which improves tokens per second. The research team also published a GGUF build for on device runtimes. These properties are useful for robotics, mobile, and industrial clients that need local processing and strict data boundaries.

Key Takeaways

Compact multimodal stack: 3B parameter LFM2-VL-3B pairs an LFM2-2.6B language tower with a 400M SigLIP2 NaFlex vision encoder and a 2-layer MLP projector for image-token fusion. NaFlex preserves native aspect ratios.

Resolution handling and token budgets: Images run natively up to 512×512, larger inputs tile into non overlapping 512×512 patches with a thumbnail pathway for global context. Documented token mappings include 256×384 → 96 tokens and 1000×3000 → 1,020 tokens.

Inference interface: ChatML-like prompting with an <image> sentinel, default text context 32,768 tokens, recommended decoding settings, and processor-level controls for image splitting enable reproducible evaluation and easy integration in multimodal pipelines.

Measured performance: Reported results include MM-IFEval 51.83, RealWorldQA 71.37, MMBench-dev-en 79.81, and POPE 89.01. Language-only signals from the backbone are about 30% GPQA and 63% MMLU, useful for mixed perception plus knowledge workloads.

Editorial Comments

LFM2-VL-3B is a practical step for edge multimodal workloads, the 3B stack pairs LFM2-2.6B with a 400M SigLIP2 NaFlex encoder and an efficient projector, which lowers image token counts for predictable latency. Native resolution processing with 512 by 512 tiling and token caps gives deterministic budgets. Reported scores on MM-IFEval, RealWorldQA, MMBench, and POPE are competitive for this size. Open weights, a GGUF build, and LEAP access reduce integration friction. Overall, this is an edge ready VLM release with clear controls and transparent benchmarks.

Check out the Model on HF and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Liquid AI’s LFM2-VL-3B Brings a 3B Parameter Vision Language Model (VLM) to Edge-Class Devices appeared first on MarkTechPost.

An Implementation on Building Advanced Multi-Endpoint Machine Learning …

In this tutorial, we explore LitServe, a lightweight and powerful serving framework that allows us to deploy machine learning models as APIs with minimal effort. We build and test multiple endpoints that demonstrate real-world functionalities such as text generation, batching, streaming, multi-task processing, and caching, all running locally without relying on external APIs. By the end, we clearly understand how to design scalable and flexible ML serving pipelines that are both efficient and easy to extend for production-level applications. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install litserve torch transformers -q

import litserve as ls
import torch
from transformers import pipeline
import time
from typing import List

We begin by setting up our environment on Google Colab and installing all required dependencies, including LitServe, PyTorch, and Transformers. We then import the essential libraries and modules that will allow us to define, serve, and test our APIs efficiently. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass TextGeneratorAPI(ls.LitAPI):
def setup(self, device):
self.model = pipeline(“text-generation”, model=”distilgpt2″, device=0 if device == “cuda” and torch.cuda.is_available() else -1)
self.device = device
def decode_request(self, request):
return request[“prompt”]
def predict(self, prompt):
result = self.model(prompt, max_length=100, num_return_sequences=1, temperature=0.8, do_sample=True)
return result[0][‘generated_text’]
def encode_response(self, output):
return {“generated_text”: output, “model”: “distilgpt2”}

class BatchedSentimentAPI(ls.LitAPI):
def setup(self, device):
self.model = pipeline(“sentiment-analysis”, model=”distilbert-base-uncased-finetuned-sst-2-english”, device=0 if device == “cuda” and torch.cuda.is_available() else -1)
def decode_request(self, request):
return request[“text”]
def batch(self, inputs: List[str]) -> List[str]:
return inputs
def predict(self, batch: List[str]):
results = self.model(batch)
return results
def unbatch(self, output):
return output
def encode_response(self, output):
return {“label”: output[“label”], “score”: float(output[“score”]), “batched”: True}

Here, we create two LitServe APIs, one for text generation using a local DistilGPT2 model and another for batched sentiment analysis. We define how each API decodes incoming requests, performs inference, and returns structured responses, demonstrating how easy it is to build scalable, reusable model-serving endpoints. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass StreamingTextAPI(ls.LitAPI):
def setup(self, device):
self.model = pipeline(“text-generation”, model=”distilgpt2″, device=0 if device == “cuda” and torch.cuda.is_available() else -1)
def decode_request(self, request):
return request[“prompt”]
def predict(self, prompt):
words = [“Once”, “upon”, “a”, “time”, “in”, “a”, “digital”, “world”]
for word in words:
time.sleep(0.1)
yield word + ” ”
def encode_response(self, output):
for token in output:
yield {“token”: token}

In this section, we design a streaming text-generation API that emits tokens as they are generated. We simulate real-time streaming by yielding words one at a time, demonstrating how LitServe can handle continuous token generation efficiently. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MultiTaskAPI(ls.LitAPI):
def setup(self, device):
self.sentiment = pipeline(“sentiment-analysis”, device=-1)
self.summarizer = pipeline(“summarization”, model=”sshleifer/distilbart-cnn-6-6″, device=-1)
self.device = device
def decode_request(self, request):
return {“task”: request.get(“task”, “sentiment”), “text”: request[“text”]}
def predict(self, inputs):
task = inputs[“task”]
text = inputs[“text”]
if task == “sentiment”:
result = self.sentiment(text)[0]
return {“task”: “sentiment”, “result”: result}
elif task == “summarize”:
if len(text.split()) < 30:
return {“task”: “summarize”, “result”: {“summary_text”: text}}
result = self.summarizer(text, max_length=50, min_length=10)[0]
return {“task”: “summarize”, “result”: result}
else:
return {“task”: “unknown”, “error”: “Unsupported task”}
def encode_response(self, output):
return output

We now develop a multi-task API that handles both sentiment analysis and summarization via a single endpoint. This snippet demonstrates how we can manage multiple model pipelines through a unified interface, dynamically routing each request to the appropriate pipeline based on the specified task. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass CachedAPI(ls.LitAPI):
def setup(self, device):
self.model = pipeline(“sentiment-analysis”, device=-1)
self.cache = {}
self.hits = 0
self.misses = 0
def decode_request(self, request):
return request[“text”]
def predict(self, text):
if text in self.cache:
self.hits += 1
return self.cache[text], True
self.misses += 1
result = self.model(text)[0]
self.cache[text] = result
return result, False
def encode_response(self, output):
result, from_cache = output
return {“label”: result[“label”], “score”: float(result[“score”]), “from_cache”: from_cache, “cache_stats”: {“hits”: self.hits, “misses”: self.misses}}

We implement an API that uses caching to store previous inference results, reducing redundant computation for repeated requests. We track cache hits and misses in real time, illustrating how simple caching mechanisms can drastically improve performance in repeated inference scenarios. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef test_apis_locally():
print(“=” * 70)
print(“Testing APIs Locally (No Server)”)
print(“=” * 70)

api1 = TextGeneratorAPI(); api1.setup(“cpu”)
decoded = api1.decode_request({“prompt”: “Artificial intelligence will”})
result = api1.predict(decoded)
encoded = api1.encode_response(result)
print(f”✓ Result: {encoded[‘generated_text’][:100]}…”)

api2 = BatchedSentimentAPI(); api2.setup(“cpu”)
texts = [“I love Python!”, “This is terrible.”, “Neutral statement.”]
decoded_batch = [api2.decode_request({“text”: t}) for t in texts]
batched = api2.batch(decoded_batch)
results = api2.predict(batched)
unbatched = api2.unbatch(results)
for i, r in enumerate(unbatched):
encoded = api2.encode_response(r)
print(f”✓ ‘{texts[i]}’ -> {encoded[‘label’]} ({encoded[‘score’]:.2f})”)

api3 = MultiTaskAPI(); api3.setup(“cpu”)
decoded = api3.decode_request({“task”: “sentiment”, “text”: “Amazing tutorial!”})
result = api3.predict(decoded)
print(f”✓ Sentiment: {result[‘result’]}”)

api4 = CachedAPI(); api4.setup(“cpu”)
test_text = “LitServe is awesome!”
for i in range(3):
decoded = api4.decode_request({“text”: test_text})
result = api4.predict(decoded)
encoded = api4.encode_response(result)
print(f”✓ Request {i+1}: {encoded[‘label’]} (cached: {encoded[‘from_cache’]})”)

print(“=” * 70)
print(” All tests completed successfully!”)
print(“=” * 70)

test_apis_locally()

We test all our APIs locally to verify their correctness and performance without starting an external server. We sequentially evaluate text generation, batched sentiment analysis, multi-tasking, and caching, ensuring each component of our LitServe setup runs smoothly and efficiently.

In conclusion, we create and run diverse APIs that showcase the framework’s versatility. We experiment with text generation, sentiment analysis, multi-tasking, and caching to experience LitServe’s seaMLess integration with Hugging Face pipelines. As we complete the tutorial, we realize how LitServe simplifies model deployment workflows, enabling us to serve intelligent ML systems in just a few lines of Python code while maintaining flexibility, performance, and simplicity.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference appeared first on MarkTechPost.

Salesforce AI Research Introduces WALT (Web Agents that Learn Tools): …

A team of Salesforce AI researchers introduced WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. It reframes browser automation around callable tools rather than long chains of clicks. Agents then call operations such as search, filter, sort, post_comment, and create_listing. This reduces dependence on large language model step by step reasoning and increases determinism during execution.

https://arxiv.org/pdf/2510.01524

What WALT builds?

Web agents often fail when layouts shift or when tasks require long sequences. WALT targets this failure mode by mining site functionality offline, then exposing it as tools that encapsulate navigation, selection, extraction, and optional agentic steps. Tools carry contracts in the form of schemas and examples. At runtime, an agent composes a short program with a few tool calls to complete a task. The design goal is higher success with fewer steps and less reliance on free form reasoning.

Pipeline in two phases

The pipeline has discovery and construction with validation. In discovery, WALT explores a website and proposes tool candidates that map to common goals such as discovery, content management, and communication. In construction and validation, WALT converts traces to deterministic scripts, stabilizes selectors, attempts URL promotion when possible, induces an input schema, and registers a tool only after end to end checks pass. This shifts as much work as possible into stable URL and form operations and leaves agentic grounding for the cases that truly require it.

https://arxiv.org/pdf/2510.01524

Results on VisualWebArena and WebArena

On VisualWebArena, WALT reports an average success rate of 52.9 percent with per split results of 64.1 percent on Classifieds, 53.4 percent on Shopping, and 39.0 percent on Reddit. The table lists baselines such as SGV at 50.2 percent and ExaCT at 33.7 percent. Human performance is 88.7 percent on average.

On WebArena, WALT reaches 50.1 percent average across GitLab, Map, Shopping, CMS, Reddit, and Multi. The table shows WALT ahead of prior methods with a nine point margin over the best skill induction baseline. Human performance is 78.2 percent.

https://arxiv.org/pdf/2510.01524

Efficiency and ablations

Tools reduce action count by a factor near 1.4 on average relative to a matched agent without tools. On the Classifieds split, ablations show consistent gains when tools are used across different agent backbones. WALT with GPT 5 mini records 7 percent higher success and 27 percent fewer steps, while a human demonstration strategy yields 66.0 percent success. The fully autonomous WALT reaches 64.1 percent with 5 percent fewer steps than the human demonstration case. Multimodal DOM parsing adds 2.6 percent absolute improvement. External verification adds 3.3 percent while increasing checks. Across components, WALT records 21.3 percent fewer steps than baseline policies.

https://arxiv.org/pdf/2510.01524

Design choices that enforce determinism

WALT prefers URL level operations when the site exposes query parameters or routes for search and filtering. When pages require dynamic grounding, the tool script inserts bounded agentic steps such as content extraction or wait for page load. Selector stabilization and schema validation reduce drift when sites change. The method keeps the fraction of agentic operations low in discovered tool sets and biases toward deterministic actions like navigation, input, and click.

Key Takeaways

Approach: WALT discovers and validates website-native functions, then exposes them as callable tools with input schemas, selector stabilization, and URL promotion, reducing brittle step sequences to deterministic operations.

Results — VisualWebArena: Average success rate 52.9%, with 64.1% on Classifieds, 53.4% on Shopping, and 39.0% on Reddit, outperforming several baselines reported in the paper.

Results — WebArena: Average success rate 50.1% across GitLab, Map, Shopping, CMS, Reddit, and Multi, showing consistent gains over skill-induction and search-based baselines.

Efficiency and Ablations: Toolization cuts steps by about 1.4x, with 21.3% fewer actions on average. Multimodal DOM parsing adds +2.6% absolute success, and external verification adds +3.3%.

Editorial Comments

WALT is a useful pivot from step sequence agents to functionality grounded tools. The framework reverse engineers latent website functionality into reusable invocable tools across discovery, content management, and communication. By promoting UI traces to deterministic tools with schema validation and URL operations, WALT lifts web agent success to 52.9 percent on VisualWebArena and 50.1 percent on WebArena, while cutting actions by about 21.3 percent. The release ships a CLI, walt discover, walt agent, and MCP serving for integration.

Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Salesforce AI Research Introduces WALT (Web Agents that Learn Tools): Enabling LLM agents to Automatically Discover Reusable Tools from Any Website appeared first on MarkTechPost.

Responsible AI design in healthcare and life sciences

Generative AI has emerged as a transformative technology in healthcare, driving digital transformation in essential areas such as patient engagement and care management. It has shown potential to revolutionize how clinicians provide improved care through automated systems with diagnostic support tools that provide timely, personalized suggestions, ultimately leading to better health outcomes. For example, a study reported in BMC Medical Education that medical students who received large language model (LLM)-generated feedback during simulated patient interactions significantly improved their clinical decision-making compared to those who did not.
At the center of most generative AI systems are LLMs capable of generating remarkably natural conversations, enabling healthcare customers to build products across billing, diagnosis, treatment, and research that can perform tasks and operate independently with human oversight. However, the utility of generative AI requires an understanding of the potential risks and impacts on healthcare service delivery, which necessitates the need for careful planning, definition, and execution of a system-level approach to building safe and responsible generative AI-infused applications.
In this post, we focus on the design phase of building healthcare generative AI applications, including defining system-level policies that determine the inputs and outputs. These policies can be thought of as guidelines that, when followed, help build a responsible AI system.
Designing responsibly
LLMs can transform healthcare by reducing the cost and time required for considerations such as quality and reliability. As shown in the following diagram, responsible AI considerations can be successfully integrated into an LLM-powered healthcare application by considering quality, reliability, trust, and fairness for everyone. The goal is to promote and encourage certain responsible AI functionalities of AI systems. Examples include the following:

Each component’s input and output is aligned with clinical priorities to maintain alignment and promote controllability
Safeguards, such as guardrails, are implemented to enhance the safety and reliability of your AI system
Comprehensive AI red-teaming and evaluations are applied to the entire end-to-end system to assess safety and privacy-impacting inputs and outputs

Conceptual architecture
The following diagram shows a conceptual architecture of a generative AI application with an LLM. The inputs (directly from an end-user) are mediated through input guardrails. After the input has been accepted, the LLM can process the user’s request using internal data sources. The output of the LLM is again mediated through guardrails and can be shared with end-users.

Establish governance mechanisms
When building generative AI applications in healthcare, it’s essential to consider the various risks at the individual model or system level, as well as at the application or implementation level. The risks associated with generative AI can differ from or even amplify existing AI risks. Two of the most important risks are confabulation and bias:

Confabulation — The model generates confident but erroneous outputs, sometimes referred to as hallucinations. This could mislead patients or clinicians.
Bias — This refers to the risk of exacerbating historical societal biases among different subgroups, which can result from non-representative training data.

To mitigate these risks, consider establishing content policies that clearly define the types of content your applications should avoid generating. These policies should also guide how to fine-tune models and which appropriate guardrails to implement. It is crucial that the policies and guidelines are tailored and specific to the intended use case. For instance, a generative AI application designed for clinical documentation should have a policy that prohibits it from diagnosing diseases or offering personalized treatment plans.
Additionally, defining clear and detailed policies that are specific to your use case is fundamental to building responsibly. This approach fosters trust and helps developers and healthcare organizations carefully consider the risks, benefits, limitations, and societal implications associated with each LLM in a particular application.
The following are some example policies you might consider using for your healthcare-specific applications. The first table summarizes the roles and responsibilities for human-AI configurations.

Action ID
Suggested Action
Generative AI Risks

GV-3.2-001
Policies are in place to bolster oversight of generative AI systems with independent evaluations or assessments of generative AI models or systems where the type and robustness of evaluations are proportional to the identified risks.
CBRN Information or Capabilities; Harmful Bias and Homogenization

GV-3.2-002
Consider adjustment of organizational roles and components across lifecycle stages of large or complex generative AI systems, including: test and evaluation, validation, and red-teaming of generative AI systems; generative AI content moderation; generative AI system development and engineering; increased accessibility of generative AI tools, interfaces, and systems; and incident response and containment.
Human-AI Configuration; Information Security; Harmful Bias and Homogenization

GV-3.2-003
Define acceptable use policies for generative AI interfaces, modalities, and human-AI configurations (for example, for AI assistants and decision-making tasks), including criteria for the kinds of queries generative AI applications should refuse to respond to.
Human-AI Configuration

GV-3.2-004
Establish policies for user feedback mechanisms for generative AI systems that include thorough instructions and any mechanisms for recourse.
Human-AI Configuration

GV-3.2-005
Engage in threat modeling to anticipate potential risks from generative AI systems.
CBRN Information or Capabilities; Information Security

The following table summarizes policies for risk management in AI system design.

Action ID
Suggested Action
Generative AI Risks

GV-4.1-001
Establish policies and procedures that address continual improvement processes for generative AI risk measurement. Address general risks associated with a lack of explainability and transparency in generative AI systems by using ample documentation and techniques such as application of gradient-based attributions, occlusion or term reduction, counterfactual prompts and prompt engineering, and analysis of embeddings. Assess and update risk measurement approaches at regular cadences.
Confabulation

GV-4.1-002
Establish policies, procedures, and processes detailing risk measurement in context of use with standardized measurement protocols and structured public feedback exercises such as AI red-teaming or independent external evaluations.
CBRN Information and Capability; Value Chain and Component Integration

Transparency artifacts
Promoting transparency and accountability throughout the AI lifecycle can foster trust, facilitate debugging and monitoring, and enable audits. This involves documenting data sources, design decisions, and limitations through tools like model cards and offering clear communication about experimental features. Incorporating user feedback mechanisms further supports continuous improvement and fosters greater confidence in AI-driven healthcare solutions.
AI developers and DevOps engineers should be transparent about the evidence and reasons behind all outputs by providing clear documentation of the underlying data sources and design decisions so that end-users can make informed decisions about the use of the system. Transparency enables the tracking of potential problems and facilitates the evaluation of AI systems by both internal and external teams. Transparency artifacts guide AI researchers and developers on the responsible use of the model, promote trust, and help end-users make informed decisions about the use of the system.
The following are some implementation suggestions:

When building AI features with experimental models or services, it’s essential to highlight the possibility of unexpected model behavior so healthcare professionals can accurately assess whether to use the AI system.
Consider publishing artifacts such as Amazon SageMaker model cards or AWS system cards. Also, at AWS we provide detailed information about our AI systems through AWS AI Service Cards, which list intended use cases and limitations, responsible AI design choices, and deployment and performance optimization best practices for some of our AI services. AWS also recommends establishing transparency policies and processes for documenting the origin and history of training data while balancing the proprietary nature of training approaches. Consider creating a hybrid document that combines elements of both model cards and service cards, because your application likely uses foundation models (FMs) but provides a specific service.
Offer a feedback user mechanism. Gathering regular and scheduled feedback from healthcare professionals can help developers make necessary refinements to improve system performance. Also consider establishing policies to help developers allow for user feedback mechanisms for AI systems. These should include thorough instructions and consider establishing policies for any mechanisms for recourse.

Security by design
When developing AI systems, consider security best practices at each layer of the application. Generative AI systems might be vulnerable to adversarial attacks suck as prompt injection, which exploits the vulnerability of LLMs by manipulating their inputs or prompt. These types of attacks can result in data leakage, unauthorized access, or other security breaches. To address these concerns, it can be helpful to perform a risk assessment and implement guardrails for both the input and output layers of the application. As a general rule, your operating model should be designed to perform the following actions:

Safeguard patient privacy and data security by implementing personally identifiable information (PII) detection, configuring guardrails that check for prompt attacks
Continually assess the benefits and risks of all generative AI features and tools and regularly monitor their performance through Amazon CloudWatch or other alerts
Thoroughly evaluate all AI-based tools for quality, safety, and equity before deploying

Developer resources
The following resources are useful when architecting and building generative AI applications:

Amazon Bedrock Guardrails helps you implement safeguards for your generative AI applications based on your use cases and responsible AI policies. You can create multiple guardrails tailored to different use cases and apply them across multiple FMs, providing a consistent user experience and standardizing safety and privacy controls across your generative AI applications.
The AWS responsible AI whitepaper serves as an invaluable resource for healthcare professionals and other developers that are developing AI applications in critical care environments where errors could have life-threatening consequences.
AWS AI Service Cards explains the use cases for which the service is intended, how machine learning (ML) is used by the service, and key considerations in the responsible design and use of the service.

Conclusion
Generative AI has the potential to improve nearly every aspect of healthcare by enhancing care quality, patient experience, clinical safety, and administrative safety through responsible implementation. When designing, developing, or operating an AI application, try to systematically consider potential limitations by establishing a governance and evaluation framework grounded by the need to maintain the safety, privacy, and trust that your users expect.
For more information about responsible AI, refer to the following resources:

NIST Trustworthy and Responsible AI
OWASP Top 10 for Large Language Model applications

About the authors
Tonny Ouma is an Applied AI Specialist at AWS, specializing in generative AI and machine learning. As part of the Applied AI team, Tonny helps internal teams and AWS customers incorporate leading-edge AI systems into their products. In his spare time, Tonny enjoys riding sports bikes, golfing, and entertaining family and friends with his mixology skills.
Simon Handley, PhD, is a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences team at Amazon Web Services. He has more than 25 years’ experience in biotechnology and machine learning and is passionate about helping customers solve their machine learning and life sciences challenges. In his spare time, he enjoys horseback riding and playing ice hockey.

Beyond pilots: A proven framework for scaling AI to production

The era of perpetual AI pilots is over. This year, 65% of AWS Generative AI Innovation Center customer projects moved from concept to production—some launching in just 45 days, as AWS VP Swami Sivasubramanian shared on LinkedIn. These results come from insights gained across more than one thousand customer implementations.
The Generative AI Innovation Center pairs organizations across industries with AWS scientists, strategists, and engineers to implement practical AI solutions that drive measurable outcomes. These initiatives transform diverse sectors worldwide. For example, through a cross-functional AWS collaboration, we supported the National Football League (NFL) to create a generative AI-powered solution that obtains statistical game insights within 30 seconds. This helps their media and production teams locate video content six times faster. Similarly, we helped Druva’s DruAI system streamline customer support and data protection through natural language processing, reducing investigation time from hours to minutes.
These achievements reflect a broader pattern of success, driven by a powerful methodology: The Five V’s Framework for AI Implementation.

This framework takes projects from initial testing to full deployment by focusing on concrete business outcomes and operational excellence. It’s grounded in two of Amazon’s Leadership Principles, Customer Obsession and Deliver Results. By starting with what customers actually need and working backwards, we’ve helped companies across industries modernize their operations and better serve their customers.
The Five V’s Framework: A foundation for success
Every successful AI deployment begins with groundwork. In our experience, projects thrive when organizations first identify specific challenges they need to solve, align key stakeholders around these goals, and establish clear accountability for results. The Five V’s Framework helps guide organizations through a structured process:

Value: Target high-impact opportunities aligned with your strategic priorities
Visualize: Define clear success metrics that link directly to business outcomes
Validate: Test solutions against real-world requirements and constraints
Verify: Create a scalable path to production that delivers sustainable results
Venture: Secure the resources and support needed for long-term success

Value: The critical first step
The Value phase emphasizes working backwards from your most pressing business challenges. By starting with existing pain points and collaborating across technical and business teams, organizations can develop solutions that deliver meaningful return on investment (ROI). This focused approach helps direct resources where they’ll have the greatest impact.
Visualize: Defining success through measurement
The next step requires translating the potential benefits—cost reduction, revenue growth, risk mitigation, improved customer experience, and competitive advantage—into clear, measurable performance indicators. A comprehensive measurement framework starts with baseline metrics using historical data where available. These metrics should address both technical aspects like accuracy and response time, as well as business outcomes such as productivity gains and customer satisfaction.
The Visualize phase examines data availability and quality to support proper measurement while working with stakeholders to define success criteria that align with strategic objectives. This dual focus helps organizations track not just the performance of the AI solution, but its actual impact on business goals.
Validate: Where ambition meets reality
The Validate phase focuses on testing solutions against real-world conditions and constraints. Our approach integrates strategic vision with implementation expertise from day one. As Sri Elaprolu, Director of the Generative AI Innovation Center, explains: “Effective validation creates alignment between vision and execution. We unite diverse perspectives—from scientists to business leaders—so that solutions deliver both technical excellence and measurable business impact.”
This process involves systematic integration testing, stress testing for expected loads, verifying compliance requirements, and gathering end-user feedback. Security specialists shape the core architecture. Industry subject matter experts define the operational processes and decision logic that guide prompt design and model refinement. Change management strategies are integrated early to ensure alignment and adoption.
The Generative AI Innovation Center partnered with SparkXGlobal, an AI-driven marketing-technology company, to validate their new solution through comprehensive testing. Their platform, Xnurta, provides business analytics and reporting for Amazon merchants, demonstrating impressive results: report processing time dropped from 6-8 hours to just 8 minutes while maintaining 95% accuracy. This successful validation established a foundation for SparkXGlobal’s continued innovation and enhanced AI capabilities.
Working with the Generative AI Innovation Center, the U.S. Environmental Protection Agency (EPA) created an intelligent document processing solution powered by Anthropic models on Amazon Bedrock. This solution helped EPA scientists accelerate chemical risk assessments and pesticide reviews through transparent, verifiable, and human-controlled AI practices. The impact has been substantial: document processing time decreased by 85%, evaluation costs dropped by 99%, and more than 10,000 regulatory applications have advanced faster to protect public health.
Verify: The path to production
Moving from pilot to production requires more than proof of concept—it demands scalable solutions that integrate with existing systems and deliver consistent value. While demos can seem compelling, verification reveals the true complexity of enterprise-wide deployment. This critical stage maps the journey from prototype to production, establishing a foundation for sustainable success.
Building production-ready AI solutions brings together several key elements. Robust governance structures must facilitate responsible AI deployment and oversight, managing risk and compliance in an evolving regulatory landscape. Change management prepares teams and processes for new ways of working, driving organization-wide adoption. Operational readiness assessments evaluate existing workflows, integration points, and team capabilities to facilitate smooth implementation.
Architectural decisions in the verification phase balance scale, reliability, and operability, with security and compliance woven into the solution’s fabric. This often involves practical trade-offs based on real-world constraints. A simpler solution aligned to existing team capabilities may prove more valuable than a complex one requiring specialized expertise. Similarly, meeting strict latency requirements might necessitate choosing a streamlined model over a more sophisticated one, as model selection requires a balance of performance, accuracy, and computational costs based on the use case.
Generative AI Innovation Center Principal Data Scientist, Isaac Privitera, captures this philosophy: “When building a generative AI solution, we focus primarily on three things: measurable business impact, production readiness from day one, and sustained operational excellence. This trinity drives solutions that thrive in real-world conditions.”
Effective verification demands both technical expertise and practical wisdom from real-world deployments. It requires proving not just that a solution works in principle, but that it can operate at scale within existing systems and team capabilities. By systematically addressing these factors, we help make sure deployments deliver sustainable, long-term value.
Venture: Securing long-term success
Long-term success in AI also requires mindful resource planning across people, processes, and funding. The Venture phase maps the full journey from implementation through sustained organizational adoption.
Financial viability starts with understanding the total cost of ownership, from initial development through deployment, integration, training, and ongoing operations. Promising projects can stall mid-implementation due to insufficient resource planning. Success requires strategic budget allocation across all phases, with clear ROI milestones and the flexibility to scale.
Successful ventures demand organizational commitment through executive sponsorship, stakeholder alignment, and dedicated teams for ongoing optimization and maintenance. Organizations must also account for both direct and indirect costs—from infrastructure and development, to team training, process adaptation, and change management. A blend of sound financial planning and flexible resource strategies allows teams to accelerate and adjust as opportunities and challenges arise.
From there, the solution must integrate seamlessly into daily operations with clear ownership and widespread adoption. This transforms AI from a project into a core organizational capability.
Adopting the Five V’s Framework in your enterprise
The Five V’s Framework shifts AI focus from technical capabilities to business results, replacing ‘What can AI do?’ with ‘What do we need AI to do?’. Successful implementation requires both an innovative culture and access to specialized expertise.

AWS resources to support your journey
AWS offers a variety of resources to help you scale your AI to production.
Expert guidance
The AWS Partnership Network (APN) offers multiple pathways to access specialized expertise, while AWS Professional Services brings proven methodologies from its own successful AI implementations. Certified partners, including Generative AI Partner Innovation Alliance members who receive direct enablement training from the Generative AI Innovation Center team, extend this expertise across industries. AWS Generative AI Competency Partners bring use case-specific success, while specialized partners focus on model customization and evaluation.
Self-service learning
For teams building internal capabilities, AWS provides technical blogs with implementation guides based on real-world experience, GitHub repositories with production-ready code, and AWS Workshop Studio for hands-on learning that bridges theory and practice.
Balancing learning and innovation
Even with the right framework and resources, not every AI project will reach production. These initiatives still provide valuable lessons that strengthen your overall program. Organizations can build lasting AI capabilities through three key principles:

Embracing a portfolio approach: Treat AI initiatives as an investment portfolio where diversification drives risk management and value creation. Balance quick wins (delivering value within months), strategic initiatives (driving longer-term transformation), and moonshot projects (potentially revolutionizing your business).
Creating a culture of safe experimentation: Organizations thrive with AI when teams can innovate boldly. In rapidly evolving fields, the cost of inaction often exceeds the risk of calculated experiments.
Learning from “productive failures”: Capture insights systematically across projects. Technical challenges reveal capability gaps, data issues expose information needs, and organizational readiness concerns illuminate broader transformation requirements – all shaping future initiatives.

The path forward
The next 12-18 months present a pivotal opportunity for organizations to harness generative AI and agentic AI to solve previously intractable problems, establish competitive advantages, and explore entirely new frontiers of business possibility. Those who successfully move from pilot to production will help define what’s possible within their industries and beyond.
Are you ready to move your AI initiatives into production?

Learn more about the AWS Generative AI Innovation Center and contact your AWS Account Manager to be connected to our expert guidance and support.
Join our AWS Builder community to connect with others on a similar AI journey.

About the authors
Sri Elaprolu serves as Director of the AWS Generative AI Innovation Center, where he leverages nearly three decades of technology leadership experience to drive artificial intelligence and machine learning innovation. In this role, he leads a global team of machine learning scientists and engineers who develop and deploy advanced generative and agentic AI solutions for enterprise and government organizations facing complex business challenges. Throughout his nearly 13-year tenure at AWS, Sri has held progressively senior positions, including leadership of ML science teams that partnered with high-profile organizations such as the NFL, Cerner, and NASA. These collaborations enabled AWS customers to harness AI and ML technologies for transformative business and operational outcomes. Prior to joining AWS, he spent 14 years at Northrop Grumman, where he successfully managed product development and software engineering teams. Sri holds a Master’s degree in Engineering Science and an MBA with a concentration in general management, providing him with both the technical depth and business acumen essential for his current leadership role.
Dr. Diego Socolinsky is currently the North America Head of the Generative AI Innovation Center at Amazon Web Services (AWS). With over 25 years of experience at the intersection of technology, machine learning, and computer vision, he has built a career driving innovation from cutting-edge research to production-ready solutions. Dr. Socolinsky holds a Ph.D. in Mathematics from The Johns Hopkins University and has been a pioneer in various fields including thermal imaging biometrics, augmented/mixed reality, and generative AI initiatives. His technical expertise spans from optimizing low-level embedded systems to architecting complex real-time deep learning solutions, with particular focus on generative AI platforms, large-scale unstructured data classification, and advanced computer vision applications. He is known for his ability to bridge the gap between technical innovation and strategic business objectives, consistently delivering transformative technology that solves complex real-world problems.
Sabine Khan is a Strategic Initiatives Leader with the AWS Generative AI Innovation Center, where she implements delivery and strategy initiatives focused on scaling enterprise-grade Generative AI solutions. She specializes in production-ready AI systems and drives agentic AI projects from concept to deployment. With over twenty years of experience in software delivery and a strong focus on AI/ML during her tenure at AWS, she has established a track record of successful enterprise implementations. Prior to AWS, she led digital transformation initiatives and held product development and software engineering leadership roles in Houston’s energy sector. Sabine holds a Master’s degree in GeoScience and an MBA.
Andrea Jimenez is a dual master’s candidate at the Massachusetts Institute of Technology, pursuing an M.S. in Computer Science from the School of Engineering and an MBA from the Sloan School of Management. As a GenAI Lead Graduate Fellow at the MIT GenAI Innovation Center, she researches agentic AI systems and the economic implications of generative AI technologies, while leveraging her background in artificial intelligence, product development, and startup innovation to lead teams at the intersection of technology and business strategy. Her work focuses on advancing human-AI collaboration and translating cutting-edge research into scalable, high-impact solutions. Prior to AWS and MIT, she led product and engineering teams in the tech industry and founded and sold a startup that helped early-stage companies build and launch SaaS products.
Randi Larson connects AI innovation with executive strategy for the AWS Generative AI Innovation Center, shaping how organizations understand and translate technical breakthroughs into business value. She combines strategic storytelling with data-driven insight through global keynotes, Amazon’s first tech-for-good podcast, and conversations with industry and Amazon leaders on AI transformation. Before Amazon, Randi refined her analytical precision as a Bloomberg journalist and advisor to economic institutions, think tanks, and family offices on technology initiatives. Randi holds an MBA from Duke University’s Fuqua School of Business and a B.S. in Journalism and Spanish from Boston University.

Google AI Introduces FLAME Approach: A One-Step Active Learning that S …

Open vocabulary object detectors answer text queries with boxes. In remote sensing, zero shot performance drops because classes are fine grained and visual context is unusual. Google Research team proposess FLAME, a one step active learning strategy that rides on a strong open vocabulary detector and adds a tiny refiner that you can train in near real time on a CPU. The base model generates high recall proposals, the refiner filters false positives with a few targeted labels, and you avoid full model fine tuning. It reports state of the art accuracy on DOTA and DIOR with 30 shots, and minute scale adaptation per label on a CPU.

https://arxiv.org/pdf/2510.17670v1

Problem framing

Open vocabulary detectors such as OWL ViT v2 are trained on web scale image text pairs. They generalize well on natural images, yet they struggle when categories are subtle, for example chimney versus storage tank, or when the imaging geometry is different, for example nadir aerial tiles with rotated objects and small scales. Precision falls because the text embedding and the visual embedding overlap for look alike categories. A practical system needs the breadth of open vocabulary models, and the precision of a local specialist, without hours of GPU fine tuning or thousands of new labels.

Method and design in concise

FLAME is a cascaded pipeline. Step one, run a zero shot open vocabulary detector to produce many candidate boxes for a text query, for example “chimney.” Step two, represent each candidate with visual features and its similarity to the text. Step three, retrieve marginal samples that sit near the decision boundary by doing a low dimensional projection with PCA, then a density estimate, then select the uncertain band. Step four, cluster this band and pick one item per cluster for diversity. Step five, have a user label about 30 crops as positive or negative. Step six, optionally rebalance with SMOTE or SVM SMOTE if the labels are skewed. Step seven, train a small classifier, for example an RBF SVM or a two layer MLP, to accept or reject the original proposals. The base detector stays frozen, so you keep recall and generalization, and the refiner learns the exact semantics the user meant.

https://arxiv.org/pdf/2510.17670v1

Datasets, base models, and setup

Evaluation uses two standard remote sensing detection benchmarks. DOTA has oriented boxes over 15 categories in high resolution aerial images. DIOR has 23,463 images and 192,472 instances over 20 categories. The comparison includes a zero shot OWL ViT v2 baseline, a zero shot RS OWL ViT v2 that is fine tuned on RS WebLI, and several few shot baselines. RS OWL ViT v2 improves zero shot mean AP to 31.827 percent on DOTA and 29.387 percent on DIOR, which becomes the starting point for FLAME.

https://arxiv.org/pdf/2510.17670v1

Understanding the Results

On 30 shot adaptation, FLAME cascaded on RS OWL ViT v2 reaches 53.96 percent AP on DOTA and 53.21 percent AP on DIOR, which is the top accuracy among the listed methods. The comparison includes SIoU, a prototype based method with DINOv2, and a few shot method proposed by the research team. These numbers appear in Table 1. The research team also reports the per class breakdown in Table 2. On DIOR, the chimney class improves from 0.11 in zero shot to 0.94 after FLAME, which illustrates how the refiner removes look alike false positives from the open vocabulary proposals.

https://arxiv.org/pdf/2510.17670v1

Key Takeaways

FLAME is a one step active learning cascade over OWL ViT v2, it retrieves marginal samples using density estimation, enforces diversity with clustering, collects about 30 labels, and trains a lightweight refiner such as an RBF SVM or a small MLP, with no base model fine tuning.

With 30 shots, FLAME on RS OWL ViT v2 reaches 53.96% AP on DOTA and 53.21% AP on DIOR, exceeding prior few shot baselines including SIoU and a prototype method with DINOv2.

On DIOR, the chimney class improves from 0.11 in zero shot to 0.94 after FLAME, which shows strong filtering of look alike false positives.

Adaptation runs in about 1 minute for each label on a standard CPU, which supports near real time, user in the loop specialization.

Zero shot OWL ViT v2 starts at 13.774% AP on DOTA and 14.982% on DIOR, RS OWL ViT v2 raises zero shot AP to 31.827% and 29.387% respectively, and FLAME then delivers the large precision gains on top.

Editorial Comments

FLAME is a one step active learning cascade that layers a tiny refiner on top of OWL ViT v2, selecting marginal detections, collecting about 30 labels, and training a small classifier without touching the base model. On DOTA and DIOR, FLAME with RS OWL ViT v2 reports 53.96 percent AP and 53.21 percent AP, establishing a strong few shot baseline. On DIOR chimney, average precision rises from 0.11 to 0.94 after refinement, illustrating false positive suppression. Adaptation runs in about 1 minute per label on a CPU, enabling interactive specialization. OWLv2 and RS WebLI provide the foundation for zero shot proposals. Overall, FLAME demonstrates a practical path to open vocabulary detection specialization in remote sensing by pairing RS OWL ViT v2 proposals with a minute scale CPU refiner that lifts DOTA to 53.96 percent AP and DIOR to 53.21 percent AP.

Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Introduces FLAME Approach: A One-Step Active Learning that Selects the Most Informative Samples for Training and Makes a Model Specialization Super Fast appeared first on MarkTechPost.

UltraCUA: A Foundation Computer-Use Agents Model that Bridges the Gap …

Computer-use agents have been limited to primitives. They click, they type, they scroll. Long action chains amplify grounding errors and waste steps. Apple Researchers introduce UltraCUA, a foundation model that builds an hybrid action space that lets an agent interleave low level GUI actions with high level programmatic tool calls. The model chooses the cheaper and more reliable move at each step. The approach improves success and reduces steps on OSWorld, and transfers to WindowsAgentArena without Windows specific training.

https://arxiv.org/pdf/2510.17790

What hybrid action changes?

Hybrid action treats tools as first class actions. A tool call encapsulates a multi step operation as a single function with a clear signature and a docstring. A click or a key press still exists when no programmatic path is available. The agent learns to alternate between both modes. The goal is to reduce cascade errors and to cut step counts. The research team positions this as a bridge between GUI only CUAs and tool centric agent frameworks.

https://arxiv.org/pdf/2510.17790

Scaled tool acquisition

UltraCUA builds its tool library with an automated pipeline. The system extracts keyboard shortcuts and commands from software documentation. The system integrates open source implementations from agent toolkits. The system also uses coding agents to synthesize new tools. Each tool is a callable interface that hides a long GUI sequence. The research team reports coverage across 10 desktop domains with 881 tools. The largest buckets include VS Code with 135 tools and LibreOffice Writer with 123 tools. Thunderbird and GIMP also have deep coverage.

https://arxiv.org/pdf/2510.17790

Verifiable synthetic tasks and trajectories

Training requires grounded supervision and stable rewards. UltraCUA uses a dual synthetic engine. An evaluator first pipeline composes atomic verifiers for browsers, files, images, and system state, then generates tasks that satisfy those checks. An instruction first pipeline explores the OS and proposes context aligned tasks which are then verified. The result is 17,864 verifiable tasks across 10 domains such as Chrome, LibreOffice, GIMP, VS Code, system, Thunderbird, VLC, and multi app workflows. Chrome has 2,826 tasks. The LibreOffice suite sums to 5,885 tasks. Multi app tasks reach 2,113.

https://arxiv.org/pdf/2510.17790

A multi agent rollout produces successful hybrid trajectories. The planner uses OpenAI o3 for decision making. The grounder uses GTA1-7B for accurate visual localization. The rollout yields about 26.8K successful trajectories that show when to use a tool and when to act in the GUI. These trajectories are the core of the supervised phase.

Training Approach

Training has two stages. Stage 1 is supervised fine tuning. The models train for 3 epochs at a learning rate of 2e-5 on the successful trajectories. Loss is applied turn wise to avoid over weighting early steps. Stage 2 is online reinforcement learning. The models train for 150 steps at a learning rate of 1e-6 on verified tasks that are sampled by difficulty. The policy optimization follows a GRPO variant with clip higher, and removes KL regularization and format rewards. The reward combines sparse task outcome with a tool use term. Experiments use NVIDIA H100 GPUs. The context is kept near 32K by controlling the number of exposed tools.

Results on OSWorld

UltraCUA improves success at both 7B and 32B scales. Under 15 step budgets, UltraCUA-32B reaches 41.0 percent success. OpenCUA-32B reaches 29.7 percent. The absolute gain is 11.3 points. UltraCUA-7B reaches 28.9 percent. UI-TARS-1.5-7B reaches 23.4 percent. Gains persist under 50 step budgets. A per domain breakdown shows consistent lifts across Chrome, Writer, VS Code, and cross application tasks. Average steps decrease against baselines. These shifts indicate better action selection rather than only more attempts.

https://arxiv.org/pdf/2510.17790

https://arxiv.org/pdf/2510.17790

Cross platform transfer on WindowsAgentArena

UltraCUA trains only on Ubuntu based OSWorld data. The model is then evaluated on WindowsAgentArena. UltraCUA-7B reaches 21.7 percent success. This exceeds UI-TARS-1.5-7B at 18.1 percent and a Qwen2 baseline trained with Windows data at 13.5 percent. The result suggests that hybrid action strategies learned on one platform transfer to other platforms. The paper highlights this as zero shot platform generalization.

https://arxiv.org/pdf/2510.17790

Key Takeaways

UltraCUA formalizes a hybrid action space that lets a single agent alternate between GUI primitives and programmatic tool calls, which reduces long error prone action chains.

The research team scales a reusable tool library through an automated pipeline and pairs it with a synthetic data engine, yielding 17,000 plus verifiable computer use tasks for training and evaluation.

Training follows a two stage recipe, supervised fine tuning on successful hybrid trajectories then online reinforcement learning on verifiable tasks, which optimizes when to call tools versus act in the GUI.

On OSWorld, UltraCUA reports an average 22 percent relative improvement over base models and 11 percent fewer steps, which indicates gains in reliability and efficiency.

The 7B model reaches 21.7 percent success on WindowsAgentArena without Windows specific training, which shows cross platform generalization of the hybrid action policy.

Editorial Comments

UltraCUA moves computer use agents from brittle primitive action chains to a hybrid action policy, integrating GUI primitives with programmatic tool calls, which reduces error propagation and step counts. It scales tools via an automated pipeline and pairs them with a synthetic data engine that yields 17,000 plus verifiable tasks, enabling supervised fine tuning and online reinforcement learning on grounded signals. Reported results include 22 percent relative improvement on OSWorld with 11 percent fewer steps, and 21.7 percent success on WindowsAgentArena without Windows specific training, which indicates cross platform transfer of the policy.

Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post UltraCUA: A Foundation Computer-Use Agents Model that Bridges the Gap between General-Purpose GUI Agents and Specialized API-based Agents appeared first on MarkTechPost.

A Coding Guide to Build a Fully Functional Multi-Agent Marketplace Usi …

In this tutorial, we explore how to build a small yet functional multi-agent system using the uAgents framework. We set up three agents — Directory, Seller, and Buyer — that communicate via well-defined message protocols to simulate a real-world marketplace interaction. We design message schemas, define agent behaviors, and implement request-response cycles to demonstrate discovery, negotiation, and transaction among agents, all running asynchronously in a shared event loop. Through this, we understand how autonomous agents collaborate, trade, and efficiently maintain decentralized workflows. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser!pip -q install “uagents>=0.11.2″

import asyncio, random
from typing import List, Dict, Optional
from uagents import Agent, Context, Bureau, Model, Protocol

class ServiceAnnounce(Model):
category: str
endpoint: str

class ServiceQuery(Model):
category: str

class ServiceList(Model):
addresses: List[str]

class OfferRequest(Model):
item: str
max_price: int

class Offer(Model):
item: str
price: int
qty: int

class Order(Model):
item: str
qty: int

class Receipt(Model):
item: str
qty: int
total: int
ok: bool
note: Optional[str] = None

We begin by installing the uAgents library and defining all the message models that underpin our communication system. We create structured data types for announcements, queries, offers, and orders, enabling agents to exchange information seamlessly. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserregistry_proto = Protocol(name=”registry”, version=”1.0″)
trade_proto = Protocol(name=”trade”, version=”1.0″)

directory = Agent(name=”directory”, seed=”dir-seed-001″)
seller = Agent(name=”seller”, seed=”seller-seed-001″)
buyer = Agent(name=”buyer”, seed=”buyer-seed-001″)

directory.include(registry_proto)
seller.include(trade_proto)
buyer.include(registry_proto)
buyer.include(trade_proto)

@registry_proto.on_message(model=ServiceAnnounce)
async def on_announce(ctx: Context, sender: str, msg: ServiceAnnounce):
reg = await ctx.storage.get(“reg”) or {}
reg.setdefault(msg.category, set()).add(sender)
await ctx.storage.set(“reg”, reg)
ctx.logger.info(f”Registered {sender} under ‘{msg.category}'”)

@registry_proto.on_message(model=ServiceQuery)
async def on_query(ctx: Context, sender: str, msg: ServiceQuery):
reg = await ctx.storage.get(“reg”) or {}
addrs = sorted(list(reg.get(msg.category, set())))
await ctx.send(sender, ServiceList(addresses=addrs))
ctx.logger.info(f”Returned {len(addrs)} providers for ‘{msg.category}'”)

We set up the Directory, Seller, and Buyer agents and define the registry protocol that manages service discovery. We make the directory respond to announcements and queries, allowing agents to register and locate each other dynamically. Check out the Full Codes here.

Copy CodeCopiedUse a different BrowserCATALOG: Dict[str, Dict[str, int]] = {
“camera”: {“price”: 120, “qty”: 3},
“laptop”: {“price”: 650, “qty”: 2},
“headphones”: {“price”: 60, “qty”: 5},
}

@seller.on_event(“startup”)
async def seller_start(ctx: Context):
await ctx.send(directory.address, ServiceAnnounce(category=”electronics”, endpoint=seller.address))
ctx.logger.info(“Seller announced to directory”)

@trade_proto.on_message(model=OfferRequest)
async def on_offer_request(ctx: Context, sender: str, req: OfferRequest):
item = CATALOG.get(req.item)
if not item:
await ctx.send(sender, Offer(item=req.item, price=0, qty=0))
return
price = max(1, int(item[“price”] * (0.9 + 0.2 * random.random())))
if price > req.max_price or item[“qty”] <= 0:
await ctx.send(sender, Offer(item=req.item, price=0, qty=0))
return
await ctx.send(sender, Offer(item=req.item, price=price, qty=item[“qty”]))
ctx.logger.info(f”Offered {req.item} at {price} with qty {item[‘qty’]}”)

@trade_proto.on_message(model=Order)
async def on_order(ctx: Context, sender: str, order: Order):
item = CATALOG.get(order.item)
if not item or item[“qty”] < order.qty:
await ctx.send(sender, Receipt(item=order.item, qty=0, total=0, ok=False, note=”Not enough stock”))
return
total = item[“price”] * order.qty
item[“qty”] -= order.qty
await ctx.send(sender, Receipt(item=order.item, qty=order.qty, total=total, ok=True, note=”Thanks!”))

We create the Seller agent’s catalog and implement logic for responding to offer requests and processing orders. We simulate real-world trading by adding variable pricing and stock management, showing how the seller negotiates and completes transactions. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser@buyer.on_event(“startup”)
async def buyer_start(ctx: Context):
ctx.logger.info(“Buyer querying directory for electronics…”)
resp = await ctx.ask(directory.address, ServiceQuery(category=”electronics”), expects=ServiceList, timeout=5.0)
sellers = resp.addresses if resp else []
if not sellers:
return
target = sellers[0]
desired = “laptop”
budget = 700
ctx.logger.info(f”Requesting offer for ‘{desired}’ within budget {budget} from {target}”)
offer = await ctx.ask(target, OfferRequest(item=desired, max_price=budget), expects=Offer, timeout=5.0)
if not offer or offer.price <= 0:
return
qty = 1 if offer.qty >= 1 else 0
if qty == 0:
return
ctx.logger.info(f”Placing order for {qty} x {offer.item} at {offer.price}”)
receipt = await ctx.ask(target, Order(item=offer.item, qty=qty), expects=Receipt, timeout=5.0)
if receipt and receipt.ok:
ctx.logger.info(f”ORDER SUCCESS: {receipt.qty} x {receipt.item} | total={receipt.total}”)

We program the Buyer agent to discover sellers, request offers, and place orders based on availability and budget. We observe how the buyer interacts with the seller through asynchronous communication to complete a purchase successfully. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser@buyer.on_interval(period=6.0)
async def periodic_discovery(ctx: Context):
seen = await ctx.storage.get(“seen”) or 0
if seen >= 1:
return
await ctx.storage.set(“seen”, seen + 1)
ctx.logger.info(“Periodic discovery tick -> re-query directory”)
resp = await ctx.ask(directory.address, ServiceQuery(category=”electronics”), expects=ServiceList, timeout=3.0)
n = len(resp.addresses) if resp else 0
ctx.logger.info(f”Periodic: directory reports {n} seller(s)”)

bureau = Bureau()
bureau.add(directory)
bureau.add(seller)
bureau.add(buyer)

async def run_demo(seconds=10):
task = asyncio.create_task(bureau.run_async())
try:
await asyncio.sleep(seconds)
finally:
task.cancel()
try:
await task
except asyncio.CancelledError:
pass
print(“n Demo run complete.n”)

try:
loop = asyncio.get_running_loop()
await run_demo(10)
except RuntimeError:
asyncio.run(run_demo(10))

We add periodic discovery to have the buyer recheck available sellers, then have the Bureau run all agents together. We launch the asynchronous runtime to see the full marketplace simulation unfold and complete smoothly.

In conclusion, we have seen our agents discover one another, negotiate an offer, and complete a transaction entirely through message-based interactions. We realize how uAgents simplifies multi-agent orchestration by combining structure, communication, and state management seamlessly within Python. As we run this example, we not only witness a dynamic, autonomous system in action but also gain insight into how the same architecture can be extended to complex decentralized marketplaces, AI collaborations, and intelligent service networks, all within a lightweight, easy-to-use framework.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Build a Fully Functional Multi-Agent Marketplace Using uAgent appeared first on MarkTechPost.

Generate Gremlin queries using Amazon Bedrock models

Graph databases have revolutionized how organizations manage complex, interconnected data. However, specialized query languages such as Gremlin often create a barrier for teams looking to extract insights efficiently. Unlike traditional relational databases with well-defined schemas, graph databases lack a centralized schema, requiring deep technical expertise for effective querying.
To address this challenge, we explore an approach that converts natural language to Gremlin queries, using Amazon Bedrock models such as Amazon Nova Pro. This approach helps business analysts, data scientists, and other non-technical users access and interact with graph databases seamlessly.
In this post, we outline our methodology for generating Gremlin queries from natural language, comparing different techniques and demonstrating how to evaluate the effectiveness of these generated queries using large language models (LLMs) as judges.
Solution overview
Transforming natural language queries into Gremlin queries requires a deep understanding of graph structures and the domain-specific knowledge encapsulated within the graph database. To achieve this, we divided our approach into three key steps:

Understanding and extracting graph knowledge
Structuring the graph similar to text-to-SQL processing
Generating and executing Gremlin queries

The following diagram illustrates this workflow.

Step 1: Extract graph knowledge
A successful query generation framework must integrate both graph knowledge and domain knowledge to accurately translate natural language queries. Graph knowledge encompasses structural and semantic information extracted directly from the graph database. Specifically, it includes:

Vertex labels and properties – A listing of vertex types, names, and their associated attributes
Edge labels and properties – Information about edge types and their attributes
One-hop neighbors for each vertex – Capturing local connectivity information, such as direct relationships between vertices

With this graph-specific knowledge, the framework can effectively reason about the heterogeneous properties and complex connections inherent to graph databases.
Domain knowledge captures additional context that augments the graph knowledge and is tailored specifically to the application domain. It is sourced in two ways:

Customer-provided domain knowledge – For example, the customer kscope.ai helped specify those vertices that represent metadata and should never be queried. Such constraints are encoded to guide the query generation process.
LLM-generated descriptions – To enhance the system’s understanding of vertex labels and their relevance to specific questions, we use an LLM to generate detailed semantic descriptions of vertex names, properties, and edges. These descriptions are stored within the domain knowledge repository and provide additional context to improve the relevance of the generated queries.

Step 2: Structure the graph as a text-to-SQL schema
To improve the model’s comprehension of graph structures, we adopt an approach similar to text-to-SQL processing, where we construct a schema representing vertex types, edges, and properties. This structured representation enhances the model’s ability to interpret and generate meaningful queries.
The question processing component transforms natural language input into structured elements for query generation. It operates in three stages:

Entity recognition and classification – Identifies key database elements in the input question (such as vertices, edges, and properties) and categorizes the question based on its intent
Context enhancement – Enriches the question with relevant information from the knowledge component, so both graph-specific and domain-specific context is properly captured
Query planning – Maps the enhanced question to specific database elements needed for query execution

The context generation component makes sure the generated queries accurately reflect the underlying graph structure by assembling the following:

Element properties – Retrieves attributes of vertices and edges along with their data types
Graph structure – Facilitates alignment with the database’s topology
Domain rules – Applies business constraints and logic

Step 3: Generate and execute Gremlin queries
The final step is query generation, where the LLM constructs a Gremlin query based on the extracted context. The process follows these steps:

The LLM generates an initial Gremlin query.
The query is executed within a Gremlin engine.
If the execution is successful, results are returned.
If execution fails, an error message parsing mechanism analyzes the returned errors and refines the query using LLM-based feedback.

This iterative refinement makes sure the generated queries align with the database’s structure and constraints, improving overall accuracy and usability.
Prompt template
Our final prompt template is as follows:

## Request
Please write a gremlin query to answer the given question:
{{question}}
You will be provided with couple relevant vertices, together with their
schema and other information.
Please choose the most relevant vertex according to its schema and other
information to make the gremlin query correct.

## Instructions
1. Here are related vertices and their details:
{{schema}}
2. Don’t rename properties.
3. Don’t change lines (using slash n) in the generated query.

## IMPORTANT
Return the results in the following XML format:

<Results>
<Query>INSERT YOUR QUERY HERE</Query>
<Explanation>
PROVIDE YOUR EXPLANATION ON HOW THIS QUERY WAS GENERATED
AND HOW THE PROVIDED SCHEMA WAS LEVERAGED
</Explanation>
</Results>

Comparing LLM-generated queries to ground truth
We implemented an LLM-based evaluation system using Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock as a judge to assess both query generation and execution results for Amazon Nova Pro and a benchmark model. The system operates in two key areas:

Query evaluation – Assesses correctness, efficiency, and similarity to ground-truth queries; calculates exact matching component percentages; and provides an overall rating based on predefined rules developed with domain experts
Execution evaluation – Initially used a single-stage approach to compare generated results with ground truth, then enhanced to a two-stage evaluation process:

Item-by-item verification against ground truth
Calculation of overall match percentage

Testing across 120 questions demonstrated the framework’s ability to effectively distinguish correct from incorrect queries. The two-stage approach particularly improved the reliability of execution result evaluation by conducting thorough comparison before scoring.
Experiments and results
In this section, we discuss the experiments we conducted and their results.
Query similarity
In the query evaluation case, we propose two metrics: query exact match and query overall rating. An exact match score is calculated by identifying matching vs. non-matching components between generated and ground truth queries. The following table summarizes the scores for query exact match.

Easy
Medium
Hard
Overall

Amazon Nova Pro
82.70%
61%
46.60%
70.36%

Benchmark Model
92.60%
68.70%
56.20%
78.93%

An overall rating is provided after considering factors including query correctness, efficiency, and completeness as instructed in the prompt. The overall rating is on scale 1–10. The following table summarizes the scores for query overall rating.

Easy
Medium
Hard
Overall

Amazon Nova Pro
8.7
7
5.3
7.6

Benchmark Model
9.7
8
6.1
8.5

One limitation in the current query evaluation setup is that we rely solely on the LLM’s ability to compare ground truth against LLM-generated queries and arrive at the final scores. As a result, the LLM can fail to align with human preferences and under- or over-penalize the generated query. To address this, we recommend working with a subject matter expert to include domain-specific rules in the evaluation prompt.
Execution accuracy
To calculate accuracy, we compare the results of the LLM-generated Gremlin queries against the results of ground truth queries. If the results from both queries match exactly, we count the instance as correct; otherwise, it is considered incorrect. Accuracy is then computed as the ratio of correct query executions to the total number of queries tested. This metric provides a straightforward evaluation of how well the model-generated queries retrieve the expected information from the graph database, facilitating alignment with the intended query logic.
The following table summarizes the scores for execution results count match.

Easy
Medium
Hard
Overall

Amazon Nova Pro
80%
50%
10%
60.42%

Benchmark Model
90%
70%
30%
74.83%

Query execution latency
In addition to accuracy, we evaluate the efficiency of generated queries by measuring their runtime and comparing it with the ground truth queries. For each query, we record the runtime in milliseconds and analyze the difference between the generated query and the corresponding ground truth query. A lower runtime indicates a more optimized query, whereas significant deviations might suggest inefficiencies in query structure or execution planning. By considering both accuracy and runtime, we gain a more comprehensive assessment of query quality, making sure the generated queries are correct and performant within the graph database. The following box plot showcases query execution latency with respect to time for the ground truth query and the query generated by Amazon Nova Pro. As illustrated, all three types of queries exhibit comparable runtimes, with similar median latencies and overlapping interquartile ranges. Although the ground truth queries display a slightly wider range and a higher outlier, the median values across all three groups remain close. This suggests that the model-generated queries are at the same level as human-written ones in terms of execution efficiency, supporting the claim that AI-generated queries are of similar quality and don’t incur additional latency overhead.

Query generation latency and cost
Finally, we compare the time taken to generate each query and calculate the cost based on token consumption. More specifically, we measure the query generation time and track the number of tokens used, because most LLM-based APIs charge based on token usage. By analyzing both the generation speed and token cost, we can determine whether the model is efficient and cost-effective. These results provide insights in selecting the optimal model that balances query accuracy, execution efficiency, and economic feasibility.
As shown in the following plots, Amazon Nova Pro consistently outperforms the benchmark model in both generation latency and cost. In the left plot, which depicts query generation latency, Amazon Nova Pro demonstrates a significantly lower median generation time, with most values clustered between 1.8–4 seconds, compared to the benchmark model’s broader range from around 5–11 seconds. The right plot, illustrating query generation cost, shows that Amazon Nova Pro maintains a much smaller cost per query—centered well below $0.005—whereas the benchmark model incurs higher and more variable costs, reaching up to $0.025 in some cases. These results highlight Amazon Nova Pro’s advantage in terms of both speed and affordability, making it a strong candidate for deployment in time-sensitive or large-scale systems.

Conclusion
We experimented with all 120 ground truth queries provided to us by kscope.ai and achieved an overall accuracy of 74.17% in generating correct results. The proposed framework demonstrates its potential by effectively addressing the unique challenges of graph query generation, including handling heterogeneous vertex and edge properties, reasoning over complex graph structures, and incorporating domain knowledge. Key components of the framework, such as the integration of graph and domain knowledge, the use of Retrieval Augmented Generation (RAG) for query plan creation, and the iterative error-handling mechanism for query refinement, have been instrumental in achieving this performance.
In addition to improving accuracy, we are actively working on several enhancements. These include refining the evaluation methodology to handle deeply nested query results more effectively and further optimizing the use of LLMs for query generation. Moreover, we are using the RAGAS-faithfulness metric to improve the automated evaluation of query results, resulting in greater reliability and consistency in assessing the framework’s outputs.

About the authors
Mengdie (Flora) Wang is a Data Scientist at AWS Generative AI Innovation Center, where she works with customers to architect and implement scalable Generative AI solutions that address their unique business challenges. She specializes in model customization techniques and agent-based AI systems, helping organizations harness the full potential of generative AI technology. Prior to AWS, Flora earned her Master’s degree in Computer Science from the University of Minnesota, where she developed her expertise in machine learning and artificial intelligence.
Jason Zhang has expertise in machine learning, reinforcement learning, and generative AI. He earned his Ph.D. in Mechanical Engineering in 2014, where his research focused on applying reinforcement learning to real-time optimal control problems. He began his career at Tesla, applying machine learning to vehicle diagnostics, then advanced NLP research at Apple and Amazon Alexa. At AWS, he worked as a Senior Data Scientist on generative AI solutions for customers.
Rachel Hanspal is a Deep Learning Architect at AWS Generative AI Innovation Center, specializing in end-to-end GenAI solutions with a focus on frontend architecture and LLM integration. She excels in translating complex business requirements into innovative applications, leveraging expertise in natural language processing, automated visualization, and secure cloud architectures.
Zubair Nabi is the CTO and Co-Founder of Kscope, an Integrated Security Posture Management (ISPM) platform. His expertise lies at the intersection of Big Data, Machine Learning, and Distributed Systems, with over a decade of experience building software, data, and AI platforms. Zubair is also an adjunct faculty member at George Washington University and the author of Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark. He holds an MPhil from the University of Cambridge.
Suparna Pal – CEO & Co-Founder of kscope.ai – 20+ years of journey of building innovative platforms & solutions for Industrial, Health Care and IT operations at PTC, GE, and Cisco.
Wan Chen is an Applied Science Manager at AWS Generative AI Innovation Center. As a ML/AI veteran in tech industry, she has wide range of expertise on traditional machine learning, recommender system, deep learning and Generative AI. She is a stronger believer of Superintelligence and is very passionate to push the boundary of AI research and application to enhance human life and drive business growth. She holds Ph.D in Applied Mathematics from University of British Columbia and had worked as postdoctoral fellow in Oxford University.
Mu Li is a Principal Solutions Architect with AWS Energy. He’s also the Worldwide Tech Leader for the AWS Energy & Utilities Technical Field Community (TFC), a community of 300+ industry and technical experts. Li is passionate about working with customers to achieve business outcomes using technology. Li has worked with customers to migrate all-in to AWS from on-prem and Azure, launch the Production Monitoring and Surveillance industry solution, deploy ION/OpenLink Endur on AWS, and implement AWS-based IoT and machine learning workloads. Outside of work, Li enjoys spending time with his family, investing, following Houston sports teams, and catching up on business and technology.

Incorporating responsible AI into generative AI project prioritization

Over the past two years, companies have seen an increasing need to develop a project prioritization methodology for generative AI. There is no shortage of generative AI use cases to consider. Rather, companies want to evaluate the business value against the cost, level of effort, and other concerns, for a large number of potential generative AI projects. One new concern for generative AI compared to other domains is considering issues like hallucination, generative AI agents making incorrect decisions and then acting on those decisions through tool calls to downstream systems, and dealing with the rapidly changing regulatory landscape. In this post we describe how to incorporate responsible AI practices into a prioritization method to systematically address these types of concerns.
Responsible AI overview
The AWS Well-Architected Framework defines responsible AI as “the practice of designing, developing, and using AI technology with the goal of maximizing benefits and minimizing risks.” The AWS responsible AI framework begins by defining eight dimensions of responsible AI: fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency. At key points in the development lifecycle, a generative AI team should consider the possible harms or risks for each dimension (inherent and residual risks), implements risk mitigations, and monitors risk on an ongoing basis. Responsible AI applies across the entire development lifecycle and should be considered during initial project prioritization. That’s especially true for generative AI projects, where there are novel types of risks to consider, and mitigations might not be as well understood or researched. Considering responsible AI up front gives a more accurate picture of project risk and mitigation level of effort and reduces the chance of costly rework if risks are uncovered later in the development lifecycle. In addition to potentially delayed projects due to rework, unmitigated concerns might also harm customer trust, result in representational harm, or fail to meet regulatory requirements.
Generative AI prioritization
While most companies have their own prioritization methods, here we’ll demonstrate how to use the weighted shortest job first (WSJF) method from the Scaled Agile system. WSJF assigns a priority using this formula:
Priority = (cost of delay) / (job size)
The cost of delay is a measure of business value. It includes the direct value (for example, additional revenue or cost savings), the timeliness (such as, is shipping this project worth a lot more today than a year from now), and the adjacent opportunities (such as, would delivering this project open up other opportunities down the road).
The job size is where you consider the level of effort to deliver the project. That normally includes direct development costs and paying for any infrastructure or software you need. The job size is where you can include the results of the initial responsible AI risk assessment and expected mitigations. For example, if the initial assessment uncovers three risks that require mitigation, you include the development cost for those mitigations in the job size. You can also qualitatively assess that a project with ten high-priority risks is more complex than a project with only two high-priority risks.
Example scenario
Now, let’s walk through a prioritization exercise that compares two generative AI projects. The first project uses a large language model (LLM) to generate product descriptions. A marketing team will use this application to automatically create production descriptions that go into the online product catalog website. The second project uses a text-to-image model to generate new visuals for advertising campaigns and the product catalog. The marketing team will use this application to more quickly create customized brand assets.
First pass prioritization
First, we’ll go through the prioritization method without considering responsible AI, assigning a score of 1–5 for each part of the WSJF formula. The specific scores vary by organization. Some companies prefer to use t-shirt sizing (S, M, L, and XL), others prefer a score of 1–5, and others will use a more granular score. A score of 1–5 is a common and straightforward way to start. For example, the direct value scores can be calculated as:
1 = no direct value
2 = 20% improvement in KPI (time to create high-quality descriptions)
3 = 40% improvement in KPI
4 = 80% improvement in KPI
5 = 100% or more improvement in KPI

Project 1: Automated product descriptions (scored from 1–5)
Project 2: Creating visual brand assets (scored from 1–5)

Direct value
3: Helps marketing team create higher quality descriptions more quickly
3: Helps marketing team create higher quality assets more quickly

Timeliness
2: Not particularly urgent
4: New ad campaign planned this quarter; without this project, cannot create enough brand assets without hiring a new agency to supplement the team

Adjacent opportunities
2: Might be able to reuse for similar scenarios)
3: Experience gained in image generation will build competence for future projects

Job size
2: Basic, well-known pattern
2: Basic, well-known pattern

Score
(3+2+2)/2 = 3.5
(3+4+3)/2 = 5

At first glance, it looks like Project 2 is more compelling. Intuitively that makes sense—it takes people a lot longer to make high-quality visuals than to create textual product descriptions.
Risk assessment
Now let’s go through a risk assessment for each project. The following table lists a brief overview of the outcome of a risk assessment along each of the AWS responsible AI dimensions, along with a t-shirt size (S, M, L, and XL) severity level. The table also includes suggested mitigations.

Project 1: Automated product descriptions
Project 2: Creating visual brand assets

Fairness
L: Are descriptions appropriate in terms of gender and demographics? Mitigate using guardrails.
L: Images must not portray particular demographics in a biased way. Mitigate using human and automated checks.

Explainability
No risks identified.
No risks identified.

Privacy and security
L: Some product information is proprietary and cannot be listed on a public site. Mitigate using data governance.
L: Model must not be trained on any images that contain proprietary information. Mitigate using data governance.

Safety
M: Language must be age-appropriate and not cover offensive topics. Mitigate using guardrails.
L: Images must not contain adult content or images of drugs, alcohol, or weapons. Mitigate using guardrails.

Controllability
S: Need to track customer feedback on the descriptions. Mitigate using customer feedback collection.
L: Do images align to our brand guidelines? Mitigate using human and automated checks.

Veracity and robustness
M: Will the system hallucinate and imply product capabilities that aren’t real? Mitigate using guardrails.
L: Are images realistic enough to avoid uncanny valley effects? Mitigate using human and automated checks.

Governance
M: Prefer LLM providers that offer copyright indemnification. Mitigate using LLM provider selection.
L: Require copyright indemnification and image source attribution. Mitigate using model provider selection.

Transparency
S: Disclose that descriptions are AI generated.
S: Disclose that descriptions are AI generated.

The risks and mitigations are use-case specific. The preceding table is for illustrative purposes only.
Second pass prioritization
How does the risk assessment affect the prioritization?

Project 1: Automated product descriptions (scored from 1–5)
Project 2: Creating visual brand assets (scored from 1–5)

Job size
3: Basic, well-known pattern; requires fairly standard guardrails, governance, and feedback collection.
5: Basic, well-known pattern. Requires advanced image guardrails with human oversight, and a more expensive commercial model. Research spike needed.

Score
(3+2+2)/3 = 2.3
(3+4+3)/5 = 2

Now it looks like Project 1 is a better one to start with. Intuitively, after you consider responsible AI, that makes sense. Poorly crafted or offensive images are more noticeable and have a larger impact than a poorly phrased product description. And the guardrails you can use for maintaining image safety are less mature than the equivalent guardrails for text, particularly in ambiguous cases like adhering to brand guidelines. In fact, an image guardrail system might require training a monitoring model or using people to spot-check some percentage of the output. You might need to dedicate a small science team to study this problem first.
Conclusion
In this post, you saw how to include responsible AI considerations in a generative AI project prioritization method. You saw how conducting a responsible AI risk assessment in the initial prioritization phase can change the outcome by uncovering a substantial amount of mitigation work. Moving forward, you should develop your own responsible AI policy and start adopting responsible AI practices for generative AI projects. You can find additional details and resources at Transform responsible AI from theory into practice.

About the author
Randy DeFauw is a Sr. Principal Solutions Architect at AWS. He has over 20 years of experience in technology, starting with his university work on autonomous vehicles. He has worked with and for customers ranging from startups to Fortune 50 companies, launching Big Data and Machine Learning applications. He holds an MSEE and an MBA, serves as a board advisor to K-12 STEM education initiatives, and has spoken at leading conferences including Strata and GlueCon. He is the co-author of the books SageMaker Best Practices and Generative AI Cloud Solutions. Randy currently acts as a technical advisor to AWS’ director of technology in North America.

PokeeResearch-7B: An Open 7B Deep-Research Agent Trained with Reinforc …

Pokee AI has open sourced PokeeResearch-7B, a 7B parameter deep research agent that executes full research loops, decomposes a query, issues search and read calls, verifies candidate answers, then synthesizes multiple research threads into a final response.

The agent runs a research and verification loop. In research, it calls external tools for web search and page reading or proposes an interim answer. In verification, it checks the answer against retrieved evidence, and either accepts or restarts research. This structure reduces brittle trajectories and catches obvious errors before finalization. The research team formalizes this loop and adds a test-time synthesis stage that merges several independent research threads.

Training recipe, RLAIF with RLOO

PokeeResearch-7B is finetuned from Qwen2.5-7B-Instruct using an annotation-free Reinforcement Learning from AI Feedback, called RLAIF, with the REINFORCE Leave-One-Out algorithm, called RLOO. The reward targets semantic correctness, citation faithfulness, and instruction adherence, not token overlap. The Model’s Hugging Face card lists batch size 64, 8 research threads per prompt during RL, learning rate 3e-6, 140 steps, context 32,768 tokens, bf16 precision, and a checkpoint near 13 GB. The research team emphasizes that RLOO provides an unbiased on policy gradient and contrasts it with the PPO family that is approximately on policy and biased.

https://arxiv.org/pdf/2510.15862

Reasoning scaffold and Research Threads Synthesis

The scaffold includes three mechanisms. Self correction, the agent detects malformed tool calls and retries. Self verification, the agent inspects its own answer against evidence. Research Threads Synthesis, the agent runs several independent threads per question, summarizes them, then synthesizes a final answer. The research team reports that synthesis improves accuracy on difficult benchmarks.

https://arxiv.org/pdf/2510.15862

Evaluation protocol

The research team evaluates text only questions from 10 benchmarks, NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, and Humanity’s Last Exam. They sample 125 questions per dataset, except GAIA with 103, for a total of 1,228 questions. For each question, they run 4 research threads, then compute mean accuracy, mean at 4, using Gemini-2.5-Flash-lite to judge correctness. The maximum interaction turns are set to 100.

https://github.com/Pokee-AI/PokeeResearchOSS

https://github.com/Pokee-AI/PokeeResearchOSS

Results at 7B scale

PokeeResearch-7B reports the best mean at 4 accuracy among 7B deep research agents across the 10 datasets. On HLE the model reports 15.2 without RTS and 17.6 with RTS. On GAIA the model reports 36.9 without RTS and 41.3 with RTS. On BrowseComp the model reports 5.4 without RTS and 8.4 with RTS. On the seven QA benchmarks, Bamboogle, 2WikiMultiHopQA, TriviaQA, NQ, PopQA, Musique, HotpotQA, the model improves over recent 7B baselines. Gains from RTS are largest on HLE, GAIA, and BrowseComp, and smaller on the QA sets.

Key Takeaways

Training: PokeeResearch-7B fine tunes Qwen2.5-7B-Instruct with RLAIF using the RLOO estimator, optimizing rewards for factual accuracy, citation faithfulness, and instruction adherence, not token overlap.

Scaffold: The agent runs a research and verification loop with Research Threads Synthesis, executing multiple independent threads, then synthesizing evidence to a final answer.

Evaluation protocol: Benchmarks span 10 datasets with 125 questions each, except GAIA with 103, 4 threads per question, mean@4 accuracy judged by Gemini-2.5-Flash-lite, with a 100 turn cap.

Results and release: PokeeResearch-7B reports state of the art among 7B deep research agents, for example HLE 17.6 with RTS, GAIA 41.3 with RTS, BrowseComp 8.4 with RTS, and is released under Apache-2.0 with code and weights public.

Editorial Comments

PokeeResearch-7B is a useful step for practical deep research agents. It aligns training with RLAIF using RLOO, so the objective targets semantic correctness, citation faithfulness, and instruction adherence. The reasoning scaffold includes self verification and Research Threads Synthesis, which improves difficult benchmarks. The evaluation uses mean at 4 with Gemini 2.5 Flash lite as the judge, across 10 datasets. The release ships Apache 2.0 code and weights with a clear tool stack using Serper and Jina. The setup runs on a single A100 80 GB and scales.

Check out the Paper, Model on HF and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post PokeeResearch-7B: An Open 7B Deep-Research Agent Trained with Reinforcement Learning from AI Feedback (RLAIF) and a Robust Reasoning Scaffold appeared first on MarkTechPost.

How to Design a Fully Functional Enterprise AI Assistant with Retrieva …

In this tutorial, we explore how we can build a compact yet powerful Enterprise AI assistant that runs effortlessly on Colab. We start by integrating retrieval-augmented generation (RAG) using FAISS for document retrieval and FLAN-T5 for text generation, both fully open-source and free. As we progress, we embed enterprise policies such as data redaction, access control, and PII protection directly into the workflow, ensuring our system is intelligent and compliant. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install faiss-cpu transformers==4.44.2 accelerate sentence-transformers==3.0.1

from typing import List, Dict, Tuple
import re, textwrap, numpy as np, torch
from sentence_transformers import SentenceTransformer
import faiss
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

GEN_MODEL = “google/flan-t5-base”
EMB_MODEL = “sentence-transformers/all-MiniLM-L6-v2″

gen_tok = AutoTokenizer.from_pretrained(GEN_MODEL)
gen_model = AutoModelForSeq2SeqLM.from_pretrained(GEN_MODEL, device_map=”auto”)
generate = pipeline(“text2text-generation”, model=gen_model, tokenizer=gen_tok)

emb_device = “cuda” if torch.cuda.is_available() else “cpu”
emb_model = SentenceTransformer(EMB_MODEL, device=emb_device)

We begin by setting up our environment and loading the required models. We initialize FLAN-T5 for text generation and MiniLM for embedding representations. We ensure both models are configured to automatically use the GPU when available, so our pipeline runs efficiently. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserDOCS = [
{“id”:”policy_sec_001″,”title”:”Data Security Policy”,
“text”:”All customer data must be encrypted at rest (AES-256) and in transit (TLS 1.2+). Access is role-based (RBAC). Secrets are stored in a managed vault. Backups run nightly with 35-day retention. PII includes name, email, phone, address, PAN/Aadhaar.”},
{“id”:”policy_ai_002″,”title”:”Responsible AI Guidelines”,
“text”:”Use internal models for confidential data. Retrieval sources must be logged. No customer decisioning without human-in-the-loop. Redact PII in prompts and outputs. All model prompts and outputs are stored for audit for 180 days.”},
{“id”:”runbook_inc_003″,”title”:”Incident Response Runbook”,
“text”:”If a suspected breach occurs, page on-call SecOps. Rotate keys, isolate affected services, perform forensic capture, notify DPO within regulatory SLA. Communicate via the incident room only.”},
{“id”:”sop_sales_004″,”title”:”Sales SOP – Enterprise Deals”,
“text”:”For RFPs, use the approved security questionnaire responses. Claims must match policy_sec_001. Custom clauses need Legal sign-off. Keep records in CRM with deal room links.”}
]

def chunk(text:str, chunk_size=600, overlap=80):
w = text.split()
if len(w) <= chunk_size: return [text]
out=[]; i=0
while i < len(w):
j=min(i+chunk_size, len(w)); out.append(” “.join(w[i:j]))
if j==len(w): break
i = j – overlap
return out

CORPUS=[]
for d in DOCS:
for i,c in enumerate(chunk(d[“text”])):
CORPUS.append({“doc_id”:d[“id”],”title”:d[“title”],”chunk_id”:i,”text”:c})

We create a small enterprise-style document set to simulate internal policies and procedures. We then break these long texts into manageable chunks so they can be embedded and retrieved effectively. This chunking helps our AI assistant handle contextual information with better precision. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef build_index(chunks:List[Dict]) -> Tuple[faiss.IndexFlatIP, np.ndarray]:
vecs = emb_model.encode([c[“text”] for c in chunks], normalize_embeddings=True, convert_to_numpy=True)
index = faiss.IndexFlatIP(vecs.shape[1]); index.add(vecs); return index, vecs

INDEX, VECS = build_index(CORPUS)

PII_PATTERNS = [
(re.compile(r”bd{10}b”), “<REDACTED_PHONE>”),
(re.compile(r”b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,}b”, re.I), “<REDACTED_EMAIL>”),
(re.compile(r”bd{12}b”), “<REDACTED_ID12>”),
(re.compile(r”b[A-Z]{5}d{4}[A-Z]b”), “<REDACTED_PAN>”)
]
def redact(t:str)->str:
for p,r in PII_PATTERNS: t = p.sub(r, t)
return t

POLICY_DISALLOWED = [
re.compile(r”b(share|exfiltrate)b.*b(raw|all)b.*bdatab”, re.I),
re.compile(r”bdisableb.*bencryptionb”, re.I),
]
def policy_check(q:str):
for r in POLICY_DISALLOWED:
if r.search(q): return False, “Request violates security policy (data exfiltration/encryption tampering).”
return True, “”

We embed all chunks using Sentence Transformers and store them in a FAISS index for fast retrieval. We introduce PII redaction rules and policy checks to prevent misuse of data. By doing this, we ensure our assistant adheres to enterprise security and compliance guidelines. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef retrieve(query:str, k=4)->List[Dict]:
qv = emb_model.encode([query], normalize_embeddings=True, convert_to_numpy=True)
scores, idxs = INDEX.search(qv, k)
return [{**CORPUS[i], “score”: float(s)} for s,i in zip(scores[0], idxs[0])]

SYSTEM = (“You are an enterprise AI assistant.n”
“- Answer strictly from the provided CONTEXT.n”
“- If missing info, say what is unknown and suggest the correct policy/runbook.n”
“- Keep it concise and cite titles + doc_ids inline like [Title (doc_id:chunk)].”)
def build_prompt(user_q:str, ctx_blocks:List[Dict])->str:
ctx = “nn”.join(f”[{i+1}] {b[‘title’]} (doc:{b[‘doc_id’]}:{b[‘chunk_id’]})n{b[‘text’]}” for i,b in enumerate(ctx_blocks))
uq = redact(user_q)
return f”SYSTEM:n{SYSTEM}nnCONTEXT:n{ctx}nnUSER QUESTION:n{uq}nnINSTRUCTIONS:n- Cite sources inline.n- Keep to 5-8 sentences.n- Preserve redactions.”

def answer(user_q:str, k=4, max_new_tokens=220)->Dict:
ok,msg = policy_check(user_q)
if not ok: return {“answer”: f” {msg}”, “ctx”:[]}
ctx = retrieve(user_q, k=k); prompt = build_prompt(user_q, ctx)
out = generate(prompt, max_new_tokens=max_new_tokens, do_sample=False)[0][“generated_text”].strip()
return {“answer”: out, “ctx”: ctx}

We design the retrieval function to fetch relevant document sections for each user query. We then construct a structured prompt combining context and questions for FLAN-T5 to generate precise answers. This step ensures that our assistant produces grounded, policy-compliant responses. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef eval_query(user_q:str, ctx:List[Dict])->Dict:
terms = [w.lower() for w in re.findall(r”[a-zA-Z]{4,}”, user_q)]
ctx_text = ” “.join(c[“text”].lower() for c in ctx)
hits = sum(t in ctx_text for t in terms)
return {“terms”: len(terms), “hits”: hits, “hit_rate”: round(hits/max(1,len(terms)), 2)}

QUERIES = [
“What encryption and backup rules do we follow for customer data?”,
“Can we auto-answer RFP security questionnaires? What should we cite?”,
“If there is a suspected breach, what are the first three steps?”,
“Is it allowed to share all raw customer data externally for testing?”
]
for q in QUERIES:
res = answer(q, k=3)
print(“n” + “=”*100); print(“Q:”, q); print(“nA:”, res[“answer”])
if res[“ctx”]:
ev = eval_query(q, res[“ctx”]); print(“nRetrieved Context (top 3):”)
for r in res[“ctx”]: print(f”- {r[‘title’]} [{r[‘doc_id’]}:{r[‘chunk_id’]}] score={r[‘score’]:.3f}”)
print(“Eval:”, ev)

We evaluate our system using sample enterprise queries that test encryption, RFPs, and incident procedures. We display retrieved documents, answers, and simple hit-rate scores to check relevance. Through this demo, we observe our Enterprise AI assistant performing retrieval-augmented reasoning securely and accurately.

In conclusion, we successfully created a self-contained enterprise AI system that retrieves, analyzes, and responds to business queries while maintaining strong guardrails. We appreciate how seamlessly we can combine FAISS for retrieval, Sentence Transformers for embeddings, and FLAN-T5 for generation to simulate an internal enterprise knowledge engine. As we finish, we realize that this simple Colab-based implementation can serve as a blueprint for scalable, auditable, and compliant enterprise deployments.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Fully Functional Enterprise AI Assistant with Retrieval Augmentation and Policy Guardrails Using Open Source AI Models appeared first on MarkTechPost.

Google AI Introduces VISTA: A Test Time Self Improving Agent for Text …

TLDR: VISTA is a multi agent framework that improves text to video generation during inference, it plans structured prompts as scenes, runs a pairwise tournament to select the best candidate, uses specialized judges across visual, audio, and context, then rewrites the prompt with a Deep Thinking Prompting Agent, the method shows consistent gains over strong prompt optimization baselines in single scene and multi scene settings, and human raters prefer its outputs.

https://arxiv.org/pdf/2510.15831

What VISTA is?

VISTA stands for Video Iterative Self improvemenT Agent. It is a black box, multi agent loop that refines prompts and regenerates videos at test time. The system targets 3 aspects jointly, visual, audio, and context. It follows 4 steps, structured video prompt planning, pairwise tournament selection, multi dimensional multi agent critiques, and a Deep Thinking Prompting Agent for prompt rewriting.

The research team evaluates VISTA on a single scene benchmark and on an internal multi scene set. It reports consistent improvements and up to 60 percent pairwise win rate against state of the art baselines in some settings, and a 66.4 percent human preference over the strongest baseline.

https://arxiv.org/pdf/2510.15831

Understanding the key problem

Text to video models like Veo 3 can produce high quality video and audio, yet outputs remain sensitive to exact prompt phrasing, adherence to physics can fail, and alignment to user goals can drift, which forces manual trial and error. VISTA frames this as a test time optimization problem. It seeks unified improvement across visual signals, audio signals, and contextual alignment.

How VISTA works, step by step?

Step 1: structured video prompt planning

The user prompt is decomposed into timed scenes. Each scene carries 9 properties, duration, scene type, characters, actions, dialogues, visual environment, camera, sounds, moods. A multimodal LLM fills missing properties and enforces constraints on realism, relevancy, and creativity by default. The system also keeps the original user prompt in the candidate set to allow models that do not benefit from decomposition.

Step 2: pairwise tournament video selection

The system samples multiple video, prompt pairs. An MLLM acts as a judge with binary tournaments and bidirectional swapping to reduce token order bias. The default criteria include visual fidelity, physical commonsense, text video alignment, audio video alignment, and engagement. The method first elicits probing critiques to support analysis, then performs pairwise comparison, and applies customizable penalties for common text to video failures.

Step 3: multi dimensional multi agent critiques

The champion video and prompt receive critiques along 3 dimensions, visual, audio, and context. Each dimension uses a triad, a normal judge, an adversarial judge, and a meta judge that consolidates both sides. Metrics include visual fidelity, motions and dynamics, temporal consistency, camera focus, and visual safety for visual, audio fidelity, audio video alignment, and audio safety for audio, situational appropriateness, semantic coherence, text video alignment, physical commonsense, engagement, and video format for context. Scores are on a 1 to 10 scale, which supports targeted error discovery.

Step 4: Deep Thinking Prompting Agent

The reasoning module reads the meta critiques and runs a 6 step introspection, it identifies low scoring metrics, clarifies expected outcomes, checks prompt sufficiency, separates model limits from prompt issues, detects conflicts or vagueness, proposes modification actions, then samples refined prompts for the next generation cycle.

https://arxiv.org/pdf/2510.15831

Understanding the results

Automatic evaluation: The research study reports win, tie, loss rates on ten criteria using an MLLM as a judge, with bidirectional comparisons. VISTA achieves a win rate over direct prompting that rises across iterations, reaching 45.9 percent in single scene and 46.3 percent in multi scene at iteration 5. It also wins directly against each baseline under the same compute budget.

Human studies: Annotators with prompt optimization experience prefer VISTA in 66.4 percent of head to head trials against the best baseline at iteration 5. Experts rate optimization trajectories higher for VISTA, and they score visual quality and audio quality higher than direct prompting.

Cost and scaling: Average tokens per iteration are about 0.7 million across two datasets, generation tokens are not included. Most token use comes from selection and critiques, which process videos as long context inputs. Win rate tends to increase as the number of sampled videos and tokens per iteration increases.

Ablations: Removing prompt planning weakens initialization. Removing tournament selection destabilizes later iterations. Using only one judge type reduces performance. Removing the Deep Thinking Prompting Agent lowers final win rates.

Evaluators: The research team repeated evaluation with alternative evaluator models and observe similar iterative improvements, which supports robustness of the trend.

https://arxiv.org/pdf/2510.15831

https://arxiv.org/pdf/2510.15831

Key Takeaways

VISTA is a test time, multi agent loop that jointly optimizes visual, audio, and context for text to video generation.

It plans prompts as timed scenes with 9 attributes, duration, scene type, characters, actions, dialogues, visual environment, camera, sounds, moods.

Candidate videos are selected via pairwise tournaments using an MLLM judge with bidirectional swap, scored on visual fidelity, physical commonsense, text video alignment, audio video alignment, and engagement.

A triad of judges per dimension, normal, adversarial, meta, produces 1 to 10 scores that guide the Deep Thinking Prompting Agent to rewrite the prompt and iterate.

Results show 45.9 percent wins on single scene and 46.3 percent on multi scene at iteration 5 over direct prompting, human raters prefer VISTA in 66.4 percent of trials, average token cost per iteration is about 0.7 million.

Editorial Comments

VISTA is a practical step toward reliable text to video generation, it treats inference as an optimization loop and keeps the generator as a black box. The structured video prompt planning is useful for early engineers, the 9 scene attributes give a concrete checklist. The pairwise tournament selection with a multimodal LLM judge and bidirectional swap is a sensible way to reduce ordering bias, the criteria target real failure modes, visual fidelity, physical commonsense, text video alignment, audio video alignment, engagement. The multi dimensional critiques separate visual, audio, and context, the normal, adversarial, and meta judges expose weaknesses that single judges miss. The Deep Thinking Prompting Agent turns those diagnostics into targeted prompt edits. The use of Gemini 2.5 Flash and Veo 3 clarifies the reference setup, the Veo 2 study is a helpful lower bound. The reported 45.9 and 46.3 percent win rates and 66.4 percent human preference indicate repeatable gains. The 0.7 million token cost is non trivial, yet transparent and scalable.

Check out the Paper and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Introduces VISTA: A Test Time Self Improving Agent for Text to Video Generation appeared first on MarkTechPost.