Implement automated smoke testing using Amazon Nova Act headless mode

Automated smoke testing using Amazon Nova Act headless mode helps development teams validate core functionality in continuous integration and continuous delivery (CI/CD) pipelines. Development teams often deploy code several times daily, so fast testing helps maintain application quality. Traditional end-to-end testing can take hours to complete, creating delays in your CI/CD pipeline.
Smoke testing is a subset of testing that validates the most critical functions of an application work correctly after deployment. These tests focus on key workflows like user login, core navigation, and key transactions rather than exhaustive feature coverage. Smoke tests typically complete in minutes rather than hours, making them ideal for CI/CD pipelines where fast feedback on code changes is essential.
Amazon Nova Act uses AI-powered UI understanding and natural language processing to interact with web applications, replacing traditional CSS selectors. Instead of maintaining brittle CSS selectors and complex test scripts, you can write tests using simple English commands that adapt to UI changes.
This post shows how to implement automated smoke testing using Amazon Nova Act headless mode in CI/CD pipelines. We use SauceDemo, a sample ecommerce application, as our target for demonstration. We demonstrate setting up Amazon Nova Act for headless browser automation in CI/CD environments and creating smoke tests that validate key user workflows. We then show how to implement parallel execution to maximize testing efficiency, configure GitLab CI/CD for automatic test execution on every deployment, and apply best practices for maintainable and scalable test automation.
Solution overview
The solution includes a Python test runner that executes smoke tests, ecommerce workflow validation for complete user journeys, GitLab CI/CD integration for automation, and parallel execution to speed up testing. Headless mode runs browser tests in the background without opening a browser window, which works well for automated testing.
The following diagram illustrates the testing workflow.

We walk through the following steps to implement automated smoke testing with Amazon Nova Act:

Set up your project and dependencies.
Create a smoke test with login validation.
Configure validation for the entire ecommerce workflow.
Configure the automated testing pipeline.
Configure parallel execution.

Prerequisites
To complete this walkthrough, you must have the following:

Access to Amazon Nova Act with API key.
A GitLab repository.
UV package manager. For instructions, refer to Installing uv.
Familiarity with Python and GitLab CI/CD.

Set up project and dependencies
Create your project and install dependencies:

# Create and navigate to project
uv init nova-act-smoke-tests
# Open in VS Code
code nova-act-smoke-tests
# Install required packages
uv add nova-act

UV is a fast Python package manager that handles dependency installation and virtual environment management automatically, similar to npm for Node.js projects.

Create a test runner
Create smoke_tests.py:

import os
from nova_act import NovaAct

# Check API key
if not os.getenv(“NOVA_ACT_API_KEY”):
exit(“❌ Set NOVA_ACT_API_KEY environment variable”)
SAUCEDEMO_URL = “https://www.saucedemo.com/”
with NovaAct(starting_page=SAUCEDEMO_URL) as nova:
nova.act(“Verify you are in the login page”)

print(“✅ Foundation setup complete!”)

Test your setup

Test your setup with the following commands:

export NOVA_ACT_API_KEY=”your-api-key”
uv run smoke_tests.py

Environment variables like NOVA_ACT_API_KEY keep sensitive information secure and separate from your code.
This solution implements the following security features:

Stores API keys in environment variables or .env files (add .env to .gitignore)
Uses different API keys for development, staging, and production environments
Implements key rotation every 90 days using automated scripts or calendar reminders
Monitors API key usage through logs to detect unauthorized access

You now have a modern Python project with Amazon Nova Act configured and ready for testing. Next, we show how to create a working smoke test that uses natural language browser automation.
Create smoke test for login validation
Let’s expand your foundation code to include a complete login test with proper structure.
Add main function and login test
Update smoke_tests.py:

import os
from nova_act import NovaAct

SAUCEDEMO_URL = “https://www.saucedemo.com/”

def test_login_flow():
“””Test complete login flow and product page verification”””
with NovaAct(starting_page=SAUCEDEMO_URL) as nova:
nova.act(“Enter ‘standard_user’ in the username field”)
nova.act(“Enter ‘secret_sauce’ in the password field”)
nova.act(“Click the login button”)
nova.act(“Verify Products appear on the page”)

def main():
# Check API key
if not os.getenv(“NOVA_ACT_API_KEY”):
exit(“❌ Set NOVA_ACT_API_KEY environment variable”)

print(“🚀 Starting Nova Act Smoke Test”)

try:
test_login_flow()
print(“✅ Login test: PASS”)
except Exception as e:
print(f”❌ Login test: FAIL – {e}”)
exit(1)

print(“🎉 All tests passed!”)

if __name__ == “__main__”:
main()

Test your login flow
Run your complete login test:

export NOVA_ACT_API_KEY=”your-api-key”
uv run smoke_tests.py

You should see the following output:

🚀 Starting NovaAct Smoke Test
✅ Login test: PASS
🎉 All tests passed!

Your smoke test now validates a complete user journey that uses natural language with Amazon Nova Act. The test handles page verification to confirm you’re on the login page, form interactions that enter user name and password credentials, action execution that clicks the login button, and success validation that verifies the products page loads correctly. The built-in error handling provides retry logic if the login process encounters any issues, showing how the AI-powered automation of Amazon Nova Act adapts to dynamic web applications without the brittleness of traditional CSS selector-based testing frameworks.
Although a login test provides valuable validation, real-world applications require testing complete user workflows that span multiple pages and complex interactions. Next, we expand the testing capabilities by building a comprehensive ecommerce journey that validates the entire customer experience.
Configure ecommerce workflow validation
Let’s build a comprehensive ecommerce workflow that tests the end-to-end customer journey from login to logout.
Add complete ecommerce test
Update smoke_tests.py to include the full workflow:

import os
from nova_act import NovaAct

SAUCEDEMO_URL = “https://www.saucedemo.com/”

def test_login_flow():
“””Test complete login flow and product page verification”””
with NovaAct(starting_page=SAUCEDEMO_URL) as nova:
nova.act(“Enter ‘standard_user’ in the username field”)
nova.act(“Enter ‘secret_sauce’ in the password field”)
nova.act(“Click the login button”)
nova.act(“Verify Products appear on the page”)

def test_ecommerce_workflow():
“””Test complete e-commerce workflow: login → shop → checkout → logout”””
with NovaAct(starting_page=SAUCEDEMO_URL) as nova:
# Login
nova.act(“Enter ‘standard_user’ in the username field”)
nova.act(“Enter ‘secret_sauce’ in the password field”)
nova.act(“Click the login button”)
nova.act(“Verify Products appear on the page”)

# Shopping
nova.act(“Select Sauce Labs Backpack”)
nova.act(“Add Sauce Labs Backpack to the cart”)
nova.act(“Navigate back to products page”)
nova.act(“Select Sauce Labs Onesie”)
nova.act(“Add Sauce Labs Onesie to the cart”)
nova.act(“Navigate back to products page”)

# Cart verification
nova.act(“Click cart and Navigate to the cart page”)
nova.act(“Verify 2 items are in the cart”)

# Checkout process
nova.act(“Click the Checkout button”)
nova.act(“Enter ‘John’ in the First Name field”)
nova.act(“Enter ‘Doe’ in the Last Name field”)
nova.act(“Enter ‘12345’ in the Zip/Postal Code field”)
nova.act(“Click the Continue button”)

# Order completion
nova.act(“Verify Checkout:Overview page appears”)
nova.act(“Click the Finish button”)
nova.act(“Verify ‘THANK YOU FOR YOUR ORDER’ appears on the page”)

# Return and logout
nova.act(“Click the Back Home button”)
nova.act(“Click the hamburger menu on the left”)
nova.act(“Click the Logout link”)
nova.act(“Verify the user is on the login page”)
def main():
# Check API key
if not os.getenv(“NOVA_ACT_API_KEY”):
exit(“❌ Set NOVA_ACT_API_KEY environment variable”)

print(“🚀 Starting Nova Act E-commerce Tests”)

tests = [
(“Login Flow”, test_login_flow),
(“E-commerce Workflow”, test_ecommerce_workflow)
]

passed = 0
for test_name, test_func in tests:
try:
test_func()
print(f”✅ {test_name}: PASS”)
passed += 1
except Exception as e:
print(f”❌ {test_name}: FAIL – {e}”)

print(f”n📊 Results: {passed}/{len(tests)} tests passed”)

if passed == len(tests):
print(“🎉 All tests passed!”)
else:
exit(1)

if __name__ == “__main__”:
main()

Test your ecommerce workflow
Run your comprehensive test suite:

export NOVA_ACT_API_KEY=”your-api-key”
uv run smoke_tests.py

You should see the following output:

🚀 Starting Nova Act E-commerce Tests
✅ Login Flow: PASS
✅ E-commerce Workflow: PASS
📊 Results: 2/2 tests passed
🎉 All tests passed!

Understanding the ecommerce journey
The workflow tests a complete customer experience:

Authentication – Login with valid credentials
Product discovery – Browse and select products
Shopping cart – Add items and verify cart contents
Checkout process – Enter shipping information
Order completion – Complete purchase and verify success
Navigation – Return to products and log out

The following screenshot shows the step-by-step visual guide of the user journey.

Your smoke tests now validate complete user journeys that mirror real customer experiences. The ecommerce workflow shows how Amazon Nova Act handles complex, multi-step processes across multiple pages. By testing the entire customer journey from authentication through order completion, you’re validating the primary revenue-generating workflows in your application.
This approach reduces maintenance overhead while providing comprehensive coverage of your application’s core functionality.
Running these tests manually provides immediate value, but the real power comes from integrating them into your development workflow. Automating test execution makes sure code changes are validated against your critical user journeys before reaching production.
Configure automated testing pipeline
With your comprehensive ecommerce workflow in place, you’re ready to integrate these tests into your CI pipeline. This step shows how to configure GitLab CI/CD to automatically run these smoke tests on every code change, making sure key user journeys remain functional throughout your development cycle. We show how to configure headless mode for CI environments while maintaining the visual debugging capabilities for local development.
Add headless mode for CI/CD
Update smoke_tests.py to support headless mode for CI environments by adding the following lines to both test functions:

def test_login_flow():
“””Test complete login flow and product page verification”””
headless = os.getenv(“HEADLESS”, “false”).lower() == “true”

with NovaAct(starting_page=SAUCEDEMO_URL, headless=headless) as nova:
# … rest of your test code remains the same

def test_ecommerce_workflow():
“””Test complete e-commerce workflow: login → shop → checkout → logout”””
headless = os.getenv(“HEADLESS”, “false”).lower() == “true”

with NovaAct(starting_page=SAUCEDEMO_URL, headless=headless) as nova:
# … rest of your test code remains the same

Create GitHub Actions workflow
GitLab CI/CD is GitLab’s built-in CI system that automatically runs pipelines when code changes occur. Pipelines are defined in YAML files that specify when to run tests and what steps to execute.
Create .gitlab-ci.yml:

stages:
– test

smoke-tests:
stage: test
image: mcr.microsoft.com/playwright/python:v1.40.0-jammy
rules:
– if: $CI_COMMIT_BRANCH == “main”
– if: $CI_COMMIT_BRANCH == “develop”
– if: $CI_PIPELINE_SOURCE == “merge_request_event”
– if: $CI_PIPELINE_SOURCE == “web”
before_script:
– pip install uv
– uv sync
– uv run playwright install chromium
script:
– uv run python smoke_tests.py
variables:
HEADLESS: ‘true’
NOVA_ACT_SKIP_PLAYWRIGHT_INSTALL: ‘true’

Configure GitLab CI/CD variables
GitLab CI/CD variables provide secure storage for sensitive information like API keys. These values are encrypted and only accessible to your GitLab CI/CD pipelines. Complete the following steps to add a variable:

In your project, choose Settings, CI/CD, and Variables.
Choose Add variable.
For the key, enter NOVA_ACT_API_KEY.
For the value, enter your Amazon Nova Act API key.
Select Mask variable to hide the value in job logs.
Choose Add variable.

Understanding the code changes
The key change is the headless mode configuration:

headless = os.getenv(“HEADLESS”, “false”).lower() == “true”
with NovaAct(starting_page=SAUCEDEMO_URL, headless=headless) as nova:

This configuration provides flexibility for different development environments. During local development when the HEADLESS environment variable is not set, the headless parameter defaults to False, which opens a browser window so you can see the automation in action. This visual feedback is invaluable for debugging test failures and understanding how Amazon Nova Act interacts with your application. In CI/CD environments where HEADLESS is set to true, the browser runs in the background without opening any windows, making it ideal for automated testing pipelines that don’t have display capabilities and need to run efficiently without visual overhead.
Test your CI/CD setup
Push your code to trigger the workflow:

git add .
git commit -m “Add Nova Act smoke tests with CI/CD”
git push origin main

Check the Pipelines section in your GitLab project to see the tests running.

Your smoke tests now run automatically as part of your CI pipeline, providing immediate feedback on code changes. The GitLab CI/CD integration makes sure critical user journeys are validated before any deployment reaches production, reducing the risk of shipping broken functionality to customers.
The implementation shows how modern package management with UV reduces CI/CD pipeline execution time compared to traditional pip installations. Combined with secure API key management through GitLab CI/CD variables, your testing infrastructure follows enterprise security best practices.
As your test suite grows, you might notice that running tests sequentially can become a bottleneck in your deployment pipeline. The next section addresses this challenge by implementing parallel execution to maximize your CI/CD efficiency.
Configure parallel execution
With your CI/CD pipeline successfully validating individual test cases, the next optimization focuses on performance enhancement through parallel execution. Concurrent test execution can reduce your total testing time by running multiple browser instances simultaneously, maximizing the efficiency of your CI/CD resources while maintaining test reliability and isolation.
Add parallel execution framework
Update smoke_tests.py to support concurrent testing:

import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from nova_act import NovaAct

SAUCEDEMO_URL = “https://www.saucedemo.com/”
headless = os.getenv(“HEADLESS”, “false”).lower() == “true”

def test_login_flow():
“””Test complete login flow and product page verification”””

with NovaAct(starting_page=SAUCEDEMO_URL, headless=headless) as nova:
nova.act(“Enter ‘standard_user’ in the username field”)
nova.act(“Enter ‘secret_sauce’ in the password field”)
nova.act(“Click the login button”)
# nova.act(“In case of error, make sure the username and password are correct, if required re-enter the username and password”)
nova.act(“Verify Products appear on the page”)

def test_ecommerce_workflow():
“””Test complete e-commerce workflow: login → shop → checkout → logout”””
with NovaAct(starting_page=SAUCEDEMO_URL, headless=headless) as nova:
# Login
nova.act(“Enter ‘standard_user’ in the username field”)
nova.act(“Enter ‘secret_sauce’ in the password field”)
nova.act(“Click the login button”)
nova.act(“Verify Products appear on the page”)

# Shopping
nova.act(“Select Sauce Labs Backpack”)
nova.act(“Add Sauce Labs Backpack to the cart”)
nova.act(“Navigate back to products page”)
nova.act(“Select Sauce Labs Onesie”)
nova.act(“Add Sauce Labs Onesie to the cart”)
nova.act(“Navigate back to products page”)

# Cart verification
nova.act(“Click cart and Navigate to the cart page”)
nova.act(“Verify 2 items are in the cart”)

# Checkout process
nova.act(“Click the Checkout button”)
nova.act(“Enter ‘John’ in the First Name field”)
nova.act(“Enter ‘Doe’ in the Last Name field”)
nova.act(“Enter ‘12345’ in the Zip/Postal Code field”)
nova.act(“Click the Continue button”)

# Order completion
nova.act(“Verify Checkout:Overview page appears”)
nova.act(“Click the Finish button”)
nova.act(“Verify ‘THANK YOU FOR YOUR ORDER’ appears on the page”)

# Return and logout
nova.act(“Click the Back Home button”)
nova.act(“Click the hamburger menu on the left”)
nova.act(“Click the Logout link”)
nova.act(“Verify the user is on the login page”)

def run_test(test_name, test_func):
“””Execute a single test and return result”””
try:
test_func()
print(f”✅ {test_name}: PASS”)
return True
except Exception as e:
print(f”❌ {test_name}: FAIL – {e}”)
return False

def main():
# Check API key
if not os.getenv(“NOVA_ACT_API_KEY”):
exit(“❌ Set NOVA_ACT_API_KEY environment variable”)

print(“🚀 Starting Nova Act Tests (Parallel)”)

tests = [
(“Login Flow”, test_login_flow),
(“E-commerce Workflow”, test_ecommerce_workflow)
]

# Configure parallel execution
max_workers = int(os.getenv(“MAX_WORKERS”, “2”))

# Run tests in parallel
results = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_test = {
executor.submit(run_test, name, func): name
for name, func in tests
}

for future in as_completed(future_to_test):
results.append(future.result())

# Report results
passed = sum(results)
total = len(results)

print(f”n📊 Results: {passed}/{total} tests passed”)

if passed == total:
print(“🎉 All tests passed!”)
else:
exit(1)

if __name__ == “__main__”:
main()

Update GitLab CI/CD for parallel execution
The parallel execution is already configured in your .gitlab-ci.yml with the MAX_WORKERS= “2” variable. The pipeline automatically uses the parallel framework when running the smoke tests.
Test parallel execution
Run your optimized tests:

export NOVA_ACT_API_KEY=”your-api-key”
export MAX_WORKERS=”2″
uv run smoke_tests.py

You should see both tests running simultaneously:

🚀 Starting Nova Act Tests (Parallel)
✅ Login Flow: PASS
✅ E-commerce Workflow: PASS
📊 Results: 2/2 tests passed
🎉 All tests passed!

Understanding parallel execution
ThreadPoolExecutor is a Python class that manages a pool of worker threads, allowing multiple tasks to run simultaneously. In this case, each thread runs a separate browser test, reducing total execution time.

# Configure worker count
max_workers = int(os.getenv(“MAX_WORKERS”, “2”))

# Execute tests concurrently
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_test = {
executor.submit(run_test, name, func): name
for name, func in tests
}

Parallel execution provides benefits such as faster execution (because tests run simultaneously instead of sequentially), configurable workers that adjust based on system resources, resource efficiency that optimizes CI/CD compute time, and scalability that makes it straightforward to add more tests without increasing total runtime.
However, there are important considerations to keep in mind. Each test opens a browser instance (which increases resource usage), tests must be independent of each other to maintain proper isolation, and you must balance worker counts with available CPU and memory limits in CI environments.
Each parallel test uses system resources and incurs API usage. Start with two workers and adjust based on your environment’s capacity and cost requirements. Monitor your Amazon Nova Act usage to optimize the balance between test speed and expenses.
The performance improvement is significant when comparing sequential vs. parallel execution. In sequential execution, tests run one after another with the total time being the sum of all individual test durations. With parallel execution, multiple tests run simultaneously, completing in approximately the time of the longest test, resulting in substantial time savings that become more valuable as your test suite grows.
Your smoke tests now feature concurrent execution that significantly reduces total testing time while maintaining complete test isolation and reliability. The ThreadPoolExecutor implementation allows multiple browser instances to run simultaneously, transforming your sequential test suite into a parallel execution that completes much faster. This performance improvement becomes increasingly valuable as your test suite grows, so comprehensive validation doesn’t become a bottleneck in your deployment pipeline.
The configurable worker count through the MAX_WORKERS environment variable provides flexibility to optimize performance based on available system resources. In CI/CD environments, this allows you to balance test execution speed with resource constraints, and local development can use full system capabilities for faster feedback cycles. The architecture maintains complete test independence, making sure parallel execution doesn’t introduce flakiness or cross-test dependencies that could compromise reliability. As a best practice, keep tests independent—each test should work correctly regardless of execution order or other tests running simultaneously.
Best practices
With your performance-optimized testing framework complete, consider the following practices for production readiness:

Keep tests independent. Tests are not impacted by execution order or other tests running simultaneously.
Add retry logic by wrapping your test functions in try-catch blocks with a retry mechanism for handling transient network issues.
Configure your GitLab CI/CD pipeline with a reasonable timeout and consider adding a scheduled run for daily validation of your production environment.
For ongoing maintenance, establish a rotation schedule for your Amazon Nova Act API keys and monitor your test execution times to catch performance regressions early. As your application grows, you can add new test functions to the parallel execution framework without impacting overall runtime, making this solution highly scalable for future needs.

Clean up
To avoid incurring future charges and maintain security, clean up the resources you created:

Remove or disable unused GitLab CI/CD pipelines
Rotate API keys every 90 days and revoke unused keys.
Delete the repositories provided with this post.
Remove API keys from inactive projects.
Clear cached credentials and temporary files from your local environment.

Conclusion
In this post, we showed how to implement automated smoke testing using Amazon Nova Act headless mode for CI/CD pipelines. We demonstrated how to create comprehensive ecommerce workflow tests that validate user journeys, implement parallel execution for faster test completion, and integrate automated testing with GitLab CI/CD for continuous validation.
The natural language approach using Amazon Nova Act needs less maintenance than traditional frameworks that use CSS selectors. Combined with modern tooling like UV package management and GitLab CI/CD, this solution provides fast, reliable test execution that scales with your development workflow. Your implementation now catches issues before they reach production, providing the fast feedback essential for confident continuous deployment while maintaining high application quality standards.
To learn more about browser automation and testing strategies on AWS, explore the following resources:

Getting Started Using the Nova Act Dev Tools and explore the capability in Nova Act playground
Amazon CloudWatch Synthetics for additional monitoring capabilities
GitLab CI/CD documentation for CI/CD best practices
AWS security best practices for securing your automation infrastructure

Try implementing these smoke tests in your own applications and consider extending the framework with additional test scenarios that match your specific user journeys. Share your experience and any optimizations you discover in the comments section.

About the authors
Sakthi Chellapparimanam Sakthivel is a Solutions Architect at AWS, specializing in .NET modernization and enterprise cloud transformations. He helps GSI and software/services customers build scalable, innovative solutions on AWS. He architects intelligent automation frameworks and GenAI-powered applications that drive measurable business outcomes across diverse industries. Beyond his technical pursuits, Sakthivel enjoys spending quality time with his family and playing cricket.
Shyam Soundar is a Solutions Architect at AWS with an extensive background in security, cost-optimization, and analytics offerings. Shyam works with enterprise customers to help them build and scale applications to achieve their business outcomes with lower cost.
Reena M is an FSI Solutions Architect at AWS, specializing in analytics and generative AI-based workloads, helping capital markets and banking customers create secure, scalable, and efficient solutions on AWS. She architects cutting-edge data platforms and AI-powered applications that transform how financial institutions leverage cloud technologies. Beyond her technical pursuits, Reena is also a writer and enjoys spending time with her family.

A Coding Guide to Build a Procedural Memory Agent That Learns, Stores, …

In this tutorial, we explore how an intelligent agent can gradually form procedural memory by learning reusable skills directly from its interactions with an environment. We design a minimal yet powerful framework in which skills behave like neural modules: they store action sequences, carry contextual embeddings, and are retrieved by similarity when a new situation resembles an experience. As we run our agent through multiple episodes, we observe how its behaviour becomes more efficient, moving from primitive exploration to leveraging a library of skills that it has learned on its own. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

class Skill:
def __init__(self, name, preconditions, action_sequence, embedding, success_count=0):
self.name = name
self.preconditions = preconditions
self.action_sequence = action_sequence
self.embedding = embedding
self.success_count = success_count
self.times_used = 0

def is_applicable(self, state):
for key, value in self.preconditions.items():
if state.get(key) != value:
return False
return True

def __repr__(self):
return f”Skill({self.name}, used={self.times_used}, success={self.success_count})”

class SkillLibrary:
def __init__(self, embedding_dim=8):
self.skills = []
self.embedding_dim = embedding_dim
self.skill_stats = defaultdict(lambda: {“attempts”: 0, “successes”: 0})

def add_skill(self, skill):
for existing_skill in self.skills:
if self._similarity(skill.embedding, existing_skill.embedding) > 0.9:
existing_skill.success_count += 1
return existing_skill
self.skills.append(skill)
return skill

def retrieve_skills(self, state, query_embedding=None, top_k=3):
applicable = [s for s in self.skills if s.is_applicable(state)]
if query_embedding is not None and applicable:
similarities = [self._similarity(query_embedding, s.embedding) for s in applicable]
sorted_skills = [s for _, s in sorted(zip(similarities, applicable), reverse=True)]
return sorted_skills[:top_k]
return sorted(applicable, key=lambda s: s.success_count / max(s.times_used, 1), reverse=True)[:top_k]

def _similarity(self, emb1, emb2):
return np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2) + 1e-8)

def get_stats(self):
return {
“total_skills”: len(self.skills),
“total_uses”: sum(s.times_used for s in self.skills),
“avg_success_rate”: np.mean([s.success_count / max(s.times_used, 1) for s in self.skills]) if self.skills else 0
}

We define how skills are represented and stored in a memory structure. We implement similarity-based retrieval so that the agent can match a new state with past skills using cosine similarity. As we work through this layer, we see how skill reuse becomes possible once skills acquire metadata, embeddings, and usage statistics. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass GridWorld:
def __init__(self, size=5):
self.size = size
self.reset()

def reset(self):
self.agent_pos = [0, 0]
self.goal_pos = [self.size-1, self.size-1]
self.objects = {“key”: [2, 2], “door”: [3, 3], “box”: [1, 3]}
self.inventory = []
self.door_open = False
return self.get_state()

def get_state(self):
return {
“agent_pos”: tuple(self.agent_pos),
“has_key”: “key” in self.inventory,
“door_open”: self.door_open,
“at_goal”: self.agent_pos == self.goal_pos,
“objects”: {k: tuple(v) for k, v in self.objects.items()}
}

def step(self, action):
reward = -0.1
if action == “move_up”:
self.agent_pos[1] = min(self.agent_pos[1] + 1, self.size – 1)
elif action == “move_down”:
self.agent_pos[1] = max(self.agent_pos[1] – 1, 0)
elif action == “move_left”:
self.agent_pos[0] = max(self.agent_pos[0] – 1, 0)
elif action == “move_right”:
self.agent_pos[0] = min(self.agent_pos[0] + 1, self.size – 1)
elif action == “pickup_key”:
if self.agent_pos == self.objects[“key”] and “key” not in self.inventory:
self.inventory.append(“key”)
reward = 1.0
elif action == “open_door”:
if self.agent_pos == self.objects[“door”] and “key” in self.inventory:
self.door_open = True
reward = 2.0
done = self.agent_pos == self.goal_pos and self.door_open
if done:
reward = 10.0
return self.get_state(), reward, done

We construct a simple environment in which the agent learns tasks such as picking up a key, opening a door, and reaching a goal. We use this environment as a playground for our procedural memory system, allowing us to observe how primitive actions evolve into more complex, reusable skills. The environment’s structure helps us observe clear, interpretable improvements in behaviour across episodes. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ProceduralMemoryAgent:
def __init__(self, env, embedding_dim=8):
self.env = env
self.skill_library = SkillLibrary(embedding_dim)
self.embedding_dim = embedding_dim
self.episode_history = []
self.primitive_actions = [“move_up”, “move_down”, “move_left”, “move_right”, “pickup_key”, “open_door”]

def create_embedding(self, state, action_seq):
state_vec = np.zeros(self.embedding_dim)
state_vec[0] = hash(str(state[“agent_pos”])) % 1000 / 1000
state_vec[1] = 1.0 if state.get(“has_key”) else 0.0
state_vec[2] = 1.0 if state.get(“door_open”) else 0.0
for i, action in enumerate(action_seq[:self.embedding_dim-3]):
state_vec[3+i] = hash(action) % 1000 / 1000
return state_vec / (np.linalg.norm(state_vec) + 1e-8)

def extract_skill(self, trajectory):
if len(trajectory) < 2:
return None
start_state = trajectory[0][0]
actions = [a for _, a, _ in trajectory]
preconditions = {“has_key”: start_state.get(“has_key”, False), “door_open”: start_state.get(“door_open”, False)}
end_state = self.env.get_state()
if end_state.get(“has_key”) and not start_state.get(“has_key”):
name = “acquire_key”
elif end_state.get(“door_open”) and not start_state.get(“door_open”):
name = “open_door_sequence”
else:
name = f”navigate_{len(actions)}_steps”
embedding = self.create_embedding(start_state, actions)
return Skill(name, preconditions, actions, embedding, success_count=1)

def execute_skill(self, skill):
skill.times_used += 1
trajectory = []
total_reward = 0
for action in skill.action_sequence:
state = self.env.get_state()
next_state, reward, done = self.env.step(action)
trajectory.append((state, action, reward))
total_reward += reward
if done:
skill.success_count += 1
return trajectory, total_reward, True
return trajectory, total_reward, False

def explore(self, max_steps=20):
trajectory = []
state = self.env.get_state()
for _ in range(max_steps):
action = self._choose_exploration_action(state)
next_state, reward, done = self.env.step(action)
trajectory.append((state, action, reward))
state = next_state
if done:
return trajectory, True
return trajectory, False

We focus on building embeddings that encode the context of a state-action sequence, enabling us to meaningfully compare skills. We also extract skills from successful trajectories, transforming raw experience into reusable behaviours. As we run this code, we observe how simple exploration gradually yields structured knowledge that the agent can apply later. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def _choose_exploration_action(self, state):
agent_pos = state[“agent_pos”]
if not state.get(“has_key”):
key_pos = state[“objects”][“key”]
if agent_pos == key_pos:
return “pickup_key”
if agent_pos[0] < key_pos[0]:
return “move_right”
if agent_pos[0] > key_pos[0]:
return “move_left”
if agent_pos[1] < key_pos[1]:
return “move_up”
return “move_down”
if state.get(“has_key”) and not state.get(“door_open”):
door_pos = state[“objects”][“door”]
if agent_pos == door_pos:
return “open_door”
if agent_pos[0] < door_pos[0]:
return “move_right”
if agent_pos[0] > door_pos[0]:
return “move_left”
if agent_pos[1] < door_pos[1]:
return “move_up”
return “move_down”
goal_pos = (4, 4)
if agent_pos[0] < goal_pos[0]:
return “move_right”
if agent_pos[1] < goal_pos[1]:
return “move_up”
return np.random.choice(self.primitive_actions)

def run_episode(self, use_skills=True):
self.env.reset()
total_reward = 0
steps = 0
trajectory = []
while steps < 50:
state = self.env.get_state()
if use_skills and self.skill_library.skills:
query_emb = self.create_embedding(state, [])
skills = self.skill_library.retrieve_skills(state, query_emb, top_k=1)
if skills:
skill_traj, skill_reward, success = self.execute_skill(skills[0])
trajectory.extend(skill_traj)
total_reward += skill_reward
steps += len(skill_traj)
if success:
return trajectory, total_reward, steps, True
continue
action = self._choose_exploration_action(state)
next_state, reward, done = self.env.step(action)
trajectory.append((state, action, reward))
total_reward += reward
steps += 1
if done:
return trajectory, total_reward, steps, True
return trajectory, total_reward, steps, False

def train(self, episodes=10):
stats = {“rewards”: [], “steps”: [], “skills_learned”: [], “skill_uses”: []}
for ep in range(episodes):
trajectory, reward, steps, success = self.run_episode(use_skills=True)
if success and len(trajectory) >= 3:
segment = trajectory[-min(5, len(trajectory)):]
skill = self.extract_skill(segment)
if skill:
self.skill_library.add_skill(skill)
stats[“rewards”].append(reward)
stats[“steps”].append(steps)
stats[“skills_learned”].append(len(self.skill_library.skills))
stats[“skill_uses”].append(self.skill_library.get_stats()[“total_uses”])
print(f”Episode {ep+1}: Reward={reward:.1f}, Steps={steps}, Skills={len(self.skill_library.skills)}, Success={success}”)
return stats

We define how the agent chooses between using known skills and exploring with primitive actions. We train the agent across several episodes and record the evolution of learned skills, usage counts, and success rates. As we examine this part, we observe that skill reuse reduces episode length and improves overall rewards. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef visualize_training(stats):
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes[0, 0].plot(stats[“rewards”])
axes[0, 0].set_title(“Episode Rewards”)
axes[0, 1].plot(stats[“steps”])
axes[0, 1].set_title(“Steps per Episode”)
axes[1, 0].plot(stats[“skills_learned”])
axes[1, 0].set_title(“Skills in Library”)
axes[1, 1].plot(stats[“skill_uses”])
axes[1, 1].set_title(“Cumulative Skill Uses”)
plt.tight_layout()
plt.savefig(“skill_learning_stats.png”, dpi=150, bbox_inches=’tight’)
plt.show()

if __name__ == “__main__”:
print(“=== Procedural Memory Agent Demo ===n”)
env = GridWorld(size=5)
agent = ProceduralMemoryAgent(env)
print(“Training agent to learn reusable skills…n”)
stats = agent.train(episodes=15)
print(“n=== Learned Skills ===”)
for skill in agent.skill_library.skills:
print(f”{skill.name}: {len(skill.action_sequence)} actions, used {skill.times_used} times, {skill.success_count} successes”)
lib_stats = agent.skill_library.get_stats()
print(f”n=== Library Statistics ===”)
print(f”Total skills: {lib_stats[‘total_skills’]}”)
print(f”Total skill uses: {lib_stats[‘total_uses’]}”)
print(f”Avg success rate: {lib_stats[‘avg_success_rate’]:.2%}”)
visualize_training(stats)
print(“n✓ Skill learning complete! Check the visualization above.”)

We bring everything together by running training, printing learned skills, and plotting behaviour statistics. We visualize the trend in rewards and how the skill library grows over time. By running this snippet, we complete the lifecycle of procedural memory formation and confirm that the agent learns to behave more intelligently with experience.

In conclusion, we see how procedural memory emerges naturally when an agent learns to extract skills from its own successful trajectories. We observe how skills are gained, structure, metadata, embeddings, and usage patterns, allowing the agent to reuse them efficiently in future situations. Lastly, we appreciate how even a small environment and simple heuristics lead to meaningful learning dynamics, giving us a concrete understanding of what it means for an agent to develop reusable internal competencies over time.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Build a Procedural Memory Agent That Learns, Stores, Retrieves, and Reuses Skills as Neural Modules Over Time appeared first on MarkTechPost.

Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into Firs …

The new LiteRT NeuroPilot Accelerator from Google and MediaTek is a concrete step toward running real generative models on phones, laptops, and IoT hardware without shipping every request to a data center. It takes the existing LiteRT runtime and wires it directly into MediaTek’s NeuroPilot NPU stack, so developers can deploy LLMs and embedding models with a single API surface instead of per chip custom code.

What is LiteRT NeuroPilot Accelerator?

LiteRT is the successor of TensorFlow Lite. It is a high performance runtime that sits on device, runs models in .tflite FlatBuffer format, and can target CPU, GPU and now NPU backends through a unified hardware acceleration layer.

LiteRT NeuroPilot Accelerator is the new NPU path for MediaTek hardware. It replaces the older TFLite NeuroPilot delegate with a direct integration to the NeuroPilot compiler and runtime. Instead of treating the NPU as a thin delegate, LiteRT now uses a Compiled Model API that understands Ahead of Time (AOT) compilation and on device compilation, and exposes both through the same C++ and Kotlin APIs.

On the hardware side, the integration currently targets MediaTek Dimensity 7300, 8300, 9000, 9200, 9300 and 9400 SoCs, which together cover a large part of the Android mid range and flagship device space.

Why Developers Care, Unified Workflow For Fragmented NPUs??

Historically, on device ML stacks were CPU and GPU first. NPU SDKs shipped as vendor specific toolchains that required separate compilation flows per SoC, custom delegates, and manual runtime packaging. The result was a combinatorial explosion of binaries and a lot of device specific debugging.

LiteRT NeuroPilot Accelerator replaces that with a three step workflow that is the same regardless of which MediaTek NPU is present:

Convert or load a .tflite model as usual.

Optionally use the LiteRT Python tools to run AOT compilation and produce an AI Pack that is tied to one or more target SoCs.

Ship the AI Pack through Play for On-device AI (PODAI), then select Accelerator.NPU at runtime. LiteRT handles device targeting, runtime loading, and falls back to GPU or CPU if the NPU is not available.

For you as an engineer, the main change is that device targeting logic moves into a structured configuration file and Play delivery, while the app code mostly interacts with CompiledModel and Accelerator.NPU.

AOT and on device compilation are both supported. AOT compiles for a known SoC ahead of time and is recommended for larger models because it removes the cost of compiling on the user device. On device compilation is better for small models and generic .tflite distribution, at the cost of higher first run latency. The blog shows that for a model such as Gemma-3-270M, pure on device compilation can take more than 1 minute, which makes AOT the realistic option for production LLM use.

Gemma, Qwen, And Embedding Models On MediaTek NPU

The stack is built around open weight models rather than a single proprietary NLU path. Google and MediaTek list explicit, production oriented support for:

Qwen3 0.6B, for text generation in markets such as mainland China.

Gemma-3-270M, a compact base model that is easy to fine tune for tasks like sentiment analysis and entity extraction.

Gemma-3-1B, a multilingual text only model for summarization and general reasoning.

Gemma-3n E2B, a multimodal model that handles text, audio and vision for things like real time translation and visual question answering.

EmbeddingGemma 300M, a text embedding model for retrieval augmented generation, semantic search and classification.

On the latest Dimensity 9500, running on a Vivo X300 Pro, the Gemma 3n E2B variant reaches more than 1600 tokens per second in prefill and 28 tokens per second in decode at a 4K context length when executed on the NPU.

For text generation use cases, LiteRT-LM sits on top of LiteRT and exposes a stateful engine with a text in text out API. A typical C++ flow is to create ModelAssets, build an Engine with litert::lm::Backend::NPU, then create a Session and call GenerateContent per conversation. For embedding workloads, EmbeddingGemma uses the lower level LiteRT CompiledModel API in a tensor in tensor out configuration, again with the NPU selected through hardware accelerator options.

Developer Experience, C++ Pipeline And Zero Copy Buffers

LiteRT introduces a new C++ API that replaces the older C entry points and is designed around explicit Environment, Model, CompiledModel and TensorBuffer objects.

For MediaTek NPUs, this API integrates tightly with Android’s AHardwareBuffer and GPU buffers. You can construct input TensorBuffer instances directly from OpenGL or OpenCL buffers with TensorBuffer::CreateFromGlBuffer, which lets image processing code feed NPU inputs without an intermediate copy through CPU memory. This is important for real time camera and video processing where multiple copies per frame quickly saturate memory bandwidth.

A typical high level C++ path on device looks like this, omitting error handling for clarity:

Copy CodeCopiedUse a different Browser// Load model compiled for NPU
auto model = Model::CreateFromFile(“model.tflite”);
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);

// Create compiled model
auto compiled = CompiledModel::Create(*env, *model, *options);

// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write<float>(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Read(output_span);

The same Compiled Model API is used whether you are targeting CPU, GPU or the MediaTek NPU, which reduces the amount of conditional logic in application code.

Key Takeaways

LiteRT NeuroPilot Accelerator is the new, first class NPU integration between LiteRT and MediaTek NeuroPilot, replacing the old TFLite delegate and exposing a unified Compiled Model API with AOT and on device compilation on supported Dimensity SoCs.

The stack targets concrete open weight models, including Qwen3-0.6B, Gemma-3-270M, Gemma-3-1B, Gemma-3n-E2B and EmbeddingGemma-300M, and runs them through LiteRT and LiteRT LM on MediaTek NPUs with a single accelerator abstraction.

AOT compilation is strongly recommended for LLMs, for example Gemma-3-270M can take more than 1 minute to compile on device, so production deployments should compile once in the pipeline and ship AI Packs via Play for On device AI.

On a Dimensity 9500 class NPU, Gemma-3n-E2B can reach more than 1600 tokens per second in prefill and 28 tokens per second in decode at 4K context, with measured throughput up to 12 times CPU and 10 times GPU for LLM workloads.

For developers, the C++ and Kotlin LiteRT APIs provide a common path to select Accelerator.NPU, manage compiled models and use zero copy tensor buffers, so CPU, GPU and MediaTek NPU targets can share one code path and one deployment workflow.

Check out the Docs and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs appeared first on MarkTechPost.

Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with …

Zhipu AI has open sourced the GLM-4.6V series as a pair of vision language models that treat images, video and tools as first class inputs for agents, not as afterthoughts bolted on top of text.

Model lineup and context length

The series has 2 models. GLM-4.6V is a 106B parameter foundation model for cloud and high performance cluster workloads. GLM-4.6V-Flash is a 9B parameter variant tuned for local deployment and low latency use.

GLM-4.6V extends the training context window to 128K tokens. In practice this supports roughly 150 pages of dense documents, 200 slide pages or one hour of video in a single pass because pages are encoded as images and consumed by the visual encoder.

Native multimodal tool use

The main technical change is native multimodal Function Calling. Traditional tool use in LLM systems routes everything through text. Images or pages are first turned into descriptions, the model calls tools using text arguments and then reads textual responses. This wastes information and increases latency.

GLM-4.6V introduces native multimodal Function Calling. Images, screenshots and document pages pass directly as tool parameters. Tools can return search result grids, charts, rendered web pages or product images. The model consumes those visual outputs and fuses them with text in the same reasoning chain. This closes the loop from perception to understanding to execution and is explicitly positioned as the bridge between visual perception and executable action for multimodal agents.

To support this, Zhipu AI extends the Model Context Protocol with URL based multimodal handling. Tools receive and return URLs that identify specific images or frames, which avoids file size limits and allows precise selection inside multi image contexts.

Rich text content, web search and frontend replication

Zhipu AI research team describes 4 canonical scenarios:

First, rich text content understanding and creation. GLM-4.6V reads mixed inputs such as papers, reports or slide decks and produces structured image text interleaved outputs. It understands text, charts, figures, tables and formulas in the same document. During generation it can crop relevant visuals or retrieve external images through tools, then run a visual audit step that filters low quality images and composes the final article with inline figures.

Second, visual web search. The model can detect user intent, plan which search tools to call and combine text to image and image to text search. It then aligns retrieved images and text, selects the relevant evidence and outputs a structured answer, for example a visual comparison of products or places.

Third, frontend replication and visual interaction. GLM-4.6V is tuned for design to code workflows. From a UI screenshot, it reconstructs pixel accurate HTML, CSS and JavaScript. Developers can then mark a region on the screenshot and issue natural language instructions, for example move this button left or change this card background. The model maps those instructions back to the code and returns an updated snippet.

Fourth, multimodal document understanding at long context. GLM-4.6V can read multi document inputs up to the 128K token context limit by treating pages as images. The research team reports a case where the model processes financial reports from 4 public companies, extracts core metrics and builds a comparison table, and a case where it summarises a full football match while keeping the ability to answer questions about specific goals and timestamps.

Architecture, data and reinforcement learning

The GLM-4.6V models belong to the GLM-V family and based on the tech report for GLM-4.5V and GLM-4.1V-Thinking. The research team highlights three main technical ingredients.

First, long sequence modeling. GLM-4.6V extends the training context window to 128K tokens and runs continual pre training on massive long context image text corpora. It uses compression alignment ideas from Glyph so that visual tokens can carry dense information that is aligned with language tokens.

Second, world knowledge enhancement. Zhipu AI team adds a billion scale multimodal perception and world knowledge dataset at pre training time. This covers layered encyclopedic concepts and everyday visual entities. The stated goal is to improve both basic perception and cross modal question answering completeness, not only benchmarks.

Third, agentic data synthesis and extended MCP. The research team generates large synthetic traces where the model calls tools, processes visual outputs and iterates on plans. They extend MCP with URL based multimodal handling and an interleaved output mechanism. The generation stack follows a Draft, Image Selection, Final Polish sequence. The model can autonomously call cropping or search tools between these stages to place images at the right positions in the output.

Tool invocation is part of the reinforcement learning objective. GLM-4.6V uses RL to align planning, instruction following and format adherence in complex tool chains.

Performance

https://z.ai/blog/glm-4.6v

Key Takeaways

GLM-4.6V is a 106B multimodal foundation model with a 128K token training context, and GLM-4.6V-Flash is a 9B variant optimized for local and low latency use.

Both models support native multimodal Function Calling so tools can consume and return images, video frames and document pages directly, which links visual perception to executable actions for agents.

GLM-4.6V is trained for long context multimodal understanding and interleaved generation, so it can read large mixed document sets and emit structured text with inline figures and tool selected images in one pass.

The series achieves state of the art performance on major multimodal benchmarks at similar parameter scales and is released as open source weights under the MIT license on Hugging Face and ModelScope.

Check out the Model Card on HF and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with Native Tool Calling appeared first on MarkTechPost.

Real-world reasoning: How Amazon Nova Lite 2.0 handles complex custome …

Artificial intelligence (AI) reasoning capabilities determine whether models can handle complex, real-world tasks beyond simple pattern matching. With strong reasoning, models can identify problems from ambiguous descriptions, apply policies under competing constraints, adapt tone to sensitive situations, and provide complete solutions that address root causes. Without robust reasoning, AI systems fail when faced with nuanced scenarios requiring judgment, context awareness, and multi-step problem-solving.
This post evaluates the reasoning capabilities of our latest offering in the Nova family, Amazon Nova Lite 2.0, using practical scenarios that test these critical dimensions. We compare its performance against other models in the Nova family—Lite 1.0, Micro, Pro 1.0, and Premier—to elucidate how the latest version advances reasoning quality and consistency.
Solution overview
We evaluate five Amazon Nova models across five customer support scenarios, measuring performance on eight dimensions:

Problem identification
Solution completeness
Policy adherence
Factual accuracy
Empathy and tone
Communication clarity
Logical coherence
Practical utility

An independent evaluator model (gpt-oss-20b) provides automated, unbiased scoring.
The evaluation architecture uses the same Region: us-east-1 and automatically handles different API formats: Converse API for Nova, OpenAI Chat Completions for gpt-oss-20b.
The sample notebook is available in the GitHub repository.
Test scenarios
To generate the scenarios evaluation dataset, we use Claude Sonnet 4.5 by Anthropic on Amazon Bedrock to generate a sample of 100 scenarios that pertain to common customer support interactions. We don’t use any of the Nova models to generate the scenarios to avoid any bias. We then randomly select five scenarios for our testing purposes that evaluate common real-world reasoning challenges:

Angry customer complaint – Tests de-escalation, empathy, and problem resolution when a customer threatens to leave after delayed delivery and poor service.
Software technical problem – Evaluates technical troubleshooting when an app crashes during photo uploads despite basic troubleshooting attempts.
Billing dispute – Assesses investigation skills and security awareness for unrecognized charges potentially indicating unauthorized access.
Product defect report – Measures warranty policy application and customer service for a two-month-old defective product.
Account security concern – Tests urgency response and security protocols for unauthorized password changes and fraudulent purchases.

Each scenario includes key issues to identify, required solutions, and relevant policies—providing objective criteria for evaluation. Depending on your industry/domain/use case, the scenarios and associated context may be different.
Implementation details
The evaluation framework establishes a comprehensive methodology for assessing model performance across multiple dimensions simultaneously. This systematic approach ensures that each model undergoes identical testing conditions, enabling fair comparison of reasoning capabilities across the Nova family. The technical implementation handles the complexity of managing different API formats while maintaining evaluation consistency. The framework assumes an active AWS account, access to Nova models and gpt-oss-20b, along with the availability of the boto3 SDK, and pandas, matplotlib, seaborn, scipy and numpy packages.
Model invocation
The system automatically detects which API format each model requires and routes requests accordingly. Nova models (Lite, Micro, Pro, Premier) use Amazon Bedrock Converse API, which provides a unified interface for conversational interactions. gpt-oss models use the OpenAI Chat Completions format, requiring a different request structure with the InvokeModel API. The invocation function checks the model identifier to determine the appropriate format. For gpt-oss models, it constructs a JSON request body with messages, token limits, and temperature settings, then parses the response to extract the generated content. For Nova models, it uses the Converse API with structured message objects and inference configuration parameters, extracting the response from the output message content. This dual-API approach supports seamless evaluation across different model families without requiring separate code paths or manual configuration changes. The same evaluation logic works for all models regardless of their underlying API requirements, with the system handling format differences transparently. The architecture also allows us to use models from different Regions while maintaining a single evaluation workflow.
The evaluation framework uses optimized prompts generated by the Amazon Bedrock Prompt Optimizer API. The optimizer analyzes and rewrites raw prompts to improve model performance with better structure, clarity, and organization, creating model-specific optimizations for each Nova model.
A scenario with the optimized prompt is shown in the following example:

“`json
{
“angry_customer”: {
“name”: “Angry Customer Complaint”,
“prompt”: “# Customer Support Response Tasknn## ContextnYou are a professional customer support representative for a technology company. You need to respond to an upset customer who has written the following message:nn”I am absolutely furious! I ordered a laptop 3 weeks ago and it still hasn’t arrived. When I called last week, your representative was rude and unhelpful. I’ve been a loyal customer for 5 years and this is how you treat me? I want my money back immediately and I’m considering switching to your competitor. This is unacceptable!”nn## InstructionsnCraft a professional, empathetic response that:n1. Acknowledges the customer’s frustration and validates their feelingsn2. Apologizes sincerely for the specific issues (delayed delivery and poor customer service)n3. Demonstrates understanding of their value as a loyal 5-year customern4. Offers a clear solution to address their refund requestn5. Provides a specific action plan to resolve the delivery issue (if they choose not to cancel)n6. Includes a concrete step to follow up and rebuild trustn7. Maintains a respectful, professional tone throughoutnnYour response should be concise, solution-oriented, and focused on retaining this valuable customer. Avoid making excuses or shifting blame.nnProvide your response immediately without any preamble.”,
“key_issues”: [
“Delayed delivery”,
“Poor customer service experience”,
“Customer loyalty concerns”,
“Refund request”
],
“required_solutions”: [
“Apologize sincerely”,
“Investigate delivery status”,
“Offer compensation”,
“Escalate if needed”
],
“policies”: [
“Always acknowledge customer emotions”,
“Provide specific next steps”,
“Offer multiple resolution options”
],
“_optimization_metadata”: {
“original_length”: 463,
“optimized_length”: 1330,
“target_model”: “amazon.nova-2-lite-v1:0”
}
}
}
“`

Evaluation Framework
The evaluator receives the scenario, model response, and evaluation criteria. We employ a two-step scoring process: first, the evaluator assigns a category label that best characterizes the response; then, the evaluator assigns a predetermined score corresponding to that category label. This approach ensures a consistent and uniform scoring methodology across all model responses.
The evaluation prompt structure:

“`python
EVALUATION_PROMPT = “””
# Customer Support Response Evaluation Task

You are an expert evaluator assessing customer support responses. Your task is to
provide **detailed, objective scoring** across 8 dimensions with specific reasoning
for each score.

## Context

### Original Customer Scenario
{scenario}

### Model’s Response to Evaluate
{response}

## Evaluation Criteria

### Key Issues That Should Be Identified
{key_issues}

### Required Solutions/Actions
{required_solutions}

### Company Policies to Follow
{policies}

## Scoring Instructions

Evaluate the response across **8 dimensions** using a **two-step process**:

### Step 1: Assign Category Label

For each dimension, first determine which category best describes the response:

**EXCELLENT**: Comprehensive, professional, exceeds expectations
– All requirements fully met with exceptional quality
– No significant improvements needed
– Demonstrates mastery of the dimension

**GOOD**: Solid performance with minor room for improvement
– Most requirements met effectively
– Minor gaps or areas for enhancement
– Clearly competent but not exceptional

**ADEQUATE**: Meets basic requirements but has notable gaps
– Core requirements partially met
– Significant room for improvement
– Functional but not impressive

**POOR**: Significant issues requiring major improvements
– Many requirements not met
– Critical gaps in quality
– Barely functional or ineffective

**FAILING**: Critical failures, does not meet requirements
– Fundamental requirements not met
– Unusable or harmful response
– Complete failure on this dimension

### Step 2: Assign Fixed Score

Each category maps to a fixed score:
– **EXCELLENT** → 10
– **GOOD** → 8
– **ADEQUATE** → 6
– **POOR** → 4
– **FAILING** → 2

For **EACH dimension**, provide:
1. **Category label** (EXCELLENT/GOOD/ADEQUATE/POOR/FAILING)
2. **Fixed score** (10/8/6/4/2 based on category)
3. **Specific reasoning** explaining your categorization

## Evaluation Dimensions

### 1. Problem Identification
**Question**: Did the response identify all key issues from the customer’s message?
– Check if all items from “Key Issues” were recognized
– Note any missed or misunderstood problems

### 2. Solution Completeness
**Question**: Are all identified problems addressed with appropriate solutions?
– Verify each issue has a corresponding solution or action
– Check if solutions are practical and actionable

### 3. Policy Adherence
**Question**: Does the response follow all stated company policies?
– Review against “Company Policies to Follow”
– Note any policy violations or omissions

### 4. Factual Accuracy
**Question**: Are technical details, processes, and options stated correctly?
– Check for factual errors or misleading information
– Verify technical accuracy of troubleshooting steps

### 5. Empathy & Tone
**Question**: Does the response demonstrate appropriate emotional intelligence?
– Assess acknowledgment of customer emotions
– Evaluate professionalism and empathy level

### 6. Communication Clarity
**Question**: Is the response clear, well-structured, and actionable?
– Check for clear language and organization
– Verify instructions are easy to follow

### 7. Logical Coherence
**Question**: Is the reasoning sound without contradictions?
– Look for logical flow and consistency
– Identify any contradictory statements

### 8. Practical Utility
**Question**: Would this response actually help the customer resolve their issue?
– Consider real-world effectiveness
– Assess likelihood of customer satisfaction

## Example Evaluation
<>
“””
“`

The evaluator must justify scores, providing transparency into the assessment. To address transparency concerns in AI evaluation, the evaluator provides detailed reasoning for each of the eight dimensions, plus an overall justification. This ensures that scores are not just numerical but backed by specific explanations of why each score was assigned.
Large language model (LLM)-as-a-judge evaluation
Machine translation-based evaluation techniques like ROUGE and BLEU fall short when it comes to open ended conversations. LLM-as-a-judge provides scalability, flexibility and evaluations that closely match human preferences up to 80%.
Refer to the comparison table in the README for further details.
Evaluation process
For each model and scenario combination, we perform 10 runs to measure consistency. This produces 250 evaluations (5 models × 5 scenarios × 10 runs) providing a statistical spread through multiple measurements. The number of runs and scenarios can be increased according to the specific use case. The framework includes diagnostic checks to verify evaluation quality and reliability. Failed evaluations (where the evaluator returns a score of 0 due to technical issues such as JSON parsing errors, or when models don’t respond owing to blocked responses adhering to Responsible AI criteria) are excluded from mean and standard deviation calculations to ensure accurate performance metrics. This prevents technical failures from artificially lowering model scores.
Results
The chosen scenarios and approach described here enable deep statistical analysis of model performance patterns. By examining both individual scenario outcomes and aggregate metrics, we can identify strengths and potential areas for improvement across the Nova model family. This multi-dimensional analysis approach provides confidence in the reliability of performance rankings.
Statistical analysis
The statistical evaluation we use follow the methods outlined in Miller, 2024. To quantify uncertainty in model performance estimates, we calculate standard error (SE) as:

SE = √(σ^2/n),

where σ^2 is the sample variance, and n is the sample size. SE measures how precise our estimate of the mean is and tells us how much the sample mean would vary if we repeated the evaluation many times. The standard error allows us to construct 95% confidence intervals (CI = μ± 1.96×SE), where μ is the sample mean. This provides plausible ranges for true model performance, facilitating statistical significance testing through interval overlap analysis. In addition, we introduce a coefficient of variation (CV) based consistency score calculated as (100 – CV%), where CV% = (σ/μ)×100, and σ is the standard deviation. This normalizes reliability measurement on a 0-100 scale, thereby providing an intuitive metric for response stability. Finally, zero-exclusion averaging prevents failed evaluations from artificially deflating scores, while error bars on visualizations transparently communicate uncertainty. For the sake of completeness, the code in the GitHub repository calculates other statistics such as a minimum detectable effect that demonstrates the ability to reliably detect meaningful performance differences, a pairwise model comparison metric that identifies correlations between model responses, and a power analysis that validates the chosen sample size. These methodologies transform the evaluation from simple score comparison into rigorous experimental science with quantified uncertainty, enabling confident conclusions about model performance differences.

Figure 1 Performance of models across the dimensions considered in the study with 95% confidence intervals

Figure 2 Overall performance of Nova Lite 2.0 compared to other models in the Nova family
Figure 1 shows the performance of models with scores averaged across all the runs for each dimension considered in the study; this is also depicted on the radar chart in Figure 2. Table 1 shows the scores across all dimensions considered in the study. Nova Lite 2.0 achieved the highest overall score (9.42/10) with a standard error of 0.08 and a coefficient of variation of 5.55%, demonstrating high-quality reasoning.

Metric
Nova Lite 2.0
Nova Lite 1.0
Nova Pro 1.0
Nova Micro
Nova Premier

Overall Score
9.42
8.65
8.53
7.70
7.16

Standard Error (SE)
0.08
0.09
0.12
0.32
0.38

95% Confidence Interval
[9.28, 9.57]
[8.48, 8.82]
[8.30, 8.76]
[7.08, 8.32]
[6.41, 7.91]

Consistency Score (CV-based)
94.45
93.05
90.46
71.37
62.96

Coefficient of Variation
5.55%
6.95%
9.54%
28.63%
37.04%

Table 1: Overall Model Performance Summary

Metric
Nova Lite 2.0
Nova Lite 1.0
Nova Pro 1.0
Nova Micro
Nova Premier

Problem Identification
9.63 ± 0.27
8.57 ± 0.46
8.16 ± 0.44
7.59 ± 0.74
6.94 ± 0.82

Solution Completeness
9.59 ± 0.23
8.08 ± 0.32
8.04 ± 0.42
6.78 ± 0.65
6.33 ± 0.69

Policy Adherence
8.82 ± 0.54
7.76 ± 0.59
7.55 ± 0.64
7.02 ± 0.69
6.37 ± 0.81

Factual Accuracy
9.55 ± 0.26
9.18 ± 0.30
9.10 ± 0.28
8.08 ± 0.74
8.00 ± 0.89

Empathy Tone
8.98 ± 0.33
8.57 ± 0.34
8.08 ± 0.36
7.55 ± 0.65
7.10 ± 0.79

Communication Clarity
9.76 ± 0.19
9.14 ± 0.28
8.94 ± 0.28
8.04 ± 0.69
7.63 ± 0.85

Logical Coherence
9.71 ± 0.35
9.67 ± 0.29
9.92 ± 0.11
8.98 ± 0.74
8.16 ± 0.91

Practical Utility
9.35 ± 0.27
8.24 ± 0.22
8.45 ± 0.24
7.55 ± 0.62
6.78 ± 0.70

Table 2: Dimension-Level Performance of the Nova models (Mean Scores with 95% Confidence Intervals)
Table 2 shows the performance across the eight dimensions considered in the study. Nova Lite 2.0 achieved consistently high scores across all dimensions.

Scenario
Nova Lite 2.0
Nova Lite 1.0
Nova Micro
Nova Pro 1.0
Nova Premier

Account Security Concern
9.25
7.95
7.65
6.90
2.00

Angry Customer Complaint
9.95
9.50
9.30
8.35
8.20

Billing Dispute
9.15
8.75
8.60
8.85
8.20

Product Defect Report
9.25
8.90
7.70
8.00
8.75

Software Technical Problem
10.00
8.20
8.55
8.75
8.60

Table 3 Summary of scores (on a scale of 1-10) across models and scenarios considered. A score of 2 for Nova Premier for Account Security Concern is due to Guardrails being invoked for almost all of the responses.
Table 3 summarizes the mean scores corresponding to each scenario considered in the study. Again, Nova Lite 2.0 achieves high scores across all dimensions.
Dimension analysis
The dimensional strengths of Nova Lite 2.0 demonstrate balanced capabilities across critical evaluation criteria. High scores in problem identification, communication, and logical reasoning indicate mature performance that translates effectively to real-world applications, distinguishing it from models that excel in individual dimensions but lack consistency.
Problem Identification: Nova Lite 2.0 excelled at identifying all key issues—crucial where missing problems lead to incomplete solutions.
Communication Clarity: The model achieved the highest score in this dimension, producing well-structured, actionable responses customers could follow easily.
Logical Coherence: Strong performance indicates the model maintains sound reasoning without contradictions across complex scenarios.
Empathy and Tone: High scores demonstrate appropriate emotional intelligence, critical for de-escalation and sensitive situations.
Table 4 shows sample evaluator explanations for high-scoring and low-scoring models, illustrating effective scoring methodology.

Nova Lite 2.0 – Score: 10 – Category: “Excellent” The response explicitly recognizes the four key issues: it mentions the delayed delivery (“delay in receiving your laptop”), the poor customer service experience (“unhelpful interaction with our support team”), the customer’s loyalty (“a valued customer of five years”), and the refund request (“cancel your order and receive a full refund”). All issues are acknowledged with appropriate language. Nova Premier – Score: 6 – Category: “Adequate” The response acknowledges frustration and loyalty, but it does not explicitly mention the delayed delivery or the rude customer‚ service representative, two key issues from the customer message.

Table 4 Sample explanations provided by the evaluator for Nova Lite 2.0 and Nova Premier for the Angry Customer scenario along the Problem Identification dimension
Key findings
The evaluation results reveal critical insights for model selection and deployment strategies. These findings emphasize considering multiple performance factors rather than focusing solely on aggregate scores, as optimal choices depend on specific application requirements and operational constraints.

Multi-dimensional reasoning matters: Models scoring well on accuracy but poorly on empathy or clarity are unsuitable for customer-facing applications. The balanced performance of Nova Lite 2 across all dimensions makes it production-ready.
Consistency predicts production success: The low variability of Nova Lite 2.0 versus other models indicates reliable performance across diverse scenarios—critical where inconsistent responses damage user trust.
Real-world evaluation reveals practical capabilities: Synthetic benchmarks miss critical dimensions like empathy, policy adherence, and practical utility. This framework surfaces production-relevant capabilities.

Implementation considerations
Successfully implementing this evaluation framework requires attention to operational factors that significantly impact assessment quality and cost-effectiveness. The choice of evaluation methodology, scoring mechanisms, and technical infrastructure directly influences result reliability and scalability.

Evaluator selection: We selected gpt-oss-20b to ensure independence from the Nova family, reducing potential bias. Amazon Bedrock offers built-in LLM-as-a-judge capabilities with standard metrics like correctness, completeness, and harmfulness. The framework presented in this post provides the flexibility to define specialized evaluation criteria and multi-dimensional assessments that can be customized to the specific use case of interest.
Scenario design: Effective scenarios balance realism with measurability. Each includes specific details grounding evaluation in realistic contexts. Objective criteria—key issues to identify, required solutions, relevant policies—enable consistent scoring. Realistic complexity combining multiple problems (billing dispute + security breach) and competing priorities (urgency vs protocols) reveals how models handle real-world ambiguity and surfaces capability gaps.
Statistical validation: Multiple runs per scenario provide confidence intervals and detect inconsistency, ensuring performance differences are statistically significant.

Key takeaways
Amazon Nova Lite 2.0 demonstrates impressive reasoning capabilities in tested real-world scenarios, achieving consistent high performance across diverse problem-solving tasks. Balanced scores across evaluation dimensions—from technical problem identification to empathetic communication—indicate robust reasoning potentially applicable to other domains after comprehensive testing. Multi-dimensional evaluation reveals nuanced model capabilities that single-metric benchmarks miss. Understanding performance across problem identification, solution completeness, policy adherence, empathy, clarity, and logical coherence provides actionable deployment insights. This practical testing methodology provides actionable insights for organizations evaluating AI systems. The framework’s focus on objective criteria, independent evaluation, and statistical validation creates reproducible assessments adaptable to domains requiring contextual judgment and problem-solving. As models advance, assessment methodologies must evolve to capture increasingly sophisticated reasoning capabilities—multi-turn conversations, complex decision-making under uncertainty, and nuanced judgment in ambiguous situations.
Conclusion
This comprehensive evaluation demonstrates that Amazon Nova Lite 2.0 delivers production-ready AI reasoning capabilities with measurable reliability across diverse business applications. The multi-dimensional assessment framework provides organizations with quantitative evidence needed to confidently deploy AI systems in critical operational environments.
Next steps
Evaluate Nova Lite 2.0 for your use case:

Bedrock Model Evaluation: Start with model evaluation tools of Amazon Bedrock, including the built-in LLM-as-a-judge capabilities for standard metrics, or adapt the custom framework discussed in this post for specialized evaluation criteria.
Implement multi-dimensional testing: Adapt the evaluation framework to your specific domain requirements.
Pilot deployment: Begin with low-risk scenarios to validate performance in your environment.
Scale systematically: Use the statistical validation approach to expand to additional use cases.

Additional resources

Miller, E; Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
Amazon Bedrock Documentation
Amazon Nova models
GitHub repository
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

About the authors
Madhu Pai, Ph.D., is a Principal Specialist Solutions Architect for Generative AI and Machine Learning at AWS. He leads strategic AI/ML initiatives that deliver scalable impact across diverse industries by identifying customer needs and building impactful solutions. Previously at AWS, Madhu served as the WW Partner Tech Lead for Manufacturing where he delivered compelling partner solutions that drove strategic outcomes for industrial manufacturing customers. He brings over 18 years of experience across multiple industries, leveraging data, AI, and ML to deliver measurable business results.
Sunita Koppar is a Senior Specialist Solutions Architect in Generative AI and Machine Learning at AWS, where she partners with customers across diverse industries to design solutions, build proof-of-concepts, and drive measurable business outcomes. Beyond her professional role, she is deeply passionate about learning and teaching Sanskrit, actively engaging with student communities to help them upskill and grow.
Satyanarayana Adimula is a Senior Builder in the AWS GenAI Invocation Center. With over 20 years of experience in data and analytics and deep expertise in generative AI, he helps organizations achieve measurable business outcomes. He builds agentic AI systems that automate workflows, accelerate decision-making, reduce costs, increase productivity, and create new revenue opportunities. His work spans large enterprise customers across various industries, including retail, banking, financial services, insurance, healthcare, media and entertainment, and professional services.

Create AI-powered chat assistants for your enterprise with Amazon Quic …

Teams need instant access to enterprise data and intelligent guidance on how to use it. Instead, they get scattered information across multiple systems. This results in employees spending valuable time searching for answers instead of making decisions.
In this post, we show how to build chat agents in Amazon Quick Suite to address this problem. We walk through a three-layer framework—identity, instructions, and knowledge—that transforms Quick Suite chat agents into intelligent enterprise AI assistants. In our example, we demonstrate how our chat agent guides feature discovery, use enterprise data to inform recommendations, and tailors solutions based on potential to impact and your team’s adoption readiness.
Benefits of Quick Suite chat agents
Quick Suite chat agents make advanced AI capabilities accessible to non-technical business users. Sales representatives, analysts, and domain experts can create sophisticated AI assistants without requiring deep technical expertise in machine learning or cloud infrastructure.
Quick Suite instances come with their own default system chat agent (My Assistant). Administrators can enable the ability to create custom chat agents for the users. Many users begin their Quick Suite journey by experimenting with My Assistant, discovering its AI capabilities through hands-on exploration. Users can enhance their interactions with contextual configuration: you can point the agent to specific Spaces to filter conversation scope, so responses draw from relevant organizational knowledge. You can also upload response templates or process documents directly into chat sessions to modify how the agent structures its outputs or approaches specific tasks.
Although these approaches offer immediate value and flexibility for individual users and one-off tasks, each conversation requires manual setup—selecting the right Spaces, uploading relevant templates, and providing context-specific instructions. With custom chat agents, you can capture these successful patterns into permanent, shareable solutions. You can preserve the contextual knowledge and behavioral guidelines in the agent’s persona, as well as the resource selections that make individual conversations successful, and package them into consistent, reusable agents that teams can deploy at scale. With this systematic deployment solution, individual insights become organizational assets that drive productivity gains. The solution reduces the cognitive load on users who no longer need to remember specific prompting techniques or locate the right resources for each interaction.
The three-layer foundation: Identity, instructions, and knowledge
Effective chat agents are built on three essential components that work together to create consistent, reliable AI assistants:

Identity – Defines who the agent is and what role it serves
Instructions – Specifies how the agent should think and respond
Knowledge – Provides the information the agent can access to search for answers and content generation

Understanding these three layers is crucial because they determine your agent’s behavior, including its communication style and the information it can retrieve.
Identity
Identity defines who your agent is and what role it plays, which shapes how it responds to every request. You can configure an identity through the Agent identity configuration field.
Instructions
Instructions function as behavioral directives that provide granular control over agent response generation, with specificity and consistency being crucial for effectiveness. Effective prompt engineering skills become essential when crafting both identity and instructions, because the precision and clarity of these elements directly impact the agent’s ability to understand context, follow behavioral directives, and maintain consistent, persona-driven responses. You can configure your Quick Suite chat agent with instructions in the Persona instructions, Communication style, and Reference documents fields. Reference documents refer to more specific or detailed instructions, or information attached as files that you require the agent to always have and follow exactly, like templates and process documents.
Knowledge
Large language models (LLMs) power the agents. The custom chat agent provides required context to LLMs through two distinct means: instructions as discussed in previous section, and searchable knowledge. Quick Spaces provides the ability to pool searchable knowledge for the chat agent in different forms:

Direct file uploads (indexed knowledge)
Amazon Quick Sight dashboards and topics
Knowledge bases created from data access integrations (indexed knowledge)
Action connectors to take actions on integrated third-party tools

Spaces function as dynamic, searchable knowledge repositories that facilitate real-time access to teams’ information in structured or unstructured form, while maintaining security boundaries and supporting collaborative workflows. These are ideal for enabling semantic search capabilities over evolving knowledge bases like current business data and collaborative knowledge.
Solution overview
The Quick Suite Product Specialist is a custom chat agent to help users identify the right Quick Suite features for their specific needs. My Assistant can answer any questions related to Quick Suite; the Product Specialist chat agent takes a product specialist’s approach to support user questions and requirements. This agent acts as an intelligent advisor that matches business challenges with appropriate Quick Suite capabilities.
The Product Specialist chat agent is configured to follow a three-phased methodology: discovery, analysis, and solution recommendations. This showcases how modern AI agents should balance comprehensive platform knowledge with practical wisdom about right-sizing solutions. It can recommend simple prompts to be used with My Assistant to serve individual users, or architect complex multi-capability workflows for enterprise-wide deployment, it exemplifies the principle of matching solution complexity to actual impact potential while fostering GenAI adoption across organizations and projecting potential ROI for recommended solutions.
In the following sections, we demonstrate how to build a knowledge Space consisting of the Quick Suite User Guide documentation and then configure the Quick Suite Product Specialist chat agent.
Prerequisites
To build a custom chat agent in Quick Suite, you must have the following:

An active Quick Suite instance
A Quick Suite subscription for the required capabilities:

Professional – Create, configure, and share, spaces and custom chat agents
Enterprise (includes Professional capabilities) – Create knowledge bases

For more information about Quick Suite’s subscription tiers, see Amazon Quick Suite pricing.
Create Space with knowledge base
We first set up a Quick Space as part of the context component of the three-layered foundation we discussed previously. This Space contains a searchable knowledge base for the Amazon Quick Suite User Guide.
This step is for reference on how to create indexed searchable content for specific documentation. Quick Suite chat agents are self-aware of all the Quick Suite capabilities and associated implementation practices.
We can choose from two options to create our Space: a static file or a live web-crawled knowledge base.
Use a static file
This option is a static snapshot of the official Quick Suite User Guide and must be updated occasionally to incorporate latest changes and additions to the platform documentation. Complete the following steps:

Go to Amazon Quick Suite User Guide.
Choose the PDF download option under the page header to download the User Guide as a PDF file to your local machine.

On the Quick Suite console, choose Spaces in the navigation pane.
Choose Create space to create a new Space:

For Title, enter a title, such as the following:

Amazon Quick Suite Documentation Space

For Description, enter a description, such as the following:

This Quick Space contains Amazon Quick Suite User Guide file.

Choose Add knowledge and choose File uploads.
Upload the User Guide PDF.
Choose Share to manage Viewer/Owner access to the created Space.

Files uploaded to a Space use the same access permissions as the Space.

Use a live web-crawled knowledge base
This represents a near real-time option in which you set up a direct connection between the documentation site and Quick Suite through a web crawler integration, and indexing the documentation, with an automatic refresh configuration set on the default schedule.

On the Quick Suite console, choose Integrations in the navigation pane.
Choose Add and choose Webcrawler to add a webcrawler.

For Name, use the default name.
Select No authentication.
Choose Create and continue.

Configure the knowledge base:

For Name, enter a name, such as the following:

Amazon Quick Suite User Guide Documentation KB

For Add URLs, enter the main documentation URL:

https://docs.aws.amazon.com/quicksuite/latest/userguide/

Choose Add.
Choose Create.
On the Knowledge bases tab, choose the knowledge base you created. The knowledge base refresh is initiated automatically.
To manage access to Knowledge base, choose Add Users & groups on the Permissions tab to search and add people or groups for Viewer access.

Choose Spaces in the navigation pane.
Choose Create space to create a new Space:

For Title, enter a title, such as the following:

Amazon Quick Suite Documentation Space

For Description, enter a description, such as the following:

This Quick Space consists of connection to the web-crawled knowledge base for Amazon Quick Suite’s User Guide from AWS Documentation website.

Choose Add knowledge, then choose Knowledge bases.
Locate the knowledge base you created and choose Add.
Choose Share to manage Viewer/Owner access to the created Space.

Knowledge base permission settings are honored by Quick Suite over Space sharing settings.
The Space is now created and should be syncing the latest Quick Suite User Guide.

Create chat agent
Complete the following steps to build your own Quick Suite Product Specialist:

On the Quick Suite console, choose Chat agents in the navigation pane.
Choose Create chat agent
Choose Skip to enter Builder view to create a custom chat agent, because we know exactly what instructions and assets the chat agent needs.

For Title, enter a title, such as the following:

Quick Suite Product Specialist

For Description, enter a description, such as the following:

A comprehensive expert agent that combines Amazon Quick Suite expertise with GenAI evangelism and prompt engineering mastery. DISCOVERS users’ productivity challenges, GenAI readiness, and solution scalability needs, ANALYZES their competency and impact potential, and provides optimal SOLUTION RECOMMENDATIONS based on Amazon Quick Suite capabilities including Custom Chat Agents, Flows, Automate, Integrations, Extensions, Spaces, Research, and Quick Sight with detailed implementation guidance and projected ROI analysis.

Update the AGENT PERSONA configuration:

For Agent identity, enter details such as the following:

You are a seasoned expert in Amazon Quick Suite’s capabilities with deep knowledge of how its features can solve various internal use cases. You also serve as a GenAI Evangelist, passionate about democratizing AI adoption across organizations, and an expert Prompt Engineer with mastery in crafting effective prompts for various AI systems. You specialize in use case discovery, analyzing productivity challenges, automation opportunities, GenAI solution design, and simple to complex workflow orchestration to recommend optimal Quick Suite solutions with detailed implementation guidance and projected ROI analysis.
The Agent identity field defines the agent’s internal persona, which shapes the decisions it makes. Using the keywords “seasoned expert” establishes authority that influences response confidence and depth, while the multi-role design (“GenAI Evangelist,” “expert Prompt Engineer”) makes sure the agent can pivot between technical guidance, strategic adoption advice, and educational support. The emphasis on “use case discovery” programs the agent to prioritize understanding before recommending, establishing a consultative rather than transactional interaction pattern. The phrase “democratizing AI adoption” internally calibrates the agent to serve users at different skill levels, preventing it from defaulting to overly technical responses that might intimidate beginners. These identity choices program how it interprets queries and structures responses.
For Persona instructions, enter instructions such as the following:

For each user problem follow this 3-phased approach:
A. DISCOVERY
1. Analyze the initial use case details provided
2. Before providing any recommendations, ask clarifying questions to understand:
-Knowledge base platforms and scale of use case relevant to identifying suitable Quick Suite capability
-User’s current experience level with GenAI solutions (Beginner/Intermediate/Advanced)
-Number of potential users who would benefit from this solution (Individual/Team/Department/Organization-wide)
-Available metrics around the problem/challenge (e.g., “it takes 8 hours to do this manually today”)
-Current AI/automation tools in use and satisfaction level
-Team’s technical capabilities and change management readiness
-Wait for user confirmation before proceeding
B. ANALYSIS
1. Analyze all the user provided information including their GenAI maturity, and scalability requirements
2. Assess impact potential: High impact = high user count + significant time/effort savings; Low impact = limited users + minimal savings
3. Right sizing the solution:
-Low impact = Consider simple prompt-based solutions using default Chat Agent (My Assistant)
-High impact = Recommend dedicated Quick Suite capabilities
-Avoid unnecessary complexity when simple solutions suffice
4. Calculate potential ROI in terms of as time savings by user count
5. CAPABILITY VERIFICATION PROTOCOL:
– Before recommending any specific Quick Suite feature, verify the exact capability exists in available documentation
– Clearly distinguish between Quick Flows (interactive, on-demand workflows) and Quick Automate (scheduled automation with triggers)
– If uncertain about a capability, explicitly state limitations and provide documented alternatives
– Never assume features exist without documentation confirmation
– When correcting previous errors, acknowledge the mistake and provide accurate information based on verified documentation
– Use the documentation knowledgebase available through the attached Space to validate capabilities before making recommendations
C. SOLUTION RECOMMENDATIONS
1. List appropriate Quick Suite capabilities with scalability-matched options:
-For low impact: Start with optimized prompts for default chat agent (My Assistant) or basic Quick Sight BI functionalities as suitable for the use case
-For moderate-high impact: assess and recommend dedicated scalable solutions (aligning with the use case) built as custom chat agent, Flows, Automation projects, required Integrations, Extensions for web browser/Slack/Teams/Outlook/Word specific use cases, relevant Spaces, Research, Quick Sight
-Present multiple options when applicable, prioritizing simplicity when impact doesn’t justify complexity
2. Provide clear reasoning for each suggested capability including:
-Impact-to-complexity analysis
-Scalability considerations (user adoption, maintenance, governance)
-Pros & Cons with emphasis on right-sizing the solution
-Detailed ROI projections including potential time savings multiplied by user count and estimated implementation costs (e.g., “suggested solution would save 7 hours per person across 50 users = 350 hours total weekly savings, equivalent to $X in productivity gains”)
-GenAI adoption benefits and change management considerations
-Prompt engineering best practices for Chat Agents when applicable
3. Ask if they want prescriptive implementation guidance, if they do, then provide detailed solution building pathways including:
-Step-by-step implementation approach starting with minimum viable solution
-Scaling pathway from simple to complex as adoption grows
-Prompt engineering templates and best practices
-GenAI adoption strategies and success metrics
-ROI tracking and measurement recommendations
-Change management recommendations
The three-phase methodology (discovery, analysis, solution recommendations) gives the agent best practices and guidelines on the kind of information it needs to collect to inform its recommendations, so its ability to get data about these features is augmented by user-specified context that is relevant to the recommended solutions.

For Tone, enter a description to calibrate emotional intelligence and approachability:

Professional, consultative, thorough, and evangelistic about GenAI potential while emphasizing practical, right-sized solutions. Ask clarifying questions to ensure accurate recommendations while inspiring confidence in AI adoption without over-engineering.

For Response format, configure the structural patterns (conversational vs. prescriptive, lists vs. paragraphs) that match different interaction phases:

Conversational in DISCOVERY phase with competency and scalability assessment questions. Always ask follow-up questions for clarity before concluding suggestions. Prescriptive in SOLUTION RECOMMENDATIONS phase: Provide structured recommendations with clear reasoning, impact analysis, prompt engineering guidance, and GenAI adoption strategies. Use numbered lists for capabilities and bullet points for implementation details.

For Length, set phase-appropriate boundaries to prevent both overwhelming verbosity and insufficient detail:

Succinct and to-the-point in DISCOVERY phase. For SOLUTION RECOMMENDATIONS phase: Comprehensive enough to cover all relevant Quick Suite capabilities with detailed reasoning, scalability analysis, prompt engineering best practices, and GenAI evangelism insights, but organized for easy scanning.

For Reference documents, you can provide reference documents that give additional guidance to the agent on enterprise considerations and guardrails to keep in mind while recommending solutions, as well as additional nuances about the different features to factor for solution complexity. For this example, we don’t upload additional documents.

For KNOWLEDGE SOURCES:

Choose Link spaces
Choose the Space you created earlier and choose Link.

Linking the Space makes sure the agent can verify capabilities against actual product documentation. The Space architecture maintains enterprise security by honoring underlying data source permissions, allowing AI deployment without compromising existing security permissions. The web crawler option for live documentation makes sure the agent’s knowledge stays current as the platform evolves.

For ACTIONS, set up relevant third-party platform integrations. For example, add one of your enterprise collaboration tools, such as Slack or Teams, for sharing the implementation recommendations from this agent with your team.

Action integrations extend capabilities beyond conversation to actual workflow execution. This dynamic knowledge approach configures an adaptive assistant that validates recommendations against current information, accesses real business data, and executes actions, all while respecting organizational security boundaries.

Update CUSTOMIZATION

For Welcome Message, enter a message such as the following:

Hello! I’m your Quick Suite Product Specialist, GenAI Evangelist, and Pro Prompt Engineer. Let’s DISCOVER your productivity challenge, assess its scalability potential and your GenAI readiness, and I’ll recommend the right-sized SOLUTION that maximizes impact, complete with projected ROI analysis.

For Suggested prompts, enter suggestions that end-users of this chat might use as quick start prompts to talk to the agent:

“What Quick Suite capability can help me with my productivity/automation use case?”
“How can I maximize impact with the simplest possible GenAI solution for my use case?”
“I’m new to GenAI – what’s the best Quick Suite solution to start with for my use case?”

Choose Update preview, test the chat agent, and make adjustments as necessary.
Choose Launch chat agent to publish the agent.
Choose Share to share access to the chat agent as necessary.

Test the chat agent
Let’s demonstrate the capabilities of the Quick Suite Product Specialist that you created:

On the Quick Suite console, choose Chat agents in the navigation pane.
Select the Quick Suite Product Specialist chat agent you created.
On the Actions menu, choose the Chat link.
Send the following request to the agent: “I want to get help in formatting my weekly status emails.”

The agent takes the initial prompts and returns with detailed discovery questionnaire to better understand your use case, without jumping to recommendations. You will notice some differences from run to run, and might not see the same questionnaire, and chat agent responses as shown in the example in this post.

Review and respond to the questionnaire.

The agent returns a comprehensive response including assessment of impact, multiple solution recommendations with reasoning, and high-level implementation pathway options, letting you choose your solution options, and receive prescriptive implementation guidance.

Continue interacting with the agent to get detailed implementation guidance. Try out the chat agent on your own use cases, build out recommended solutions, and learn from your interactions.

Clean up
When you are ready to remove the custom chat agent from your Quick Suite setup, clean up the resources to avoid potential additional indexing costs:

Delete the knowledge base:

On the Quick Suite console, choose Integrations in the navigation pane, then choose Knowledge bases.
Choose the options menu (three dots) next to the knowledge base you created.
Choose Delete knowledge base and follow the prompts to delete the knowledge base.

Delete the Space:

On the Quick Suite console, choose Spaces in the navigation pane.
Choose the options menu (three dots) next to the Space you created.
Choose Delete and follow the prompts to delete the Space.

Delete the chat agent:

On the Quick Suite console, choose Chat agents in the navigation pane.
Choose the options menu (three dots) next to the chat agent you created.
Choose Delete and follow the prompts to delete the chat agent.

Key takeaways
Building effective chat agents requires intentional design across three foundational layers. The Quick Suite Product Specialist demonstrates these principles in action:

Specificity drives consistency – Rather than hoping the LLM will determine the right approach, you can provide explicit identity definitions, behavioral constraints, decision frameworks, and output formats to transform generic AI into reliable expert assistants.
Structure prevents common failures – The three-phase methodology (discovery, analysis, solution recommendations) shows how systematic approaches guide users to right-size solutions, only after understanding the problem.
Dynamic knowledge maintains relevance – Linking live documentation and permission-aware Spaces makes sure agents validate recommendations against current information while respecting organizational security boundaries.

Conclusion
Custom chat agents in Quick Suite can transform how teams access and use enterprise knowledge. By applying the three-layer framework—identity, instructions, and knowledge—you can create AI assistants that deliver instant, accurate answers while maintaining enterprise security and compliance. The Quick Suite Product Specialist example demonstrates how structured methodologies and careful configuration turn generic AI into specialized experts that guide users to the right solutions for their specific needs.
Start with a focused use case that demonstrates clear ROI, then expand as adoption grows. Custom chat agents can deliver measurable productivity gains, helping teams find information faster, automating repetitive workflows, or providing expert guidance at scale. To learn more about creating and deploying Quick Suite chat agents, see Create, customize, and deploy AI-powered chat agents in Amazon Quick Suite.

About the authors
Nitish Chaudhari is a Senior Customer Solutions Manager at AWS, where he partners with customers to architect and implement generative AI solutions. He specializes in building collaborating agents, chat agents, and automation flows with Amazon Quick Suite and Amazon Bedrock that help teams solve real-world productivity challenges at scale. Before joining AWS, Nitish led product teams in the energy sector, and he now works closely with customers and AWS service teams to shape the next generation of generative AI capabilities.
Sindhu Santhanakrishnan is a Senior Product Manager at AWS, where she leads the development of custom agent capabilities in Amazon Quick Suite. She has played a key role in AWS’s automation journey, being part of the Q Apps launch, leading Q Actions in Q Business, and most recently driving the successful launch of chat agents in Quick Suite. She specializes in building business-focused automation solutions, with a background in launching zero-to-one products and customer data platforms. Sindhu holds a Master’s in Product Management from Carnegie Mellon University.
Vinayak Datar is a Senior Solutions Manager based in Bay Area, helping enterprise customers accelerate their AWS Cloud journey. He’s focusing on helping customers to convert ideas from concepts to working prototypes to production using AWS generative AI services.

Jina AI Releases Jina-VLM: A 2.4B Multilingual Vision Language Model F …

Jina AI has released Jina-VLM, a 2.4B parameter vision language model that targets multilingual visual question answering and document understanding on constrained hardware. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone and uses an attention pooling connector to reduce visual tokens while preserving spatial structure. Among open 2B scale VLMs, it reaches state of the art results on multilingual benchmarks such as MMMB and Multilingual MMBench.

https://arxiv.org/pdf/2512.04032

Architecture, overlapping tiles with attention pooling connector

Jina-VLM keeps the standard VLM layout, but optimizes the vision side for arbitrary resolution and low token count. The vision encoder is SigLIP2 So400M/14 384, a 27 layer Vision Transformer with about 400M parameters. It processes 378×378 pixel crops into a 27×27 grid of 14×14 patches, so each tile produces 729 patch tokens.

To handle high resolution images, the model does not resize the full input to a single square. Instead, it constructs a grid of up to 12 overlapping tiles along with a global thumbnail. Each tile is a 378×378 crop, adjacent tiles overlap by 112 pixels, and the stride between tile origins is 266 pixels. A 4×3 grid covers an effective resolution of 1176×910 pixels before downscaling larger images to fit inside the tile budget.

The core design is the vision language connector. Rather than using the final ViT layer, Jina-VLM concatenates features from two intermediate layers, the third from last and ninth from last, that correspond to layers 24 and 18. This combines high level semantics and mid level spatial detail. The connector then applies attention pooling over 2×2 patch neighborhoods. It computes a mean pooled query for each 2×2 region, attends over the full concatenated feature map, and outputs a single pooled token per neighborhood. This reduces 729 visual tokens per tile to 182 tokens, which is a 4 times compression. A SwiGLU projection maps the pooled features to the Qwen3 embedding dimension.

With the default 12 tile configuration plus thumbnail, a naive connector would feed 9,477 visual tokens into the language model. Attention pooling cuts this to 2,366 visual tokens. The ViT compute does not change, but for the language backbone this yields about 3.9 times fewer prefill FLOPs and 4 times smaller KV cache. When including the shared ViT cost, the overall FLOPs drop by about 2.3 times for the default setting.

The language decoder is Qwen3-1.7B-Base. The model introduces special tokens for images, with <im_start> and <im_end> around the tile sequence and <im_col> to mark rows in the patch grid. Visual tokens from the connector and text embeddings are concatenated and passed to Qwen3 to generate answers.

Training pipeline and multilingual data mix

Training proceeds in 2 stages. All components, encoder, connector and decoder, are updated jointly, without freezing. The full corpus contains about 5M multimodal samples and 12B text tokens across more than 30 languages. Roughly half of the text is English, and the rest covers high and mid resource languages such as Chinese, Arabic, German, Spanish, French, Italian, Japanese and Korean.

Stage 1 is alignment training. The goal is cross language visual grounding, not instruction following. The team uses caption heavy datasets PixmoCap and PangeaIns, which span natural images, documents, diagrams and infographics. They add 15 percent text only data from the PleiAS common corpus to control degradation on pure language tasks. The connector uses a higher learning rate and shorter warmup than the encoder and decoder to speed up adaptation without destabilizing the backbones.

Stage 2 is instruction fine tuning. Here Jina VLM learns to follow prompts for visual question answering and reasoning. The mix combines LLaVA OneVision, Cauldron, Cambrian, PangeaIns and FineVision, plus Aya style multilingual text only instructions. The Jina research team first train for 30,000 steps with single source batches, then for another 30,000 steps with mixed source batches. This schedule stabilizes learning in the presence of very heterogeneous supervision.

Across pretraining and fine tuning, the model sees about 10B tokens in the first stage and 37B tokens in the second stage, with a total of roughly 1,300 GPU hours reported for the main experiments.

Benchmark profile, 2.4B model with multilingual strength

On standard English VQA tasks that include diagrams, charts, documents, OCR and mixed scenes, Jina-VLM reaches an average score of 72.3 across 8 benchmarks. These are AI2D, ChartQA, TextVQA, DocVQA, InfoVQA, OCRBench, SEED Bench 2 Plus and CharXiv. This is the best average among the 2B scale comparison models in this research paper from Jina AI.

On multimodal comprehension and real world understanding tasks, the model scores 67.4 on the multimodal group, which includes MME, MMB v1.1 and MMStar. It scores 61.9 on the real world group, which includes RealWorldQA, MME RealWorld and R Bench, and it reaches 68.2 accuracy on RealWorldQA itself, which is the best result among the baselines considered.

https://arxiv.org/pdf/2512.04032

Multi image reasoning is a weaker area. On BLINK, MuirBench and MMT, Jina-VLM reaches an average of 47.3. The research team point to limited multi-image training data as the reason. In contrast, hallucination control is strong. On the POPE benchmark, which measures object hallucination, the model scores 90.3, the best score in the comparison table.

For mathematical and structured reasoning, the model uses the same architecture, without thinking mode. It reaches 59.5 on MMMU and an overall math score of 33.3 across MathVista, MathVision, MathVerse, WeMath and LogicVista. Jina-VLM is comparable to InternVL3-2B on this set and clearly ahead of Qwen2-VL-2B, while InternVL3.5-2B remains stronger due to its larger scale and more specialized math training.

On pure text benchmarks, the picture is mixed. The research team reports that Jina-VLM keeps most of the Qwen3-1.7B performance on MMLU, GSM 8K, ARC C and HellaSwag. However, MMLU-Pro drops from 46.4 for the base model to 30.3 after multimodal tuning. The research team attribute this to instruction tuning that pushes the model toward very short answers, which clashes with the long multi step reasoning required by MMLU Pro.

The main highlight is multilingual multimodal understanding. On MMMB across Arabic, Chinese, English, Portuguese, Russian and Turkish, Jina-VLM reaches an average of 78.8. On Multilingual MMBench across the same languages, it reaches 74.3. The research team reports these as state of the art averages among open 2B scale VLMs.

Comparison Table

ModelParamsVQA AvgMMMBMulti. MMBDocVQAOCRBenchJina-VLM2.4B72.378.874.390.6778Qwen2-VL-2B2.1B66.471.369.489.2809Qwen3-VL-2B2.8B71.675.072.392.3858InternVL3-2B2.2B69.273.671.987.4835InternVL3.5-2B2.2B71.674.670.988.5836

Key Takeaways

Jina-VLM is a 2.4B parameter VLM that couples SigLIP2 So400M as vision encoder with Qwen3-1.7B as language backbone through an attention pooling connector that cuts visual tokens by 4 times while keeping spatial structure.

The model uses overlapping 378×378 tiles, 12 tiles plus a global thumbnail, to handle arbitrary resolution images up to roughly 4K, then feeds only pooled visual tokens to the LLM which reduces prefill FLOPs and KV cache size by about 4 times compared to naive patch token usage.

Training uses about 5M multimodal samples and 12B text tokens across nearly 30 languages in a 2 stage pipeline, first alignment with caption style data, then instruction fine tuning with LLaVA OneVision, Cauldron, Cambrian, PangeaIns, FineVision and multilingual instruction sets.

On English VQA, Jina-VLM reaches 72.3 average across 8 VQA benchmarks, and on multilingual multimodal benchmarks it leads the open 2B scale class with 78.8 on MMMB and 74.3 on Multilingual MMBench while keeping competitive text only performance.

Check out the Paper, Model on HF and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Jina AI Releases Jina-VLM: A 2.4B Multilingual Vision Language Model Focused on Token Efficient Visual QA appeared first on MarkTechPost.

Interview: From CUDA to Tile-Based Programming: NVIDIA’s Stephen Jon …

As AI models grow in complexity and hardware evolves to meet the demand, the software layer connecting the two must also adapt. We recently sat down with Stephen Jones, a Distinguished Engineer at NVIDIA and one of the original architects of CUDA.

Jones, whose background spans from fluid mechanics to aerospace engineering, offered deep insights into NVIDIA’s latest software innovations, including the shift toward tile-based programming, the introduction of “Green Contexts,” and how AI is rewriting the rules of code development.

Here are the key takeaways from our conversation.

The Shift to Tile-Based Abstraction

For years, CUDA programming has revolved around a hierarchy of grids, blocks, and threads. With the latest updates, NVIDIA is introducing a higher level of abstraction: CUDA Tile.

According to Jones, this new approach allows developers to program directly to arrays and tensors rather than managing individual threads. “It extends the existing CUDA,” Jones explained. “What we’ve done is we’ve added a way to talk about and program directly to arrays, tensors, vectors of data… allowing the language and the compiler to see what the high-level data was that you’re operating on opened up a whole realm of new optimizations”.

This shift is partly a response to the rapid evolution of hardware. As Tensor Cores become larger and denser to combat the slowing of Moore’s Law, the mapping of code to silicon becomes increasingly complex.

Future-Proofing: Jones noted that by expressing programs as vector operations (e.g., Tensor A times Tensor B), the compiler takes on the heavy lifting of mapping data to the specific hardware generation.

Stability: This ensures that program structure remains stable even as the underlying GPU architecture changes from Ampere to Hopper to Blackwell.

Python First, But Not Python Only

Recognizing that Python has become the lingua franca of Artificial Intelligence, NVIDIA launched CUDA Tile support with Python first. “Python’s the language of AI,” Jones stated, adding that an array-based representation is “much more natural to Python programmers” who are accustomed to NumPy.

However, performance purists need not worry. C++ support is arriving next year, maintaining NVIDIA’s philosophy that developers should be able to accelerate their code regardless of the language they choose.

“Green Contexts” and Reducing Latency

For engineers deploying Large Language Models (LLMs) in production, latency and jitter are critical concerns. Jones highlighted a new feature called Green Contexts, which allows for precise partitioning of the GPU.

“Green contexts lets you partition the GPU… into different sections,” Jones said. This allows developers to dedicate specific fractions of the GPU to different tasks, such as running pre-fill and decode operations simultaneously without them competing for resources. This micro-level specialization within a single GPU mirrors the disaggregation seen at the data center scale.

No Black Boxes: The Importance of Tooling

One of the pervasive fears regarding high-level abstractions is the loss of control. Jones, drawing on his experience as a CUDA user in the aerospace industry, emphasized that NVIDIA tools will never be black boxes.

“I really believe that the most important part of CUDA is the developer tools,” Jones affirmed. He assured developers that even when using tile-based abstractions, tools like Nsight Compute will allow inspection down to the individual machine language instructions and registers. “You’ve got to be able to tune and debug and optimize… it cannot be a black box,” he added.

Accelerating Time-to-Result

Ultimately, the goal of these updates is productivity. Jones described the objective as “left shifting” the performance curve, enabling developers to reach 80% of potential performance in a fraction of the time.

“If you can come to market [with] 80% of performance in a week instead of a month… then you’re spending the rest of your time just optimizing,” Jones explained. Crucially, this ease of use does not come at the cost of power; the new model still provides a path to 100% of the peak performance the silicon can offer.

Conclusion

As AI algorithms and scientific computing converge, NVIDIA is positioning CUDA not just as a low-level tool for hardware experts, but as a flexible platform that adapts to the needs of Python developers and HPC researchers alike. With support extending from Ampere to the upcoming Blackwell and Rubin architectures, these updates promise to streamline development across the entire GPU ecosystem.

For the full technical details on CUDA Tile and Green Contexts, visit the NVIDIA developer portal.
The post Interview: From CUDA to Tile-Based Programming: NVIDIA’s Stephen Jones on Building the Future of AI appeared first on MarkTechPost.

From Transformers to Associative Memory, How Titans and MIRAS Rethink …

What comes after Transformers? Google Research is proposing a new way to give sequence models usable long term memory with Titans and MIRAS, while keeping training parallel and inference close to linear.

Titans is a concrete architecture that adds a deep neural memory to a Transformer style backbone. MIRAS is a general framework that views most modern sequence models as instances of online optimization over an associative memory.

Why Titans and MIRAS?

Standard Transformers use attention over a key value cache. This gives strong in context learning, but cost grows quadratically with context length, so practical context is limited even with FlashAttention and other kernel tricks.

Efficient linear recurrent neural networks and state space models such as Mamba-2 compress the history into a fixed size state, so cost is linear in sequence length. However, this compression loses information in very long sequences, which hurts tasks such as genomic modeling and extreme long context retrieval.

Titans and MIRAS combine these ideas. Attention acts as a precise short term memory on the current window. A separate neural module provides long term memory, learns at test time, and is trained so that its dynamics are parallelizable on accelerators.

https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/

Titans, a neural long term memory that learns at test time

The Titans research paper introduces a neural long term memory module that is itself a deep multi layer perceptron rather than a vector or matrix state. Attention is interpreted as short term memory, since it only sees a limited window, while the neural memory acts as persistent long term memory.

For each token, Titans defines an associative memory loss

ℓ(Mₜ₋₁; kₜ, vₜ) = ‖Mₜ₋₁(kₜ) − vₜ‖²

where Mₜ₋₁ is the current memory, kₜ is the key and vₜ is the value. The gradient of this loss with respect to the memory parameters is the “surprise metric”. Large gradients correspond to surprising tokens that should be stored, small gradients correspond to expected tokens that can be mostly ignored.

The memory parameters are updated at test time by gradient descent with momentum and weight decay, which together act as a retention gate and forgetting mechanism.To keep this online optimization efficient, the research paper shows how to compute these updates with batched matrix multiplications over sequence chunks, which preserves parallel training across long sequences.

Architecturally, Titans uses three memory branches in the backbone, often instanced in the Titans MAC variant:

a core branch that performs standard in context learning with attention

a contextual memory branch that learns from the recent sequence

a persistent memory branch with fixed weights that encodes pretraining knowledge

The long term memory compresses past tokens into a summary, which is then passed as extra context into attention. Attention can choose when to read that summary.

Experimental results for Titans

On language modeling and commonsense reasoning benchmarks such as C4, WikiText and HellaSwag, Titans architectures outperform state of the art linear recurrent baselines Mamba-2 and Gated DeltaNet and Transformer++ models of comparable size. The Google research attribute this to the higher expressive power of deep memory and its ability to maintain performance as context length grows. Deep neural memories with the same parameter budget but higher depth give consistently lower perplexity.

For extreme long context recall, the research team uses the BABILong benchmark, where facts are distributed across very long documents. Titans outperforms all baselines, including very large models such as GPT-4, while using many fewer parameters, and scales to context windows beyond 2,000,000 tokens.

The research team reports that Titans keeps efficient parallel training and fast linear inference. Neural memory alone is slightly slower than the fastest linear recurrent models, but hybrid Titans layers with Sliding Window Attention remain competitive on throughput while improving accuracy.

https://arxiv.org/pdf/2504.13173

MIRAS, a unified framework for sequence models as associative memory

The MIRAS research paper, “It’s All Connected: A Journey Through Test Time Memorization, Attentional Bias, Retention, and Online Optimization,” generalizes this view. It observes that modern sequence models can be seen as associative memories that map keys to values while balancing learning and forgetting.

MIRAS defines any sequence model through four design choices:

Memory structure for example a vector, linear map, or MLP

Attentional bias the internal loss that defines what similarities the memory cares about

Retention gate the regularizer that keeps the memory close to its past state

Memory algorithm the online optimization rule, often gradient descent with momentum

Using this lens, MIRAS recovers several families:

Hebbian style linear recurrent models and RetNet as dot product based associative memories

Delta rule models such as DeltaNet and Gated DeltaNet as MSE based memories with value replacement and specific retention gates

Titans LMM as a nonlinear MSE based memory with local and global retention optimized by gradient descent with momentum

Crucially, MIRAS then moves beyond the usual MSE or dot product objectives. The research team constructs new attentional biases based on Lₚ norms, robust Huber loss and robust optimization, and new retention gates based on divergences over probability simplices, elastic net regularization and Bregman divergence.

From this design space, the research team instantiate three attention free models:

Moneta uses a 2 layer MLP memory with Lₚ attentional bias and a hybrid retention gate based on generalized norms

Yaad uses the same MLP memory with Huber loss attentional bias and a forget gate related to Titans

Memora uses regression loss as attentional bias and a KL divergence based retention gate over a probability simplex style memory.

These MIRAS variants replace attention blocks in a Llama style backbone, use depthwise separable convolutions in the Miras layer, and can be combined with Sliding Window Attention in hybrid models. Training remains parallel by chunking sequences and computing gradients with respect to the memory state from the previous chunk.

In research experiments, Moneta, Yaad and Memora match or surpass strong linear recurrent models and Transformer++ on language modeling, commonsense reasoning and recall intensive tasks, while maintaining linear time inference.

Key Takeaways

Titans introduces a deep neural long term memory that learns at test time, using gradient descent on an L2 associative memory loss so the model selectively stores only surprising tokens while keeping updates parallelizable on accelerators.

Titans combines attention with neural memory for long context, using branches like core, contextual memory and persistent memory so attention handles short range precision and the neural module maintains information over sequences beyond 2,000,000 tokens.

Titans outperforms strong linear RNNs and Transformer++ baselines, including Mamba-2 and Gated DeltaNet, on language modeling and commonsense reasoning benchmarks at comparable parameter scales, while staying competitive on throughput.

On extreme long context recall benchmarks such as BABILong, Titans achieves higher accuracy than all baselines, including larger attention models such as GPT 4, while using fewer parameters and still enabling efficient training and inference.

MIRAS provides a unifying framework for sequence models as associative memories, defining them by memory structure, attentional bias, retention gate and optimization rule, and yields new attention free architectures such as Moneta, Yaad and Memora that match or surpass linear RNNs and Transformer++ on long context and reasoning tasks.

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post From Transformers to Associative Memory, How Titans and MIRAS Rethink Long Context Modeling appeared first on MarkTechPost.

How AWS delivers generative AI to the public sector in weeks, not year …

When critical services depend on quick action, from the safety of vulnerable children to environmental protection, you need working AI solutions in weeks, not years. Amazon recently announced an investment of up to $50 billion in expanded AI and supercomputing infrastructure for US government agencies, demonstrating both the urgency and commitment from Amazon Web Services (AWS) to accelerating public sector innovation. The AWS Generative AI Innovation Center is already making this happen, consistently delivering production-ready solutions for government organizations.
What makes this time different
The convergence of three factors makes this technology moment different:

Mission urgency – Public sector organizations currently face the challenge of managing both growing workloads in mission-critical areas, such as veterans’ benefits claims and bridge safety inspections, and workforce and budget limitations.
Technology readiness – Production-ready AI solutions can now be deployed securely and at scale, with unprecedented compute capacity being built specifically for US government requirements.
Proven success models – Early adopters have demonstrated that rapid AI implementation is possible in government settings, creating blueprints for others to follow.

Drawing from over a thousand implementations, the Generative AI Innovation Center combines AWS infrastructure and security conformance to help you transform mission delivery.

Accelerating real-world innovation
Public sector organizations working to improve mission speed and effectiveness can collaborate with the Innovation Center to develop targeted solutions. These three case studies show this approach in action.
AI systems that support critical care to protect vulnerable children
When protecting a child’s welfare, having key information surface at exactly the right moment is crucial. Systems must work reliably, every time.
This was the challenge the Miracle Foundation faced when managing foster care caseloads globally. In the span of weeks, the Innovation Center worked alongside caseworkers to build a production AI assistant that analyzes case files, flags urgent situations, and recommends evidence-based interventions tailored to each child’s unique circumstances.
“When a caseworker misses an urgent signal in a child’s file, it can have life-changing consequences,” explains Innovation Center strategist Brittany Roush. “We were building a system that needed to surface critical information at exactly the right moment.”
The solution aims to help caseworkers make faster, more informed decisions for vulnerable children around the world. It also includes built-in enterprise-grade security, designed for scalability and delivered with comprehensive knowledge transfer so the Miracle Foundation team can fully manage and evolve their system.
It’s important to start with actual users on day one. The Miracle Foundation team interfaced directly with caseworkers to understand workflows before writing a single line of code. This user-first approach removed months of work to gather requirements and iterate through revisions.
Innovation at institutional scale
The University of Texas at Austin (UT Austin) approached the Innovation Center about personalized academic support for 52,000 students. The team delivered UT Sage, a production AI tutoring service designed by learning scientists and trained by faculty, which is now in open beta across the UT Austin campus. Unlike generic AI tools, UT Sage provides custom, course-specific support while maintaining academic integrity standards. “It’s like having a knowledgeable teaching assistant available whenever you need help,” one student reported during testing.
“The UT Sage project empowers our faculty to create personalized learning tools, designed to motivate student engagement,” said Julie Schell, Assistant Vice Provost and Director of the Office of Academic Technology. “With the potential to deploy across hundreds of courses, we are aiming to enhance learning outcomes and reduce the time and effort required to design student-centered, high-quality course materials.”
Build flexible foundations, not point solutions. The team architected UT Sage as a service that faculty could adapt to specific courses. This extensible design enabled institutional scale from day one, avoiding the trap of a successful pilot that never scales, which can plague technology projects.
Transforming government speed with the EPA
The U.S. Environmental Protection Agency partnered with the innovation center to transform document processing workflows that used to take weeks or months. The team, in partnership with the EPA, delivered two breakthrough solutions that demonstrate both the team’s velocity and increasing technical complexity:

Chemical risk assessment acceleration – An intelligent document processing system that evaluates research studies against predetermined scientific criteria. What once required hours of manual review by EPA scientists now takes minutes. The system achieved an 85% reduction in processing time while maintaining 85% accuracy. Processing 250 documents costs the team $40 through the system, compared to requiring 500 hours of scientist time manually.
Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA) application reviews – Automated creation of data evaluation records (DERs) from health and safety studies for pesticide applications under FIFRA. This process traditionally took EPA reviewers 4 months of manual work. The AI solution now generates these critical regulatory documents in seconds, achieving a 99% cost reduction while potentially accelerating approval timelines for safe pesticide products.

Both solutions incorporate rigorous human-in-the-loop review processes to maintain scientific integrity and regulatory compliance alignment. EPA scientists oversee AI-generated assessments, but they can now focus their expertise on analysis and decision-making rather than manual data processing.
“We’re not replacing scientific judgment,” explained an EPA team member. “We’re eliminating the tedious work so our scientists can spend more time on what matters most—protecting public health and the environment.”
The EPA cases demonstrate that AI augmentation can deliver both speed and trust. The team designed review workflows into the architecture to improve trust, making the systems immediately acceptable to scientific staff and leadership.
Strategies to increase the pace of innovation
Experts at the Innovation Center have developed several strategies to help organizations excel with generative AI. To facilitate building your own production systems and increase the pace of innovation, follow these best practices:

Build on day 1, not week 6 – Traditional projects spend months on requirements and architecture. The Innovation Center starts building immediately, using extensive libraries of reusable, secure infrastructure-as-code (IaC) components. They also use tools such as Kiro, an AI integrated development environment (IDE) that efficiently converts developer prompts into detailed specifications and working code. This approach has been refined with each engagement, meaning the team is building increasingly complex use cases faster than ever before. Access to the expanded government AI infrastructure of AWS can further accelerate this development process, so you can tackle increasingly sophisticated use cases.
Get the right people on your team – Each engagement brings together scientists, architects, security experts, and domain specialists who understand public sector missions. This cross-functional composition minimizes the typical back-and-forth that often complicates requirement gathering and refinement. Everyone who’s needed to make decisions is already in the discussion, collaboratively working toward a common goal.
Knowledge transfer happens throughout, not at the end – Don’t wait to think about technology hand-offs. Advancing a project to the next team without prior coordination is rarely an effective strategy. The deep collaboration between stakeholders working alongside Innovation Center experts happens throughout development. Knowledge transfer occurs naturally in daily collaboration, with formal documentation being handed off at the end. The Innovation Center team then continues to support in an advisory capacity until the solution goes into production.
Harness the secure and reliable infrastructure and services of AWS – For public sector organizations, moving fast can’t mean compromising on security or compliance. Every solution is architected on secure AWS infrastructure with the ability to meet even stringent Federal Risk and Authorization Management Program (FedRAMP) High requirements. The Innovation Center follows a secure-by-design approach where compliance alignment is woven into the entire development lifecycle. By making compliance alignment concurrent, not sequential, the team demonstrates that security and speed aren’t trade-offs. The upcoming expansion of the AWS government cloud infrastructure further strengthens these security and compliance capabilities, providing you with one of the most comprehensive and secure AI computing environments.

Next steps in public sector AI
Every case study in this post started with a specific, pressing challenge. Each example achieved institutional scale by delivering value quickly, not by waiting for the perfect moment. Start with one persistent operational need, deliver results in weeks, then expand. With the AWS investment of up to $50 billion in purpose-built government AI infrastructure, these transformations can now happen at even greater scale and speed. Each successful engagement creates a blueprint for the next, continuously expanding what’s possible for public sector AI.
Learn more about the AWS Generative AI Innovation Center and how they’re helping public sector organizations turn AI potential into production reality.

About the authors
Kate Zimmerman serves as the Generative AI Innovation Center Geo Leader for Worldwide Public Sector at AWS. Kate leads a team of generative AI strategists and scientists, architecting innovative solutions for public sector organizations globally. Her role combines strategic leadership with hands-on technical expertise, and she works directly with Director, VP, and C-level executives to drive GenAI adoption and deliver mission-critical outcomes. With 13+ years of experience spanning commercial cloud, defense, national security, and aerospace, Kate brings a unique perspective to driving transformative AI/ML solutions. Previously, as Chief Scientist & VP of Data and Analytics at HawkEye 360, she led 50+ developers, engineers, and scientists to launch the company’s first production AI/ML capabilities. Her tenure at AWS included leadership roles as Senior Manager & Principal Architect of the ML Solutions Lab, where she accelerated AI/ML adoption among national security customers, and Senior Solutions Architect supporting the National Reconnaissance Office. Kate also served in the USAF on active duty for 5 years developing advance satellite systems and continues to serve as a reservist supporting strategic AI/ML initiatives with the USAF 804th Test Group.
Sri Elaprolu serves as Director of the AWS Generative AI Innovation Center, where he leverages nearly three decades of technology leadership experience to drive artificial intelligence and machine learning innovation. In this role, he leads a global team of machine learning scientists and engineers who develop and deploy advanced generative and agentic AI solutions for enterprise and government organizations facing complex business challenges. Throughout his nearly 13-year tenure at AWS, Sri has held progressively senior positions, including leadership of ML science teams that partnered with high-profile organizations such as the NFL, Cerner, and NASA. These collaborations enabled AWS customers to harness AI and ML technologies for transformative business and operational outcomes. Prior to joining AWS, he spent 14 years at Northrop Grumman, where he successfully managed product development and software engineering teams. Sri holds a Master’s degree in Engineering Science and an MBA with a concentration in general management, providing him with both the technical depth and business acumen essential for his current leadership role.

S&P Global Data integration expands Amazon Quick Research capabili …

Today, we are pleased to announce a new integration between Amazon Quick Research and S&P Global. This integration brings both S&P Global Energy news, research, and insights and S&P Global Market Intelligence data to Quick Research customers in one deep research agent.
The S&P Global integration extends the capabilities of Quick Research so that business professionals can analyze multiple data sources—including global energy news and premium financial intelligence—in one workspace, eliminating the need to switch between platforms and transforming weeks of research into minutes of focused insight generation. Quick Suite connects information across internal repositories, popular applications, AWS services, and through Model Context Protocol (MCP) integrations, to over 1,000 apps. This agentic AI application is reshaping how work gets done by transforming how teams find insights, conduct deep research, automate tasks, visualize data, and take actions across apps.
In this post, we explore S&P Global’s data sets and the solution architecture of the integration with Quick Research.
Solution overview
S&P Global has pioneered two MCP server implementations on AWS so organizations can easily integrate trusted financial services and energy content into AI-powered workflows while maintaining the quality, security, and reliability that business leaders demand.

“Our collaboration with AWS expands how S&P Global delivers trusted intelligence through the next generation of agentic AI experiences. By working alongside leading AI companies, our goal is to ensure customers can access our trusted data and insights wherever their workflows take place.” 
– Bhavesh Dayalji, Chief AI Officer of S&P Global and CEO of Kensho.

S&P Global Energy: Comprehensive commodity and energy intelligence
The S&P Global Energy integration, now available in Amazon Quick Research, utilizes an AI Ready Data MCP server to deliver comprehensive access to commodity and energy market intelligence spanning Oil, Gas, Power, Metals, Clean Energy, Agriculture, and Shipping sectors across global markets. Built on S&P Global’s reputation as a trusted market authority, the MCP server uses hundreds of thousands of expert-created documents including analyses, commentaries, and news articles reflecting decades of industry expertise.
The solution provides a unique multi-horizon perspective, offering intelligence from daily market updates to one-year outlooks and extending to 20+ year scenario analyses. With data refreshing every 30 minutes, business leaders gain near real-time access to commodity and energy intelligence, dramatically accelerating decision velocity when exploring regulatory challenges, investment opportunities, or environmental implications.
S&P Global Market Intelligence: Trusted financial intelligence
The S&P Global Market Intelligence integration, now available in Amazon Quick Research, uses the Kensho LLM-ready API MCP server developed by Kensho, S&P Global’s AI innovation hub. This MCP server makes trusted financial data accessible through natural language queries, integrating seamlessly with Amazon Quick Research. Financial professionals can access S&P Capital IQ Financials, earnings call transcripts, company information, transactions and more, simply by asking questions.
The Kensho solution addresses a critical challenge in financial services: making vast repositories of financial data immediately accessible without requiring complex query languages or technical expertise. Engineering, product, and business teams can save significant time and resources by transforming what once required hours of data extraction into conversational queries that return precise, trusted information in seconds.
Solution architecture
S&P Global’s MCP server architecture is shown in the following diagram. When using one of the S&P integrations, traffic flows from Quick Research through an Amazon API Gateway to an AWS Application Load Balancer with the MCP services hosted on Amazon Elastic Kubernetes Service (Amazon EKS). The MCP server uses data hosted in Amazon S3 and AWS Relational Database Service for PostgreSQL for structured data, and Amazon OpenSearch Service for vector storage. This architecture delivers enterprise-ready MCP servers with defense-in-depth security, automated scaling, and comprehensive observability.

MCP is an open standard that supports seamless communication between AI agents and external data sources, tools, and services. MCP operates on a client-server architecture where MCP servers handle tool calls, typically consisting of multiple API calls and expose business logic implementations as callable functions. This enables AI agents to discover capabilities dynamically, negotiate features, and share context securely with all critical requirements for enterprise-grade applications.
S&P Global’s solution has the following key building blocks:

Automated data pipeline with Amazon Bedrock: At the heart of the solution is a Retrieval Augmented Generation (RAG) data ingestion pipeline using Amazon Bedrock. This pipeline transforms raw market data into AI Ready Data. Documents from S&P Global’s proprietary repositories undergo preprocessing, chunking, and enrichment before being converted into vector embeddings using Bedrock hosted Cohere Embed model. The ingestion pipeline runs on a scheduled basis, refreshing the OpenSearch vector store every 30 minutes for near real-time access to the energy data.
Vector and semantic search: Amazon OpenSearch serves as the vector database, storing embeddings generated by Bedrock and enabling semantic search capabilities across S&P Global’s energy data. The OpenSearch vector store is optimized for high-dimensional vector operations, supporting rapid similarity searches that power the MCP servers’ ability to retrieve contextually relevant information in response to natural language queries.
Resilience and scale: This solution uses Amazon EKS to host all MCP server solutions with two production clusters enabling traffic splitting and failover capabilities. This dual-cluster approach provides continuous availability even during unexpected failures. Both the Cluster Autoscaler and Horizontal Pod Autoscaler enable dynamic scaling based on demand. The MCP servers are built with the FastMCP framework, providing high-performance HTTP endpoints that comply with the Streamable HTTP Transport specification required by the MCP protocol.
Security: Security is built-in to every layer of the solution. API Gateway serves as the endpoint for MCP server access. S&P Global’s enterprise identity provider is used for OAuth authentication. API Gateway is further secured with AWS Web Application Firewall (WAF) with advanced threat detection. AWS IAM roles and policies enforce least privilege principles, so that each component has only the permissions it requires. AWS Secrets Manager securely stores credentials for accessing resources and AWS services. AWS Security Groups and VPC configurations provide network isolation, while TLS 1.2+ with AWS Certificate Manager validates all data in transit remains encrypted. This multi-layered security includes defense-in-depth security controls.
Observability: Amazon CloudWatch provides centralized logging, metrics collection, and real-time monitoring of the entire pipeline from data ingestion through MCP server responses. AWS CloudTrail captures detailed API activity logs and audit trails, essential for compliance in regulated industries.

Conclusion
Together, these MCP servers built on AWS and integrated into Amazon Quick Research demonstrates S&P Global’s vision for the future of financial services and energy intelligence: maintaining the trust, accuracy, and depth that business leaders require while embracing the transformative potential of AI to make that intelligence more accessible, actionable, and integrated into modern workflows.
Ready to get started? Please refer to Quick Research Third Party Data for more details.

About the authors
Jon Einkauf is a Product leader at AWS based in Seattle, where he focuses on building AI-powered tools that help businesses synthesize information and accelerate research. With over a decade of experience at Amazon spanning digital health, cloud computing, and AI products, he has led cross-functional teams in product management, engineering, and design to deliver innovative solutions for customers worldwide.
Prasanth Ponnoth is an AWS solutions architect supporting global financial services with more than 20 years of industry and technology experience with cloud migration, modernization and building distributed systems at scale. His areas of interests are machine learning, containers/ Kubernetes and open-source technologies. In AWS, he is part of the machine learning technical field community and focusing on Amazon Bedrock, Amazon SageMaker AI, Amazon Bedrock AgentCore services.
Brandon Pominville is a Senior Solutions Architect at AWS based in New York, where he works with global financial services customers to build secure, scalable data and AI platforms in the cloud. With over 20 years of experience across financial services, enterprise data platforms, and cloud computing, he specializes in translating business requirements into technical solutions. Outside of work, Brandon enjoys spending time with his family outdoors or on a cruise ship, and playing volleyball.

Streamline AI agent tool interactions: Connect API Gateway to AgentCor …

AgentCore Gateway now supports API GatewayAs organizations explore the possibilities of agentic applications, they continue to navigate challenges of using enterprise data as context in invocation requests to large language models (LLMs) in a manner that is secure and aligned with enterprise policies. To help standardize and secure those interactions, many organizations are using the Model Context Protocol (MCP) specification, which defines how agentic applications can securely connect to data sources and tools.
While MCP has been advantageous for net new use cases, organizations also navigate challenges with bringing their existing API estate into the agentic era. MCP can certainly wrap existing APIs, but it requires additional work, translating requests from MCP to RESTful APIs, making sure security is maintained through the entire request flow, and applying the standard observability required for production deployments.
Amazon Bedrock AgentCore Gateway now supports Amazon API Gateway as a target, translating MCP requests to AgentCore Gateway into RESTful requests to API Gateway. You can now expose both new and existing API endpoints to agentic applications using MCP, with built-in security and observability. This post covers these new capabilities and shows how to implement them.
What’s new: API Gateway support in AgentCore Gateway
AgentCore Gateway now supports API Gateway targets in addition to existing target types (Lambda functions, OpenAPI schemas, Smithy models, and MCP servers).

Our customers have successfully built extensive API ecosystems using API Gateway, connecting backends across numerous applications. As enterprises advance toward next-generation agentic applications, the natural evolution is to expose these existing APIs and backend tools to AI-powered systems, enabling seamless integration between established infrastructure and modern intelligent agents.
This integration between AgentCore Gateway and API Gateway simplifies the connection between API Gateway and AgentCore Gateway. It allows you to directly target API Gateway, so that you don’t need to export API Gateway APIs as an OpenAPI 3 specification and then add it to AgentCore Gateway as an OpenAPI target.
With this integration, a new API_GATEWAY target type will be added to AgentCore Gateway, eliminating the manual export/import process. REST API owners can add their API as an AgentCore Gateway target with a few console interactions or a single CLI command to expose their existing REST API as MCP tools using AgentCore Gateway. API consumers can then connect AI agents with these REST APIs through the Model Context Protocol (MCP) and power their workflows with AI integration. Your agentic applications can now connect to your new or existing API Gateway API. This integration between AgentCore Gateway and API Gateway supports IAM and API key authorization.

Both AgentCore Gateway and API Gateway have integrations with Amazon CloudWatch Logs, AWS CloudTrail, and AWS X-Ray for observability. Agent developers using this new capability between AgentCore Gateway and API Gateway can use these observability tools.
Walkthrough
This post shows you how to set up an existing REST API with API Gateway as a target for AgentCore Gateway. With this integration you can use your existing REST APIs as a tool for your agentic applications exposed using AgentCore Gateway.
Prerequisites
For this example, you need the following:

An AWS account with an existing REST API in API Gateway.
An Identity and Access Management (IAM) role or user with enough permissions to create an AgentCore Gateway and set up an API Gateway target.

You can create gateways and add targets in multiple ways:

AWS Management Console
AWS SDK for Python (Boto3)
AWS Command Line Interface (AWS CLI)
AgentCore starter toolkit for fast and straightforward set up

This post uses Boto3 for setting up the integration between AgentCore Gateway and API Gateway. For an interactive walkthrough, you can use the Jupyter Notebook sample on GitHub.
Set up prerequisites for inbound and outbound authorization.
Inbound authorization authenticates incoming user requests. Outbound authorization helps AgentCore Gateway to securely connect to gateway targets, such as an API Gateway, on behalf of the authenticated user.

For API Gateway as a target, AgentCore Gateway supports the following types of outbound authorization:

No authorization (not recommended) – Some target types provide you the option to bypass outbound authorization. We do not recommend this less secure option.
IAM-based outbound authorization – Use the gateway service role to authorize access to the gateway target with AWS Signature Version 4 (Sig V4).
API key – Use the API key, which is set up using AgentCore Identity to authorize access to API Gateway target. API keys created using an API Gateway mapped with API Gateway usage plans, helps you monitor and control API usage. Please refer to this documentation for more details.

Create an IAM role with the trust policy from the documentation.

For Outbound Authorization with IAM-based authorization, the policy should include execute-api:Invoke permission. Sample inline policy:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Action”: [
“execute-api:Invoke”,
],
“Resource”: ” “arn:aws:execute-api:{AWS_Region}:{AWS_Account_ID}:api-id/stage/METHOD_HTTP_VERB/resource-path”,
“Effect”: “Allow”
}
]
}

For API key authorization, you can create an API key (see the API Gateway documentation) and associate it with your API Gateway usage plan. Then create an API key credential provider with AgentCore Identity.

Once done, update the policy as described in the AgentCore Gateway documentation.
Create an AgentCore Gateway
When using the AgentCore starter toolkit, you can create a gateway with a default authorization configuration using Amazon Cognito for JWT-based inbound authorization.

import boto3
gateway_client = boto3.client(‘bedrock-agentcore-control’)
auth_config = {
“customJWTAuthorizer”: {
“allowedClients”: [‘<cognito_client_id>’], # Client MUST match with the ClientId configured in Cognito. Example: 7rfbikfsm51j2fpaggacgng84g
“discoveryUrl”: <cognito_oauth_discovery_url>
}
}
create_response = gateway_client.create_gateway(
name=’sample-ac-gateway’,
roleArn='<IAM_Role_ARN>’, # The IAM Role must have permissions to create/list/get/delete Gateway
protocolType=’MCP’,
protocolConfiguration={
‘mcp’: {
‘supportedVersions’: [‘2025-03-26’],
‘searchType’: ‘SEMANTIC’
}
},
authorizerType=’CUSTOM_JWT’,
authorizerConfiguration=auth_config,
description=’AgentCore Gateway with API Gateway target’
)
print(create_response)
# Retrieve the GatewayID used for GatewayTarget creation
gatewayID = create_response[“gatewayId”]
gatewayURL = create_response[“gatewayUrl”]
print(gatewayID)

This returns GATEWAY_ID that you will need to create the gateway target.
Create an AgentCore Gateway target

Create a target configuration
To create an API gateway target, you need to specify the following as the part of target configuration:

toolFilters: Use this to determine what resources on the REST API will be exposed as tool on the gateway. Filters also support wildcards in the filterPath.
toolOverrides (optional): Use this to allow users to override tool names and description. You must specify explicit paths and methods.
restApiId: Use this to pass API Gateway ID.

Below are a few examples of target configurations:
Example 1
This exposes GET & POST /pets, GET /pets/{petId} to the gateway and overrides their tool names and descriptions.

{
“mcp”: {
“apiGateway”: {
“restApiId”: “<api-id>”,
“stage”: “<stage>”,
“apiGatewayToolConfiguration”: {
“toolFilters”: [
{
“filterPath”: “/pets”,
“methods”: [“GET”,”POST”]
},
{
“filterPath”: “/pets/{petId}”,
“methods”: [“GET”]
}
],
“toolOverrides” : [
{
“name”: “ListPets”,
“path”: “/pets”,
“method”: “GET”,
“description”:”Retrieves all the available Pets.”
},
{
“name”: “AddPet”,
“path”: “/pets”,
“method”: “POST”,
“description”:”Add a new pet to the available Pets.”
},
{
“path”: “/pets/{petId}”,
“method”: “GET”,
“name”: “GetPetById”,
“description”: “Retrieve a specific pet by its ID”
}
]
}
}
}
}

Example 2
This will expose GET /pets but also GET /pets/{petId} or anything under /pets. Since toolOverrides is not specified, it will use the resource description from API Gateway.

{
“mcp”: {
“apiGateway”: {
“restApiId”: “<api-id>”,
“stage”: “<stage>”,
“apiGatewayToolConfiguration”: {
“toolFilters”: [
{
“filterPath”: “/pets/*”,
“methods”: [“GET”]
}
]
}
}
}
}

Credential provider configuration
When creating a target, you also need to specify the target’s outbound authorization using a credential provider configuration. As discussed above, there are three types of credential providers:
GATEWAY_IAM_ROLE
This uses the ROLE_ARN you specified when creating the gateway. Define the credential provider configuration as follows:

[
{
“credentialProviderType”: “GATEWAY_IAM_ROLE”
}
]

API_KEY
This requires the creation of an API key credential provider with AgentCore Identity.

[
{
“credentialProviderType”: “API_KEY”,
“credentialProvider”: {
“apiKeyCredentialProvider”: {
“providerArn”: “<provider-arn>”,
“credentialParameterName”: “x-api-key”, // optional
“credentialPrefix”: “abc”, // optional, prefix is added to the API key when sending it to the target endpoint
“credentialLocation”: “HEADER” //optional, specifies where in the request the API key should be placed
}
}
}
]

NO_AUTH
NO_AUTH can be configured by not specifying a credential provider configuration while creating the AgentCore Gateway target. This is not recommended.
Create an AgentCore Gateway target
Now configure your REST API as a gateway target:

import boto3
gateway_client = boto3.client(‘bedrock-agentcore-control’)
create_gateway_target_response = gateway_client.create_gateway_target(
name=’api-gateway-target’,
gatewayIdentifier='<gateway_ID>’,
targetConfiguration=[< your_target_configuration>],
credentialProviderConfigurations=[<your_credential_config>]
)
print(create_gateway_target_response)
gateway_target_id=create_gateway_target_2_response[‘targetId’]

Test gateway with the Strands Agent framework
Test the gateway with the Strands Agents framework to list and call the available tools from MCP server. You can also use other MCP-compatible agents built with different agentic frameworks.

def create_streamable_http_transport():
return streamablehttp_client(
gatewayURL, headers={“Authorization”: f”Bearer {<Bearer_Token>}”}
)
client = MCPClient(create_streamable_http_transport)
with client:
# Call the listTools
tools = client.list_tools_sync()
# Create an Agent with the model and tools
agent = Agent(model=yourModel, tools=tools) ## you can replace with any model you like
# Invoke the agent with the sample prompt. This will only invoke MCP listTools and retrieve the list of tools the LLM has access to. The below does not actually call any tool.
agent(“Hi, can you list all tools available to you”)
# Tool calling
agent(“List all the available pets”)
agent(“Tell me about the pet with petId 3 “)
agent(“When my order will be delivered? My order id is 2”)

You will observe the following output:

I have access to the following tools:
1. **x_amz_bedrock_agentcore_search** – A search tool that returns a trimmed down list of tools based on a provided context/query
2. **api-gateway-target-1___Add_Pet** – Add a new pet to the available Pets
3. **api-gateway-target-1___GetPetById** – Retrieve a specific pet by its ID (requires petId parameter)
4. **api-gateway-target-1___List_Pets** – Retrieves all the available Pets (optional parameters: page, type)
5. **api-gateway-target-2___GetOrderById** – Retrieve a specific order by its ID (requires orderId parameter)
I’ll retrieve all the available pets for you.
Tool #1: api-gateway-target-1___List_Pets
“HTTP/1.1 200 OK”
Here are all the available pets:
1. **Pet ID 1** – Dog – $249.99
2. **Pet ID 2** – Cat – $124.99
3. **Pet ID 3** – Fish – $0.99
I’ll retrieve the details for pet ID 3.
Tool #2: api-gateway-target-1___GetPetById
“HTTP/1.1 200 OK”
Here are the details for pet ID 3:
– **Pet ID**: 3
– **Type**: Fish
– **Price**: $0.99
I’ll check the details of your order with ID 2 to see the delivery information.
Tool #3: api-gateway-target-2___GetOrderById
“HTTP/1.1 200 OK”
Based on your order details:
– **Order ID**: 2
– **Pet Category**: Cat
– **Price**: $124.99
– **Delivery Date**: 02-12-2025 (December 2nd, 2025)
Your cat order will be delivered on **December 2nd, 2025**.

Observability
Enable application logs and tracing for your AgentCore Gateway resource. You will see detailed logs to help monitor and troubleshoot your AgentCore Gateway resource. It will include the tool calls performed by your agentic application, request parameters, responses, and errors if any.
Example logs:

{
“resource_arn”: “arn:aws:bedrock-agentcore:us-west-2:<AWS_Account_Id>:gateway/sample-ac-gateway2-mgtqozexct”,
“event_timestamp”: 1763621922275,
“body”: {
“isError”: false,
“log”: “Executing tool api-gateway-target-1___GetPetById from target W8BCF5VEAZ”,
“id”: “3”
},
“account_id”: “<AWS_Account_Id>”,
“request_id”: “8a70f423-79ee-4168-9d68-b76ad3*****”,
“trace_id”: “324a2ecc08631a55a02bb8f74104****”,
“span_id”: “f58914982450ad9b”,
“timestamp”: “1763621922275”,
“gateway_id”: “sample-ac-gateway2-mgtqozexct”
}
{
“resource_arn”: “arn:aws:bedrock-agentcore:us-west-2: <AWS_Account_Id>:gateway/sample-ac-gateway2-mgtqozexct”,
“event_timestamp”: 1763621922348,
“body”: {
“isError”: false,
“responseBody”: “{jsonrpc=2.0, id=3, result={isError=false, content=[{type=text, text={“id”:3,”type”:”fish”,”price”:0.99}}]}}”,
“log”: “Successfully processed request”,
“id”: “3”
},
“account_id”: “<AWS_Account_Id>”,
“request_id”: “8a70f423-79ee-4168-9d68-b76ad3ef****”,
“trace_id”: “324a2ecc08631a55a02bb8f7410****”,
“span_id”: “f58914982450ad9b”,
“timestamp”: “1763621922348”,
“gateway_id”: “sample-ac-gateway2-mgtqozexct”
}

Along with this, AgentCore Gateway offers detailed CloudWatch metrics including the usage metrics (TargetType, IngressAuthType, EgressAuthType, RequestsPerSession), invocation metrics (Invocations, ConcurrentExecutions, Sessions), performance metrics (Latency, Duration, TargetExecutionTime), and error rates (Throttles, SystemErrors, UserErrors).

AgentCore Gateway also supports AWS X-Ray and OTEL conformant vended spans that customers can use to track invocations across different primitives that are being used.

To learn more, see the AgentCore Gateway Observability documentation.
Clean up
To avoid recurring charges, make sure to delete the resources created by running the following code.

import boto3
gateway_client = boto3.client(‘bedrock-agentcore-control’)
# Deleting the Gateway
Targetresponse = gateway_client.delete_gateway_target( gatewayIdentifier='<Gateway_Id>’, targetId='<Target_Id>’)print(response)
# Deleting the Gateway
response = gateway_client.delete_gateway(
gatewayIdentifier='<Gateway_Id>’)
print(response)

Conclusion
AgentCore Gateway now supports Amazon API Gateway as a target, exposing REST APIs as MCP-compatible endpoints. You can bring your existing API infrastructure to agentic use cases while using your current security and observability tools.
Visit our developer documentation and workshop to learn more and get started today.

About the authors
With over 6+ years at AWS, Sparsh Wadhwa brings deep expertise in serverless, event-driven architectures, and Generative AI to his work with ISV customers in India. As a Solutions Architect, he partners with Independent Software Vendors to reimagine their products for the cloud era—from modernizing legacy systems to embedding AI capabilities that differentiate their offerings. Sparsh believes the best solutions emerge from understanding both technical possibilities and business context.
Heeki Park is a Principal Solutions Architect at AWS. In his 9+ years at AWS, he helped enterprise customers think about how to build and operate cloud-native applications, adopt serverless and event-driven patterns, and build pragmatic generative AI applications. Heeki is an avid runner and enjoys analyzing activity data to measure improvement in cardiovascular fitness.
Dhawal Patel is a Principal Generative AI Tech lead at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to agentic AI, deep learning, and distributed computing.

Cisco Released Cisco Time Series Model: Their First Open-Weights Found …

Cisco and Splunk have introduced the Cisco Time Series Model, a univariate zero shot time series foundation model designed for observability and security metrics. It is released as an open weight checkpoint on Hugging Face under an Apache 2.0 license, and it targets forecasting workloads without task specific fine tuning. The model extends TimesFM 2.0 with an explicit multiresolution architecture that fuses coarse and fine history in one context window.

https://arxiv.org/pdf/2511.19841

Why observability needs multiresolution context?

Production metrics are not simple single scale signals. Weekly patterns, long term growth and saturation are visible only at coarse resolutions. Saturation events, traffic spikes and incident dynamics show up at 1 minute or 5 minute resolution. The common time series foundation models work at a single resolution with context windows between 512 and 4096 points, while TimesFM 2.5 extends this to 16384 points. For 1 minute data this still covers at most a couple of weeks and often less.

This is a problem in observability where data platforms often retain only old data in aggregated form. Fine grained samples expire and survive only as 1 hour rollups. Cisco Time Series Model is built for this storage pattern. It treats coarse history as a first class input that improves forecasts at the fine resolution. The architecture operates directly on a multiresolution context instead of pretending that all inputs live on a single grid.

https://arxiv.org/pdf/2511.19841

Multiresolution input and forecasting objective

Formally, the model consumes a pair of contexts, (xc, xf). The coarse context (x_c) and the fine context (x_f) each have length up to 512. The spacing of (xc) is fixed at 60 times the spacing of (xf). A typical observability setup uses 512 hours of 1 hour aggregates and 512 minutes of 1 minute values. Both series terminate at the same forecast cut point. The model predicts a horizon of 128 points at the fine resolution, with a mean and a set of quantiles from 0.1 to 0.9.

Architecture, TimesFM core with resolution embeddings

Internally, Cisco Time Series Model reuses the TimesFM patch based decoder stack. The inputs are normalized, patched into non overlapping chunks, and passed through a residual embedding block. The transformer core consists of 50 decoder only layers. A final residual block maps tokens back to the horizon. The research team remove positional embeddings and instead rely on patch ordering, the multiresolution structure and a new resolution embedding to encode structure.

Two additions make the architecture multiresolution aware. A special token, often called ST in the report, is inserted between the coarse and fine token streams. It lives in sequence space and marks the boundary between resolutions. Resolution embeddings, often called RE, are added in model space. One embedding vector is used for all coarse tokens and another for all fine tokens. Ablation studies in the paper show that both components improve quality, especially in long context scenarios.

The decode procedure is also multiresolution. The model outputs mean and quantile forecasts for the fine resolution horizon. During long horizon decoding, newly predicted fine points are appended to the fine context. Aggregates of these predictions update the coarse context. This creates an autoregressive loop in which both resolutions evolve together during forecasting.

https://arxiv.org/pdf/2511.19841

Training data and recipe

Cisco Time Series Model is trained by continued pretraining on top of TimesFM weights. The final model has 500 million parameters. Training uses AdamW for biases, norms and embeddings, and Muon for the hidden layers, with cosine learning rate schedules. The loss combines mean squared error on the mean forecast with quantile loss over the quantiles from 0.1 to 0.9. The team trains for 20 epochs and picks the best checkpoint by validation loss.

The dataset is large and skewed toward observability. The Splunk team reports about 400 million metrics time series from their own Splunk Observability Cloud deployments, collected at 1 minute resolution over 13 months and partly aggregated to 5 minute resolution. The research team states that the final corpus contains more than 300 billion unique data points, with about 35 percent 1 minute observability, 16.5 percent 5 minute observability, 29.5 percent GIFT Eval pretraining data, 4.5 percent Chronos datasets and 14.5 percent synthetic KernelSynth series.

Benchmark results on observability and GIFT Eval

The research team evaluate the model on two main benchmarks. The first is an observability dataset derived from Splunk metrics at 1 minute and 5 minute resolution. The second is a filtered version of GIFT Eval, where datasets that leak TimesFM 2.0 training data are removed.

On observability data at 1 minute resolution with 512 fine steps, Cisco Time Series Model using a 512 multiresolution context reduces mean absolute error from 0.6265 for TimesFM 2.5 and 0.6315 for TimesFM 2.0 to 0.4788, with similar improvements in mean absolute scaled error and continuous ranked probability score. Similar gains appear at 5 minute resolution. Across both resolutions, the model outperforms Chronos 2, Chronos Bolt, Toto and AutoARIMA baselines under the normalized metrics used in the paper.

On the filtered GIFT Eval benchmark, Cisco Time Series Model matches the base TimesFM 2.0 model and performs competitively with TimesFM-2.5, Chronos-2 and Toto. The key claim is not universal dominance but preservation of general forecasting quality while adding a strong advantage on long context windows and observability workloads.

https://arxiv.org/pdf/2511.19841

Key Takeaways

Cisco Time Series Model is a univariate zero shot time series foundation model that extends the TimesFM 2.0 decoder only backbone with a multiresolution architecture for observability and security metrics.

The model consumes a multiresolution context, with a coarse series and a fine series, each up to 512 steps long, where the coarse resolution is 60 times the fine resolution, and it predicts 128 fine resolution steps with mean and quantile outputs.

Cisco Time Series Model is trained on more than 300B data points, with more than half from observability, mixing Splunk machine data, GIFT Eval, Chronos datasets and synthetic KernelSynth series, and it has about 0.5B parameters.

On observability benchmarks at 1 minute and 5 minute resolutions, the model achieves lower error than TimesFM 2.0’s, Chronos and other baselines, while retaining competitive performance on the general purpose GIFT Eval benchmark.

Check out the Paper, Blog and Model Card on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Cisco Released Cisco Time Series Model: Their First Open-Weights Foundation Model based on Decoder-only Transformer Architecture appeared first on MarkTechPost.

Google Colab Integrates KaggleHub for One Click Access to Kaggle Datas …

Google is closing an old gap between Kaggle and Colab. Colab now has a built in Data Explorer that lets you search Kaggle datasets, models and competitions directly inside a notebook, then pull them in through KaggleHub without leaving the editor.

What Colab Data Explorer actually ships?

Kaggle announced the feature recently where they describe a panel in the Colab notebook editor that connects to Kaggle search.

From this panel you can:

Search Kaggle datasets, models and competitions

Access the feature from the left toolbar in Colab

Use integrated filters to refine the results, for example by resource type or relevance

The Colab Data Explorer lets you search Kaggle datasets, models and competitions directly from a Colab notebook and that you can import data with a KaggleHub code snippet and integrated filters.

The old Kaggle to Colab pipeline was all setup work

Before this launch, most workflows that pulled Kaggle data into Colab followed a fixed sequence.

You created a Kaggle account, generated an API token, downloaded the kaggle.json credentials file, uploaded that file into the Colab runtime, set environment variables and then used the Kaggle API or command line interface to download datasets.

The steps were well documented and reliable. They were also mechanical and easy to misconfigure, especially for beginners who had to debug missing credentials or incorrect paths before they could even run pandas.read_csv on a file. Many tutorials exist only to explain this setup.

Colab Data Explorer does not remove the need for Kaggle credentials. It changes how you reach Kaggle resources and how much code you must write before you can start analysis.

KaggleHub is the integration layer

KaggleHub is a Python library that provides a simple interface to Kaggle datasets, models and notebook outputs from Python environments.

The key properties, which matter for Colab users, are:

KaggleHub works in Kaggle notebooks and in external environments such as local Python and Colab

It authenticates using existing Kaggle API credentials when needed

It exposes resource centric functions such as model_download and dataset_download which take Kaggle identifiers and return paths or objects in the current environment

Colab Data Explorer uses this library as the loading mechanism. When you select a dataset or model in the panel, Colab shows a KaggleHub code snippet that you run inside the notebook to access that resource.

Once the snippet runs, the data is available in the Colab runtime. You can then read it with pandas, train models with PyTorch or TensorFlow or plug it into evaluation code, just as you would with any local files or data objects.

The post Google Colab Integrates KaggleHub for One Click Access to Kaggle Datasets, Models and Competitions appeared first on MarkTechPost.

A Coding Implementation of a Complete Hierarchical Bayesian Regression …

In this tutorial, we explore hierarchical Bayesian regression with NumPyro and walk through the entire workflow in a structured manner. We start by generating synthetic data, then we define a probabilistic model that captures both global patterns and group-level variations. Through each snippet, we set up inference using NUTS, analyze posterior distributions, and perform posterior predictive checks to understand how well our model captures the underlying structure. By approaching the tutorial step by step, we build an intuitive understanding of how NumPyro enables flexible, scalable Bayesian modeling. Check out the Full Codes here.

Copy CodeCopiedUse a different Browsertry:
import numpyro
except ImportError:
!pip install -q “llvmlite>=0.45.1” “numpyro[cpu]” matplotlib pandas

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import jax
import jax.numpy as jnp
from jax import random
import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS, Predictive
from numpyro.diagnostics import hpdi

numpyro.set_host_device_count(1)

We set up our environment by installing NumPyro and importing all required libraries. We prepare JAX, NumPyro, and plotting tools so we have everything ready for Bayesian inference. As we run this cell, we ensure our Colab session is fully equipped for hierarchical modeling. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserdef generate_data(key, n_groups=8, n_per_group=40):
k1, k2, k3, k4 = random.split(key, 4)
true_alpha = 1.0
true_beta = 0.6
sigma_alpha_g = 0.8
sigma_beta_g = 0.5
sigma_eps = 0.7
group_ids = np.repeat(np.arange(n_groups), n_per_group)
n = n_groups * n_per_group
alpha_g = random.normal(k1, (n_groups,)) * sigma_alpha_g
beta_g = random.normal(k2, (n_groups,)) * sigma_beta_g
x = random.normal(k3, (n,)) * 2.0
eps = random.normal(k4, (n,)) * sigma_eps
a = true_alpha + alpha_g[group_ids]
b = true_beta + beta_g[group_ids]
y = a + b * x + eps
df = pd.DataFrame({“y”: np.array(y), “x”: np.array(x), “group”: group_ids})
truth = dict(true_alpha=true_alpha, true_beta=true_beta,
sigma_alpha_group=sigma_alpha_g, sigma_beta_group=sigma_beta_g,
sigma_eps=sigma_eps)
return df, truth

key = random.PRNGKey(0)
df, truth = generate_data(key)
x = jnp.array(df[“x”].values)
y = jnp.array(df[“y”].values)
groups = jnp.array(df[“group”].values)
n_groups = int(df[“group”].nunique())

We generate synthetic hierarchical data that mimics real-world group-level variation. We convert this data into JAX-friendly arrays so NumPyro can process it efficiently. By doing this, we lay the foundation for fitting a model that learns both global trends and group differences. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserdef hierarchical_regression_model(x, group_idx, n_groups, y=None):
mu_alpha = numpyro.sample(“mu_alpha”, dist.Normal(0.0, 5.0))
mu_beta = numpyro.sample(“mu_beta”, dist.Normal(0.0, 5.0))
sigma_alpha = numpyro.sample(“sigma_alpha”, dist.HalfCauchy(2.0))
sigma_beta = numpyro.sample(“sigma_beta”, dist.HalfCauchy(2.0))
with numpyro.plate(“group”, n_groups):
alpha_g = numpyro.sample(“alpha_g”, dist.Normal(mu_alpha, sigma_alpha))
beta_g = numpyro.sample(“beta_g”, dist.Normal(mu_beta, sigma_beta))
sigma_obs = numpyro.sample(“sigma_obs”, dist.Exponential(1.0))
alpha = alpha_g[group_idx]
beta = beta_g[group_idx]
mean = alpha + beta * x
with numpyro.plate(“data”, x.shape[0]):
numpyro.sample(“y”, dist.Normal(mean, sigma_obs), obs=y)

nuts = NUTS(hierarchical_regression_model, target_accept_prob=0.9)
mcmc = MCMC(nuts, num_warmup=1000, num_samples=1000, num_chains=1, progress_bar=True)
mcmc.run(random.PRNGKey(1), x=x, group_idx=groups, n_groups=n_groups, y=y)
samples = mcmc.get_samples()

We define our hierarchical regression model and launch the NUTS-based MCMC sampler. We allow NumPyro to explore the posterior space and learn parameters such as group intercepts and slopes. As this sampling completes, we obtain rich posterior distributions that reflect uncertainty at every level. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserdef param_summary(arr):
arr = np.asarray(arr)
mean = arr.mean()
lo, hi = hpdi(arr, prob=0.9)
return mean, float(lo), float(hi)

for name in [“mu_alpha”, “mu_beta”, “sigma_alpha”, “sigma_beta”, “sigma_obs”]:
m, lo, hi = param_summary(samples[name])
print(f”{name}: mean={m:.3f}, HPDI=[{lo:.3f}, {hi:.3f}]”)

predictive = Predictive(hierarchical_regression_model, samples, return_sites=[“y”])
ppc = predictive(random.PRNGKey(2), x=x, group_idx=groups, n_groups=n_groups)
y_rep = np.asarray(ppc[“y”])

group_to_plot = 0
mask = df[“group”].values == group_to_plot
x_g = df.loc[mask, “x”].values
y_g = df.loc[mask, “y”].values
y_rep_g = y_rep[:, mask]

order = np.argsort(x_g)
x_sorted = x_g[order]
y_rep_sorted = y_rep_g[:, order]
y_med = np.median(y_rep_sorted, axis=0)
y_lo, y_hi = np.percentile(y_rep_sorted, [5, 95], axis=0)

plt.figure(figsize=(8, 5))
plt.scatter(x_g, y_g)
plt.plot(x_sorted, y_med)
plt.fill_between(x_sorted, y_lo, y_hi, alpha=0.3)
plt.show()

We analyze our posterior samples by computing summaries and performing posterior predictive checks. We visualize how well the model recreates observed data for a selected group. This step helps us understand how accurately our model captures the underlying generative process. Check out the Full Codes here.

Copy CodeCopiedUse a different Browseralpha_g = np.asarray(samples[“alpha_g”]).mean(axis=0)
beta_g = np.asarray(samples[“beta_g”]).mean(axis=0)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].bar(range(n_groups), alpha_g)
axes[0].axhline(truth[“true_alpha”], linestyle=”–“)
axes[1].bar(range(n_groups), beta_g)
axes[1].axhline(truth[“true_beta”], linestyle=”–“)
plt.tight_layout()
plt.show()

We plot the estimated group-level intercepts and slopes to compare their learned patterns with the true values. We explore how each group behaves and how the model adapts to their differences. This final visualization brings together the complete picture of hierarchical inference.

In conclusion, we implemented how NumPyro allows us to model hierarchical relationships with clarity, efficiency, and strong expressive power. We observed how the posterior results reveal meaningful global and group-specific effects, and how predictive checks validate the model’s fit to the generated data. As we put everything together, we gain confidence in constructing, fitting, and interpreting hierarchical models using JAX-powered inference. This process strengthens our ability to apply Bayesian thinking to richer, more realistic datasets where multilevel structure is essential.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation of a Complete Hierarchical Bayesian Regression Workflow in NumPyro Using JAX-Powered Inference and Posterior Predictive Analysis appeared first on MarkTechPost.