New Relic transforms productivity with generative AI on AWS

New Relic Inc. is a San Francisco-based technology company that pioneered application performance monitoring (APM) and provides comprehensive observability solutions. Serving leading customers worldwide, including major brands like Ryanair, New Relic helps organizations monitor and optimize their digital systems to deliver better customer experiences.
New Relic faced a challenge common to many rapidly growing enterprises. Their engineers were spending valuable time searching through fragmented documentation across multiple systems, with time consuming internal system queries, in some cases, taking more than a day. As a leading observability platform supporting thousands of customers worldwide, New Relic knew a more efficient way to access and utilize organizational knowledge was needed.
This challenge led to the creation of New Relic NOVA (New Relic Omnipresence Virtual Assistant): an innovative artificial intelligence (AI) tool built on Amazon Web Services (AWS). New Relic NOVA has transformed how New Relic employees access and interact with company knowledge and systems.
Working with the Generative AI Innovation Center, New Relic NOVA evolved from a knowledge assistant into a comprehensive productivity engine. New Relic NOVA is built on AWS services including Amazon Bedrock, Amazon Kendra, Amazon Simple Storage Service (Amazon S3), and Amazon DynamoDB. Through Strands Agents, New Relic NOVA provides intelligent code reviews, AI governance, and managed Model Context Protocol (MCP) services.
Amazon Bedrock is a fully managed service that provides access to leading foundation models for building generative AI applications, eliminating the need to manage infrastructure while enabling teams to customize models for their specific use cases. Through a single API, developers can experiment with and evaluate different foundation models, integrate them with enterprise systems, and build secure AI applications at scale.
The solution has reduced information search time while automating complex operational workflows. Through collaboration with the Generative AI Innovation Center, New Relic NOVA was developed into a solution that now processes over 1,000 daily queries across their organization. New Relic NOVA integrates seamlessly with Confluence, GitHub, Salesforce, Slack, and various internal systems, maintaining 80% accuracy in its responses for both knowledge-based queries and transactional tasks.
We will show how New Relic NOVA is architected using AWS services to create a scalable, intelligent assistant that goes beyond document retrieval to handle complex tasks like automated team permission requests and rate limit management. We explore the technical architecture, development journey, and key lessons learned in building an enterprise-grade AI solution that delivers measurable productivity gains at scale.
Solution overview
In designing New Relic NOVA, New Relic established several critical objectives beyond the initial goal of improving documentation search. These included maintaining data security during knowledge retrieval and achieving consistent response quality across different data sources. As shown in Figure 1, New Relic NOVA’s AWS architecture enables seamless interaction between users and various AWS services while maintaining security and scalability. The solution required a flexible framework that could evolve with the organization’s needs for both knowledge retrieval and transactional tasks. A key challenge was balancing these requirements while keeping response times under 20 seconds to maintain user engagement.

Figure 1 – Solution architecture of New Relic NOVA framework
The development team identified several potential risks early in the project. These included the possibility of exposing sensitive information through AI responses, maintaining accuracy when retrieving from multiple data sources, and ensuring system reliability at enterprise scale. Figure 2 illustrates New Relic NOVA’s detailed agent workflow, demonstrating how queries are processed and routed through various specialized agents to address user intentions. Additionally, the team implemented comprehensive security controls which included personable identifiable information (PII) detection and masking, along with a robust evaluation framework to monitor and maintain response quality.

Figure 2 – New Relic NOVA agent workflow architecture
The project also revealed opportunities for future optimization. These include expanding an agent hierarchy architecture to support additional automated workflows and developing more sophisticated analytics for tracking user interaction patterns. The team’s experience suggests that organizations undertaking similar projects should focus on establishing clear evaluation metrics early and building flexible architectures that can accommodate evolving business needs.
Solution
New Relic NOVA was developed over an eight-week period, involving a collaborative effort between internal engineering, security, legal, and compliance teams and the AWS Generative AI Innovation Center. This partnership accelerated rapid development and iteration, leveraging AWS expertise in large-scale AI implementations.
Agent architecture
The New Relic NOVA architecture consists of three key layers:

Main agent layer – This acts as a controllable orchestration for executing different workflows by identifying the user intent and delegating efforts to the following downstream layers:

Retrieval Augmented Generation (RAG) with customized ingested knowledge from Amazon Bedrock Knowledge Bases or Amazon Kendra.
Agents for direct interaction with third-party platforms.
Customized agents for handling internal New Relic tasks.
Fallback handling if users’ responses cannot be determined.

Data source layers (vector DB, enrich, data sources) – These layers represent resources where internal knowledge (for example, New Relic standards documentation and code repository documentation) are ingested for retrieval or RAG purposes. The benefit of these custom resources is to enhance information and search performance for use information requests.
Agents layer – Comprises two distinct agent types:

Strands Agents with MCP: Handle multi-step processes for third-party services, leveraging MCP for standardized service interactions.
Custom action agents: Execute New Relic-specific tasks such as permission requests and service limit modifications, providing precise control over internal systems.

A central agent acts as an orchestrator, routing queries to specialized sub-agents in a delegation model where responses flow directly back to the user rather than requiring inter-agent reasoning or adjustments. Meanwhile, Strands Agents are used to efficiently manage third-party service integrations using MCP. This approach gives New Relic NOVA the best of both worlds: the orchestration model maintains flexibility for internal processes while standardizing external services through MCP, creating a scalable foundation for New Relic regarding future automation needs.
Data integration strategy
The power lies in the ability of New Relic NOVA to seamlessly integrate multiple data sources, providing a unified interface for knowledge retrieval. This approach includes:

Amazon Bedrock Knowledge Bases for Confluence: Confirms direct synchronization with Confluence spaces and maintains up-to-date information.
Amazon Kendra for GitHub Enterprise: Indexes and searches GitHub repositories, providing quick access to code documentation.
Strands Agents for Salesforce and Jira: Custom agents execute SOQL and JQL queries, respectively, to fetch relevant data from their respective platforms (Salesforce and Jira).
Amazon Q Index for Slack: Uses Amazon Q Index capabilities to implement a RAG solution for Slack channel history, chosen for its rapid development potential.

A unique aspect of the data integration of New Relic NOVA is the custom document enrichment process. During ingestion, documents are enhanced with metadata, keywords, and summaries, significantly improving retrieval relevance and accuracy.
Using Amazon Nova models
Amazon Nova is AWS’s new generation of foundation models designed to deliver frontier intelligence with industry-leading price performance for enterprise use cases. The Amazon Nova family of models can process diverse inputs including text, images, and video, excelling in tasks from interactive chat to document analysis, while supporting advanced capabilities like RAG systems and AI agent workflows.
To optimize performance and cost-efficiency, New Relic NOVA utilizes Amazon Nova Lite and Pro models through Amazon Bedrock. These models were carefully selected to balance response quality with latency, enabling New Relic NOVA to maintain sub-20 second response times while processing complex queries. Amazon Bedrock provides access to diverse foundation model families. Its standardized framework and prompt optimization supports seamless switching between models without code changes. This allows New Relic NOVA to optimize for speed with Amazon Nova Lite or, because of complexity, switch to Amazon Nova Pro while maintaining consistent performance and cost efficiency.
Advanced RAG implementation
New Relic NOVA employs a sophisticated RAG approach, utilizing Amazon Bedrock Knowledge Bases, Amazon Kendra, and Amazon Q Index. To maximize retrieval accuracy, New Relic NOVA implements several key optimization techniques:

Hierarchical chunking: Amazon Bedrock Knowledge Bases employs hierarchical chunking, a method proven most effective through extensive experimentation with various chunking methodologies.
Context enrichment: A custom AWS Lambda function enhances chunks during knowledge base ingestion, incorporating relevant keywords and contextual information. This process is particularly valuable for code-related content, where structural and semantic cues significantly impact retrieval performance.
Metadata integration: During knowledge base document ingestion, additional context, such as summaries, titles, authors, creation dates, and last modified dates, is appended as document metadata. This enriched metadata enhances the quality and relevance of retrieved information.
Custom document processing: For specific data sources like GitHub repositories, tailored document processing techniques are applied to preserve code structure and improve search relevance.

These techniques work in concert to optimize the RAG system within New Relic NOVA, delivering highly accurate retrieval across varied document types while minimizing development effort through existing connectors. The combination of hierarchical chunking, context enrichment, metadata integration, and custom document processing enables New Relic NOVA to provide precise, context-aware responses regardless of the data source or document format.
Evaluation framework
New Relic NOVA implements a comprehensive evaluation framework, leveraging Amazon Bedrock foundation models for its large language model (LLM)-as-a-judge approach, along with validation datasets that combine questions, ground truth answers, and source document URLs. This evaluation framework, which can be executed on-demand in development environments, encompasses three critical metrics for system validation:

Answer accuracy measurement utilizes a 1–5 discrete scale rating system, where the LLM evaluates the generated response’s factual alignment with the established ground truth data.
Context relevance assessment on a scale of 1–5, analyzing the retrieved context’s relevance to the user query.
Response latency tracking measures workflow performance, from initial query input to final answer generation, ensuring optimal user experience through comprehensive timing analysis.

This triple-metric evaluation approach supports detailed performance optimization across the New Relic NOVA solution core functionalities.
Observability and continuous improvements
The solution includes a comprehensive observability framework that collects metrics and analyzes user feedback. The metric and feedback collection is implemented through New Relic AI monitoring solutions. Feedback is implemented through the Slack reaction feature (emoji responses), users can quickly provide feedback on New Relic NOVA responses. These reactions are captured by a New Relic python agent and sent to a https://one.newrelic.com/ domain. The feedback collection system provides valuable insights for:

Measuring user satisfaction with responses.
Identifying areas where accuracy can be improved.
Understanding usage patterns across different teams.
Tracking the effectiveness of different types of queries.
Monitoring the performance of various data sources.
Tracing each LLM call and latency.

The collected feedback data can be analyzed using AWS analytics services such as AWS Glue for ETL processing, Amazon Athena for querying, and Amazon QuickSight for visualization. This data-driven approach enables continuous improvement of New Relic NOVA and helps prioritize future enhancements based on actual user interactions.
Internal teams are already experiencing the advantages of New Relic NOVA. Figure 3 showcases some of the responses captured by the Slack feedback process.

Figure 3 – Users Slack message exchanges about New Relic NOVA experience
Considerations and next steps
The success of New Relic NOVA highlights several key learnings for organizations looking to implement similar solutions:

Start with a clear understanding of user pain points and measurable success criteria.
Implement robust data integration strategies with custom document enrichment.
Use the generative AI services and foundation models that best fit your use cases to achieve optimal results.
Build in feedback mechanisms from the start to enable continuous improvement.
Focus on both speed and accuracy to ensure user adoption.

In terms of next steps, New Relic NOVA is evolving from a standalone solution into a comprehensive enterprise AI platform by integrating cutting-edge AWS technologies and open-source frameworks. In the future, New Relic anticipates leveraging Amazon S3 Vectors. It offers up to 90% cost reduction for vector storage and querying compared to conventional approaches, enabling the handling of massive-scale AI workloads more efficiently. New Relic is looking to explore Amazon Bedrock AgentCore for enterprise-grade security, memory management, and scalable AI agent deployment, supporting robust production capabilities.
Additionally, New Relic is exploring Strands Agent Workflows, an open-source SDK that streamlines building AI agents from simple conversational assistants to complex autonomous workflows. This technology stack positions New Relic NOVA to deliver enterprise-ready AI solutions that scale seamlessly while maintaining cost efficiency and developer productivity.
Conclusion
The journey of creating New Relic NOVA demonstrates how enterprises can use the generative AI services of AWS to transform organizational productivity. Through the integration of Amazon Bedrock, Amazon Kendra, and other AWS services, New Relic created an AI assistant that transforms their internal operations. Working with the Generative AI Innovation Center of AWS, New Relic achieved a 95% reduction in information search time across their organization while automating complex operational workflows.
Learn more about transforming your business with generative AI by visiting the Generative AI Innovation Center or speak with an AWS Partner Specialist or AWS Representative to know how we can help accelerate your business.
Further reading

Building generative AI applications on AWS – AWS Classroom Training
Generative AI Lens – AWS Well-Architected Framework – Gain a deep understanding of how to design, deploy, and operate generative AI applications on AWS effectively
Build an end-to-end RAG solution using Amazon Bedrock Knowledge Bases and AWS CloudFormation
Open Protocols for Agent Interoperability Part 1: Inter-Agent Communication on MCP

About the authors
Yicheng Shen is a lead software engineer for New Relic NOVA, where he focuses on developing gen AI and agentic solutions that transform how businesses understand their application performance. When he’s not building intelligent systems, you’ll find him exploring the outdoors with his family and their dog.
Sarathy Varadarajan, Senior Director of Engineering at New Relic, drives AI-first transformation and developer productivity, aiming for tenfold gains via intelligent automation and enterprise AI. He scaled engineering teams from 15 to over 350 in Bangalore and Hyderabad. He enjoys family time and volleyball.
Joe King is an AWS Senior Data Scientist at the Generative AI Innovation Center, where he helps organizations architect and implement cutting-edge generative AI solutions. With deep expertise in science, engineering, and AI/ML architecture, he specializes in transforming complex generative AI use cases into scalable solutions on AWS.
Priyashree Roy is an AWS data scientist at the Generative AI Innovation Center, where she applies her deep expertise in machine learning and generative AI to build cutting-edge solutions for AWS strategic customers. With a PhD in experimental particle physics, she brings a rigorous scientific approach to solving complex real-world problems through advanced AI technologies.
Gene Su is an AWS Data Scientist at the Generative AI Innovation Center, specializing in generative AI solutions for finance, retail, and other industries. He uses his expertise in large language models (LLMs) to deliver generative AI applications on AWS.
Dipanshu Jain is a generative AI Strategist at AWS, helping unlock the potential of gen AI through strategic advisory and tailored solution development. Specialized in identifying high-impact generative AI use cases, shaping execution roadmaps, and guiding cross-functional teams through proofs of concept—from discovery to production.
Ameer Hakme is an AWS Solutions Architect that collaborates with Independent Software Vendors (ISVs) in the Northeast region, assisting in designing and building scalable and modern platforms on the AWS Cloud. An expert in AI/ML and generative AI, Ameer helps customers unlock the potential of these cutting-edge technologies. In his leisure time, he enjoys riding his motorcycle and spending quality time with his family.

How to Design Production-Grade Mock Data Pipelines Using Polyfactory w …

In this tutorial, we walk through an advanced, end-to-end exploration of Polyfactory, focusing on how we can generate rich, realistic mock data directly from Python type hints. We start by setting up the environment and progressively build factories for data classes, Pydantic models, and attrs-based classes, while demonstrating customization, overrides, calculated fields, and the generation of nested objects. As we move through each snippet, we show how we can control randomness, enforce constraints, and model real-world structures, making this tutorial directly applicable to testing, prototyping, and data-driven development workflows. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess
import sys

def install_package(package):
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, package])

packages = [
“polyfactory”,
“pydantic”,
“email-validator”,
“faker”,
“msgspec”,
“attrs”
]

for package in packages:
try:
install_package(package)
print(f”✓ Installed {package}”)
except Exception as e:
print(f”✗ Failed to install {package}: {e}”)

print(“n”)

print(“=” * 80)
print(“SECTION 2: Basic Dataclass Factories”)
print(“=” * 80)

from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime, date
from uuid import UUID
from polyfactory.factories import DataclassFactory

@dataclass
class Address:
street: str
city: str
country: str
zip_code: str

@dataclass
class Person:
id: UUID
name: str
email: str
age: int
birth_date: date
is_active: bool
address: Address
phone_numbers: List[str]
bio: Optional[str] = None

class PersonFactory(DataclassFactory[Person]):
pass

person = PersonFactory.build()
print(f”Generated Person:”)
print(f” ID: {person.id}”)
print(f” Name: {person.name}”)
print(f” Email: {person.email}”)
print(f” Age: {person.age}”)
print(f” Address: {person.address.city}, {person.address.country}”)
print(f” Phone Numbers: {person.phone_numbers[:2]}”)
print()

people = PersonFactory.batch(5)
print(f”Generated {len(people)} people:”)
for i, p in enumerate(people, 1):
print(f” {i}. {p.name} – {p.email}”)
print(“n”)

We set up the environment and ensure all required dependencies are installed. We also introduce the core idea of using Polyfactory to generate mock data from type hints. By initializing the basic dataclass factories, we establish the foundation for all subsequent examples.

Copy CodeCopiedUse a different Browserprint(“=” * 80)
print(“SECTION 3: Customizing Factory Behavior”)
print(“=” * 80)

from faker import Faker
from polyfactory.fields import Use, Ignore

@dataclass
class Employee:
employee_id: str
full_name: str
department: str
salary: float
hire_date: date
is_manager: bool
email: str
internal_notes: Optional[str] = None

class EmployeeFactory(DataclassFactory[Employee]):
__faker__ = Faker(locale=”en_US”)
__random_seed__ = 42

@classmethod
def employee_id(cls) -> str:
return f”EMP-{cls.__random__.randint(10000, 99999)}”

@classmethod
def full_name(cls) -> str:
return cls.__faker__.name()

@classmethod
def department(cls) -> str:
departments = [“Engineering”, “Marketing”, “Sales”, “HR”, “Finance”]
return cls.__random__.choice(departments)

@classmethod
def salary(cls) -> float:
return round(cls.__random__.uniform(50000, 150000), 2)

@classmethod
def email(cls) -> str:
return cls.__faker__.company_email()

employees = EmployeeFactory.batch(3)
print(“Generated Employees:”)
for emp in employees:
print(f” {emp.employee_id}: {emp.full_name}”)
print(f” Department: {emp.department}”)
print(f” Salary: ${emp.salary:,.2f}”)
print(f” Email: {emp.email}”)
print()
print()

print(“=” * 80)
print(“SECTION 4: Field Constraints and Calculated Fields”)
print(“=” * 80)

@dataclass
class Product:
product_id: str
name: str
description: str
price: float
discount_percentage: float
stock_quantity: int
final_price: Optional[float] = None
sku: Optional[str] = None

class ProductFactory(DataclassFactory[Product]):
@classmethod
def product_id(cls) -> str:
return f”PROD-{cls.__random__.randint(1000, 9999)}”

@classmethod
def name(cls) -> str:
adjectives = [“Premium”, “Deluxe”, “Classic”, “Modern”, “Eco”]
nouns = [“Widget”, “Gadget”, “Device”, “Tool”, “Appliance”]
return f”{cls.__random__.choice(adjectives)} {cls.__random__.choice(nouns)}”

@classmethod
def price(cls) -> float:
return round(cls.__random__.uniform(10.0, 1000.0), 2)

@classmethod
def discount_percentage(cls) -> float:
return round(cls.__random__.uniform(0, 30), 2)

@classmethod
def stock_quantity(cls) -> int:
return cls.__random__.randint(0, 500)

@classmethod
def build(cls, **kwargs):
instance = super().build(**kwargs)
if instance.final_price is None:
instance.final_price = round(
instance.price * (1 – instance.discount_percentage / 100), 2
)
if instance.sku is None:
name_part = instance.name.replace(” “, “-“).upper()[:10]
instance.sku = f”{instance.product_id}-{name_part}”
return instance

products = ProductFactory.batch(3)
print(“Generated Products:”)
for prod in products:
print(f” {prod.sku}”)
print(f” Name: {prod.name}”)
print(f” Price: ${prod.price:.2f}”)
print(f” Discount: {prod.discount_percentage}%”)
print(f” Final Price: ${prod.final_price:.2f}”)
print(f” Stock: {prod.stock_quantity} units”)
print()
print()

We focus on generating simple but realistic mock data using dataclasses and default Polyfactory behavior. We show how to quickly create single instances and batches without writing any custom logic. It helps us validate how Polyfactory automatically interprets type hints to populate nested structures.

Copy CodeCopiedUse a different Browserprint(“=” * 80)
print(“SECTION 6: Complex Nested Structures”)
print(“=” * 80)

from enum import Enum

class OrderStatus(str, Enum):
PENDING = “pending”
PROCESSING = “processing”
SHIPPED = “shipped”
DELIVERED = “delivered”
CANCELLED = “cancelled”

@dataclass
class OrderItem:
product_name: str
quantity: int
unit_price: float
total_price: Optional[float] = None

@dataclass
class ShippingInfo:
carrier: str
tracking_number: str
estimated_delivery: date

@dataclass
class Order:
order_id: str
customer_name: str
customer_email: str
status: OrderStatus
items: List[OrderItem]
order_date: datetime
shipping_info: Optional[ShippingInfo] = None
total_amount: Optional[float] = None
notes: Optional[str] = None

class OrderItemFactory(DataclassFactory[OrderItem]):
@classmethod
def product_name(cls) -> str:
products = [“Laptop”, “Mouse”, “Keyboard”, “Monitor”, “Headphones”,
“Webcam”, “USB Cable”, “Phone Case”, “Charger”, “Tablet”]
return cls.__random__.choice(products)

@classmethod
def quantity(cls) -> int:
return cls.__random__.randint(1, 5)

@classmethod
def unit_price(cls) -> float:
return round(cls.__random__.uniform(5.0, 500.0), 2)

@classmethod
def build(cls, **kwargs):
instance = super().build(**kwargs)
if instance.total_price is None:
instance.total_price = round(instance.quantity * instance.unit_price, 2)
return instance

class ShippingInfoFactory(DataclassFactory[ShippingInfo]):
@classmethod
def carrier(cls) -> str:
carriers = [“FedEx”, “UPS”, “DHL”, “USPS”]
return cls.__random__.choice(carriers)

@classmethod
def tracking_number(cls) -> str:
return ”.join(cls.__random__.choices(‘0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ’, k=12))

class OrderFactory(DataclassFactory[Order]):
@classmethod
def order_id(cls) -> str:
return f”ORD-{datetime.now().year}-{cls.__random__.randint(100000, 999999)}”

@classmethod
def items(cls) -> List[OrderItem]:
return OrderItemFactory.batch(cls.__random__.randint(1, 5))

@classmethod
def build(cls, **kwargs):
instance = super().build(**kwargs)
if instance.total_amount is None:
instance.total_amount = round(sum(item.total_price for item in instance.items), 2)
if instance.shipping_info is None and instance.status in [OrderStatus.SHIPPED, OrderStatus.DELIVERED]:
instance.shipping_info = ShippingInfoFactory.build()
return instance

orders = OrderFactory.batch(2)
print(“Generated Orders:”)
for order in orders:
print(f”n Order {order.order_id}”)
print(f” Customer: {order.customer_name} ({order.customer_email})”)
print(f” Status: {order.status.value}”)
print(f” Items ({len(order.items)}):”)
for item in order.items:
print(f” – {item.quantity}x {item.product_name} @ ${item.unit_price:.2f} = ${item.total_price:.2f}”)
print(f” Total: ${order.total_amount:.2f}”)
if order.shipping_info:
print(f” Shipping: {order.shipping_info.carrier} – {order.shipping_info.tracking_number}”)
print(“n”)

We build more complex domain logic by introducing calculated and dependent fields within factories. We show how we can derive values such as final prices, totals, and shipping details after object creation. This allows us to model realistic business rules directly inside our test data generators.

Copy CodeCopiedUse a different Browserprint(“=” * 80)
print(“SECTION 7: Attrs Integration”)
print(“=” * 80)

import attrs
from polyfactory.factories.attrs_factory import AttrsFactory

@attrs.define
class BlogPost:
title: str
author: str
content: str
views: int = 0
likes: int = 0
published: bool = False
published_at: Optional[datetime] = None
tags: List[str] = attrs.field(factory=list)

class BlogPostFactory(AttrsFactory[BlogPost]):
@classmethod
def title(cls) -> str:
templates = [
“10 Tips for {}”,
“Understanding {}”,
“The Complete Guide to {}”,
“Why {} Matters”,
“Getting Started with {}”
]
topics = [“Python”, “Data Science”, “Machine Learning”, “Web Development”, “DevOps”]
template = cls.__random__.choice(templates)
topic = cls.__random__.choice(topics)
return template.format(topic)

@classmethod
def content(cls) -> str:
return ” “.join(Faker().sentences(nb=cls.__random__.randint(3, 8)))

@classmethod
def views(cls) -> int:
return cls.__random__.randint(0, 10000)

@classmethod
def likes(cls) -> int:
return cls.__random__.randint(0, 1000)

@classmethod
def tags(cls) -> List[str]:
all_tags = [“python”, “tutorial”, “beginner”, “advanced”, “guide”,
“tips”, “best-practices”, “2024”]
return cls.__random__.sample(all_tags, k=cls.__random__.randint(2, 5))

posts = BlogPostFactory.batch(3)
print(“Generated Blog Posts:”)
for post in posts:
print(f”n ‘{post.title}'”)
print(f” Author: {post.author}”)
print(f” Views: {post.views:,} | Likes: {post.likes:,}”)
print(f” Published: {post.published}”)
print(f” Tags: {‘, ‘.join(post.tags)}”)
print(f” Preview: {post.content[:100]}…”)
print(“n”)

print(“=” * 80)
print(“SECTION 8: Building with Specific Overrides”)
print(“=” * 80)

custom_person = PersonFactory.build(
name=”Alice Johnson”,
age=30,
email=”alice@example.com”
)
print(f”Custom Person:”)
print(f” Name: {custom_person.name}”)
print(f” Age: {custom_person.age}”)
print(f” Email: {custom_person.email}”)
print(f” ID (auto-generated): {custom_person.id}”)
print()

vip_customers = PersonFactory.batch(
3,
bio=”VIP Customer”
)
print(“VIP Customers:”)
for customer in vip_customers:
print(f” {customer.name}: {customer.bio}”)
print(“n”)

We extend Polyfactory usage to validated Pydantic models and attrs-based classes. We demonstrate how we can respect field constraints, validators, and default behaviors while still generating valid data at scale. It ensures our mock data remains compatible with real application schemas.

Copy CodeCopiedUse a different Browserprint(“=” * 80)
print(“SECTION 9: Field-Level Control with Use and Ignore”)
print(“=” * 80)

from polyfactory.fields import Use, Ignore

@dataclass
class Configuration:
app_name: str
version: str
debug: bool
created_at: datetime
api_key: str
secret_key: str

class ConfigFactory(DataclassFactory[Configuration]):
app_name = Use(lambda: “MyAwesomeApp”)
version = Use(lambda: “1.0.0”)
debug = Use(lambda: False)

@classmethod
def api_key(cls) -> str:
return f”api_key_{”.join(cls.__random__.choices(‘0123456789abcdef’, k=32))}”

@classmethod
def secret_key(cls) -> str:
return f”secret_{”.join(cls.__random__.choices(‘0123456789abcdef’, k=64))}”

configs = ConfigFactory.batch(2)
print(“Generated Configurations:”)
for config in configs:
print(f” App: {config.app_name} v{config.version}”)
print(f” Debug: {config.debug}”)
print(f” API Key: {config.api_key[:20]}…”)
print(f” Created: {config.created_at}”)
print()
print()

print(“=” * 80)
print(“SECTION 10: Model Coverage Testing”)
print(“=” * 80)

from pydantic import BaseModel, ConfigDict
from typing import Union

class PaymentMethod(BaseModel):
model_config = ConfigDict(use_enum_values=True)
type: str
card_number: Optional[str] = None
bank_name: Optional[str] = None
verified: bool = False

class PaymentMethodFactory(ModelFactory[PaymentMethod]):
__model__ = PaymentMethod

payment_methods = [
PaymentMethodFactory.build(type=”card”, card_number=”4111111111111111″),
PaymentMethodFactory.build(type=”bank”, bank_name=”Chase Bank”),
PaymentMethodFactory.build(verified=True),
]

print(“Payment Method Coverage:”)
for i, pm in enumerate(payment_methods, 1):
print(f” {i}. Type: {pm.type}”)
if pm.card_number:
print(f” Card: {pm.card_number}”)
if pm.bank_name:
print(f” Bank: {pm.bank_name}”)
print(f” Verified: {pm.verified}”)
print(“n”)

print(“=” * 80)
print(“TUTORIAL SUMMARY”)
print(“=” * 80)
print(“””
This tutorial covered:

1. ✓ Basic Dataclass Factories – Simple mock data generation
2. ✓ Custom Field Generators – Controlling individual field values
3. ✓ Field Constraints – Using PostGenerated for calculated fields
4. ✓ Pydantic Integration – Working with validated models
5. ✓ Complex Nested Structures – Building related objects
6. ✓ Attrs Support – Alternative to dataclasses
7. ✓ Build Overrides – Customizing specific instances
8. ✓ Use and Ignore – Explicit field control
9. ✓ Coverage Testing – Ensuring comprehensive test data

Key Takeaways:
– Polyfactory automatically generates mock data from type hints
– Customize generation with classmethods and decorators
– Supports multiple libraries: dataclasses, Pydantic, attrs, msgspec
– Use PostGenerated for calculated/dependent fields
– Override specific values while keeping others random
– Perfect for testing, development, and prototyping

For more information:
– Documentation: https://polyfactory.litestar.dev/
– GitHub: https://github.com/litestar-org/polyfactory
“””)
print(“=” * 80)

We cover advanced usage patterns such as explicit overrides, constant field values, and coverage testing scenarios. We show how we can intentionally construct edge cases and variant instances for robust testing. This final step ties everything together by demonstrating how Polyfactory supports comprehensive and production-grade test data strategies.

In conclusion, we demonstrated how Polyfactory enables us to create comprehensive, flexible test data with minimal boilerplate while still retaining fine-grained control over every field. We showed how to handle simple entities, complex nested structures, and Pydantic model validation, as well as explicit field overrides, within a single, consistent factory-based approach. Overall, we found that Polyfactory enables us to move faster and test more confidently, as it reliably generates realistic datasets that closely mirror production-like scenarios without sacrificing clarity or maintainability.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design Production-Grade Mock Data Pipelines Using Polyfactory with Dataclasses, Pydantic, Attrs, and Nested Models appeared first on MarkTechPost.

ByteDance Releases Protenix-v1: A New Open-Source Model Achieving AF3- …

How close can an open model get to AlphaFold3-level accuracy when it matches training data, model scale and inference budget? ByteDance has introduced Protenix-v1, a comprehensive AlphaFold3 (AF3) reproduction for biomolecular structure prediction, released with code and model parameters under Apache 2.0. The model targets AF3-level performance across protein, DNA, RNA and ligand structures while keeping the entire stack open and extensible for research and production.

The core release also ships with PXMeter v1.0.0, an evaluation toolkit and dataset suite for transparent benchmarking on more than 6k complexes with time-split and domain-specific subsets.

What is Protenix-v1?

Protenix is described as ‘Protenix: Protein + X‘, a foundation model for high-accuracy biomolecular structure prediction. It predicts all-atom 3D structures for complexes that can include:

Proteins

Nucleic acids (DNA and RNA)

Small-molecule ligands

The research team defines Protenix as a comprehensive AF3 reproduction. It re-implements the AF3-style diffusion architecture for all-atom complexes and exposes it in a trainable PyTorch codebase.

The project is released as a full stack:

Training and inference code

Pre-trained model weights

Data and MSA pipelines

A browser-based Protenix Web Server for interactive use

AF3-level performance under matched constraints

As per the research team Protenix-v1 (protenix_base_default_v1.0.0) is ‘the first fully open-source model that outperforms AlphaFold3 across diverse benchmark sets while adhering to the same training data cutoff, model scale, and inference budget as AlphaFold3.‘

The important constraints are:

Training data cutoff: 2021-09-30, aligned with AF3’s PDB cutoff.

Model scale: Protenix-v1 itself has 368M parameters; AF3 scale is matched but not disclosed.

Inference budget: comparisons use similar sampling budgets and runtime constraints.

https://github.com/bytedance/Protenix

On challenging targets such as antigen–antibody complexes, increasing the number of sampled candidates from several to hundreds yields consistent log-linear improvements in accuracy. This gives a clear and documented inference-time scaling behavior rather than a single fixed operating point.

PXMeter v1.0.0: Evaluation for 6k+ complexes

To support these claims, the research team released PXMeter v1.0.0, an open-source toolkit for reproducible structure prediction benchmarks.

PXMeter provides:

A manually curated benchmark dataset, with non-biological artifacts and problematic entries removed

Time-split and domain-specific subsets (for example, antibody–antigen, protein–RNA, ligand complexes)

A unified evaluation framework that computes metrics such as complex LDDT and DockQ across models

The associated PXMeter research paper, ‘Revisiting Structure Prediction Benchmarks with PXMeter,‘ evaluates Protenix, AlphaFold3, Boltz-1 and Chai-1 on the same curated tasks, and shows how different dataset designs affect model ranking and perceived performance.

How Protenix fits into the broader stack?

Protenix is part of a small ecosystem of related projects:

PXDesign: a binder design suite built on the Protenix foundation model. It reports 20–73% experimental hit rates and 2–6× higher success than methods such as AlphaProteo and RFdiffusion, and is accessible via the Protenix Server.

Protenix-Dock: a classical protein–ligand docking framework that uses empirical scoring functions rather than deep nets, tuned for rigid docking tasks.

Protenix-Mini and follow-on work such as Protenix-Mini+: lightweight variants that reduce inference cost using architectural compression and few-step diffusion samplers, while keeping accuracy within a few percent of the full model on standard benchmarks.

Together, these components cover structure prediction, docking, and design, and share interfaces and formats, which simplifies integration into downstream pipelines.

Key Takeaways

AF3-class, fully open model: Protenix-v1 is an AF3-style all-atom biomolecular structure predictor with open code and weights under Apache 2.0, targeting proteins, DNA, RNA and ligands.

Strict AF3 alignment for fair comparison: Protenix-v1 matches AlphaFold3 on critical axes: training data cutoff (2021-09-30), model scale class and comparable inference budget, enabling fair AF3-level performance claims.

Transparent benchmarking with PXMeter v1.0.0: PXMeter provides a curated benchmark suite over 6k+ complexes with time-split and domain-specific subsets plus unified metrics (for example, complex LDDT, DockQ) for reproducible evaluation.

Verified inference-time scaling behavior: Protenix-v1 shows log-linear accuracy gains as the number of sampled candidates increases, giving a documented latency–accuracy trade-off rather than a single fixed operating point.

Check out the Repo and Try it here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post ByteDance Releases Protenix-v1: A New Open-Source Model Achieving AF3-Level Performance in Biomolecular Structure Prediction appeared first on MarkTechPost.

Google AI Introduces PaperBanana: An Agentic Framework that Automates …

Generating publication-ready illustrations is a labor-intensive bottleneck in the research workflow. While AI scientists can now handle literature reviews and code, they struggle to visually communicate complex discoveries. A research team from Google and Peking University introduce new framework called ‘PaperBanana‘ which is changing that by using a multi-agent system to automate high-quality academic diagrams and plots.

https://dwzhu-pku.github.io/PaperBanana/

5 Specialized Agents: The Architecture

PaperBanana does not rely on a single prompt. It orchestrates a collaborative team of 5 agents to transform raw text into professional visuals.

https://dwzhu-pku.github.io/PaperBanana/

Phase 1: Linear Planning

Retriever Agent: Identifies the 10 most relevant reference examples from a database to guide the style and structure.

Planner Agent: Translates technical methodology text into a detailed textual description of the target figure.

Stylist Agent: Acts as a design consultant to ensure the output matches the “NeurIPS Look” using specific color palettes and layouts.

Phase 2: Iterative Refinement

Visualizer Agent: Transforms the description into a visual output. For diagrams, it uses image models like Nano-Banana-Pro. For statistical plots, it writes executable Python Matplotlib code.

Critic Agent: Inspects the generated image against the source text to find factual errors or visual glitches. It provides feedback for 3 rounds of refinement.

Beating the NeurIPS 2025 Benchmark

https://dwzhu-pku.github.io/PaperBanana/

The research team introduced PaperBananaBench, a dataset of 292 test cases curated from actual NeurIPS 2025 publications. Using a VLM-as-a-Judge approach, they compared PaperBanana against leading baselines.

MetricImprovement over BaselineOverall Score+17.0% Conciseness+37.2% Readability+12.9% Aesthetics+6.6% Faithfulness+2.8%

The system excels in ‘Agent & Reasoning’ diagrams, achieving a 69.9% overall score. It also provides an automated ‘Aesthetic Guideline’ that favors ‘Soft Tech Pastels’ over harsh primary colors.

Statistical Plots: Code vs. Image

Statistical plots require numerical precision that standard image models often lack. PaperBanana solves this by having the Visualizer Agent write code instead of drawing pixels.

Image Generation: Excels in aesthetics but often suffers from ‘numerical hallucinations’ or repeated elements.

Code-Based Generation: Ensures 100% data fidelity by using the Matplotlib library to render the final plot.

Domain-Specific Aesthetic Preferences in AI Research

According to the PaperBanana style guide, aesthetic choices often shift based on the research domain to match the expectations of different scholarly communities.

Research DomainVisual ‘Vibe‘Key Design ElementsAgent & ReasoningIllustrative, Narrative, “Friendly” 2D vector robots, human avatars, emojis, and “User Interface” aesthetics (chat bubbles, document icons)Computer Vision & 3DSpatial, Dense, Geometric Camera cones (frustums), ray lines, point clouds, and RGB color coding for axis correspondence Generative & LearningModular, Flow-oriented 3D cuboids for tensors, matrix grids, and “Zone” strategies using light pastel fills to group logic Theory & OptimizationMinimalist, Abstract, “Textbook” Graph nodes (circles), manifolds (planes), and a restrained grayscale palette with single highlight colors

Comparison of Visualization Paradigms

For statistical plots, the framework highlights a clear trade-off between using an image generation model (IMG) versus executable code (Coding).

FeaturePlots via Image Generation (IMG)Plots via Coding (Matplotlib)AestheticsGenerally higher; plots look more “visually appealing” Professional and standard academic look FidelityLower; prone to “numerical hallucinations” or element repetition 100% accurate; strictly represents the raw data provided ReadabilityHigh for sparse data but struggles with complex datasets Consistently high; handles dense or multi-series data without error

Key Takeaways

Multi-Agent Collaborative Framework: PaperBanana is a reference-driven system that orchestrates 5 specialized agents—Retriever, Planner, Stylist, Visualizer, and Critic—to transform raw technical text and captions into publication-quality methodology diagrams and statistical plots.

Dual-Phase Generation Process: The workflow consists of a Linear Planning Phase to retrieve reference examples and set aesthetic guidelines, followed by a 3-round Iterative Refinement Loop where the Critic agent identifies errors and the Visualizer agent regenerates the image for higher accuracy.

Superior Performance on PaperBananaBench: Evaluated against 292 test cases from NeurIPS 2025, the framework outperformed vanilla baselines in Overall Score (+17.0%), Conciseness (+37.2%), Readability (+12.9%), and Aesthetics (+6.6%).

Precision-Focused Statistical Plots: For statistical data, the system switches from direct image generation to executable Python Matplotlib code; this hybrid approach ensures numerical precision and eliminates “hallucinations” common in standard AI image generators.

Check out the Paper and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Introduces PaperBanana: An Agentic Framework that Automates Publication Ready Methodology Diagrams and Statistical Plots appeared first on MarkTechPost.

How to Build a Production-Grade Agentic AI System with Hybrid Retrieva …

In this tutorial, we build an ultra-advanced agentic AI workflow that behaves like a production-grade research and reasoning system rather than a single prompt call. We ingest real web sources asynchronously, split them into provenance-tracked chunks, and run hybrid retrieval using both TF-IDF (sparse) and OpenAI embeddings (dense), then fuse results for higher recall and stability. We orchestrate multiple agents, planning, synthesis, and repair, while enforcing strict guardrails so every major claim is grounded in retrieved evidence, and we persist episodic memory. Hence, the system improves its strategy over time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install openai openai-agents pydantic httpx beautifulsoup4 lxml scikit-learn numpy

import os, re, json, time, getpass, asyncio, sqlite3, hashlib
from typing import List, Dict, Tuple, Optional, Any

import numpy as np
import httpx
from bs4 import BeautifulSoup
from pydantic import BaseModel, Field

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from openai import AsyncOpenAI
from agents import Agent, Runner, SQLiteSession

if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass.getpass(“Enter your OpenAI API key: “)
if not os.environ.get(“OPENAI_API_KEY”):
raise RuntimeError(“OPENAI_API_KEY not provided.”)
print(” OpenAI API key loaded securely.”)
oa = AsyncOpenAI(api_key=os.environ[“OPENAI_API_KEY”])

def sha1(s: str) -> str:
return hashlib.sha1(s.encode(“utf-8″, errors=”ignore”)).hexdigest()

def normalize_url(u: str) -> str:
u = (u or “”).strip()
return u.rstrip(“).,]”‘”)

def clean_html_to_text(html: str) -> str:
soup = BeautifulSoup(html, “lxml”)
for tag in soup([“script”, “style”, “noscript”]):
tag.decompose()
txt = soup.get_text(“n”)
txt = re.sub(r”n{3,}”, “nn”, txt).strip()
txt = re.sub(r”[ t]+”, ” “, txt)
return txt

def chunk_text(text: str, chunk_chars: int = 1600, overlap_chars: int = 320) -> List[str]:
if not text:
return []
text = re.sub(r”s+”, ” “, text).strip()
n = len(text)
step = max(1, chunk_chars – overlap_chars)
chunks = []
i = 0
while i < n:
chunks.append(text[i:i + chunk_chars])
i += step
return chunks

def canonical_chunk_id(s: str) -> str:
if s is None:
return “”
s = str(s).strip()
s = s.strip(“<>”‘()[]{}”)
s = s.rstrip(“.,;:”)
return s

def inject_exec_summary_citations(exec_summary: str, citations: List[str], allowed_chunk_ids: List[str]) -> str:
exec_summary = exec_summary or “”
cset = []
for c in citations:
c = canonical_chunk_id(c)
if c and c in allowed_chunk_ids and c not in cset:
cset.append(c)
if len(cset) >= 2:
break
if len(cset) < 2:
for c in allowed_chunk_ids:
if c not in cset:
cset.append(c)
if len(cset) >= 2:
break
if len(cset) >= 2:
needed = [c for c in cset if c not in exec_summary]
if needed:
exec_summary = exec_summary.strip()
if exec_summary and not exec_summary.endswith(“.”):
exec_summary += “.”
exec_summary += f” (cite: {cset[0]}) (cite: {cset[1]})”
return exec_summary

We set up the environment, securely load the OpenAI API key, and initialize core utilities that everything else depends on. We define hashing, URL normalization, HTML cleaning, and chunking so all downstream steps operate on clean, consistent text. We also add deterministic helpers to normalize and inject citations, ensuring guardrails are always satisfied. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserasync def fetch_many(urls: List[str], timeout_s: float = 25.0, per_url_char_limit: int = 60000) -> Dict[str, str]:
headers = {“User-Agent”: “Mozilla/5.0 (AgenticAI/4.2)”}
urls = [normalize_url(u) for u in urls]
urls = [u for u in urls if u.startswith(“http”)]
urls = list(dict.fromkeys(urls))
out: Dict[str, str] = {}
async with httpx.AsyncClient(timeout=timeout_s, follow_redirects=True, headers=headers) as client:
async def _one(url: str):
try:
r = await client.get(url)
r.raise_for_status()
out[url] = clean_html_to_text(r.text)[:per_url_char_limit]
except Exception as e:
out[url] = f”__FETCH_ERROR__ {type(e).__name__}: {e}”
await asyncio.gather(*[_one(u) for u in urls])
return out

def dedupe_texts(sources: Dict[str, str]) -> Dict[str, str]:
seen = set()
out = {}
for url, txt in sources.items():
if not isinstance(txt, str) or txt.startswith(“__FETCH_ERROR__”):
continue
h = sha1(txt[:25000])
if h in seen:
continue
seen.add(h)
out[url] = txt
return out

class ChunkRecord(BaseModel):
chunk_id: str
url: str
chunk_index: int
text: str

class RetrievalHit(BaseModel):
chunk_id: str
url: str
chunk_index: int
score_sparse: float = 0.0
score_dense: float = 0.0
score_fused: float = 0.0
text: str

class EvidencePack(BaseModel):
query: str
hits: List[RetrievalHit]

We asynchronously fetch multiple web sources in parallel and aggressively deduplicate content to avoid redundant evidence. We convert raw pages into structured text and define the core data models that represent chunks and retrieval hits. We ensure every piece of text is traceable back to a specific source and chunk index. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserEPISODE_DB = “agentic_episode_memory.db”

def episode_db_init():
con = sqlite3.connect(EPISODE_DB)
cur = con.cursor()
cur.execute(“””
CREATE TABLE IF NOT EXISTS episodes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts INTEGER NOT NULL,
question TEXT NOT NULL,
urls_json TEXT NOT NULL,
retrieval_queries_json TEXT NOT NULL,
useful_sources_json TEXT NOT NULL
)
“””)
con.commit()
con.close()

def episode_store(question: str, urls: List[str], retrieval_queries: List[str], useful_sources: List[str]):
con = sqlite3.connect(EPISODE_DB)
cur = con.cursor()
cur.execute(
“INSERT INTO episodes(ts, question, urls_json, retrieval_queries_json, useful_sources_json) VALUES(?,?,?,?,?)”,
(int(time.time()), question, json.dumps(urls), json.dumps(retrieval_queries), json.dumps(useful_sources)),
)
con.commit()
con.close()

def episode_recall(question: str, top_k: int = 2) -> List[Dict[str, Any]]:
con = sqlite3.connect(EPISODE_DB)
cur = con.cursor()
cur.execute(“SELECT ts, question, urls_json, retrieval_queries_json, useful_sources_json FROM episodes ORDER BY ts DESC LIMIT 200″)
rows = cur.fetchall()
con.close()
q_tokens = set(re.findall(r”[A-Za-z]{3,}”, (question or “”).lower()))
scored = []
for ts, q2, u, rq, us in rows:
t2 = set(re.findall(r”[A-Za-z]{3,}”, (q2 or “”).lower()))
if not t2:
continue
score = len(q_tokens & t2) / max(1, len(q_tokens))
if score > 0:
scored.append((score, {
“ts”: ts,
“question”: q2,
“urls”: json.loads(u),
“retrieval_queries”: json.loads(rq),
“useful_sources”: json.loads(us),
}))
scored.sort(key=lambda x: x[0], reverse=True)
return [x[1] for x in scored[:top_k]]

episode_db_init()

We introduce episodic memory backed by SQLite so the system can recall what worked in previous runs. We store questions, retrieval strategies, and useful sources to guide future planning. We also implement lightweight similarity-based recall to bias the system toward historically effective patterns. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass HybridIndex:
def __init__(self):
self.records: List[ChunkRecord] = []
self.tfidf: Optional[TfidfVectorizer] = None
self.tfidf_mat = None
self.emb_mat: Optional[np.ndarray] = None

def build_sparse(self):
corpus = [r.text for r in self.records] if self.records else [“”]
self.tfidf = TfidfVectorizer(stop_words=”english”, ngram_range=(1, 2), max_features=80000)
self.tfidf_mat = self.tfidf.fit_transform(corpus)

def search_sparse(self, query: str, k: int) -> List[Tuple[int, float]]:
if not self.records or self.tfidf is None or self.tfidf_mat is None:
return []
qv = self.tfidf.transform([query])
sims = cosine_similarity(qv, self.tfidf_mat).flatten()
top = np.argsort(-sims)[:k]
return [(int(i), float(sims[i])) for i in top]

def set_dense(self, mat: np.ndarray):
self.emb_mat = mat.astype(np.float32)

def search_dense(self, q_emb: np.ndarray, k: int) -> List[Tuple[int, float]]:
if self.emb_mat is None or not self.records:
return []
M = self.emb_mat
q = q_emb.astype(np.float32).reshape(1, -1)
M_norm = M / (np.linalg.norm(M, axis=1, keepdims=True) + 1e-9)
q_norm = q / (np.linalg.norm(q) + 1e-9)
sims = (M_norm @ q_norm.T).flatten()
top = np.argsort(-sims)[:k]
return [(int(i), float(sims[i])) for i in top]

def rrf_fuse(rankings: List[List[int]], k: int = 60) -> Dict[int, float]:
scores: Dict[int, float] = {}
for r in rankings:
for pos, idx in enumerate(r, start=1):
scores[idx] = scores.get(idx, 0.0) + 1.0 / (k + pos)
return scores

HYBRID = HybridIndex()
ALLOWED_URLS: List[str] = []

EMBED_MODEL = “text-embedding-3-small”

async def embed_batch(texts: List[str]) -> np.ndarray:
resp = await oa.embeddings.create(model=EMBED_MODEL, input=texts, encoding_format=”float”)
vecs = [np.array(item.embedding, dtype=np.float32) for item in resp.data]
return np.vstack(vecs) if vecs else np.zeros((0, 0), dtype=np.float32)

async def embed_texts(texts: List[str], batch_size: int = 96, max_concurrency: int = 3) -> np.ndarray:
sem = asyncio.Semaphore(max_concurrency)
mats: List[Tuple[int, np.ndarray]] = []

async def _one(start: int, batch: List[str]):
async with sem:
m = await embed_batch(batch)
mats.append((start, m))

tasks = []
for start in range(0, len(texts), batch_size):
batch = [t[:7000] for t in texts[start:start + batch_size]]
tasks.append(_one(start, batch))
await asyncio.gather(*tasks)

mats.sort(key=lambda x: x[0])
emb = np.vstack([m for _, m in mats]) if mats else np.zeros((len(texts), 0), dtype=np.float32)
if emb.shape[0] != len(texts):
raise RuntimeError(f”Embedding rows mismatch: got {emb.shape[0]} expected {len(texts)}”)
return emb

async def embed_query(query: str) -> np.ndarray:
m = await embed_batch([query[:7000]])
return m[0] if m.shape[0] else np.zeros((0,), dtype=np.float32)

async def build_index(urls: List[str], max_chunks_per_url: int = 60):
global ALLOWED_URLS
fetched = await fetch_many(urls)
fetched = dedupe_texts(fetched)

records: List[ChunkRecord] = []
allowed: List[str] = []

for url, txt in fetched.items():
if not isinstance(txt, str) or txt.startswith(“__FETCH_ERROR__”):
continue
allowed.append(url)
chunks = chunk_text(txt)[:max_chunks_per_url]
for i, ch in enumerate(chunks):
cid = f”{sha1(url)}:{i}”
records.append(ChunkRecord(chunk_id=cid, url=url, chunk_index=i, text=ch))

if not records:
err_view = {normalize_url(u): fetched.get(normalize_url(u), “”) for u in urls}
raise RuntimeError(“No sources fetched successfully.n” + json.dumps(err_view, indent=2)[:4000])

ALLOWED_URLS = allowed
HYBRID.records = records
HYBRID.build_sparse()

texts = [r.text for r in HYBRID.records]
emb = await embed_texts(texts, batch_size=96, max_concurrency=3)
HYBRID.set_dense(emb)

We build a hybrid retrieval index that combines sparse TF-IDF search with dense OpenAI embeddings. We enable reciprocal rank fusion, so that sparse and dense signals complement each other rather than compete. We construct the index once per run and reuse it across all retrieval queries for efficiency. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef build_evidence_pack(query: str, sparse: List[Tuple[int,float]], dense: List[Tuple[int,float]], k: int = 10) -> EvidencePack:
sparse_rank = [i for i,_ in sparse]
dense_rank = [i for i,_ in dense]
sparse_scores = {i:s for i,s in sparse}
dense_scores = {i:s for i,s in dense}
fused = rrf_fuse([sparse_rank, dense_rank], k=60) if dense_rank else rrf_fuse([sparse_rank], k=60)
top = sorted(fused.keys(), key=lambda i: fused[i], reverse=True)[:k]

hits: List[RetrievalHit] = []
for idx in top:
r = HYBRID.records[idx]
hits.append(RetrievalHit(
chunk_id=r.chunk_id, url=r.url, chunk_index=r.chunk_index,
score_sparse=float(sparse_scores.get(idx, 0.0)),
score_dense=float(dense_scores.get(idx, 0.0)),
score_fused=float(fused.get(idx, 0.0)),
text=r.text
))
return EvidencePack(query=query, hits=hits)

async def gather_evidence(queries: List[str], per_query_k: int = 10, sparse_k: int = 60, dense_k: int = 60):
evidence: List[EvidencePack] = []
useful_sources_count: Dict[str, int] = {}
all_chunk_ids: List[str] = []

for q in queries:
sparse = HYBRID.search_sparse(q, k=sparse_k)
q_emb = await embed_query(q)
dense = HYBRID.search_dense(q_emb, k=dense_k)
pack = build_evidence_pack(q, sparse, dense, k=per_query_k)
evidence.append(pack)
for h in pack.hits[:6]:
useful_sources_count[h.url] = useful_sources_count.get(h.url, 0) + 1
for h in pack.hits:
all_chunk_ids.append(h.chunk_id)

useful_sources = sorted(useful_sources_count.keys(), key=lambda u: useful_sources_count[u], reverse=True)
all_chunk_ids = sorted(list(dict.fromkeys(all_chunk_ids)))
return evidence, useful_sources[:8], all_chunk_ids

class Plan(BaseModel):
objective: str
subtasks: List[str]
retrieval_queries: List[str]
acceptance_checks: List[str]

class UltraAnswer(BaseModel):
title: str
executive_summary: str
architecture: List[str]
retrieval_strategy: List[str]
agent_graph: List[str]
implementation_notes: List[str]
risks_and_limits: List[str]
citations: List[str]
sources: List[str]

def normalize_answer(ans: UltraAnswer, allowed_chunk_ids: List[str]) -> UltraAnswer:
data = ans.model_dump()
data[“citations”] = [canonical_chunk_id(x) for x in (data.get(“citations”) or [])]
data[“citations”] = [x for x in data[“citations”] if x in allowed_chunk_ids]
data[“executive_summary”] = inject_exec_summary_citations(data.get(“executive_summary”,””), data[“citations”], allowed_chunk_ids)
return UltraAnswer(**data)

def validate_ultra(ans: UltraAnswer, allowed_chunk_ids: List[str]) -> None:
extras = [u for u in ans.sources if u not in ALLOWED_URLS]
if extras:
raise ValueError(f”Non-allowed sources in output: {extras}”)

cset = set(ans.citations or [])
missing = [cid for cid in cset if cid not in set(allowed_chunk_ids)]
if missing:
raise ValueError(f”Citations reference unknown chunk_ids (not retrieved): {missing}”)

if len(cset) < 6:
raise ValueError(“Need at least 6 distinct chunk_id citations in ultra mode.”)

es_text = ans.executive_summary or “”
es_count = sum(1 for cid in cset if cid in es_text)
if es_count < 2:
raise ValueError(“Executive summary must include at least 2 chunk_id citations verbatim.”)

PLANNER = Agent(
name=”Planner”,
model=”gpt-4o-mini”,
instructions=(
“Return a technical Plan schema.n”
“Make 10-16 retrieval_queries.n”
“Acceptance must include: at least 6 citations and exec_summary contains at least 2 citations verbatim.”
),
output_type=Plan,
)

SYNTHESIZER = Agent(
name=”Synthesizer”,
model=”gpt-4o-mini”,
instructions=(
“Return UltraAnswer schema.n”
“Hard constraints:n”
“- executive_summary MUST include at least TWO citations verbatim as: (cite: <chunk_id>).n”
“- citations must be chosen ONLY from ALLOWED_CHUNK_IDS list.n”
“- citations list must include at least 6 unique chunk_ids.n”
“- sources must be subset of allowed URLs.n”
),
output_type=UltraAnswer,
)

FIXER = Agent(
name=”Fixer”,
model=”gpt-4o-mini”,
instructions=(
“Repair to satisfy guardrails.n”
“Ensure executive_summary includes at least TWO citations verbatim.n”
“Choose citations ONLY from ALLOWED_CHUNK_IDS list.n”
“Return UltraAnswer schema.”
),
output_type=UltraAnswer,
)

session = SQLiteSession(“ultra_agentic_user”, “ultra_agentic_session.db”)

We gather evidence by running multiple targeted queries, fusing sparse and dense results, and assembling evidence packs with scores and provenance. We define strict schemas for plans and final answers, then normalize and validate citations against retrieved chunk IDs. We enforce hard guardrails so every answer remains grounded and auditable. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserasync def run_ultra_agentic(question: str, urls: List[str], max_repairs: int = 2) -> UltraAnswer:
await build_index(urls)
recall_hint = json.dumps(episode_recall(question, top_k=2), indent=2)[:2000]

plan_res = await Runner.run(
PLANNER,
f”Question:n{question}nnAllowed URLs:n{json.dumps(ALLOWED_URLS, indent=2)}nnRecall:n{recall_hint}n”,
session=session
)
plan: Plan = plan_res.final_output
queries = (plan.retrieval_queries or [])[:16]

evidence_packs, useful_sources, allowed_chunk_ids = await gather_evidence(queries)

evidence_json = json.dumps([p.model_dump() for p in evidence_packs], indent=2)[:16000]
allowed_chunk_ids_json = json.dumps(allowed_chunk_ids[:200], indent=2)

draft_res = await Runner.run(
SYNTHESIZER,
f”Question:n{question}nnAllowed URLs:n{json.dumps(ALLOWED_URLS, indent=2)}nn”
f”ALLOWED_CHUNK_IDS:n{allowed_chunk_ids_json}nn”
f”Evidence packs:n{evidence_json}nn”
“Return UltraAnswer.”,
session=session
)
draft = normalize_answer(draft_res.final_output, allowed_chunk_ids)

last_err = None
for i in range(max_repairs + 1):
try:
validate_ultra(draft, allowed_chunk_ids)
episode_store(question, ALLOWED_URLS, plan.retrieval_queries, useful_sources)
return draft
except Exception as e:
last_err = str(e)
if i >= max_repairs:
draft = normalize_answer(draft, allowed_chunk_ids)
validate_ultra(draft, allowed_chunk_ids)
return draft

fixer_res = await Runner.run(
FIXER,
f”Question:n{question}nnAllowed URLs:n{json.dumps(ALLOWED_URLS, indent=2)}nn”
f”ALLOWED_CHUNK_IDS:n{allowed_chunk_ids_json}nn”
f”Guardrail error:n{last_err}nn”
f”Draft:n{json.dumps(draft.model_dump(), indent=2)[:12000]}nn”
f”Evidence packs:n{evidence_json}nn”
“Return corrected UltraAnswer that passes guardrails.”,
session=session
)
draft = normalize_answer(fixer_res.final_output, allowed_chunk_ids)

raise RuntimeError(f”Unexpected failure: {last_err}”)

question = (
“Design a production-lean but advanced agentic AI workflow in Python with hybrid retrieval, ”
“provenance-first citations, critique-and-repair loops, and episodic memory. ”
“Explain why each layer matters, failure modes, and evaluation.”
)

urls = [
“https://openai.github.io/openai-agents-python/”,
“https://openai.github.io/openai-agents-python/agents/”,
“https://openai.github.io/openai-agents-python/running_agents/”,
“https://github.com/openai/openai-agents-python”,
]

ans = await run_ultra_agentic(question, urls, max_repairs=2)

print(“nTITLE:n”, ans.title)
print(“nEXECUTIVE SUMMARY:n”, ans.executive_summary)
print(“nARCHITECTURE:”)
for x in ans.architecture:
print(“-“, x)
print(“nRETRIEVAL STRATEGY:”)
for x in ans.retrieval_strategy:
print(“-“, x)
print(“nAGENT GRAPH:”)
for x in ans.agent_graph:
print(“-“, x)
print(“nIMPLEMENTATION NOTES:”)
for x in ans.implementation_notes:
print(“-“, x)
print(“nRISKS & LIMITS:”)
for x in ans.risks_and_limits:
print(“-“, x)
print(“nCITATIONS (chunk_ids):”)
for c in ans.citations:
print(“-“, c)
print(“nSOURCES:”)
for s in ans.sources:
print(“-“, s)

We orchestrate the full agentic loop by chaining planning, synthesis, validation, and repair in an async-safe pipeline. We automatically retry and fix outputs until they pass all constraints without human intervention. We finish by running a full example and printing a fully grounded, production-ready agentic response.

In conclusion, we developed a comprehensive agentic pipeline robust to common failure modes: unstable embedding shapes, citation drift, and missing grounding in executive summaries. We validated outputs against allowlisted sources, retrieved chunk IDs, automatically normalized citations, and injected deterministic citations when needed to guarantee compliance without sacrificing correctness. By combining hybrid retrieval, critique-and-repair loops, and episodic memory, we created a reusable foundation we can extend with stronger evaluations (claim-to-evidence coverage scoring, adversarial red-teaming, and regression tests) to continuously harden the system as it scales to new domains and larger corpora.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Production-Grade Agentic AI System with Hybrid Retrieval, Provenance-First Citations, Repair Loops, and Episodic Memory appeared first on MarkTechPost.

Waymo Introduces the Waymo World Model: A New Frontier Simulator Model …

Waymo is introducing the Waymo World Model, a frontier generative model that drives its next generation of autonomous driving simulation. The system is built on top of Genie 3, Google DeepMind’s general-purpose world model, and adapts it to produce photorealistic, controllable, multi-sensor driving scenes at scale.

Waymo already reports nearly 200 million fully autonomous miles on public roads. Behind the scenes, the Driver trains and is evaluated on billions of additional miles in virtual worlds. The Waymo World Model is now the main engine generating those worlds, with the explicit goal of exposing the stack to rare, safety-critical ‘long-tail’ events that are almost impossible to see often enough in reality.

From Genie 3 to a driving-specific world model

Genie 3 is a general-purpose world model that turns text prompts into interactive environments you can navigate in real time at roughly 24 frames per second, typically at 720p resolution. It learns the dynamics of scenes directly from large video corpora and supports fluid control by user inputs.

Waymo uses Genie 3 as the backbone and post-trains it for the driving domain. The Waymo World Model keeps Genie 3’s ability to generate coherent 3D worlds, but aligns the outputs with Waymo’s sensor suite and operating constraints. It generates high-fidelity camera images and lidar point clouds that evolve consistently over time, matching how the Waymo Driver actually perceives the environment.

This is not just video rendering. The model produces multi-sensor, temporally consistent observations that downstream autonomous driving systems can consume under the same conditions as real-world logs.

Emergent multimodal world knowledge

Most AV simulators are trained only on on-road fleet data. That limits them to the weather, infrastructure, and traffic patterns a fleet actually encountered. Waymo instead leverages Genie 3’s pre-training on an extremely large and diverse set of videos to import broad ‘world knowledge’ into the simulator.

Waymo then applies specialized post-training to transfer this knowledge from 2D video into 3D lidar outputs tailored to its hardware. Cameras provide rich appearance and lighting. Lidar contributes precise geometry and depth. The Waymo World Model jointly generates these modalities, so a simulated scene comes with both RGB streams and realistic 4D point clouds.

Because of the diversity of the pre-training data, the model can synthesize conditions that Waymo’s fleet has not directly seen. The Waymo team shows examples such as light snow on the Golden Gate Bridge, tornadoes, flooded cul-de-sacs, tropical streets strangely covered in snow, and driving out of a roadway fire. It also handles unusual objects and edge cases like elephants, Texas longhorns, lions, pedestrians dressed as T-rexes, and car-sized tumbleweed.

The important point is that these behaviors are emergent. The model is not explicitly programmed with rules for elephants or tornado fluid dynamics. Instead, it reuses generic spatiotemporal structure learned from videos and adapts it to driving scenes.

Three axes of controllability

A key design goal is strong simulation controllability. The Waymo World Model exposes three main control mechanisms: driving action control, scene layout control, and language control.

Driving action control: The simulator responds to specific driving inputs, allowing ‘what if’ counterfactuals on top of recorded logs. Devs can ask whether the Waymo Driver could have driven more assertively instead of yielding in a past scene, and then simulate that alternative behavior. Because the model is fully generative, it maintains realism even when the simulated route diverges far from the original trajectory, where purely reconstructive methods like 3D Gaussian Splatting (3DGS) would suffer from missing viewpoints.

Scene layout control: The model can be conditioned on modified road geometry, traffic signal states, and other road users. Waymo can insert or reposition vehicles and pedestrians or apply mutations to road layouts to synthesize targeted interaction scenarios. This supports systematic stress testing of yielding, merging, and negotiation behaviors beyond what appears in raw logs.

Language control: Natural language prompts act as a flexible, high-level interface for editing time-of-day, weather, or even generating entirely synthetic scenes. The Waymo team demonstrates ‘World Mutation’ sequences where the same base city scene is rendered at dawn, morning, noon, afternoon, evening, and night, and then under cloudy, foggy, rainy, snowy, and sunny conditions.

This tri-axis control is close to a structured API: numeric driving actions, structural layout edits, and semantic text prompts all steer the same underlying world model.

Turning ordinary videos into multimodal simulations

The Waymo World Model can convert regular mobile or dashcam recordings into multimodal simulations that show how the Waymo Driver would perceive the same scene.

Waymo showcases examples from scenic drives in Norway, Arches National Park, and Death Valley. Given only the video, the model reconstructs a simulation with aligned camera images and lidar output. This creates scenarios with strong realism and factuality because the generated world is anchored to actual footage, while still being controllable via the three mechanisms above.

Practically, this means a large corpus of consumer-style video can be reused as structured simulation input without requiring lidar recordings in those locations.

Scalable inference and long rollouts

Long-horizon maneuvers such as threading a narrow lane with oncoming traffic or navigating dense neighborhoods require many simulation steps. Naive generative models suffer from quality drift and high compute cost over long rollouts.

Waymo team reports an efficient variant of the Waymo World Model that supports long sequences with a dramatic reduction in compute while maintaining realism. They show 4x-speed playback of extended scenes like freeway navigation around an in-lane stopper, busy neighborhood driving, climbing steep streets around motorcyclists, and handling SUV U-turns.

For training and regression testing, this reduces the hardware budget per scenario and makes large test suites more tractable.

Key Takeaways

Genie 3–based world model: Waymo World Model adapts Google DeepMind’s Genie 3 into a driving-specific world model that generates photorealistic, interactive, multi-sensor 3D environments for AV simulation.

Multi-sensor, 4D outputs aligned with the Waymo Driver: The simulator jointly produces temporally consistent camera imagery and lidar point clouds, aligned with Waymo’s real sensor stack, so downstream autonomy systems can consume simulation like real logs.

Emergent coverage of rare and long-tail scenarios: By leveraging large-scale video pre-training, the model can synthesize rare conditions and objects, such as snow on unusual roads, floods, fires, and animals like elephants or lions, that the fleet has never directly observed.

Tri-axis controllability for targeted stress testing: Driving action control, scene layout control, and language control let devs run counterfactuals, edit road geometry and traffic participants, and mutate time-of-day or weather via text prompts in the same generative environment.

Efficient long-horizon and video-anchored simulation: An optimized variant supports long rollouts at reduced compute cost, and the system can also convert ordinary dashcam or mobile videos into controllable multimodal simulations, expanding the pool of realistic scenarios.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Waymo Introduces the Waymo World Model: A New Frontier Simulator Model for Autonomous Driving and Built on Top of Genie 3 appeared first on MarkTechPost.

A Coding, Data-Driven Guide to Measuring, Visualizing, and Enforcing C …

In this tutorial, we build an end-to-end cognitive complexity analysis workflow using complexipy. We start by measuring complexity directly from raw code strings, then scale the same analysis to individual files and an entire project directory. Along the way, we generate machine-readable reports, normalize them into structured DataFrames, and visualize complexity distributions to understand how decision depth accumulates across functions. By treating cognitive complexity as a measurable engineering signal, we show how it can be integrated naturally into everyday Python development and quality checks. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install complexipy pandas matplotlib

import os
import json
import textwrap
import subprocess
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt

from complexipy import code_complexity, file_complexity

print(” Installed complexipy and dependencies”)

We set up the environment by installing the required libraries and importing all dependencies needed for analysis and visualization. We ensure the notebook is fully self-contained and ready to run in Google Colab without external setup. It forms the backbone of execution for everything that follows.

Copy CodeCopiedUse a different Browsersnippet = “””
def score_orders(orders):
total = 0
for o in orders:
if o.get(“valid”):
if o.get(“priority”):
if o.get(“amount”, 0) > 100:
total += 3
else:
total += 2
else:
if o.get(“amount”, 0) > 100:
total += 2
else:
total += 1
else:
total -= 1
return total
“””

res = code_complexity(snippet)
print(“=== Code string complexity ===”)
print(“Overall complexity:”, res.complexity)
print(“Functions:”)
for f in res.functions:
print(f” – {f.name}: {f.complexity} (lines {f.line_start}-{f.line_end})”)

We begin by analyzing a raw Python code string to understand cognitive complexity at the function level. We directly inspect how nested conditionals and control flow contribute to complexity. It helps us validate the core behavior of complexipy before scaling to real files.

Copy CodeCopiedUse a different Browserroot = Path(“toy_project”)
src = root / “src”
tests = root / “tests”
src.mkdir(parents=True, exist_ok=True)
tests.mkdir(parents=True, exist_ok=True)

(src / “__init__.py”).write_text(“”)
(tests / “__init__.py”).write_text(“”)

(src / “simple.py”).write_text(textwrap.dedent(“””
def add(a, b):
return a + b

def safe_div(a, b):
if b == 0:
return None
return a / b
“””).strip() + “n”)

(src / “legacy_adapter.py”).write_text(textwrap.dedent(“””
def legacy_adapter(x, y):
if x and y:
if x > 0:
if y > 0:
return x + y
else:
return x – y
else:
if y > 0:
return y – x
else:
return -(x + y)
return 0
“””).strip() + “n”)

(src / “engine.py”).write_text(textwrap.dedent(“””
def route_event(event):
kind = event.get(“kind”)
payload = event.get(“payload”, {})
if kind == “A”:
if payload.get(“x”) and payload.get(“y”):
return _handle_a(payload)
return None
elif kind == “B”:
if payload.get(“flags”):
return _handle_b(payload)
else:
return None
elif kind == “C”:
for item in payload.get(“items”, []):
if item.get(“enabled”):
if item.get(“mode”) == “fast”:
_do_fast(item)
else:
_do_safe(item)
return True
else:
return None

def _handle_a(p):
total = 0
for v in p.get(“vals”, []):
if v > 10:
total += 2
else:
total += 1
return total

def _handle_b(p):
score = 0
for f in p.get(“flags”, []):
if f == “x”:
score += 1
elif f == “y”:
score += 2
else:
score -= 1
return score

def _do_fast(item):
return item.get(“id”)

def _do_safe(item):
if item.get(“id”) is None:
return None
return item.get(“id”)
“””).strip() + “n”)

(tests / “test_engine.py”).write_text(textwrap.dedent(“””
from src.engine import route_event

def test_route_event_smoke():
assert route_event({“kind”: “A”, “payload”: {“x”: 1, “y”: 2, “vals”: [1, 20]}}) == 3
“””).strip() + “n”)

print(f” Created project at: {root.resolve()}”)

We programmatically construct a small but realistic Python project with multiple modules and test files. We intentionally include varied control-flow patterns to create meaningful differences in complexity. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserengine_path = src / “engine.py”
file_res = file_complexity(str(engine_path))

print(“n=== File complexity (Python API) ===”)
print(“Path:”, file_res.path)
print(“File complexity:”, file_res.complexity)
for f in file_res.functions:
print(f” – {f.name}: {f.complexity} (lines {f.line_start}-{f.line_end})”)

MAX_ALLOWED = 8

def run_complexipy_cli(project_dir: Path, max_allowed: int = 8):
cmd = [
“complexipy”,
“.”,
“–max-complexity-allowed”, str(max_allowed),
“–output-json”,
“–output-csv”,
]
proc = subprocess.run(cmd, cwd=str(project_dir), capture_output=True, text=True)

preferred_csv = project_dir / “complexipy.csv”
preferred_json = project_dir / “complexipy.json”

csv_candidates = []
json_candidates = []

if preferred_csv.exists():
csv_candidates.append(preferred_csv)
if preferred_json.exists():
json_candidates.append(preferred_json)

csv_candidates += list(project_dir.glob(“*.csv”)) + list(project_dir.glob(“**/*.csv”))
json_candidates += list(project_dir.glob(“*.json”)) + list(project_dir.glob(“**/*.json”))

def uniq(paths):
seen = set()
out = []
for p in paths:
p = p.resolve()
if p not in seen and p.is_file():
seen.add(p)
out.append(p)
return out

csv_candidates = uniq(csv_candidates)
json_candidates = uniq(json_candidates)

def pick_best(paths):
if not paths:
return None
paths = sorted(paths, key=lambda p: p.stat().st_mtime, reverse=True)
return paths[0]

return proc.returncode, pick_best(csv_candidates), pick_best(json_candidates)

rc, csv_report, json_report = run_complexipy_cli(root, MAX_ALLOWED)

We analyze a real source file using the Python API, then run the complexipy CLI on the entire project. We run the CLI from the correct working directory to reliably generate reports. This step bridges local API usage with production-style static analysis workflows.

Copy CodeCopiedUse a different Browserdf = None

if csv_report and csv_report.exists():
df = pd.read_csv(csv_report)
elif json_report and json_report.exists():
data = json.loads(json_report.read_text())
if isinstance(data, list):
df = pd.DataFrame(data)
elif isinstance(data, dict):
if “files” in data and isinstance(data[“files”], list):
df = pd.DataFrame(data[“files”])
elif “results” in data and isinstance(data[“results”], list):
df = pd.DataFrame(data[“results”])
else:
df = pd.json_normalize(data)

if df is None:
raise RuntimeError(“No report produced”)

def explode_functions_table(df_in):
if “functions” in df_in.columns:
tmp = df_in.explode(“functions”, ignore_index=True)
if tmp[“functions”].notna().any() and isinstance(tmp[“functions”].dropna().iloc[0], dict):
fn = pd.json_normalize(tmp[“functions”])
base = tmp.drop(columns=[“functions”])
return pd.concat([base.reset_index(drop=True), fn.reset_index(drop=True)], axis=1)
return tmp
return df_in

fn_df = explode_functions_table(df)

col_map = {}
for c in fn_df.columns:
lc = c.lower()
if lc in (“path”, “file”, “filename”, “module”):
col_map[c] = “path”
if (“function” in lc and “name” in lc) or lc in (“function”, “func”, “function_name”):
col_map[c] = “function”
if lc == “name” and “function” not in fn_df.columns:
col_map[c] = “function”
if “complexity” in lc and “allowed” not in lc and “max” not in lc:
col_map[c] = “complexity”
if lc in (“line_start”, “linestart”, “start_line”, “startline”):
col_map[c] = “line_start”
if lc in (“line_end”, “lineend”, “end_line”, “endline”):
col_map[c] = “line_end”

fn_df = fn_df.rename(columns=col_map)

We load the generated complexity reports into pandas and normalize them into a function-level table. We handle multiple possible report schemas to keep the workflow robust. This structured representation allows us to reason about complexity using standard data analysis tools.

Copy CodeCopiedUse a different Browserif “complexity” in fn_df.columns:
fn_df[“complexity”] = pd.to_numeric(fn_df[“complexity”], errors=”coerce”)
plt.figure()
fn_df[“complexity”].dropna().plot(kind=”hist”, bins=20)
plt.title(“Cognitive Complexity Distribution (functions)”)
plt.xlabel(“complexity”)
plt.ylabel(“count”)
plt.show()

def refactor_hints(complexity):
if complexity >= 20:
return [
“Split into smaller pure functions”,
“Replace deep nesting with guard clauses”,
“Extract complex boolean predicates”
]
if complexity >= 12:
return [
“Extract inner logic into helpers”,
“Flatten conditionals”,
“Use dispatch tables”
]
if complexity >= 8:
return [
“Reduce nesting”,
“Early returns”
]
return [“Acceptable complexity”]

if “complexity” in fn_df.columns and “function” in fn_df.columns:
for _, r in fn_df.sort_values(“complexity”, ascending=False).head(8).iterrows():
cx = float(r[“complexity”]) if pd.notna(r[“complexity”]) else None
if cx is None:
continue
print(r[“function”], cx, refactor_hints(cx))

print(” Tutorial complete.”)

We visualize the distribution of cognitive complexity and derive refactoring guidance from numeric thresholds. We translate abstract complexity scores into concrete engineering actions. It closes the loop by connecting measurement directly to maintainability decisions.

In conclusion, we presented a practical, reproducible pipeline for auditing cognitive complexity in Python projects using complexipy. We demonstrated how we can move from ad hoc inspection to data-driven reasoning about code structure, identify high-risk functions, and provide actionable refactoring guidance based on quantified thresholds. The workflow allows us to reason about maintainability early, enforce complexity budgets consistently, and evolve codebases with clarity and confidence, rather than relying solely on intuition.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding, Data-Driven Guide to Measuring, Visualizing, and Enforcing Cognitive Complexity in Python Projects Using complexipy appeared first on MarkTechPost.

NVIDIA AI releases C-RADIOv4 vision backbone unifying SigLIP2, DINOv3, …

How do you combine SigLIP2, DINOv3, and SAM3 into a single vision backbone without sacrificing dense or segmentation performance? NVIDIA’s C-RADIOv4 is a new agglomerative vision backbone that distills three strong teacher models, SigLIP2-g-384, DINOv3-7B, and SAM3, into a single student encoder. It extends the AM-RADIO and RADIOv2.5 line, keeping similar computational cost while improving dense prediction quality, resolution robustness, and drop-in compatibility with SAM3.

The key idea is simple. Instead of choosing between a vision language model, a self supervised dense model, and a segmentation model, C-RADIOv4 tries to approximate all three at once with one backbone.

https://www.arxiv.org/pdf/2601.17237

Agglomerative distillation in RADIO

The RADIO family uses agglomerative distillation. A single ViT style student is trained to match both dense feature maps and summary tokens from several heterogeneous teachers.

Earlier RADIO models combined DFN CLIP, DINOv2, and SAM. They already supported multi resolution training but showed ‘mode switching’, where the representation changed qualitatively as input resolution changed. Later work such as PHI-S, RADIOv2.5, and FeatSharp added better multi resolution distillation and regularization, but the teacher set was still limited.

C-RADIOv4 upgrades the teachers:

SigLIP2-g-384 for stronger image text alignment

DINOv3-7B for high quality self supervised dense features

SAM3 for segmentation oriented features and compatibility with the SAM3 decoder

The student is trained so that its dense features match DINOv3 and SAM3, while its summary tokens match SigLIP2 and DINOv3. This gives one encoder that can support classification, retrieval, dense prediction, and segmentation.

Stochastic multi resolution training

C-RADIOv4 uses stochastic multi resolution training rather than a small fixed set of resolutions.

Training samples input sizes from two partitions:

Low resolution: {128, 192, 224, 256, 384, 432}

High resolution: {512, 768, 1024, 1152}

SigLIP2 operates natively at 384 pixels. Its features are upsampled by a factor of 3 using FeatSharp to align with 1152 pixel SAM3 features. SAM3 is trained with mosaic augmentation at 1152 × 1152.

This design smooths the performance curve over resolution and improves low resolution behavior. For example, on ADE20k linear probing, C-RADIOv4-H reaches around:

55.20 mIoU at 512 px

57.02 mIoU at 1024 px

57.72 mIoU at 1536 px

The scaling trend is close to DINOv3-7B while using roughly an order of magnitude fewer parameters.

Removing teacher noise with shift equivariant losses and MESA

Distilling from large vision models tends to copy their artifacts, not just their useful structure. SigLIP2 has border noise patterns, and ViTDet style models can show window boundary artifacts. Direct feature regression can force the student to reproduce those patterns.

C-RADIOv4 introduces two shift equivariant mechanisms to suppress such noise:

Shift equivariant dense loss: Each teacher and the student see independently shifted crops of an image. Before computing the squared error, features are aligned via a shift mapping and the loss only uses overlapping spatial positions. Because the student never sees the same absolute positions as the teacher, it cannot simply memorize position fixed noise and is forced to track input dependent structure instead.

Shift equivariant MESA: C-RADIOv4 also uses MESA style regularization between the online network and an EMA copy. Here again, the student and its EMA see different crops, features are aligned by a shift, and the loss is applied after layer normalization. This encourages smooth loss landscapes and robustness, while being invariant to absolute position.

In addition, training uses DAMP, which injects multiplicative noise into weights. This further improves robustness to corruptions and small distribution shifts.

Balancing teachers with an angular dispersion aware summary loss

The summary loss in previous RADIO models used cosine distance between student and teacher embeddings. Cosine distance removes magnitude but not directional dispersion on the sphere. Some teachers, such as SigLIP2, produce embeddings concentrated in a narrow cone, while DINOv3 variants produce more spread out embeddings.

If raw cosine distance is used, teachers with wider angular dispersion contribute larger losses and dominate optimization. In practice, DINOv3 tended to overshadow SigLIP2 in the summary term.

C-RADIOv4 replaces this with an angle normalized loss. The squared angle between student and teacher embeddings is divided by the teacher’s angular dispersion. Measured dispersions show SigLIP2-g-384 around 0.694, while DINOv3-H+ and DINOv3-7B are around 2.12 and 2.19. Normalizing by these values equalizes their influence and preserves both vision language and dense semantics.

Performance: classification, dense prediction, and Probe3d

On ImageNet-1k zero shot classification, C-RADIOv4-H reaches about 83.09 % top-1 accuracy. It matches or improves on RADIOv2.5-H and C-RADIOv3-H across resolutions, with the best performance near 1024 px.

On k-NN classification, C-RADIOv4-H improves over RADIOv2.5 and C-RADIOv3, and matches or surpasses DINOv3 starting around 256 px. DINOv3 peaks near 192–256 px and then degrades, while C-RADIOv4 keeps stable or improving performance at higher resolutions.

Dense and 3D aware metrics show the intended tradeoff. On ADE20k, PASCAL VOC, NAVI, and SPair, C-RADIOv4-H and the SO400M variant outperform earlier RADIO models and are competitive with DINOv3-7B on dense benchmarks. For C-RADIOv4-H, typical scores are:

ADE20k: 55.20 mIoU

VOC: 87.24 mIoU

NAVI: 63.44

SPair: 60.57

https://www.arxiv.org/pdf/2601.17237

On Probe3d, which includes Depth Normals, Surface Normals, NAVI, and SPair, C-RADIOv4-H achieves the best NAVI and SPair scores in the RADIO family. Depth and Surface metrics are close to those of C-RADIOv3-H, with small differences in either direction, rather than a uniform improvement.

Integration with SAM3 and ViTDet-mode deployment

C-RADIOv4 is designed to be a drop in replacement for the Perception Encoder backbone in SAM3. The SAM3 decoder and memory components remain unchanged. A reference implementation is provided in a SAM3 fork. Qualitative examples show that segmentation behavior is preserved for both text prompts such as “shoe”, “helmet”, “bike”, “spectator” and box prompts, and in some reported cases C-RADIOv4 based SAM3 resolves failure cases from the original encoder.

For deployment, C-RADIOv4 exposes a ViTDet-mode configuration. Most transformer blocks use windowed attention, while a few use global attention. Supported window sizes range from 6 × 6 to 32 × 32 tokens, subject to divisibility with patch size and image resolution. On an A100, the SO400M model with window size at most 12 is faster than the SAM3 ViT-L+ encoder across a wide range of input sizes, and the Huge model with window size 8 is close in latency.

This makes C-RADIOv4 a practical backbone for high resolution dense tasks where full global attention at all layers is too expensive.

Key Takeaways

Single unified backbone: C-RADIOv4 distills SigLIP2-g-384, DINOv3-7B, and SAM3 into one ViT-style encoder that supports classification, retrieval, dense prediction, and segmentation.

Any-resolution behavior: Stochastic multi resolution training over {128…1152} px, and FeatSharp upsampling for SigLIP2, stabilizes performance across resolutions and tracks DINOv3-7B scaling with far fewer parameters.

Noise suppression via shift equivariance: Shift equivariant dense loss and shift equivariant MESA prevent the student from copying teacher border and window artifacts, focusing learning on input dependent semantics.

Balanced multi-teacher distillation: An angular dispersion normalized summary loss equalizes the contribution of SigLIP2 and DINOv3, preserving both text alignment and dense representation quality.

SAM3 and ViTDet-ready deployment: C-RADIOv4 can directly replace the SAM3 Perception Encoder, offers ViTDet-mode windowed attention for faster high resolution inference, and is released under the NVIDIA Open Model License.

Check out the Paper, Repo, Model-1 and Model-2. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post NVIDIA AI releases C-RADIOv4 vision backbone unifying SigLIP2, DINOv3, SAM3 for classification, dense prediction, segmentation workloads at scale appeared first on MarkTechPost.

Structured outputs on Amazon Bedrock: Schema-compliant AI responses

Today, we’re announcing structured outputs on Amazon Bedrock—a capability that fundamentally transforms how you can obtain validated JSON responses from foundation models through constrained decoding for schema compliance.
This represents a paradigm shift in AI application development. Instead of validating JSON responses and writing fallback logic for when they fail, you can move straight to building with the data. With structured outputs, you can build zero-validation data pipelines that trust model outputs, reliable agentic systems that confidently call external functions, and simplified application architectures without retry logic.
In this post, we explore the challenges of traditional JSON generation and how structured outputs solves them. We cover the two core mechanisms—JSON Schema output format and strict tool use—along with implementation details, best practices, and practical code examples. Whether you’re building data extraction pipelines, agentic workflows, or AI-powered APIs, you’ll learn how to use structured outputs to create reliable, production-ready applications. Our companion Jupyter notebook provides hands-on examples for every feature covered here.
The problem with traditional JSON generation
For years, getting structured data from language models meant crafting detailed prompts, hoping for the best, and building elaborate error-handling systems. Even with careful prompting, developers routinely encounter:

Parsing failures: Invalid JSON syntax that breaks json.loads() calls
Missing fields: Required data points absent from responses
Type mismatches: Strings where integers are expected, breaking downstream processing
Schema violations: Responses that technically parse but don’t match your data model

In production systems, these failures compound. A single malformed response can cascade through your pipeline, requiring retries that increase latency and costs. For agentic workflows where models call tools, invalid parameters can break function calls entirely.
Consider a booking system requiring passengers: int. Without schema enforcement, the model might return passengers: “two” or passengers: “2”—syntactically valid JSON, but semantically wrong for your function signature.
What changes with structured outputs
Structured outputs on Amazon Bedrock isn’t incremental improvement—it’s a fundamental shift from probabilistic to deterministic output formatting. Through constrained decoding, Amazon Bedrock constrains model responses to conform to your specified JSON schema. Two complementary mechanisms are available:

Feature
Purpose
Use case

JSON Schema output format
Control the model’s response format
Data extraction, report generation, API responses

Strict tool use
Validate tool parameters
Agentic workflows, function calling, multi-step automation

These features can be used independently or together, giving you precise control over both what the model outputs and how it calls your functions.
What structured outputs delivers:

Always valid: No more JSON.parse() errors or parsing exceptions
Type safe: Field types are enforced and required fields are always present
Reliable: No retries needed for schema violations
Production ready: Deploy with confidence at enterprise scale

How structured outputs works
Structured outputs uses constrained sampling with compiled grammar artifacts. Here’s what happens when you make a request:

Schema validation: Amazon Bedrock validates your JSON schema against the supported JSON Schema Draft 2020-12 subset
Grammar compilation: For new schemas, Amazon Bedrock compiles a grammar (first request might take longer)
Caching: Compiled grammars are cached for 24 hours, making subsequent requests faster
Constrained generation: The model generates tokens that produce valid JSON matching your schema

Performance considerations:

First request latency: Initial compilation might add latency to new schemas
Cached performance: Subsequent requests with identical schemas have minimal overhead
Cache scope: Grammars are cached per account for 24 hours from first access

Changing the JSON schema structure or a tool’s input schema invalidates the cache, but changing only name or description fields does not.
Getting started with structured outputs
The following example demonstrates structured outputs with the Converse API:

import boto3
import json
# Initialize the Bedrock Runtime client
bedrock_runtime = boto3.client(
service_name=’bedrock-runtime’,
region_name=’us-east-1′ # Choose your preferred region
)
# Define your JSON schema
extraction_schema = {
“type”: “object”,
“properties”: {
“name”: {“type”: “string”, “description”: “Customer name”},
“email”: {“type”: “string”, “description”: “Customer email address”},
“plan_interest”: {“type”: “string”, “description”: “Product plan of interest”},
“demo_requested”: {“type”: “boolean”, “description”: “Whether a demo was requested”}
},
“required”: [“name”, “email”, “plan_interest”, “demo_requested”],
“additionalProperties”: False
}
# Make the request with structured outputs
response = bedrock_runtime.converse(
modelId=”us.anthropic.claude-opus-4-5-20251101-v1:0″,
messages=[
{
“role”: “user”,
“content”: [
{
“text”: “Extract the key information from this email: John Smith (john@example.com) is interested in our Enterprise plan and wants to schedule a demo for next Tuesday at 2pm.”
}
]
}
],
inferenceConfig={
“maxTokens”: 1024
},
outputConfig={
“textFormat”: {
“type”: “json_schema”,
“structure”: {
“jsonSchema”: {
“schema”: json.dumps(extraction_schema),
“name”: “lead_extraction”,
“description”: “Extract lead information from customer emails”
}
}
}
}
)
# Parse the schema-compliant JSON response
result = json.loads(response[“output”][“message”][“content”][0][“text”])
print(json.dumps(result, indent=2))

Output:

{
“name”: “John Smith”,
“email”: “john@example.com”,
“plan_interest”: “Enterprise”,
“demo_requested”: true
}

The response conforms to your schema—no additional validation required.
Requirements and best practices
To use structured outputs effectively, follow these guidelines:

Set additionalProperties: false on all objects. This is required for structured outputs to work. Without it, your schema won’t be accepted.

{
“type”: “object”,
“properties”: {
“name”: {“type”: “string”}
},
“required”: [“name”],
“additionalProperties”: false
}

Use descriptive field names and descriptions. Models use property names and descriptions to understand what data to extract. Clear names like customer_email outperform generic names like field1.
Use enum for constrained values. When a field has a limited set of valid values, use enum to constrain options. This improves accuracy and produces valid values.
Start basic, then add complexity. Begin with the minimum required fields and add complexity incrementally. Basic schemas compile faster and are easier to maintain.
Reuse schemas to benefit from caching. Structure your application to reuse schemas across requests. The 24-hour grammar cache significantly improves performance for repeated queries.
Check stopReason in every response. Two scenarios can produce non-conforming responses: refusals (when the model declines for safety reasons) and token limits (when max_tokens is reached before completing). Handle both cases in your code.
Test with realistic data before deployment. Validate your schemas against production-representative inputs. Edge cases in real data often reveal schema design issues.

Supported JSON Schema features:

All basic types: object, array, string, integer, number, boolean, null
enum (strings, numbers, bools, or nulls only)
const, anyOf, allOf (with limitations)
$ref, $def, and definitions (internal references only)
String formats: date-time, time, date, duration, email, hostname, uri, ipv4, ipv6, uuid
Array minItems (only values 0 and 1)

Not supported:

Recursive schemas
External $ref references
Numerical constraints (minimum, maximum, multipleOf)
String constraints (minLength, maxLength)
additionalProperties set to anything other than false

Strict tool use for agentic workflows
When building applications where models call tools, set strict: true in your tool definition to constrain tool parameters to match your input schema exactly:

import boto3
import json
bedrock_runtime = boto3.client(‘bedrock-runtime’, region_name=’us-east-1′)
response = bedrock_runtime.converse(
modelId=”us.anthropic.claude-opus-4-5-20251101-v1:0″,
messages=[
{
“role”: “user”,
“content”: [{“text”: “What’s the weather like in San Francisco?”}]
}
],
inferenceConfig={“maxTokens”: 1024},
toolConfig={
“tools”: [
{
“toolSpec”: {
“name”: “get_weather”,
“description”: “Get the current weather for a specified location”,
“strict”: True, # Enable strict mode
“inputSchema”: {
“json”: {
“type”: “object”,
“properties”: {
“location”: {
“type”: “string”,
“description”: “The city and state, e.g., San Francisco, CA”
},
“unit”: {
“type”: “string”,
“enum”: [“celsius”, “fahrenheit”],
“description”: “Temperature unit”
}
},
“required”: [“location”, “unit”],
“additionalProperties”: False
}
}
}
}
]
}
)
# Tool inputs conform to the schema
for content_block in response[“output”][“message”][“content”]:
if “toolUse” in content_block:
tool_input = content_block[“toolUse”][“input”]
print(f”Tool: {content_block[‘toolUse’][‘name’]}”)
print(f”Input: {json.dumps(tool_input, indent=2)}”)

With strict: true, structured outputs constrains the output so that:

The location field is always a string
The unit field is always either celsius or fahrenheit
No unexpected fields appear in the input

Practical applications across industries
The notebook demonstrates use cases that span industries:

Financial services: Extract structured data from earnings reports, loan applications, and compliance documents. With structured outputs, every required field is present and correctly typed for downstream processing.
Healthcare: Parse clinical notes into structured, schema-compliant records. Extract patient information, diagnoses, and treatment plans into validated JSON for EHR integration.
Ecommerce: Build reliable product catalog enrichment pipelines. Extract specifications, categories, and attributes from product descriptions with consistent, reliable results.
Legal: Analyze contracts and extract key terms, parties, dates, and obligations into structured formats suitable for contract management systems.
Customer service: Build intelligent ticket routing and response systems where extracted intents, sentiments, and entities match your application’s data model.

Choosing the right approach
Our testing revealed clear patterns for when to use each feature:
Use JSON Schema output format when:

You need the model’s response in a specific structure
Building data extraction pipelines
Generating API-ready responses
Creating structured reports or summaries

Use strict tool use when:

Building agentic systems that call external functions
Implementing multi-step workflows with tool chains
Requiring validated parameter types for function calls
Connecting AI to databases, APIs, or external services

Use both together when:

Building complex agents that need validated tool calls and structured final responses
Creating systems where intermediate tool results feed into structured outputs
Implementing enterprise workflows requiring end-to-end schema compliance

API comparison: Converse compared to InvokeModel
Both the Converse API and InvokeModel API support structured outputs, with slightly different parameter formats:

Aspect
Converse API
InvokeModel (Anthropic Claude)
InvokeModel (open-weight models)

Schema location
outputConfig.textFormat
output_config.format
response_format

Tool strict flag
toolSpec.strict
tools[].strict
tools[].function.strict

Schema format
JSON string in jsonSchema.schema
JSON object in schema
JSON object in json_schema.schema

Best for
Conversational workflows
Single-turn inference (Claude)
Single-turn inference (open-weight)

Note: The InvokeModel API uses different request field names depending on the model type. For Anthropic Claude models, use output_config.format for JSON schema outputs. For open-weight models, use response_format instead.
Choose the Converse API for multi-turn conversations and the InvokeModel API when you need direct model access with provider-specific request formats.
Supported models and availability
Structured outputs is generally available in all commercial AWS Regions for select Amazon Bedrock model providers:

Anthropic
DeepSeek
Google
MiniMax
Mistral AI
Moonshot AI
NVIDIA
OpenAI
Qwen

The feature works seamlessly with:

Cross-Region inference: Use structured outputs across AWS Regions without additional setup
Batch inference: Process large volumes with schema-compliant outputs
Streaming: Stream structured responses with ConverseStream or InvokeModelWithResponseStream

Conclusion
In this post, you discovered how structured outputs on Amazon Bedrock reduce the uncertainty of AI-generated JSON through validated, schema-compliant responses. By using JSON Schema output format and strict tool use, you can build reliable data extraction pipelines, robust agentic workflows, and production-ready AI applications—without custom parsing or validation logic.Whether you’re extracting data from documents, building intelligent automation, or creating AI-powered APIs, structured outputs deliver the reliability your applications demand.
Structured outputs is now generally available on Amazon Bedrock. To use structured outputs with the Converse APIs, update to the latest AWS SDK. To learn more, see the Amazon Bedrock documentation and explore our sample notebook.
What workflows could validated, schema-compliant JSON unlock in your organization? The notebook provides everything you need to find out.

About the authors

Jeffrey Zeng
Jeffrey Zeng is a Worldwide Specialist Solutions Architect for Generative AI at AWS, leading third-party models on Amazon Bedrock. He focuses on agentic coding and workflows, with hands-on experience helping customers build and deploy AI solutions from proof-of-concept to production.

Jonathan Evans
Jonathan Evans is a Worldwide Solutions Architect for Generative AI at AWS, where he helps customers leverage cutting-edge AI technologies with Anthropic Claude models on Amazon Bedrock, to solve complex business challenges. With a background in AI/ML engineering and hands-on experience supporting machine learning workflows in the cloud, Jonathan is passionate about making advanced AI accessible and impactful for organizations of all sizes.

Manage Amazon SageMaker HyperPod clusters using the HyperPod CLI and S …

Training and deploying large AI models requires advanced distributed computing capabilities, but managing these distributed systems shouldn’t be complex for data scientists and machine learning (ML) practitioners. The command line interface (CLI) and software development kit (SDK) for Amazon SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS) orchestration simplify how you manage cluster infrastructure and use the service’s distributed training and inference capabilities.
The SageMaker HyperPod CLI provides data scientists with an intuitive command-line experience, abstracting away the underlying complexity of distributed systems. Built on top of the SageMaker HyperPod SDK, the CLI offers straightforward commands for managing HyperPod clusters and common workflows like launching training or fine-tuning jobs, deploying inference endpoints, and monitoring cluster performance. This makes it ideal for quick experimentation and iteration.
A layered architecture for simplicity
The HyperPod CLI and SDK follow a multi-layered, shared architecture. The CLI and the Python module serve as user-facing entry points and are both built on top of common SDK components to provide consistent behavior across interfaces. For infrastructure automation, the SDK orchestrates cluster lifecycle management through a combination of AWS CloudFormation stack provisioning and direct AWS API interactions. Training and inference workloads and integrated development environments (IDEs) (Spaces) are expressed as Kubernetes Custom Resource Definitions (CRDs), which the SDK manages through the Kubernetes API.
In this post, we demonstrate how to use the CLI and the SDK to create and manage SageMaker HyperPod clusters in your AWS account. We walk through a practical example and dive deeper into the user workflow and parameter choices.
This post focuses on cluster creation and management. For a deep dive into using the HyperPod CLI and SDK to submit training jobs and deploy inference endpoints, see our companion post: Train and deploy models on Amazon SageMaker HyperPod using the new HyperPod CLI and SDK.
Prerequisites
To follow the examples in this post, you must have the following prerequisites:

An AWS account with access to SageMaker HyperPod, Amazon Simple Storage Service (Amazon S3) and Amazon FSx for Lustre.
Sufficient service quota for creating the HyperPod cluster and instance groups.
A local environment (either your local machine or a cloud-based compute environment) from which to run the SageMaker HyperPod CLI commands, configured as follows:

Operating system based on Linux or MacOS.
Python 3.8 or later installed.
The AWS Command Line Interface (AWS CLI) configured with the appropriate credentials to use the aforementioned services.

Install the SageMaker HyperPod CLI
First, install the latest version of the SageMaker HyperPod CLI and SDK. The examples in this post are based on version 3.5.0. From your local environment, run the following command, you can alternatively install the CLI in a Python virtual environment:

# Install the HyperPod CLI and SDK
pip install sagemaker-hyperpod

This command sets up the tools needed to interact with SageMaker HyperPod clusters. For an existing installation, make sure you have the latest version of the package installed (SageMaker HyperPod 3.5.0 or later) to be able to use the relevant set of features described in this post. To verify if the CLI is installed correctly, run the hyp command and check the outputs:

# Check if the HyperPod CLI is correctly installed
hyp

The output will be similar to the following, and includes instructions on how to use the CLI:

Usage: hyp [OPTIONS] COMMAND [ARGS]…

Options:
  –version  Show version information
  –help     Show this message and exit.

Commands:
  configure                       Update any subset of fields in ./config.yaml by passing –<field> flags.
  create                          Create endpoints, pytorch jobs, cluster stacks, space, space access or space admin config.
  delete                          Delete endpoints, pytorch jobs, space, space access or space template.
  describe                        Describe endpoints, pytorch jobs or cluster stacks, spaces or space template.
  exec                            Execute commands in pods for endpoints or pytorch jobs.
  get-cluster-context             Get context related to the current set cluster.
  get-logs                        Get pod logs for endpoints, pytorch jobs or spaces.
  get-monitoring                  Get monitoring configurations for Hyperpod cluster.
  get-operator-logs               Get operator logs for endpoints.
  init                            Initialize a TEMPLATE scaffold in DIRECTORY.
  invoke                          Invoke model endpoints.
  list                            List endpoints, pytorch jobs, cluster stacks, spaces, and space templates.
  list-accelerator-partition-type
                                  List available accelerator partition types for an instance type.
  list-cluster                    List SageMaker Hyperpod Clusters with metadata.
  list-pods                       List pods for endpoints or pytorch jobs.
  reset                           Reset the current directory’s config.yaml to an “empty” scaffold: all schema keys set to default values (but keeping the…
  set-cluster-context             Connect to a HyperPod EKS cluster.
  start                           Start space resources.
  stop                            Stop space resources.
  update                          Update an existing HyperPod cluster configuration, space, or space template.
  validate                        Validate this directory’s config.yaml against the appropriate schema.

For more information on CLI usage and the available commands and respective parameters, see the CLI reference documentation.
The HyperPod CLI provides commands to manage the full lifecycle of HyperPod clusters. The following sections explain how to create new clusters, monitor their creation, modify instance groups, and delete clusters.
Creating a new HyperPod cluster
HyperPod clusters can be created through the AWS Management Console or the HyperPod CLI, both of which provide streamlined experiences for cluster creation. The console offers the easiest and most guided approach, while the CLI is especially useful for customers who prefer a programmatic experience—for example, to enable reproducibility or to build automation around cluster creation. Both methods use the same underlying CloudFormation template, which is available in the SageMaker HyperPod cluster setup GitHub repository. For a walkthrough of the console-based experience, see the cluster creation experience blog post.
Creating a new cluster through the HyperPod CLI follows a configuration-based workflow: the CLI first generates configuration files, which are then edited to match the intended cluster specifications. These files are subsequently submitted as a CloudFormation stack that creates the HyperPod cluster along with the required resources, such as a VPC and FSx for Lustre filesystem, among others.To initialize a new cluster configuration by running the following command:hyp init cluster-stack
This initializes a new cluster configuration in the current directory and generates a config.yaml file that you can use to specify the configuration of the cluster stack. Additionally it will create a README.md with information about the functionality and workflow in addition to a template for the CloudFormation stack parameters in cfn_params.jinja.

(base) xxxxxxxx@3c06303f9abb hyperpod % hyp init cluster-stack
Initializing new scaffold for ‘cluster-stack’…
✔ cluster-stack for schema version=’1.0’ is initialized in .
🚀 Welcome!
📘 See ./README.md for usage.

The cluster stack’s configuration variables are defined in config.yaml. The following is an excerpt from the file:


# Prefix to be used for all resources. A 4-digit UUID will be added to prefix during submission
resource_name_prefix: hyp-eks-stack
# Boolean to Create HyperPod Cluster Stack
create_hyperpod_cluster_stack: True
# Name of SageMaker HyperPod Cluster
hyperpod_cluster_name: hyperpod-cluster
# Boolean to Create EKS Cluster Stack
create_eks_cluster_stack: True
# The Kubernetes version
kubernetes_version: 1.31

The resource_name_prefix parameter serves as the primary identifier for the AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. The value of the prefix parameter is automatically appended with a unique identifier during cluster creation to provide resource uniqueness.
The configuration can be edited either directly by opening config.yaml in an editor of your choice or by running the hyp configure command. The following example shows how to specify the Kubernetes version of the Amazon EKS cluster that will be created by the stack:
hyp configure –kubernetes-version 1.33
Updating variables through the CLI commands provides added security by performing validation against the defined schema before setting the value in config.yaml.
Besides the Kubernetes version and the resource name prefix, some examples of significant parameters are listed below:

# List of string containing instance group configurations
instance_group_settings:
  – {‘InstanceCount’: 1, ‘InstanceGroupName’: ‘default’, ‘InstanceType’: ‘ml.t3.medium’, ‘TargetAvailabilityZoneId’: ‘use2-az2’, ‘ThreadsPerCore’: 1, ‘InstanceStorageConfigs’: [{‘EbsVolumeConfig’: {‘VolumeSizeInGB’: 500}}]}

# Boolean to Create EKS Cluster Stack
create_eks_cluster_stack: True

# The name of the S3 bucket used to store the cluster lifecycle scripts
s3_bucket_name: amzn-s3-demo-bucket

# Storage capacity for the FSx file system in GiB
storage_capacity: 1200

There are two important nuances when updating the configuration values through hyp configure commands:

Underscores (_) in variable names within config.yaml become hyphens (-) in the CLI commands. Thus kubernetes_version in config.yaml is configured via hyp configure –kubernetes-version in the CLI.
Variables that contain lists of entries within config.yaml are configured as JSON lists in the CLI command. For example, multiple instance groups are configured within config.yaml as the following:

instance_group_settings:
  – {‘InstanceCount’: 1, ‘InstanceGroupName’: ‘default’, ‘InstanceType’: ‘ml.t3.medium’, ‘TargetAvailabilityZoneId’: ‘use2-az2’, ‘ThreadsPerCore’: 1, ‘InstanceStorageConfigs’: [{‘EbsVolumeConfig’: {‘VolumeSizeInGB’: 500}}]}
  – {‘InstanceCount’: 2, ‘InstanceGroupName’: ‘worker’, ‘InstanceType’: ‘ml.t3.large’, ‘TargetAvailabilityZoneId’: ‘use2-az2’, ‘ThreadsPerCore’: 1, ‘InstanceStorageConfigs’: [{‘EbsVolumeConfig’: {‘VolumeSizeInGB’: 1000}}]}

Which translates to the following CLI command:

hyp configure —instance-group-settings “[{‘InstanceCount’: 1, ‘InstanceGroupName’: ‘default’, ‘InstanceType’: ‘ml.t3.medium’, ‘TargetAvailabilityZoneId’: ‘use2-az2’, ‘ThreadsPerCore’: 1, ‘InstanceStorageConfigs’: [{‘EbsVolumeConfig’: {‘VolumeSizeInGB’: 500}}]}, {‘InstanceCount’: 2, ‘InstanceGroupName’: ‘worker’, ‘InstanceType’: ‘ml.t3.large’, ‘TargetAvailabilityZoneId’: ‘use2-az2’, ‘ThreadsPerCore’: 1, ‘InstanceStorageConfigs’: [{‘EbsVolumeConfig’: {‘VolumeSizeInGB’: 1000}}]}]”

After you’re done making the desired changes, validate your configuration file by running the following command:hyp validate
This will validate the parameters in config.yaml against the defined schema. If successful, the CLI will output the following:

(base) xxxxxxxx@3c06303f9abb hyperpod % hyp validate
✔  config.yaml is valid!

The cluster creation stack can be submitted to CloudFormation by running the following command:hyp create –region <region>
The hyp create command performs validation and injects values from config.yaml into the cfn_params.jinja template. If no AWS Region is explicitly provided, the command uses the default Region from your AWS credentials configuration. The resolved configuration file and CloudFormation template values are saved to a timestamped subdirectory under the ./run/ directory, providing a lightweight local versioning mechanism to track which configuration was used to create a cluster at a given point in time. You can also choose to commit these artifacts to your version control system to improve reproducibility and auditability. If successful, the command outputs the CloudFormation stack ID:

(base) xxxxxxxx@3c06303f9abb dev % hyp create
✔ config.yaml is valid!
✔ Submitted! Files written to run/20251118T101501
Submitting to default region: us-east-1.
Stack creation initiated. Stack ID: arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1

Monitoring the HyperPod cluster creation process
You can list the existing CloudFormation stacks by running the following command:hyp list cluster-stack –region <region>
You can optionally filter the output by stack status by adding the following flag: –status “[‘CREATE_COMPLETE’, ‘UPDATE_COMPLETE’]”.
The output of this command will look similar to the following:

(base) xxxxxxxx@3c06303f9abb dev % hyp list cluster-stack
📋 HyperPod Cluster Stacks (94 found)

[1] Stack Details:
Field | Value
———————+—————————————————————————————————————————————————
StackId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8/e2898250-c491-11f0-bf25-0afff7e082cf
StackName | HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8
TemplateDescription | S3 Endpoint Stack
 CreationTime | :18:50
StackStatus | CREATE_COMPLETE
ParentId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1
RootId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1
DriftInformation | {‘StackDriftStatus’: ‘NOT_CHECKED’}

Depending on the configuration in config.yaml, several nested stacks are created that cover different aspects of the HyperPod cluster setup such as the EKSClusterStack, FsxStack and the VPCStack.
You can use the describe command to view details about any of the individual stacks:hyp describe cluster-stack <stack-name> –region <region>
The output for an exemplary substack, S3EndpointStack, will look like the following:

(base) xxxxxxxx@3c06303f9abb dev % hyp describe cluster-stack HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8
📋 Stack Details for: HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8
Status: CREATE_COMPLETE
Field | Value
—————————–+—————————————————————————————————————————————————
StackId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8/e2898250-c491-11f0-bf25-0afff7e082cf
StackName | HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8
Description | S3 Endpoint Stack
Parameters | [
| {
| “ParameterKey”: “ResourceNamePrefix”,
| “ParameterValue”: “hyp-eks-demo-stack”
| },
| {
| “ParameterKey”: “VpcId”,
| “ParameterValue”: “vpc-XXXXXXXXXXXXXX”
| },
| {
| “ParameterKey”: “EksPrivateRouteTableIds”,
| “ParameterValue”: “rtb-XXXXXXXXXXXXXX,rtb-XXXXXXXXXXXXXX”
| },
| {
| “ParameterKey”: “PrivateRouteTableIds”,
| “ParameterValue”: “rtb-XXXXXXXXXXXXXX,rtb-XXXXXXXXXXXXXX”
| }
| ]
 CreationTime | :18:50.007000+00:00
RollbackConfiguration | {}
StackStatus | CREATE_COMPLETE
DisableRollback | True
NotificationARNs | []
Capabilities | [
| “CAPABILITY_AUTO_EXPAND”,
| “CAPABILITY_IAM”,
| “CAPABILITY_NAMED_IAM”
| ]
Tags | []
EnableTerminationProtection | False
ParentId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1
RootId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1
DriftInformation | {
| “StackDriftStatus”: “NOT_CHECKED”

If any of the stacks show CREATE_FAILED, ROLLBACK_* or DELETE_*, open the CloudFormation page in the console to investigate the root cause. Failed cluster creation stacks are often related to insufficient service quotas for the cluster itself, the instance groups, or the network components such as VPCs or NAT gateways. Check the corresponding SageMaker HyperPod Quotas to learn more about the required quotas for SageMaker HyperPod.
Connecting to a cluster
After the cluster stack has successfully created the required resources and the status has changed to CREATE_COMPLETE, you can configure the CLI and your local Kubernetes environment to interact with the HyperPod cluster.
hyp set-cluster-context –cluster-name <cluster-name> —region <region>
The –cluster-name option specifies the name of the HyperPod cluster to connect to and the –region option specifies the Region where the cluster has been created. Optionally, a specific namespace can be configured using the –namespace parameter. The command updates your local Kubernetes config in ./kube/config, so that you can use both the HyperPod CLI and Kubernetes utilities such as kubectl to manage the resources in your HyperPod cluster.
See our companion blog post for further information about how to use the CLI to submit training jobs and inference deployments to your newly created HyperPod cluster: Train and deploy models on Amazon SageMaker HyperPod using the new HyperPod CLI and SDK.
Modifying an existing HyperPod cluster
The HyperPod CLI provides a command to modify the instance groups and node recovery mode of an existing HyperPod cluster through the hyp update cluster command. This can be useful if you need to scale your cluster by adding or removing worker nodes, or if you want to change the instance types used by the node groups.
To update the instance groups, run the following command, adapted with your cluster name and desired instance group settings:

hyp update cluster –cluster-name  –region
 –instance-groups ‘[{
        “instance_count”: 2,
        “instance_group_name”: “worker-nodes”,
        “instance_type”: “ml.m5.large”,
        “execution_role”: “arn:aws:iam:::role/”,
        “life_cycle_config”: {
            “source_s3_uri”: “s3:///amzn-s3-demo-source-bucket/”,
            “on_create”: “on_create.sh”
        }
    }]’

Note that all of the fields in the preceding command are required to run the update command, even if, for example, only the instance count is modified. You can list the current cluster and instance group configurations to obtain the required values by running the hyp describe cluster <cluster-name> –region <region> command.
The output of the update command will look like the following:

[11/18/25 13:21:57] Update Params: {‘instance_groups’: [ClusterInstanceGroupSpecification(instance_count=2, instance_group_name=’worker-nodes’, instance_type=’ml.m5.large’, life_cycle_config=ClusterLifeCycleConfig(source_s3_uri=’s3://amzn-s3-demo-source-bucket2′, on_create=’on_create.sh’), execution_role=’arn:aws:iam::037065979077:role/hyp-eks-stack-4e5aExecRole’, threads_per_core=<sagemaker_core.main.utils.Unassigned object at 0x106637810>, instance_storage_configs=<sagemaker_core.main.utils.Unassigned object at 0x106637810>, on_start_deep_health_checks=<sagemaker_core.main.utils.Unassigned object at 0x106637810>, training_plan_arn=<sagemaker_core.main.utils.Unassigned object at 0x106637810>, override_vpc_config=<sagemaker_core.main.utils.Unassigned object at 0x106637810>, scheduled_update_config=<sagemaker_core.main.utils.Unassigned object at 0x106637810>, image_id=<sagemaker_core.main.utils.Unassigned object at 0x106637810>)], ‘node_recovery’: ‘Automatic’}
[11/18/25 13:21:58] Updating cluster resource. resources.py:3506
INFO:sagemaker_core.main.resources:Updating cluster resource.
Cluster has been updated
Cluster hyperpod-cluster has been updated

The –node-recovery option lets you configure the node recovery behavior, which can be set to either Automatic or None. For information about the SageMaker HyperPod automatic node recovery feature, see Automatic node recovery.
Deleting an existing HyperPod cluster
To delete an existing HyperPod cluster, run the following command. Note that this action is not reversible:
hyp delete cluster-stack <stack-name> –region <region>
This command removes the specified CloudFormation stack and the associated AWS resources. You can use the optional –retain-resources flag to specify a comma-separated list of logical resource IDs to retain during the deletion process. It’s important to carefully consider which resources you need to retain, because the delete operation cannot be undone.
The output of this command will look like the following, asking you to confirm the resource deletion:

⚠ WARNING: This will delete the following 12 resources:

Other (12):
– EKSClusterStack
– FsxStack
– HelmChartStack
– HyperPodClusterStack
– HyperPodParamClusterStack
– LifeCycleScriptStack
– PrivateSubnetStack
– S3BucketStack
– S3EndpointStack
– SageMakerIAMRoleStack
– SecurityGroupStack
– VPCStack

Continue? [y/N]: y
✓ Stack ‘HyperpodClusterStack-d5351’ deletion initiated successfully

SageMaker HyperPod SDK
SageMaker HyperPod also includes a Python SDK for programmatic access to the features described earlier. The Python SDK is used by the CLI commands and is installed when you install the sagemaker-hyperpod Python package as described in the beginning of this post. The HyperPod CLI is best suited for users who prefer a streamlined, interactive experience for common HyperPod management tasks like creating and monitoring clusters, training jobs, and inference endpoints. It’s particularly helpful for quick prototyping, experimentation, and automating repetitive HyperPod workflows through scripts or continuous integration and delivery (CI/CD) pipelines. In contrast, the HyperPod SDK provides more programmatic control and flexibility, making it the preferred choice when you need to embed HyperPod functionality directly into your application, integrate with other AWS or third-party services, or build complex, customized HyperPod management workflows. Consider the complexity of your use case, the need for automation and integration, and your team’s familiarity with programming languages when deciding whether to use the HyperPod CLI or SDK.
The SageMaker HyperPod CLI GitHub repository shows examples of how cluster creation and management can be implemented using the Python SDK.
Conclusion
The SageMaker HyperPod CLI and SDK simplify cluster creation and management. With the examples in this post, we’ve demonstrated how these tools provide value through:

Simplified lifecycle management – From initial configuration to cluster updates and cleanup, the CLI aligns with how teams manage long-running training and inference environments and abstracts away unnecessary complexity.
Declarative control when needed – The SDK exposes the underlying configuration model, so that teams can codify cluster specifications, instance groups, storage filesystems, and more.
Integrated observability – Visibility into CloudFormation stacks is available without switching tools, supporting smooth iteration during development and operation.

Getting started with these tools is as straightforward as installing the SageMaker HyperPod package. The SageMaker HyperPod CLI and SDK provide the right level of abstraction for both data scientists looking to quickly experiment with distributed training and ML engineers building production systems.
If you’re interested in how to use the HyperPod CLI and SDK for submitting training jobs and deploying models to your new cluster, make sure to check our companion blog post: Train and deploy models on Amazon SageMaker HyperPod using the new HyperPod CLI and SDK.

About the authors

Nicolas Jourdan
Nicolas Jourdan is a Specialist Solutions Architect at AWS, where he helps customers unlock the full potential of AI and ML in the cloud. He holds a PhD in Engineering from TU Darmstadt in Germany, where his research focused on the reliability and MLOps of industrial ML applications. Nicolas has extensive hands-on experience across industries, including autonomous driving, drones, and manufacturing, having worked in roles ranging from research scientist to engineering manager. He has contributed to award-winning research, holds patents in object detection and anomaly detection, and is passionate about applying cutting-edge AI to solve complex real-world problems.

Andrew Brown
Andrew Brown is a Sr. Solutions Architect who has been working at AWS in the Energy Industry for the last four years. He specializes in Deep Learning and High Performance Computing.

Giuseppe Angelo Porcelli
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Evaluate generative AI models with an Amazon Nova rubric-based LLM jud …

In the post Evaluating generative AI models with Amazon Nova LLM-as-a-Judge on Amazon SageMaker AI, we introduced the Amazon Nova LLM-as-a-judge capability, which is a specialized evaluation model available through Amazon SageMaker AI that you can use to systematically measure the relative performance of generative AI systems.
SageMaker AI now offers a rubric-based large language model (LLM) judge powered by Amazon Nova. Instead of using the same general rules for every task, it automatically creates specific evaluation criteria for each individual prompt. This helps generative AI developers and machine learning (ML) engineers automatically generate precise, scenario-specific evaluation criterion for their LLMs and generative AI products, without manually crafting rule sets for every use case.
In this post, we explore the Amazon Nova rubric-based judge feature: what a rubric-based judge is, how the judge is trained, what metrics to consider, and how to calibrate the judge. We chare notebook code of the Amazon Nova rubric-based LLM-as-a-judge methodology to evaluate and compare the outputs of two different LLMs using SageMaker training jobs.
Overview of Amazon Nova rubric-based judge
A rubric-based LLM judge uses a highly capable LLM (specifically Amazon Nova) to act as a judge for responses generated by other AI models or human-generated responses. This capability provides pairwise comparisons between model iterations, so you can make data-driven decisions about model improvements with confidence. A rubric is a scoring guide used to evaluate performance. Traditionally, to evaluate an AI model, humans manually write a checklist of rules (a static rubric) that apply to every task. For example: “Is the answer polite?” or “Is the answer short?” The challenge with this approach is that these general rules don’t scale to all users. A creative writing prompt needs different grading criteria than a coding prompt or a legal summary rather than one general checklist. The Amazon Nova powered judge creates a custom checklist for different interactions. Users send a specific prompt to the system (for example, “Summarize this medical document for a patient”). The judge analyzes that specific prompt and decides in the moment what an ideal answer should look like. It automatically writes criteria such as the following:

Does it use simple, non-medical jargon?
Does it capture the diagnosis accurately?
Is the tone empathetic?

The judge then looks at the output generated by your LLM model and grades it against those specific criteria it just created. You get a quality score based on rules that were relevant to that specific task, not generic rules.
Example input
For the example prompt or input “do dinosaurs really exist?”, we have two responses as shown in the following table.

Response A
Response B

The simple answer is: Yes, dinosaurs absolutely existed, but they do not exist today (except for their bird descendants). Here’s a breakdown of what that means: 1. Dinosaurs Were Very Real For over 150 million years, dinosaurs were the dominant land animals on Earth. We know they existed because we have an enormous amount of proof: Fossils: We have found millions of dinosaur fossils—bones, teeth, skulls, and claws—all over the world. These aren’t myths; they are the actual preserved remains of these animals. Footprints: We have discovered countless fossilized trackways, which are footprints left in mud that later turned to stone. These tell us how dinosaurs walked, ran, and lived in groups. Eggs and Nests: Paleontologists have found fossilized dinosaur eggs and nests, showing us how they reproduced. Even Soft Tissues: In some extremely rare cases, scientists have found evidence of feathers, skin impressions, and even chemical traces of proteins from dinosaurs. 2. But They Are Extinct (Most of Them) About 66 million years ago, a massive asteroid struck the Earth (in what is now the Gulf of Mexico). This caused a catastrophic event that led to the extinction of all non-avian dinosaurs—that is, all dinosaurs except for the ones that could fly. 3. Their Descendants Are All Around Us: Birds! This is the most exciting part of the answer! Birds are actually a type of dinosaur. Think of it this way: The dinosaur group is huge. T. rex, Triceratops, and Brachiosaurus are all famous members. Another branch of the dinosaur family tree were small, feathered dinosaurs. One group of these feathered dinosaurs evolved the ability to fly. They are the direct ancestors of all modern birds. So, when you see a pigeon on the street or a chicken on your plate, you are looking at a living dinosaur. In summary:

Did dinosaurs exist? Yes, overwhelmingly, we have proof they were real animals.
Do they exist today? Not in the form of T. rex or Stegosaurus. Those specific types are extinct.
But wait! Their direct descendants, the birds, are everywhere. So, in a very real scientific sense, dinosaurs are still with us today.

Yes, dinosaurs did really exist — but not today. Dinosaurs were a diverse group of reptiles that lived on Earth millions of years ago, primarily during the Mesozoic Era, which spanned from about 252 million years ago to 66 million years ago. Fossil evidence — including bones, footprints, eggs, and even skin impressions — confirms that dinosaurs once roamed the planet. Scientists have identified hundreds of dinosaur species, ranging from the massive Tyrannosaurus rex and long-necked Brachiosaurus to smaller, agile predators like Velociraptor. Most non-avian dinosaurs went extinct around 66 million years ago, likely due to a massive asteroid impact combined with volcanic activity and climate change. However, one group of dinosaurs survived: the ancestors of modern birds. So in a scientific sense, birds are living dinosaurs, making them the closest living relatives to extinct dinosaurs like T. rex and Triceratops. So while you won’t see a Stegosaurus walking around today, the legacy of dinosaurs lives on — especially every time you see a sparrow or eagle fly by.

Example output
Out of the two responses, the rubric-based judge prefers Response A and also provides a justification of why it prefers response A over response B, as shown in the following screenshot.

The evaluation is tailored to the exact intent of the user’s prompt (coding vs. writing vs. summarizing). Generative AI developers, data scientists, and ML engineers don’t have to spend hundreds of hours manually writing evaluation rules for every possible scenario. You can evaluate thousands of different types of prompts instantly, achieving high quality across diverse use cases.
Enterprise implementation examples
The Amazon Nova rubric-based LLM judge addresses critical evaluation challenges across different scenarios:

Model development and checkpoint selection – Development teams integrate the Amazon Nova rubric-based judge evaluation into training pipelines to automatically evaluate checkpoints. Per-criterion scores reveal which capabilities strengthened or regressed across iterations, enabling data-driven decisions about hyperparameter adjustments and data curation.
Training data quality control – Teams use the Amazon Nova rubric-based judge evaluation to filter supervised fine-tuning datasets by generating point-wise scores on relevance criteria, identifying low-quality examples. For preference datasets, calculated margins between response pairs enable curriculum learning strategies that filter overwhelmingly one-sided examples providing limited learning signals.
Automated deep dive and root cause analysis – Organizations deploying generative AI at scale can use the Amazon Nova rubric-based judge evaluation for systematic analysis across thousands of model outputs without manual review. When models exhibit quality issues, developers can examine which specific criteria drive preference judgments, identifying systematic weaknesses that inform targeted improvements instead of broad retraining efforts.

How dynamic rubric generation works
The Amazon Nova rubric-based LLM judge takes as input a triplet: <prompt, response_1, response_2>. The judge compares the quality of the two responses for the given prompt and outputs a preference label. In addition to the overall label, the judge generates a justification for its decision, guided by a rubric.
A rubric is a set of weighted criteria used to evaluate the two responses. The rubric-based LLM judge is trained to generate criteria with weights that sum to 1. Each criterion in the rubric has a short_name, description, and weight. The judge’s decision includes a score for each response on each criterion in the rubric along with justifications for the scores.
The Amazon Nova rubric-based LLM judge employs an evaluation methodology where each judgment is supported by dynamically generated, prompt-specific criteria. When the judge receives an evaluation request containing a prompt and candidate responses, it analyzes the prompt to understand the prompt context, and generates criteria based on that context. This dynamic generation process makes sure evaluations are grounded in criteria directly applicable to the task at hand, providing transparent and interpretable assessments.
For each evaluation, the judge produces structured YAML output containing the generated criteria with their definitions, per-criterion scores on a 1–5 scale, and detailed justifications explaining each score. The final output includes one of four preference labels: [[A>B]], [[B>A]], [[A=B]], or [[A=B (bothbad)]. Each criterion score is accompanied by a justification that grounds the assessment in observable characteristics of the responses, enabling deep-dive analysis and debugging of model behavior.
Comparing rubric-based Amazon Nova LLM-as-a-judge to previous versions
The rubric-based judge differs from previous versions in how it presents evaluation results and what information it provides.
The previous version of the Amazon Nova LLM-as-a-judge model returned simple preference labels ([[A>B]] or [[B>A]]). The rubric-based version generates a structured YAML output that consists of the following:

A prompt-specific rubric for assessing the responses organized as a set of criteria with associated per-criterion importance weights (weights sum up to 1)
Brief natural language descriptions of each criteria
Likert score (on 1–5 scale) or binary (true/false) decision for each criterion for every candidate response in the input
Justification for each criterion score for every candidate response
Overall preference judgement: one of A>B, B>A, A=B, or A=B (both bad)

The new detailed output format facilitates a broad range of nuanced use cases. For example, specific criteria within rubrics allow for pointed comparisons of responses. A succinct response might be more suitable for certain use cases, whereas a comprehensive response might be needed in others. Justifications and explicit criteria scoring helps users discard certain criteria that are unsuitable for their needs and recompute the preference judgements without rerunning the query though the LLM judge.
Metrics explanation
In our judge evaluation process, we use several important metrics to serve as comparison points for ranking judge quality. Forward agreement is a metric which computes agreement with human preference with the chosen response and rejected response in a specific order, which makes sure the correct label is always one of A>B or B>A for the entire dataset. Because positional consistency is an important desired property of a trustworthy LLM judge, we evaluate our checkpoints on reconciled agreement—that is, we obtain two judgements with responses presented to the judge in both possible orders (for two response preference judgements). We only credit the judge with a correct answer if the judge agrees in both directions and the judgement matches human preference. This number, by definition, will always be lower than forward agreement. However, because real-world datasets aren’t sorted, it provides a more accurate proxy for the real-world performance of an LLM judge model.
Weighted scores (weighted_score_A and weighted_score_B) are new metrics added to the rubric judge evaluation output, which provide a view into the confidence of the judgment. A large difference between the weighted scores indicates a strong preference for one response over the over. These scores are calculated per sample based on the assigned scores for each criterion in the rubric. Each criterion score is normalized to a 0–1 range (where scale scores 1–5 map to 0.0–1.0, and binary True/False map to 1.0/0.0), then multiplied by the criterion’s weight and summed to produce the weighted scores for each response.
The score_margin shows the difference between the weighted scores, with negative values indicating a preference towards response B and positive values indicating a preference towards response A. In the final evaluation output, these metrics are reported as averages across all samples. Per-sample criteria breakdowns, individual scores, and justifications can be found in the detailed Parquet output file.
Per comparison sample, we can get the specific criteria that the new rubric judge model used during to compare the two results, which looks like the following example code:

================================================================================
Row 1:
  Preference: [‘B>A’]
  A wins: 0.0
  B wins: 2.0
  Weighted A: 0.225
  Weighted B: 1.000
  Margin: -0.775

  Overall Justification:
    Response B provides a comprehensive and detailed explanation of photosynthesis, covering the process, location, chemical equation, and importance. Response A only provides a brief, surface-level description without explaining the mechanism or significance.

  Criteria:

    completeness:
      Score A: 2, Score B: 5
      Weight: 0.5, Type: scale
      Description: How thoroughly the response explains the photosynthesis process.
      Justification A: Response A mentions the basic inputs and outputs but lacks detail on the mechanism, location in the cell, or the chemical equation.
      Justification B: Response B provides a complete explanation including the process, chloroplasts, chemical equation, and the importance to life on Earth.

    clarity:
      Score A: 3, Score B: 5
      Weight: 0.3, Type: scale
      Description: How clearly the response communicates the concept.
      Justification A: Response A is clear but overly simplistic, lacking the detail needed for full understanding.
      Justification B: Response B is well-structured and clearly explains each component of photosynthesis in an accessible way.

    accuracy:
      Score A: 4, Score B: 5
      Weight: 0.2, Type: scale
      Description: How accurate the scientific information is.
      Justification A: Response A is accurate in what it states but incomplete.
      Justification B: Response B is fully accurate and includes the correct chemical equation and scientific terminology.
================================================================================

These weighted metrics are informational and provide quantitative insight into the scoring breakdown, but the actual preference decision (A>B, B>A, or A=B) that determines the final win counts is based on the judge model’s overall preference output.
Training approach for the judge
The Amazon Nova rubric-based judge is trained with a multi-aspect reward package. In our training methodology, we optimize for several desirable characteristics for an LLM judge using an effective reward formulation. We mainly target the following criteria:

Preference accuracy – The judge is rewarded when it produces decisions that align with gold human preferences. When it chooses one response over another, the model is rewarded.
Positional consistency – The judge’s decisions are trained to be resilient towards positional inconsistency issues given a specific candidate response order.
Justification quality – The judge’s justifications for making the decision must align with the generated rubrics, scores, and final judgement.
Score calibration – The weighted scores for the responses must be calibrated with the decision accuracy (high confidence judgements must be correct more often than low confidence judgements).

We start with human annotated preference data and employ a custom data filtering and synthetic data generation setup to obtain rubric-aligned preference justifications. We sample from the generated synthetic rubrics and developed a custom pipeline to train the Amazon Nova rubric-based LLM judge to proficiently generate appropriate criteria with precise granularity for consistent and robust decision-making.
Benchmark performance
Testing on standard evaluation datasets shows improvements, particularly on tasks requiring nuanced judgment, as shown in the following table.

Benchmark
Previous Amazon Nova Judge
New Amazon Nova Rubric-Based Judge

PPE
0.61
0.64

RMBench
0.66
0.88

RewardBench
0.88
0.9

JudgeBench
0.51
0.76

CodeUltraFeedback
0.69
0.72

MMEval
0.8
0.84

The larger improvements on JudgeBench and RMBench reflect better handling of complex evaluation scenarios.
Calibration
During our training process as well as during postprocessing, we evaluate the Amazon Nova rubric-based judge’s ability to make well-calibrated decisions. To achieve balanced calibration, we look at confidence buckets on a human annotated preference dataset. We look at the difference of weighted scores for response pairs. We aim for calibration of confidence to accuracy. Ideally, the LLM judge should be more accurate when making high confidence decisions and is allowed to be less accurate when making low confidence decisions. We find that this calibration methodology results in consistent decision-making in and out of distribution datasets. We also look at the distributions of scores generated for different criteria. We look for an approximately normal distribution over Likert scale scores (1–5) over the eval dataset. This two-pronged calibration checking process helps us identify better LLM judge checkpoints among several similarly well-performing checkpoints.
Use cases of rubric-based judgement
The reliability of dynamically generated rubrics stems from three decisions:

The judge is trained on diverse, high-quality rubric-annotated preference data representing real-world use cases, teaching it patterns that distinguish effective evaluation criteria from superficial ones.
Our filtering mechanism during training prioritizes rubrics exhibiting desirable properties—comprehensiveness, mutual exclusivity, appropriate specificity, and task relevance—making sure the model learns from the best examples.
Our reward formulation directly incentivizes rubric quality: criteria that lead to accurate, position-invariant preferences with well-calibrated confidence receiving positive rewards, whereas those producing inconsistent judgments are penalized.

How to use rubrics to improve practical applications
Many modern applications operate in reference-free environments, where no gold-standard human answers exist. In these cases, the usefulness of the rubric is paramount. In this section, we spotlight instances where rubrics generated by our judge could be useful inputs for informed decision-making. We demonstrate how outputs of our rubric-based judge—specifically the weighted criteria, granular scores, and explicit justifications—serve as critical control mechanisms.
Evaluating RAG systems
In Retrieval Augmented Generation (RAG), the primary failure mode is hallucinations. Traditional preference judges typically conflate “is the response good?” with “is this fluent?”, “is this well-formatted?”, “does the internal logic hold up?”, and so on. A fluent but factually incorrect response is often perceived as more credible than a disjointed one containing accurate information. A factuality-focused evaluation can help you choose a summarization model because the retrieval results don’t have hallucinations. Using a rubric-based judge for such judgements could help in understanding whether preference judgement is based on criteria like fluency and formatting, or if the judgement is based on relevant criteria such as faithfulness, context relevance, and so on. Users can disregard the scores of irrelevant criteria and re-valuate judgements based on a subset of criteria they care about for their application.
The creative critic
In this example, we look in the other direction, where creativity and originality are desirable over faithfulness to real-world facts or previous context. Consider a use case where you are using an LLM to generate short stories or scripts that are original, but the user provides a few examples of past scripts to demonstrate the requirements. Selecting good outputs from these generations require the generated stories to be sufficiently different from the examples, creative, original, and not borrow directly from existing training data. The end-user could index on criteria such as originality, coherence, and engagement to optimize for preference judgements suited to this use case, when using our rubric-based judge. You could further look at the explicit justifications for criteria scores for the specific type of originality and creativity that is desirable.
Solution overview
This solution demonstrates how to evaluate generative AI models on SageMaker AI using a rubric-based judge capability. You can also evaluate human generated responses, but in this solution, we show how you can evaluate responses generated by other LLMs such as Qwen models using Amazon Nova as a rubric-based judge.
First, we prepare a dataset by sampling questions from the Stanford Question Answering Dataset (SQuAD) and generating candidate responses from both Qwen2.5 1.5B Instruct and Qwen2.5 7B Instruct. Both models are accessed through SageMaker hosted Hugging Face endpoints. The responses from both models are saved in a JSONL file (llm_judge.jsonl) containing the prompt, response_A (from Qwen2.5 1.5B Instruct), and response_B (from Qwen2.5 7B Instruct).
Next, the JSONL file is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket. A PyTorch Estimator then launches an evaluation job using the Amazon Nova rubric-based LLM-as-a-judge recipe. The judge model dynamically generates evaluation rubrics and criteria tailored to each task, then compares the two candidate responses against these criteria. The job runs on GPU instances such as ml.g5.12xlarge and produces evaluation metrics, including per-criterion scores, justifications, comparative assessments, preference counts, and confidence measures. Results are saved to Amazon S3 for analysis.
Finally, a visualization function renders charts and tables, summarizing the generated rubrics, score distributions across evaluation dimensions, comparative performance between the two Qwen2.5 models, and detailed examples with justifications. Through this end-to-end approach, you can assess which model performs better, identify specific strengths and weaknesses, track improvements, and make data-driven decisions about deploying generative models—all without manual annotation.
Prerequisites
You must complete the following prerequisites before you can run the notebook:

Make the following quota increase requests for SageMaker AI. For this use case, you must request (on the Service Quotas console) a minimum of two g5.12xlarge instances for endpoint usage and at least one g5.12xlarge instance for training job usage.
(Optional) You can create an Amazon SageMaker Studio domain (refer to Use quick setup for Amazon SageMaker AI) to access Jupyter notebooks with the preceding IAM role. (You can use JupyterLab in your local setup, too.)

Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess, AmazonS3FullAccess, and AmazonBedrockFullAccess to give required access to SageMaker AI and Amazon Bedrock to run the examples.
Before proceeding, make sure to grant the execution role direct s3:PutObject permissions for your S3 bucket prefix as an inline policy:

{
“Effect”: “Allow”,
  “Action”: [
“s3:PutObject”,
    “s3:GetObject”,
    “s3:ListBucket”
],
  “Resource”: [
“arn:aws:s3:::my-bucket-east”,
    “arn:aws:s3:::my-bucket-east/*”
]
}

Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets.

git clone https://github.com/aws-samples/amazon-nova-samples.git
cd customization/Nova_2.0/04_eval/Amazon-Nova-Rubric-Based-LLM-As-A-Judge

Run the notebook Amazon-Nova-Rubric-LLM-as-a-Judge-Sagemaker-AI.ipynb to start using the Amazon Nova LLM-as-a-judge implementation on SageMaker AI.

Configure models
To conduct a rubric-based Amazon Nova LLM-as-a-judge evaluation, you must generate outputs from both candidate models you want to compare. In this project, we deploy Qwen2.5 1.5B Instruct and Qwen2.5 7B Instruct on SageMaker to generate responses that will be compared by the Amazon Nova judge model.
Both models are open-weight multilingual language models deployed on dedicated SageMaker endpoints. This is achieved by using the HuggingFaceModel deployment interface. To deploy the Qwen2.5 1.5B Instruct and Qwen2.5 7B Instruct models, we provide a convenient script that accepts the model name as an argument:

python3 deploy_model_arg.py Qwen/Qwen2.5-1.5B-Instruct
python3 deploy_model_arg.py Qwen/Qwen2.5-7B-Instruct

We have also included the ability to test both of these deployed models. When you have deployed the models, you can move on to creating the evaluation data for the rubric-based Amazon Nova LLM-as-a-judge.
Prepare dataset
To create a realistic evaluation dataset for comparing the Qwen models, we used SQuAD, a widely adopted benchmark in natural language understanding distributed under the CC BY-SA 4.0 license. SQuAD consists of thousands of crowd-sourced question-answer pairs covering a diverse range of Wikipedia articles. By sampling from this dataset, we made sure that our evaluation prompts reflected high-quality, factual question-answering tasks representative of real-world applications.
We began by loading a small subset of examples to keep the workflow fast and reproducible. Specifically, we used the Hugging Face datasets library to download and load the first 20 examples from the SQuAD training split:

from datasets import load_dataset
squad = load_dataset(“squad”, split=”train[:20]”)

This command retrieves a slice of the full dataset, containing 20 entries with structured fields including context, question, and answers. To verify the contents and inspect an example, we printed out a sample question and its ground truth answer:

print(squad[3][“question”])
print(squad[3][“answers”][“text”][0])

For the evaluation set, we selected the first six questions from this subset:questions = [squad[i][“question”] for i in range(6)]
Generate evaluation dataset
After preparing a set of evaluation questions from SQuAD, we generated outputs from both Qwen2.5 models and assembled them into a structured dataset to be used by the Amazon Nova rubric-based LLM-as-a-judge workflow. This dataset serves as the core input for SageMaker AI evaluation recipes.To do this, we iterated over each question prompt and invoked the generation function for both SageMaker endpoints:

generate_response(“qwen25-15b-instruct-endpoint”, q) for completions from the Qwen2.5 1.5B Instruct model
generate_response(“qwen25-7b-instruct-endpoint”, q) for completions from the Qwen2.5 7B Instruct model

For each prompt, the workflow attempted to generate a response from each model.The following code calls two different versions of the Qwen 2.5 model. This allows the LLM judge to later determine if the larger model provides significantly better accuracy or if the smaller model is sufficient for the task.

# Define the output file path for the LLM judge dataset

output_path = “llm_judge.jsonl”

with open(output_path, “w”) as f:
    for q in questions:
        try:
# Generate response from Model A (1.5B parameter model)
            response_a = generate_response(“qwen25-15b-instruct-endpoint”, q)
        except Exception as e:
# Fallback error message if the API call fails
            response_a = f”[Qwen2.5 generation failed: {e}]”
        try:
# Generate response from Model B (7B parameter model)
            response_b = generate_response(“qwen25-7b-instruct-endpoint”, q)
        except Exception as e:
# Fallback error message if the API call fails
            response_b = f”[ qwen25-7b generation failed: {e}]”
# Construct a dictionary containing the prompt and both model responses
        row = {
            “prompt”: q,
            “response_A”: response_a,
            “response_B”: response_b
        }
        f.write(json.dumps(row) + “n”)
# Write the record to the JSONL file as a single line

print(f”JSONL file created at: {output_path}”)

This workflow produced a JSON Lines file named llm_judge.jsonl. Each line contains a single evaluation record structured as follows:

{
  “prompt”: “What is the capital of France?”,
  “response_A”: “The capital of France is Paris.”,
  “response_B”: “Paris is the capital city of France.”
}

Then, we uploaded the llm_judge.jsonl to an S3 bucket:

upload_to_s3(
    “llm_judge.jsonl”,
    “s3://<YOUR_BUCKET_NAME>/datasets/byo-datasets-dev/custom-llm-judge/llm_judge.jsonl”
)

Launch Amazon Nova rubric-based LLM-as-a-judge evaluation job
After preparing the dataset and creating the evaluation recipe, the final step is to launch the SageMaker training job that performs the Amazon Nova rubric-based LLM-as-a-judge evaluation. In this workflow, the training job acts as a fully managed, self-contained process that loads the judge model, processes the comparison dataset, applies dynamically generated rubrics, and generates comprehensive evaluation metrics in your designated Amazon S3 location. We use the PyTorch estimator class from the SageMaker Python SDK to encapsulate the configuration for the evaluation run. The estimator defines the compute resources, container image, evaluation recipe, and output paths for storing results:

estimator = PyTorch(
    output_path=output_s3_uri,
    base_job_name=job_name,
    role=role,
    instance_type=instance_type,
    training_recipe=recipe_path,
    sagemaker_session=sagemaker_session,
    image_uri=image_uri,
    disable_profiler=True,
    debugger_hook_config=False,
)

After the estimator is configured, you initiate the evaluation job using the fit() method. This call submits the job to the SageMaker control plane, provisions the compute cluster (ml.g5.12xlarge instances), and begins processing your evaluation dataset:
estimator.fit(inputs={“train”: evalInput})The job will execute the rubric-based comparison, with the Amazon Nova judge model dynamically generating evaluation criteria and scoring both Qwen2.5 model outputs. Results, including per-criterion scores, justifications, and comparative assessments, are automatically saved to your specified S3 output path for downstream analysis and visualization.
Results from Amazon Nova rubric-based LLM-as-a-judge evaluation job
The following is an example result for a row of the evaluation. In this example, Assistant B is the clear winner because it prioritizes grounded, nuanced information over Assistant A’s suspiciously specific but unverified claim of 145 newspapers. The judge penalizes Assistant A for its lack of context, resulting in significantly lower scores for accuracy and completeness. By applying a custom weight that allocates 50% of the total score to accuracy, the evaluation calculates a weighted margin that quantifies precisely why Assistant B’s detailed, verifiable response is superior.

================================================================================
Row 0:
  Preference: [‘B>A’]
  A wins: 0.0
  B wins: 1.0
  Weighted A: 0.175
  Weighted B: 0.875
  Margin: -0.700

  Overall Justification:
    Assistant B’s response is more accurate and complete as it provides specific examples of student publications and acknowledges the variability in the number of publications. Assistant A’s response, while providing a specific number, lacks context and explanation, making it less useful for understanding the situation.

  Criteria:

    accuracy:
      Score A: 2, Score B: 4
      Weight: 0.5, Type: scale
      Description: How accurate the information provided is regarding the number of student newspapers at Notre Dame.
      Justification A: Assistant A provides a specific number (145) but does not offer any context or explanation for this number, making it difficult to assess its accuracy.
      Justification B: Assistant B provides a more nuanced answer, stating that there are at least three significant student publications but acknowledges that the number can vary. This response is more accurate given the dynamic nature of student publications.

    completeness:
      Score A: 1, Score B: 5
      Weight: 0.3, Type: scale
      Description: How complete the response is in providing information about student newspapers at Notre Dame.
      Justification A: Assistant A’s response is incomplete as it does not provide any context or examples of student newspapers at Notre Dame.
      Justification B: Assistant B’s response is more complete as it provides examples of well-known student publications and acknowledges the variability in the number of publications.

    clarity:
      Score A: 2, Score B: 5
      Weight: 0.2, Type: scale
      Description: How clear and understandable the response is.
      Justification A: Assistant A’s response is clear in providing a number but lacks clarity in explaining what this number represents.
      Justification B: Assistant B’s response is clear and understandable, providing examples and context to help the reader understand the number of student publications.

As in the post Evaluating generative AI models with Amazon Nova LLM-as-a-Judge on Amazon SageMaker AI, to help practitioners quickly interpret the outcome of an Amazon Nova rubric-based LLM-as-a-judge evaluation, we created a convenience function that produces a single, comprehensive visualization summarizing key metrics, as shown in the following screenshot.

This function, plot_nova_judge_results, uses Matplotlib and Seaborn to render an image with six panels, each highlighting a different perspective of the evaluation outcome.
This function takes the evaluation metrics dictionary produced when the evaluation job is complete and generates the following visual components:

Score distribution bar chart – Shows how many times Model A was preferred (three wins), how many times Model B was preferred (seven wins), how many ties occurred, and how often the judge failed to produce a decision (one inference error out of 11 evaluations). This provides an immediate sense of how decisive the evaluation was, clearly showing Model B’s dominance with a 70% preference rate.
Win rate with 95% confidence interval – Plots Model B’s overall win rate of 70% against Model A, including an error bar reflecting the confidence interval bounds of [0.400, 0.909]. A vertical reference line at 50% marks the point of no preference. Because the confidence interval doesn’t cross this line, we can conclude the result is statistically significant, indicating meaningful superiority for the 7B model.
Preference pie chart – Visually displays the proportion of preferences among the 10 valid judgments: 70% for Model B and 30% for Model A. This can help users quickly understand the clear preference distribution favoring the larger model.
A vs. B score comparison bar chart – Compares the raw counts of preferences for each model side by side (three for Model A vs seven for Model B). A clear label annotates the margin of difference, emphasizing Model B’s four-win advantage. The chart also displays the weighted rubric-based scores: Model A averaged 0.495 whereas Model B averaged 0.630 across all evaluation criteria (accuracy, completeness, clarity), with an average margin of -0.135 favoring Model B.
Win rate gauge – Depicts the 70% win rate as a semicircular gauge with a needle pointing to Model B’s performance relative to the theoretical 0–100% range. This intuitive visualization helps nontechnical stakeholders immediately grasp that Model B outperformed Model A by a substantial margin based on dynamically generated rubric criteria tailored to each question-answer pair.
Summary statistics table – Compiles numerical metrics into a compact, clean table: 11 total evaluations, one error (9.1% error rate), 70% win rate, weighted rubric scores (0.630 for B vs 0.495 for A with -0.135 margin), and confidence intervals [0.400, 0.909]. This makes it straightforward to reference the exact numeric values behind the plots and understand both the statistical rigor and rubric-based assessment of the evaluation.

Because the function outputs a standard Matplotlib figure, you can quickly save the image, display it in Jupyter notebooks, or embed it in other documentation. The visualization clearly demonstrates that Model B shows statistically significant superiority overall with higher rubric-based scores across accuracy, completeness, and clarity dimensions.
Clean up
To stop and delete the SageMaker Studio spaces, follow these clean up steps in the SageMaker Studio documentation. You must delete the S3 bucket and the hosted model endpoint to stop incurring costs. You can delete the real-time endpoints you created using the SageMaker console. For instructions, see Delete Endpoints and Resources.
Conclusion
Evaluating generative AI outputs at scale requires more than simple preference labels, it requires transparency into why one response outperforms another. The Amazon Nova rubric-based LLM judge addresses this need by dynamically generating task-specific evaluation criteria, providing per-criterion scores with explicit justifications, and delivering well-calibrated confidence signals. Compared to previous judge implementations, the rubric-based approach offers three key advantages: interpretability through structured YAML output with criterion-level breakdowns, flexibility enabling users to reweight or filter criteria for their specific use cases, and improved accuracy with significant gains across standard benchmarks—including a 49% improvement on complex evaluation scenarios in JudgeBench. If you are selecting model checkpoints during development, filtering training data for quality, or debugging production model behavior at scale, the Amazon Nova rubric-based LLM-as-a-judge evaluation transforms opaque preference decisions into actionable insights. By exposing the reasoning behind each judgment, teams can identify systematic weaknesses, validate that evaluations align with their quality priorities, and build greater trust in automated evaluation pipelines.
To get started with the Amazon Nova rubric-based LLM judge on SageMaker AI, refer to Rubric Based Judge.

About the authors
Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the Llama family, and Qwen, focusing on their fine-tuning and optimization for specific scientific applications. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker, enabling the scaling of foundation models from development to production. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.
Joseph Moulton is a Software Engineer on the Amazon AGI Customization team supporting the implementation of evaluation and inference workflows for AWS Nova Forge. Current work focuses on developing and implementing new strategies for customers to evaluate their custom trained Nova models. He has been with the company as a software engineer for 4 years, joining the Alexa AI Machine Learning platform team in 2022 before transitioning to the Nova Forge team in 2025. In his free time he enjoys golfing and building computers.
Morteza Ziyadi is an senior science lead and manager at Amazon AGI, where he leads several projects on post-training recipes and (Multimodal) large language models in the Amazon AGI Foundation modeling team. Before joining Amazon AGI, he spent four years at Microsoft Cloud and AI, where he led projects focused on developing natural language-to-code generation models for various products. He has also served as an adjunct faculty at Northeastern University. He earned his PhD from the University of Southern California (USC) in 2017 and has since been actively involved as a workshop organizer, and reviewer for numerous NLP, Computer Vision and machine learning conferences.
Rajkumar Pujari is an Applied Scientist II on the Nova Models post-training team at Amazon AGI. He obtained his Ph.D. in Computer Science from Purdue University, specializing in Machine Learning for Computational Social Science. Currently, his work focuses on post-training and reinforcement learning for Large Language Models. He develops large-scale, dynamic evaluation pipelines for frontier models and builds LLM-as-a-Judge frameworks.
Swastik Roy is a Senior Applied Scientist on Amazon’s AGI Foundation team, specializing in generalizability research and post-training of the Amazon Nova family of models. His expertise spans fine-tuning, reinforcement learning, and evaluation methodologies, where he drives efforts to advance the robustness of foundational AI systems.
Joel Catapano is a Senior Applied Scientist on the Amazon AGI foundation modeling team. He primarily works on developing novel approaches for improving the LLM-as-a-Judge capability of the Nova family of models.
Mona Mona is a Sr World Wide Gen AI Specialist Solutions Architect focusing on Gen AI Solutions in Amazon SageMaker AI team. She was a Lead Generative AI specialist in Google before joining Amazon. She is a published author of two books – Natural Language Processing with AWS AI Services and Google Cloud Certified Professional Machine Learning Study Guide. She has authored 20+ blogs on AI/ML and cloud technology and a co-author on a research paper on CORD19 Neural Search which won an award for Best Research Paper at the prestigious AAAI (Association for the Advancement of Artificial Intelligence) conference.
Pradeep Natarajan is a Senior Principal Scientist in Amazon AGI Foundation modeling team working on post-training recipes and Multimodal large language models. He has 20+ years of experience in developing and launching multiple large-scale machine learning systems. He has a PhD in Computer Science from University of Southern California.

Anthropic Releases Claude Opus 4.6 With 1M Context, Agentic Coding, Ad …

Anthropic has launched Claude Opus 4.6, its most capable model to date, focused on long-context reasoning, agentic coding, and high-value knowledge work. The model builds on Claude Opus 4.5 and is now available on claude.ai, the Claude API, and major cloud providers under the ID claude-opus-4-6.

Model focus: agentic work, not single answers

Opus 4.6 is designed for multi-step tasks where the model must plan, act, and revise over time. As per the Anthropic team, they use it in Claude Code and report that it focuses more on the hardest parts of a task, handles ambiguous problems with better judgment, and stays productive over longer sessions.

The model tends to think more deeply and revisit its reasoning before answering. This improves performance on difficult problems but can increase cost and latency on simple ones. Anthropic exposes a /effort parameter with 4 levels — low, medium, high (default), and max — so developers can explicitly trade off reasoning depth against speed and cost per endpoint or use case.

Beyond coding, Opus 4.6 targets practical knowledge-work tasks:

running financial analyses

doing research with retrieval and browsing

using and creating documents, spreadsheets, and presentations

Inside Cowork, Anthropic’s autonomous work surface, the model can run multi-step workflows that span these artifacts without continuous human prompting.

Long-context capabilities and developer controls

Opus 4.6 is the first Opus-class model with a 1M token context window in beta. For prompts above 200k tokens in this 1M-context mode, pricing rises to $10 per 1M input tokens and $37.50 per 1M output tokens. The model supports up to 128k output tokens, which is enough for very long reports, code reviews, or structured multi-file edits in one response.

To make long-running agents manageable, Anthropic ships several platform features around Opus 4.6:

Adaptive thinking: the model can decide when to use extended thinking based on task difficulty and context, instead of always running at maximum reasoning depth.

Effort controls: 4 discrete effort levels (low, medium, high, max) expose a clean control surface for latency vs reasoning quality.

Context compaction (beta): the platform automatically summarizes and replaces older parts of the conversation as a configurable context threshold is approached, reducing the need for custom truncation logic.

US-only inference: workloads that must stay in US regions can run at 1.1× token pricing.

These controls target a common real-world pattern: agentic workflows that accumulate hundreds of thousands of tokens while interacting with tools, documents, and code over many steps.

Product integrations: Claude Code, Excel, and PowerPoint

Anthropic has upgraded its product stack so that Opus 4.6 can drive more realistic workflows for engineers and analysts.

In Claude Code, a new ‘agent teams’ mode (research preview) lets users create multiple agents that work in parallel and coordinate autonomously. This is aimed at read-heavy tasks such as codebase reviews. Each sub-agent can be taken over interactively, including via tmux, which fits terminal-centric engineering workflows.

Claude in Excel now plans before acting, can ingest unstructured data and infer structure, and can apply multi-step transformations in a single pass. When paired with Claude in PowerPoint, users can move from raw data in Excel to structured, on-brand slide decks. The model reads layouts, fonts, and slide masters so generated decks stay aligned with existing templates. Claude in PowerPoint is currently in research preview for Max, Team, and Enterprise plans.

Benchmark profile: coding, search, long-context retrieval

Anthropic team positions Opus 4.6 as state of the art on several external benchmarks that matter for coding agents, search agents, and professional decision support.

https://www.anthropic.com/news/claude-opus-4-6

Key results include:

GDPval-AA (economically valuable knowledge work in finance, legal, and related domains): Opus 4.6 outperforms OpenAI’s GPT-5.2 by around 144 Elo points and Claude Opus 4.5 by 190 points. This implies that, in head-to-head comparisons, Opus 4.6 beats GPT-5.2 on this evaluation about 70% of the time.

Terminal-Bench 2.0: Opus 4.6 achieves the highest reported score on this agentic coding and system task benchmark.

Humanity’s Last Exam: on this multidisciplinary reasoning test with tools (web search, code execution, and others), Opus 4.6 leads other frontier models, including GPT-5.2 and Gemini 3 Pro configurations, under the documented harness.

BrowseComp: Opus 4.6 performs better than any other model on this agentic search benchmark. When Claude models are combined with a multi-agent harness, scores increase to 86.8%.

https://www.anthropic.com/news/claude-opus-4-6

Long-context retrieval is a central improvement. On the 8-needle 1M variant of MRCR v2 — a ‘needle-in-a-haystack’ benchmark where facts are buried inside 1M tokens of text — Opus 4.6 scores 76%, compared to 18.5% for Claude Sonnet 4.5. Anthropic describes this as a qualitative shift in how much context a model can actually use without context rot.

Additional performance gains in:

root cause analysis on complex software failures

multilingual coding

long-term coherence and planning

cybersecurity tasks

life sciences, where Opus 4.6 performs almost 2× better than Opus 4.5 on computational biology, structural biology, organic chemistry, and phylogenetics evaluations

On Vending-Bench 2, a long-horizon economic performance benchmark, Opus 4.6 earns $3,050.53 more than Opus 4.5 under the reported setup.

Key Takeaways

Opus 4.6 is Anthropic’s highest-end model with 1M-token context (beta): Supports 1M input tokens and up to 128k output tokens, with premium pricing above 200k tokens, making it suitable for very long codebases, documents, and multi-step agentic workflows.

Explicit controls for reasoning depth and cost via effort and adaptive thinking: Developers can tune /effort (low, medium, high, max) and let ‘adaptive thinking’ decide when extended reasoning is needed, exposing a clear latency vs accuracy vs cost trade-off for different routes and tasks.

Strong benchmark performance on coding, search, and economic value tasks: Opus 4.6 leads on GDPval-AA, Terminal-Bench 2.0, Humanity’s Last Exam, BrowseComp, and MRCR v2 1M, with large gains over Claude Opus 4.5 and GPT-class baselines in long-context retrieval and tool-augmented reasoning.

Tight integration with Claude Code, Excel, and PowerPoint for real workloads: Agent teams in Claude Code, structured Excel transformations, and template-aware PowerPoint generation position Opus 4.6 as a backbone for practical engineering and analyst workflows, not just chat.

Check out the Technical details and Documentation. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Anthropic Releases Claude Opus 4.6 With 1M Context, Agentic Coding, Adaptive Reasoning Controls, and Expanded Safety Tooling Capabilities appeared first on MarkTechPost.

How to Build Production-Grade Data Validation Pipelines Using Pandera, …

Schemas, and Composable DataFrame ContractsIn this tutorial, we demonstrate how to build robust, production-grade data validation pipelines using Pandera with typed DataFrame models. We start by simulating realistic, imperfect transactional data and progressively enforce strict schema constraints, column-level rules, and cross-column business logic using declarative checks. We show how lazy validation helps us surface multiple data quality issues at once, how invalid records can be quarantined without breaking pipelines, and how schema enforcement can be applied directly at function boundaries to guarantee correctness as data flows through transformations. Check out the FULL CODES here. 

Copy CodeCopiedUse a different Browser!pip -q install “pandera>=0.18” pandas numpy polars pyarrow hypothesis

import json
import numpy as np
import pandas as pd
import pandera as pa
from pandera.errors import SchemaError, SchemaErrors
from pandera.typing import Series, DataFrame

print(“pandera version:”, pa.__version__)
print(“pandas version:”, pd.__version__)

We set up the execution environment by installing Pandera and its dependencies and importing all required libraries. We confirm library versions to ensure reproducibility and compatibility. It establishes a clean foundation for enforcing typed data validation throughout the tutorial. Check out the FULL CODES here. 

Copy CodeCopiedUse a different Browserrng = np.random.default_rng(42)

def make_raw_orders(n=250):
countries = np.array([“CA”, “US”, “MX”])
channels = np.array([“web”, “mobile”, “partner”])
raw = pd.DataFrame(
{
“order_id”: rng.integers(1, 120, size=n),
“customer_id”: rng.integers(1, 90, size=n),
“email”: rng.choice(
[“alice@example.com”, “bob@example.com”, “bad_email”, None],
size=n,
p=[0.45, 0.45, 0.07, 0.03],
),
“country”: rng.choice(countries, size=n, p=[0.5, 0.45, 0.05]),
“channel”: rng.choice(channels, size=n, p=[0.55, 0.35, 0.10]),
“items”: rng.integers(0, 8, size=n),
“unit_price”: rng.normal(loc=35, scale=20, size=n),
“discount”: rng.choice([0.0, 0.05, 0.10, 0.20, 0.50], size=n, p=[0.55, 0.15, 0.15, 0.12, 0.03]),
“ordered_at”: pd.to_datetime(“2025-01-01″) + pd.to_timedelta(rng.integers(0, 120, size=n), unit=”D”),
}
)

raw.loc[rng.choice(n, size=8, replace=False), “unit_price”] = -abs(raw[“unit_price”].iloc[0])
raw.loc[rng.choice(n, size=6, replace=False), “items”] = 0
raw.loc[rng.choice(n, size=5, replace=False), “discount”] = 0.9
raw.loc[rng.choice(n, size=4, replace=False), “country”] = “ZZ”
raw.loc[rng.choice(n, size=3, replace=False), “channel”] = “unknown”
raw.loc[rng.choice(n, size=6, replace=False), “unit_price”] = raw[“unit_price”].iloc[:6].round(2).astype(str).values

return raw

raw_orders = make_raw_orders(250)
display(raw_orders.head(10))

We generate a realistic transactional dataset that intentionally includes common data quality issues. We simulate invalid values, inconsistent types, and unexpected categories to reflect real-world ingestion scenarios. It allows us to meaningfully test and demonstrate the effectiveness of schema-based validation. Check out the FULL CODES here. 

Copy CodeCopiedUse a different BrowserEMAIL_RE = r”^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}$”

class Orders(pa.DataFrameModel):
order_id: Series[int] = pa.Field(ge=1)
customer_id: Series[int] = pa.Field(ge=1)
email: Series[object] = pa.Field(nullable=True)
country: Series[str] = pa.Field(isin=[“CA”, “US”, “MX”])
channel: Series[str] = pa.Field(isin=[“web”, “mobile”, “partner”])
items: Series[int] = pa.Field(ge=1, le=50)
unit_price: Series[float] = pa.Field(gt=0)
discount: Series[float] = pa.Field(ge=0.0, le=0.8)
ordered_at: Series[pd.Timestamp]

class Config:
coerce = True
strict = True
ordered = False

@pa.check(“email”)
def email_valid(cls, s: pd.Series) -> pd.Series:
return s.isna() | s.astype(str).str.match(EMAIL_RE)

@pa.dataframe_check
def total_value_reasonable(cls, df: pd.DataFrame) -> pd.Series:
total = df[“items”] * df[“unit_price”] * (1.0 – df[“discount”])
return total.between(0.01, 5000.0)

@pa.dataframe_check
def channel_country_rule(cls, df: pd.DataFrame) -> pd.Series:
ok = ~((df[“channel”] == “partner”) & (df[“country”] == “MX”))
return ok

We define a strict Pandera DataFrameModel that captures both structural and business-level constraints. We apply column-level rules, regex-based validation, and dataframe-wide checks to declaratively encode domain logic. Check out the FULL CODES here. 

Copy CodeCopiedUse a different Browsertry:
validated = Orders.validate(raw_orders, lazy=True)
print(validated.dtypes)
except SchemaErrors as exc:
display(exc.failure_cases.head(25))
err_json = exc.failure_cases.to_dict(orient=”records”)
print(json.dumps(err_json[:5], indent=2, default=str))

We validate the raw dataset using lazy evaluation to surface multiple violations in a single pass. We inspect structured failure cases to understand exactly where and why the data breaks schema rules. It helps us debug data quality issues without interrupting the entire pipeline. Check out the FULL CODES here. 

Copy CodeCopiedUse a different Browserdef split_clean_quarantine(df: pd.DataFrame):
try:
clean = Orders.validate(df, lazy=False)
return clean, df.iloc[0:0].copy()
except SchemaError:
pass

try:
Orders.validate(df, lazy=True)
return df.copy(), df.iloc[0:0].copy()
except SchemaErrors as exc:
bad_idx = sorted(set(exc.failure_cases[“index”].dropna().astype(int).tolist()))
quarantine = df.loc[bad_idx].copy()
clean = df.drop(index=bad_idx).copy()
return Orders.validate(clean, lazy=False), quarantine

clean_orders, quarantine_orders = split_clean_quarantine(raw_orders)
display(quarantine_orders.head(10))
display(clean_orders.head(10))

@pa.check_types
def enrich_orders(df: DataFrame[Orders]) -> DataFrame[Orders]:
out = df.copy()
out[“unit_price”] = out[“unit_price”].round(2)
out[“discount”] = out[“discount”].round(2)
return out

enriched = enrich_orders(clean_orders)
display(enriched.head(5))

We separate valid records from invalid ones by quarantining rows that fail schema checks. We then enforce schema guarantees at function boundaries to ensure only trusted data is transformed. This pattern enables safe data enrichment while preventing silent corruption. Check out the FULL CODES here. 

Copy CodeCopiedUse a different Browserclass EnrichedOrders(Orders):
total_value: Series[float] = pa.Field(gt=0)

class Config:
coerce = True
strict = True

@pa.dataframe_check
def totals_consistent(cls, df: pd.DataFrame) -> pd.Series:
total = df[“items”] * df[“unit_price”] * (1.0 – df[“discount”])
return (df[“total_value”] – total).abs() <= 1e-6

@pa.check_types
def add_totals(df: DataFrame[Orders]) -> DataFrame[EnrichedOrders]:
out = df.copy()
out[“total_value”] = out[“items”] * out[“unit_price”] * (1.0 – out[“discount”])
return EnrichedOrders.validate(out, lazy=False)

enriched2 = add_totals(clean_orders)
display(enriched2.head(5))

We extend the base schema with a derived column and validate cross-column consistency using composable schemas. We verify that computed values obey strict numerical invariants after transformation. It demonstrates how Pandera supports safe feature engineering with enforceable guarantees.

In conclusion, we established a disciplined approach to data validation that treats schemas as first-class contracts rather than optional safeguards. We demonstrated how schema composition enables us to safely extend datasets with derived features while preserving invariants, and how Pandera seamlessly integrates into real analytical and data-engineering workflows. Through this tutorial, we ensured that every transformation operates on trusted data, enabling us to build pipelines that are transparent, debuggable, and resilient in real-world environments.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build Production-Grade Data Validation Pipelines Using Pandera, Typed Schemas, and Composable DataFrame Contracts appeared first on MarkTechPost.

OpenAI Just Launched GPT-5.3-Codex: A Faster Agentic Coding Model Unif …

OpenAI has just introduced GPT-5.3-Codex, a new agentic coding model that extends Codex from writing and reviewing code to handling a broad range of work on a computer. The model combines the frontier coding performance of GPT-5.2-Codex with the reasoning and professional knowledge capabilities of GPT-5.2 into a single system, and it runs 25% faster for Codex users due to infrastructure and inference improvements.

For Devs folks, GPT-5.3-Codex is positioned as a coding agent that can execute long-running tasks that involve research, tool use, and complex execution, while remaining steerable ‘much like a colleague’ during a run.

Frontier agentic capabilities and benchmark results

OpenAI evaluates GPT-5.3-Codex on four key benchmarks that target real-world coding and agentic behavior: SWE-Bench Pro, Terminal-Bench 2.0, OSWorld-Verified, and GDPval.

https://openai.com/index/introducing-gpt-5-3-codex/

On SWE-Bench Pro, a contamination-resistant benchmark constructed from real GitHub issues and pull requests across 4 languages, GPT-5.3-Codex reaches 56.8% with xhigh reasoning effort. This slightly improves over GPT-5.2-Codex and GPT-5.2 at the same effort level. Terminal-Bench 2.0, which measures terminal skills that coding agents need, shows a larger gap: GPT-5.3-Codex reaches 77.3%, significantly higher than previous models.

https://openai.com/index/introducing-gpt-5-3-codex/

On OSWorld-Verified, an agentic computer-use benchmark where agents complete productivity tasks in a visual desktop environment, GPT-5.3-Codex reaches 64.7%. Humans score around 72% on this benchmark, which gives a rough human-level reference point.

For professional knowledge work, GPT-5.3-Codex is evaluated with GDPval, an evaluation introduced in 2025 that measures performance on well-specified tasks across 44 occupations. GPT-5.3-Codex achieves 70.9% wins or ties on GDPval, matching GPT-5.2 at high reasoning effort. These tasks include constructing presentations, spreadsheets, and other work products that align with typical professional workflows.

A notable systems detail is that GPT-5.3-Codex achieves its results with fewer tokens than previous models, allowing users to “build more” within the same context and cost budgets.

Beyond coding: GDPval and OSWorld

OpenAI emphasizes that software devs, designers, product managers, and data scientists perform a wide range of tasks beyond code generation. GPT-5.3-Codex is built to assist across the software lifecycle: debugging, deployment, monitoring, writing PRDs, editing copy, running user research, tests, and metrics.

With custom skills similar to those used in prior GDPval experiments, GPT-5.3-Codex produces full work products. Examples in the OpenAI official blog include financial advice slide decks, a retail training document, an NPV analysis spreadsheet, and a fashion presentation. Each GDPval task is designed by a domain professional and reflects realistic work from that occupation.

https://openai.com/index/introducing-gpt-5-3-codex/

On OSWorld, GPT-5.3-Codex demonstrates stronger computer-use capabilities than earlier GPT models. OSWorld-Verified requires the model to use vision to complete diverse tasks in a desktop environment, aligning closely with how agents operate real applications and tools instead of only producing text.

An interactive collaborator in the Codex app

As models become more capable, OpenAI frames the main challenge as human supervision and control of many agents working in parallel. The Codex app is designed to make managing and directing agents easier, and with GPT-5.3-Codex it gains more interactive behavior.

Codex now provides frequent updates during a run so users can see key decisions and progress. Instead of waiting for a single final output, users can ask questions, discuss approaches, and steer the model in real time. GPT-5.3-Codex explains what it is doing and responds to feedback while keeping context. This ‘follow-up behavior’ can be configured in the Codex app settings.

A model that helped train and deploy itself

GPT-5.3-Codex is the first model in this family that was ‘instrumental in creating itself.’ OpenAI used early versions of GPT-5.3-Codex to debug its own training, manage deployment, and diagnose test results and evaluations.

The OpenAI research team used Codex to monitor and debug the training run, track patterns across the training process, analyze interaction quality, propose fixes, and build applications that visualize behavioral differences relative to prior models. The development team used Codex to optimize and adapt the serving harness, identify context rendering bugs, find the root causes of low cache hit rates, and dynamically scale GPU clusters to maintain stable latency under traffic surges.

During alpha testing, a researcher asked GPT-5.3-Codex to quantify additional work completed per turn and the effect on productivity. The model generated regex-based classifiers to estimate clarification frequency, positive and negative responses, and task progress, then ran these over session logs and produced a report. Codex also helped build new data pipelines and richer visualizations when standard dashboard tools were insufficient and summarized insights from thousands of data points in under 3 minutes

Cybersecurity capabilities and safeguards

GPT-5.3-Codex is the first model OpenAI classifies as ‘High capability’ for cybersecurity-related tasks under its Preparedness Framework and the first model it has trained directly to identify software vulnerabilities. OpenAI states that it has no definitive evidence that the model can automate cyber attacks end-to-end and is taking a precautionary approach with its most comprehensive cybersecurity safety stack to date.

Mitigations include safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines that incorporate threat intelligence. OpenAI is launching a ‘Trusted Access for Cyber’ pilot, expanding the private beta of Aardvark, a security research agent, and providing free codebase scanning for widely used open-source projects such as Next.js, where Codex was recently used to identify disclosed vulnerabilities.

Key Takeaways

Unified frontier model for coding and work: GPT-5.3-Codex combines the coding strength of GPT-5.2-Codex with the reasoning and professional capabilities of GPT-5.2 in a single agentic model, and runs 25% faster in Codex.

State-of-the-art on coding and agent benchmarks: The model sets new highs on SWE-Bench Pro (56.8% at xhigh), Terminal-Bench 2.0 (77.3%), and achieves 64.7% on OSWorld-Verified and 70.9% wins or ties on GDPval, often with fewer tokens than previous models.

Supports long-horizon web and app development: Using skills such as ‘develop web game’ and generic follow-ups like ‘fix the bug’ and ‘improve the game,’ GPT-5.3-Codex autonomously developed complex racing and diving games over millions of tokens, demonstrating sustained multi-step development ability.

Instrumental in its own training and deployment: Early versions of GPT-5.3-Codex were used to debug the training run, analyze behavior, optimize the serving stack, build custom pipelines, and summarize large-scale alpha logs, making it the first Codex model ‘instrumental in creating itself.’

High-capability cyber model with guarded access: GPT-5.3-Codex is the first OpenAI model rated ‘High capability’ for cyber and the first trained directly to identify software vulnerabilities. OpenAI pairs this with Trusted Access for Cyber, expanded Aardvark beta, free codebase scanning for projects such as Next.js.

Check out the Technical details and Try it here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OpenAI Just Launched GPT-5.3-Codex: A Faster Agentic Coding Model Unifying Frontier Code Performance And Professional Reasoning Into One System appeared first on MarkTechPost.

How Associa transforms document classification with the GenAI IDP Acce …

This is a guest post co-written with David Meredith and Josh Zacharias from Associa.
Associa, North America’s largest community management company, oversees approximately 7.5 million homeowners with 15,000 employees across more than 300 branch offices. The company manages approximately 48 million documents across 26 TB of data, but their existing document management system lacks efficient automated classification capabilities, making it difficult to organize and retrieve documents across multiple document types. Every day, employees spend countless hours manually categorizing and organizing incoming documents—a time-consuming, error-prone process that creates bottlenecks in operational efficiency and potentially results in operational delays and reduced productivity.
Associa collaborated with the AWS Generative AI Innovation Center to build a generative AI-powered document classification system aligning with Associa’s long-term vision of using generative AI to achieve operational efficiencies in document management. The solution automatically categorizes incoming documents with high accuracy, processes documents efficiently, and provides substantial cost savings while maintaining operational excellence. The document classification system, developed using the Generative AI Intelligent Document Processing (GenAI IDP) Accelerator, is designed to integrate seamlessly into existing workflows. It revolutionizes how employees interact with document management systems by reducing the time spent on manual classification tasks.
This post discusses how Associa is using Amazon Bedrock to automatically classify their documents and to help enhance employee productivity.
Solution overview
The GenAI IDP Accelerator is a cloud-based document processing solution built on AWS that automatically extracts and organizes information from various document types. The system uses OCR technology and generative AI to convert unstructured documents into structured, usable data while scaling seamlessly to handle high document volumes.
The accelerator is built with a flexible, modular design using AWS CloudFormation templates that can handle different types of document processing while sharing core infrastructure for job management, progress tracking, and system monitoring. The accelerator supports three processing patterns. We use Pattern 2 for this solution using OCR (Amazon Textract) and classification (Amazon Bedrock). The following diagram illustrates this architecture.

We optimized the document classification workflow by evaluating three key aspects:

Prompt input – Full PDF document (all pages) vs. first page only
Prompt design – Multimodal prompting with OCR data (using the Amazon Textract analyze_document_layout) vs. document image only
Model choice – Amazon Nova Lite, Amazon Nova Pro, Amazon Nova Premier, and Anthropic’s Claude Sonnet 4 on Amazon Bedrock

This comprehensive evaluation framework helped us identify the configuration that delivers the highest accuracy while minimizing processing inference costs for Associa’s specific document types and operational requirements. The evaluation dataset consists of 465 PDF documents across eight distinct document types. The dataset includes some samples identified as draft documents or email correspondences. These samples are categorized as document type Unknown due to insufficient classification criteria. The distribution of document types across classes is unbalanced, ranging from 6 samples for Policies and Resolutions to 155 samples for Minutes.
Evaluation: Prompt input
We started our initial evaluation using full PDF documents, where all pages of a PDF were used as input to the prompt for classification. The following table shows the accuracy for full PDF classification using Amazon Nova Pro with OCR and image. We observed an average classification accuracy of 91% considering the different document types with an average cost of 1.10 cents per document.

Document Type
Number of Samples
Number of Samples Classified Correctly
Classification Accuracy
Classification Cost (in Cents)

Bylaws
46
42
91%
1.52c

CCR Declarations
22
19
86%
1.55c

Certificate of Insurance
74
74
100%
1.49c

Contracts
71
66
93%
1.48c

Minutes
155
147
95%
1.47c

Plat Map
21
20
95%
1.45c

Policies and Resolutions
6
5
83%
0.35c

Rules and Regulations
50
44
88%
0.36c

Unknown
20
8
40%
0.24c

Overall
465
425
91%
1.10c

Using full PDF for document classification demonstrates an accuracy of 100% for Certificate of Insurance and 95% for Minutes. The system correctly classified 425 out of 465 documents. However, for the Unknown document type, it achieved only 40% accuracy, correctly classifying just 8 out of 20 documents.
Next, we experimented with using only the first page of a PDF document for classification, as shown in the following table. This approach improved overall accuracy from 91% to 95% with 443 out of 465 documents classified correctly while reducing classification cost per document from 1.10 cents to 0.55 cents.

Document Type
Number of Samples
Number of Samples Classified Correctly
Classification Accuracy
Classification Cost (in Cents)

Bylaws
46
44
96%
0.55c

CCR Declarations
22
21
95%
0.55c

Certificate of Insurance
74
74
100%
0.59c

Contracts
71
64
90%
0.56c

Minutes
155
153
99%
0.55c

Plat Map
21
17
81%
0.56c

Policies and Resolutions
6
4
67%
0.57c

Rules and Regulations
50
49
98%
0.56c

Unknown
20
17
85%
0.55c

Overall
465
443
95%
0.55c

Apart from improved accuracy and reduced cost, the first-page-only approach significantly improved Unknown document classification accuracy from 40% to 85%. First pages typically contain the most distinctive document features, whereas later pages in drafts or email threads can introduce noise that confuses the classifier. Combined with faster processing speeds and lower infrastructure costs, we selected the first-page-only approach for the subsequent evaluations.
Evaluation: Prompt design
Next, we experimented on prompt design to evaluate whether OCR data is necessary for document classification or just using the document image is sufficient. We evaluated by removing the OCR text extraction data from the prompt and only using the image in a multimodal prompt. This approach removes the Amazon Textract costs and relies entirely on the model’s understanding of visual features. The following table shows the accuracy for first-page-only classification using Amazon Nova Pro with only image.

Document Type
Number of Samples
Number of Samples Classified Correctly
Classification Accuracy
Classification Cost (in Cents)

Bylaws
46
45
98%
0.19c

CCR Declarations
22
20
91%
0.19c

Certificate of Insurance
74
74
100%
0.18c

Contracts
71
63
89%
0.18c

Minutes
155
151
97%
0.18c

Plat Map
21
18
86%
0.19c

Policies and Resolutions
6
4
67%
0.18c

Rules and Regulations
50
48
96%
0.18c

Unknown
20
10
50%
0.18c

Overall
465
433
93%
0.18c

The image-only classification approach demonstrates similar issues as the full PDF classification approach. Although this method achieves an overall accuracy of 93%, for Unknown document types, it could classify only 10 out of 20 documents correctly with 50% accuracy. The following table summarizes our evaluation of an image-only approach.

Overall Classification Accuracy (All Document Types, Including Unknown)
Classification Accuracy (Document Type: Unknown)
Classification Cost (in Cents)

First page only classification (OCR + Image)
95%
85%
0.55c

First page only classification (Only Image)
93%
50%
0.18c

The image-only approach removes OCR costs but reduces overall accuracy from 95% to 93% and Unknown document accuracy from 85% to 50%. Accurate Unknown document classification is critical for downstream human review and operational efficiency at Associa. We selected the combined OCR and image approach to maintain this capability.
Evaluation: Model choice
Using the optimal configuration of first-page-only classification with OCR and image, we evaluated different models to identify an optimal balance of accuracy and cost, as summarized in the following table. We focus on overall classification performance, classification of unknown documents, and per-document classification costs.

Overall Classification Accuracy (All Document Types, Including Unknown)
Classification Accuracy (Document Type: Unknown)
Classification Cost (in Cents)

Amazon Nova Pro
95%
85%
0.55c

Amazon Nova Lite
95%
50%
0.41c

Amazon Nova Premier
96%
90%
1.12c

Anthropic Claude Sonnet 4
95%
95%
1.21c

Overall classification accuracy ranged from 95–96% across the models, with variation in unknown document type performance. Certificate of Insurance, Plat Map, and Minutes achieved 98–100% accuracy across the models. Anthropic’s Claude Sonnet 4 achieved the highest unknown document accuracy (95%), followed by Amazon Nova Premier (90%) and Amazon Nova Pro (85%). However, Anthropic’s Claude Sonnet 4 increased classification cost from 0.55 cents to 1.21 cents per document. Amazon Nova Premier achieved the best overall classification accuracy at 1.12 cents per document. Considering the trade-offs between accuracy and cost, we selected Amazon Nova Pro as the optimal model choice.
Conclusion
Associa built a generative AI-powered document classification system using Amazon Nova Pro on Amazon Bedrock that achieves 95% accuracy at an average cost of 0.55 cents per document. The GenAI IDP Accelerator facilitates reliable performance scaling to high volume of documents across their branches. “The solution developed by AWS Generative AI Innovation Center improves how our employees manage and organize documents, and we foresee significant reduction of manual effort in document processing,” says Andrew Brock, President, Digital & Technology Services & Chief Information Officer at Associa. “The document classification system provides substantial cost savings and operational improvements, while maintaining our high accuracy standards in serving residential communities.”
Refer to the GenAI IDP Accelerator GitHub repository for detailed examples and choose Watch to stay informed on new releases. If you’d like to work with the AWS GenAI Innovation Center, please reach out to us or leave a comment.
Acknowledgements
We would like to thank Mike Henry, Bob Strahan, Marcelo Silva, and Mofijul Islam for their significant contributions, strategic decisions, and guidance throughout.

About the authors
David Meredith is Director of Employee Software Development at Associa. He oversees the efforts of the Associa team to create software for their 15,000 employees to use daily. He has almost 20 years of experience with software in the residential property management industry and lives in the Vancouver area of BC, Canada.
Josh Zacharias is a Software Developer at Associa, where he is a lead engineer for the internal software team. His work includes architecting full stack solutions for various departments in the company as well as empowering other developers to be more efficient experts in developing software.
Monica Raj is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she works with organizations across various industries to develop AI solutions. Her work focuses on building and deploying agentic AI solutions, natural language processing, contact center automation, and intelligent document processing. Monica has extensive experience in building scalable AI solutions for enterprise customers.
Tryambak Gangopadhyay is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where he collaborates with organizations across a diverse spectrum of industries. His role involves researching and developing generative AI solutions to address crucial business challenges and accelerate AI adoption. Prior to joining AWS, Tryambak completed his PhD at Iowa State University.
Nkechinyere Agu is an Applied Scientist at the AWS Generative AI Innovation Center, where she works with organizations across various industries to develop AI solutions. Her work focuses on developing multimodal AI solutions, agentic AI solutions, and natural language processing. Prior to joining AWS, Nkechinyere completed her PhD at Rensselaer Polytechnic Institute, Troy NY.
 Naman Sharma is a Generative AI Strategist at the AWS Generative AI Innovation Center, where he collaborates with organizations to drive adoption of generative AI to solve business problems at scale. His work focuses on leading customers from scoping, deploying, and scaling frontier solutions with the GenAIIC Strategy and Applied Science teams.
 Yingwei Yu is an Applied Science Manager at the Generative AI Innovation Center, based in Houston, Texas. With extensive experience in applied machine learning and generative AI, Yingwei leads the development of innovative solutions across various industries.
 Dwaragha Sivalingam is a Senior Solutions Architect specializing in generative AI at AWS, serving as a trusted advisor to customers on cloud transformation and AI strategy. With eight AWS certifications, including ML Specialty, he has helped customers in many industries, including insurance, telecom, utilities, engineering, construction, and real estate. A machine learning enthusiast, he balances his professional life with family time, enjoying road trips, movies, and drone photography.