GPT OSS models from OpenAI are now available on SageMaker JumpStart

Today, we are excited to announce the availability of Open AI’s new open weight GPT OSS models, gpt-oss-120b and gpt-oss-20b, from OpenAI in Amazon SageMaker JumpStart. With this launch, you can now deploy OpenAI’s newest reasoning models to build, experiment, and responsibly scale your generative AI ideas on AWS.
In this post, we demonstrate how to get started with these models on SageMaker JumpStart.
Solution overview
The OpenAI GPT OSS models (gpt-oss-120b and gpt-oss-20b) excel at coding, scientific analysis, and mathematical reasoning tasks. Both models feature a 128K context window and adjustable reasoning levels (low/medium/high) to match specific requirements. They support external tool integration and can be used in agentic workflows through frameworks like Strands Agents, an open source AI agent SDK. With full chain-of-thought output capabilities, you get detailed visibility into the model’s reasoning process. You can use the OpenAI SDK to call your SageMaker endpoint directly by simply updating the endpoint. The models give you the flexibility to modify and customize them for your specific business needs while benefiting from enterprise-grade security and seamless scaling.
SageMaker JumpStart is a fully managed service that offers state-of-the-art foundation models (FMs) for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy, accelerating the development and deployment of machine learning (ML) applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as OpenAI, for a variety of tasks.
You can now discover and deploy OpenAI models in Amazon SageMaker Studio or programmatically through the Amazon SageMaker Python SDK, to derive model performance and MLOps controls with Amazon SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The models are deployed in a secure AWS environment and under your VPC controls, helping to support data security for enterprise security needs.
You can discover GPT OSS models from US East (Ohio, N. Virginia) and Asia Pacific (Mumbai, Tokyo) AWS Regions.
Throughout this example, we use the gpt-oss-120b model. These steps can be replicated with the gpt-oss-20b model as well.
Prerequisites
To deploy the GPT OSS models, you must have the following prerequisites:

An AWS account that will contain your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, see AWS Identity and Access Management for Amazon SageMaker AI.
Access to SageMaker Studio, a SageMaker notebook instance, or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
To deploy GPT OSS models, make sure you have access to the recommended instance types based on the model size. You can find these instance recommendations on the SageMaker JumpStart model card. The default instance type for both these models is p5.48xlarge, but you can also use other P5 family instances where available. To verify you have the necessary service quotas, complete the following steps:

On the Service Quotas console, under AWS Services, choose Amazon SageMaker.
Check that you have sufficient quota for the required instance type for endpoint deployment.
Make sure at least one of these instance types is available in your target Region.
If needed, request a quota increase and contact your AWS account team for support.

Deploy gpt-oss-120b through the SageMaker JumpStart UI
Complete the following steps to deploy gpt-oss-120b through SageMaker JumpStart:

On the SageMaker console, choose Studio in the navigation pane.
First-time users will be prompted to create a domain. If not, choose Open Studio.
On the SageMaker Studio console, access SageMaker JumpStart by choosing JumpStart in the navigation pane.
On the SageMaker JumpStart landing page, search for gpt-oss-120b using the search box.

Choose a model card to view details about the model such as license, data used to train, and how to use the model. Before you deploy the model, review the configuration and model details from the model card. The model details page includes the following information:

The model name and provider information.
A Deploy button to deploy the model.

Choose Deploy to proceed with deployment.

For Endpoint name, enter an endpoint name (up to 50 alphanumeric characters).
For Number of instances, enter a number between 1–100 (default: 1).
For Instance type, select your instance type. For optimal performance with gpt-oss-120b, a GPU-based instance type such as p5.48xlarge is recommended.

Choose Deploy to deploy the model and create an endpoint.

When deployment is complete, your endpoint status will change to InService. At this point, the model is ready to accept inference requests through the endpoint. When the deployment is complete, you can invoke the model using a SageMaker runtime client and integrate it with your applications.
Deploy gpt-oss-120b with the SageMaker Python SDK
To deploy using the SDK, start by selecting the gpt-oss-120b model, specified by the model_id with the value openai-reasoning-gpt-oss-120b. You can deploy your choice of model on SageMaker using the Python SDK examples in the following sections. Similarly, you can deploy gpt-oss-20b using its model ID.
Enable web search on your model with EXA
By default, models in SageMaker JumpStart run in network isolation. The GPT OSS models come with a built-in tool for web search using EXA, a meaning-based web search API powered by embeddings. To use this tool, OpenAI requires customers get an API key from EXA and pass this key as an environment variable to their JumpStartModel instance when deploying it through the SageMaker Python SDK. The following code details how to deploy the model on SageMaker with network isolation disabled and pass in the EXA API key to the model:

from sagemaker.jumpstart.model import JumpStartModel

accept_eula = True
model = JumpStartModel(
    model_id=”openai-reasoning-gpt-oss-120b”,
    enable_network_isolation=False,
    env={
        “EXA_API_KEY”: “<INSERT_API_KEY>”
    }
)
predictor = model.deploy(
    accept_eula=accept_eula
)

You can change these configurations by specifying other non-default values in JumpStartModel. The end user license agreement (EULA) value must be explicitly defined as True to accept the terms. With the preceding deployment, because network isolation is set at deployment time, turning it back on requires creating a new endpoint.
Optionally, you can deploy your model with the JumpStart default values (with network isolation enabled) as follows:

from sagemaker.jumpstart.model import JumpStartModel
accept_eula = True
model = JumpStartModel(model_id=”openai-reasoning-gpt-oss-120b”)
predictor = model.deploy(accept_eula=accept_eula)

Run inference with the SageMaker predictor
After the model is deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
    “model”: “/opt/ml/model”,
    “input”: [
        {
            “role”: “system”,
            “content”: “You are a good AI assistant”
        },
        {
            “role”: “user”,
            “content”: “Hello, how is it going?”
        }
    ],
    “max_output_tokens”: 200,
    “stream”: “false”,
    “temperature”: 0.7,
    “top_p”: 1
}
    
response = predictor.predict(payload)
print(response[‘output’][-1][‘content’][0][‘text’])

We get the following response:

Hey there! All good on my end—just ready to dive into whatever you need. How’s it going on your side?

Function calling
The GPT OSS models were trained on the harmony response format for defining conversation structures, generating reasoning output and structuring function calls. The format is designed to mimic the OpenAI Responses API, so if you have used that API before, this format should hopefully feel familiar to you. The model should not be used without using the harmony format. The following example showcases an example of tool use with this format:

payload= {
  “model”: “/opt/ml/model”,
  “input”: “System: You are ChatGPT, a large language model trained by OpenAI.nKnowledge cutoff: 2024-06nCurrent date: 2024-08-05nnreasoning: mediumnn# Valid channels: analysis, commentary, final. Channel must be included for every message.nCalls to these tools must go to the commentary channel: ‘functions’.nn# Toolsnn## functionsnnnamespace functions {nn// Gets the current weather for a specific location.ntype get_current_weather = (_: {n// The city and state/country, e.g. “San Francisco, CA” or “London, UK”nlocation: string,n// Temperature unit preferencenunit?: “celsius” | “fahrenheit”, // default: celsiusn}) => any;nn} // namespace functionsnnDeveloper: You are a helpful AI assistant. Provide clear, concise, and helpful responses.nnHuman: What’s the weather like in Seattle?nnAssistant:”,
  “instructions”: “You are a helpful AI assistant. Provide clear, concise, and helpful responses.”,
  “max_output_tokens”: 2048,
  “stream”: “false”,
  “temperature”: 0.7,
  “reasoning”: {
    “effort”: “medium”
  },
  “tools”: [
    {
      “type”: “function”,
      “name”: “get_current_weather”,
      “description”: “Gets the current weather for a specific location”,
      “parameters”: {
        “type”: “object”,
        “properties”: {
          “location”: {
            “type”: “string”,
            “description”: “The city and state/country, e.g. ‘San Francisco, CA’ or ‘London, UK'”
          },
          “unit”: {
            “type”: “string”,
            “enum”: [“celsius”, “fahrenheit”],
            “default”: “celsius”,
            “description”: “Temperature unit preference”
          }
        },
        “required”: [“location”]
      }
    }
  ],
}

We get the following response:

{‘arguments’: ‘{“location”:”Seattle, WA”}’, ‘call_id’: ‘call_596a67599df2465495fd444772ff9539’, ‘name’: ‘get_current_weather’, ‘type’: ‘function_call’, ‘id’: ‘ft_596a67599df2465495fd444772ff9539’, ‘status’: None}

Clean up
After you’re done running the notebook, make sure to delete the resources that you created in the process to avoid additional billing. For more details, see Delete Endpoints and Resources.

predictor.delete_model()
predictor.delete_endpoint()

Conclusion
In this post, we demonstrated how to deploy and get started with OpenAI’s GPT OSS models (gpt-oss-120band gpt-oss-20b) on SageMaker JumpStart. These reasoning models bring advanced capabilities for coding, scientific analysis, and mathematical reasoning tasks directly to your AWS environment with enterprise-grade security and scalability.
Try out the new models, and share your feedback in the comments.

About the Authors
Pradyun Ramadorai, Senior Software Development Engineer Malav Shastri, Software Development Engineer Varun Morishetty, Software Development Engineer Evan Kravitz, Software Development Engineer Benjamin Crabtree, Software Development Engineer Shen Teng, Software Development Engineer Loki Ravi, Senior Software Development Engineer Nithin Vijeaswaran, Specialist Solutions Architect Breanne Warner, Enterprise Solutions Architect Yotam Moss, Software Development Manager Mike James, Software Development Manager Sadaf Fardeen, Software Development Manager Siddharth Shah, Principal Software Development Engineer June Won, Principal Product Manager

Discover insights from Microsoft Exchange with the Microsoft Exchange …

Amazon Q Business is a fully managed, generative AI-powered assistant that helps enterprises unlock the value of their data and knowledge. With Amazon Q Business, you can quickly find answers to questions, generate summaries and content, and complete tasks by using the information and expertise stored across your company’s various data sources and enterprise systems. At the core of this capability are native data source connectors that seamlessly integrate and index content from multiple repositories into a unified index. This enables the Amazon Q Business large language model (LLM) to provide accurate, well-written answers by drawing from the consolidated data and information. The data source connectors act as a bridge, synchronizing content from disparate systems like Salesforce, Jira, and SharePoint into a centralized index that powers the natural language understanding and generative abilities of Amazon Q Business.
To make this integration process as seamless as possible, Amazon Q Business offers multiple pre-built connectors to a wide range of data sources, including Atlassian Jira, Atlassian Confluence, Amazon Simple Storage Service (Amazon S3), Microsoft Exchange, Microsoft SharePoint, Salesforce, and many more. This allows you to create your generative AI solution with minimal configuration. For a full list of Amazon Q Business supported data source connectors, see Supported connectors.
One of the key integrations for Amazon Q Business is with Microsoft Exchange. Microsoft Exchange is a widely used enterprise email and collaboration environment that contains a wealth of valuable information, including email conversations, attachments, calendar events, and contacts.
With the Microsoft Exchange connector, we are enhancing user productivity and streamlining communication processes within organizations. This integration empowers users to use advanced search capabilities and intelligent email management using natural language.
The Microsoft Exchange connector for Amazon Q Business providing a seamless way to index and query data stored in Microsoft Exchange. With this connector, organizations

Centralized access to Microsoft Exchange data – Amazon Q Business allows you to configure Microsoft Exchange as a data source, providing a single, centralized interface to search and access information stored in your Microsoft Exchange repositories. This alleviates the need for users to navigate through individual email accounts or folders to find relevant data.
Intelligent search and retrieval – Amazon Q Business uses advanced natural language processing and machine learning capabilities to enable intelligent, natural language-based search and retrieval of information from Microsoft Exchange. Users can ask questions or make queries in plain language, and Amazon Q Business will surface the most relevant responses and insights.
Enhanced productivity and collaboration – By making it straightforward for employees to find and act on the information stored in Microsoft Exchange, Amazon Q Business can improve productivity and collaboration across the organization. Users can quickly locate key documents, contacts, or calendar events, and use that information to make more informed decisions and drive faster, more effective outcomes.
Secure and compliant data access – Amazon Q Business provides a secure, compliant way to access and query Microsoft Exchange data. Amazon Q Business integrates with your organization’s identity provider (IdP) to make sure only authorized users can access sensitive information, and it maintains a detailed audit trail of all user activity.
Streamlined workflows and decision-making – Amazon Q Business has the ability to generate summaries, answers, and recommendations based on Microsoft Exchange data, users can make more informed decisions and streamline various workflows, such as customer support, project management, and strategic planning.

By using the Microsoft Exchange connector for Amazon Q Business, organizations can unlock the full value of the data stored in their Microsoft Exchange repositories, empowering employees to work more efficiently, collaborate more effectively, and drive greater business impact.
In this post, we show how to index information stored in Microsoft Exchange and use Amazon Q Business to query your Microsoft Exchange data.
Microsoft Exchange connector for Amazon Q Business features
The following table gives an overview of the Microsoft Exchange connector for Amazon Q Business and its supported features. For more details, refer to Microsoft Exchange connector overview.

Solution overview
With Amazon Q Business, you can configure multiple data sources to provide a central place to search across your internal repository. For our solution, we demonstrate how to retrieve data from the Microsoft Exchange repository or a folder using the Microsoft Exchange connector for Amazon Q Business. The solution consists of the following steps:

Configure a Microsoft Exchange application and gather connection details
Create users and groups in AWS IAM Identity Center
Create the Microsoft Exchange connector for Amazon Q Business
Query Microsoft Exchange data using the Amazon Q web experience
Troubleshooting

The following diagram illustrates the solution architecture.

Prerequisites
To configure the Microsoft Exchange connector for Amazon Q Business, you need to create a Microsoft Exchange account in Office 365.
Configure a Microsoft Exchange application and gather connection details

Log in to the Azure portal using your global admin user account and choose Next.

Enter your password and choose Sign in.

If multi-factor authentication (MFA) is configured, now authenticate using Microsoft Authenticator.

Choose Yes to stay signed in.
On the Azure portal’s landing page, search for and choose Microsoft Entra ID.

On the Microsoft Entra ID service page, copy the value of Tenant ID.

Choose App registrations in the navigation pane.

Choose New registration.

Enter the name of your choice for Name, then choose Register.

After successful registration, you will land on the application page, as shown in the following screenshot.

Choose Certificates & secrets in the navigation pane.

Choose New client secret.

Enter a description for the client secret for Description and choose Add.

Make a note of the secret value and secret ID.

Now configure API permissions by choosing API permissions in the navigation pane.

For Microsoft Exchange Online, please make sure that you have Azure AD Premium P2 activated. This will make sure that the Microsoft Exchange Online is available as part of your organization APIs.

Add the permissions to the APIs Microsoft Graph and Office 365 Exchange Online.

There are 13 permissions for Microsoft Graph and 1 permission for Office 365 Exchange Online.

Create users and groups in AWS IAM Identity Center
In this section, you create a user John Doe in AWS IAM Identity Center, who will be given permission to use the application.
To create your user, complete the following steps:

Open IAM Identity Center console.
If you haven’t enabled IAM Identity Center, choose Enable. If there’s a pop-up, choose how you want to enable IAM Identity Center. For this example, select Enable only in this AWS account. Choose Continue.

For more details, refer to Enable IAM Identity Center.

On the IAM Identity Center console, choose Users in the navigation pane.
Choose Add user.
Enter the following user details:

Username: john_doe
Email address: john_doe@xyz.com (Use or create a real email address for each user to use in a later step.)
First name: John
Last name: Doe
Display name: John Doe

Skip the optional fields and choose Next to create the user.
On the Add user to groups page, choose Next and then choose Add user.

Create the Microsoft Exchange connector for Amazon Q Business
For detailed steps to set up Amazon Q Business, refer to Getting started with Amazon Q Business. To configure the Amazon Q Business connector, complete the following steps:

In the Amazon Q Business console, choose Applications in the navigation pane.
Choose Create application.

In the Create application step, for Service access, select Create and use a new service role, then choose Create.

In the Select retriever step, select Use native retriever and choose Next.

In the Connect data sources step, search for and choose Microsoft Exchange, then choose Create application.

On the Applications page, choose your application (qbiz-mx-app).

In the Data sources section, choose Add data source.

On the Add data source page, search for Microsoft Exchange and choose the plus sign to configure the data source.

Enter the name of the data source and the tenant ID noted earlier.

In the Authorization section, enable Access Control List (ACL).
In the Authentication section, for AWS Secrets Manager secret, choose Create and add new secret.

Enter the secret name of your choice, the client ID and client secret values you noted earlier, and choose Save.

In the Configure VPC and security group section, leave the settings as default.
In the IAM role section, choose Create a new service role.
In the Sync scope section, for User email ID, enter the email of your Microsoft Exchange account and choose Add.

Alternatively, if you have list of user email IDs, you can provide an Amazon S3 path to a file with user emails in your S3 bucket. For more details, refer to Connecting Amazon Q Business to Microsoft Exchange using the console.

In the Sync mode section, use the default Full sync.
In the Sync run schedule section, choose the frequency of your choice.
Leave the remaining sections with default values.

Choose Add data source.

Amazon Q will take 30 seconds to 1 minute to configure your data source. You will see a success banner as shown in the following screenshot.

Choose Sync now to sync your data source.

After successfully syncing the data source, you will see the Status / Summary column as Completed.

For the Update groups and users step, choose Add groups and users.

The users and groups that you add in this section are from the IAM Identity Center users and groups set up by your administrator.

In the Add or assign users and groups pop-up, select Assign existing users and groups to add existing users configured in your connected IAM Identity Center.

Optionally, if you have permissions to add users to connected IAM Identity Center, you can select Add new users.

Choose Get started.

In the Assign users and groups pop-up, search for users by user display name or groups by group name.
Choose the users or groups you want you add and choose Done.

This closes the pop-up. The groups and users that you added should now be available on the Groups or Users tabs.

Choose Assign.

For each group or user entry, an Amazon Q Business subscription tier needs to be assigned.

To enable a subscription for a group, on the Update groups and users page, choose the Groups (If individual users need to be assigned a subscription, choose the Users tab.)
For Subscription, choose Choose subscription and choose a subscription (Q Business Lite or Q Business Pro).
Choose Update application to complete setting up the data connector for Amazon Q Business.

Query Microsoft Exchange data using the Amazon Q web experience
To query the data that is synced through the data source, navigate back to the Amazon Q Business application (qbiz-mx-app) and choose the Web experience URL link.

Sign in to the web application using the credentials of the user assigned and configured in IAM Identity Center.

After a successful sign in, the Amazon Q Business application should be displayed in the list of applications, as shown in the following screenshot.

The application link should redirect you to the Amazon Q Business chat application, as shown in the following screenshot.

The following screenshot shows the emails that were synced earlier. We will first query based on the content from the email highlighted in this screenshot.

The following screenshot shows the response to the query “what are the easy ways to get started on Azure?”

You can choose the data source hyperlink to open the email that the response is based on.

The following screenshot shows an invoice email from Microsoft Outlook, which we will use for another question.

We will also refer to calendar details of a meeting with the billing team along with the agenda details.

We ask the question “Q Assistant, I have a meeting with the billing team tomorrow. Can you summarize the agenda and find relevant information from my emails that I can review in the meeting?” The following screenshot shows the response based on the sample invoices email.

The response included the information from the email along with the hyperlink to the data sources (in this case, it is the hyperlink to the Outlook emails).
We ask another question: “What are the main features and my actions items relating to the recent CloudTrail changes? By when should I implement the changes?”

Amazon Q Business retrieved the main features, action items, and the implementation timeline.

Congratulations! You have successfully used the Microsoft Exchange connector for Amazon Q Business to surface answers and insights based on the content indexed from your Microsoft Exchange account.
Troubleshooting
Troubleshooting your Microsoft Exchange connector provides information about error codes you might see for the connector and suggested troubleshooting actions. If you encounter an HTTP status code 403 (Forbidden) error when you open your Amazon Q Business application, it means that the user is unable to access the application. See for common causes and how to address them.
The sync run history report is a new feature now available in Amazon Q Business that significantly improves visibility into data source sync operations. The latest release introduces a comprehensive document-level report incorporated into the sync history, providing administrators with granular indexing status, metadata, and ACL details for the documents processed during a data source sync job.
Frequently asked questions
In this section, we provide guidance to frequently asked questions.
Amazon Q Business is unable to answer your questions
If you get response “Sorry, I couldn’t find relevant information to complete your request,” this might be due to a few reasons:

No permissions – Access control lists (ACLs) applied to your account don’t allow you to query certain data sources. If this is the case, reach out to your administrator to make sure your ACLs are configured to access the data sources
Data connector sync failed – Your data connector might have failed to sync information from the source to the Amazon Q Business application. Verify the data connector’s sync run schedule and sync history to confirm the sync is successful.
Empty mail exchange – Verify that the connected email exchange to Amazon Q has data.

If none of these are true in your case, open a support case to get this resolved.
How to generate responses from authoritative data sources
You can configure these options using Amazon Q Business application global controls under Admin controls and guardrails:

Log in as an Amazon Q Business application administrator.
Navigate to the application and choose Admin controls and guardrails in the navigation pane.
Choose Edit in the Global controls section to configure these options.

For more information, refer to Admin controls and guardrails in Amazon Q Business.

Amazon Q Business responds using old (stale) data even though your data source is updated
Each Amazon Q Business data connector can be configured with unique sync run schedule frequency. Verify the sync status and sync schedule frequency for your data connector to see when the last sync ran successfully. Your data connector’s sync run schedule might be set to sync at a scheduled time of day, week, or month. If set to run on demand, then the sync has to be manually triggered. When the sync run is complete, verify the sync history to make sure the run has successfully synced all new issues. Refer to Sync run schedule for more information.
How to set up Amazon Q Business using a different IdP
You can set up Amazon Q Business with another SAML 2.0-compliant IdP, such as Okta, Entra ID, or Ping Identity. For more information, see Creating an Amazon Q Business application using Identity Federation through IAM.
Expand the solution
You can explore other features in Amazon Q Business. For example, the Amazon Q Business document enrichment feature helps you control what documents and document attributes are ingested into your index and also how they’re ingested. Using document enrichment, you can create, modify, or delete document attributes and document content when you ingest them into your Amazon Q Business index. For example, you can scrub personally identifiable information (PII) by choosing to delete any document attributes related to PII.
Amazon Q Business also offers the following features:

Filtering using metadata – Use document attributes to customize and control users’ chat experience. This is currently supported only if you use the Amazon Q Business API.
Source attribution with citations – Verify responses using Amazon Q Business source attributions.
Upload files and chat – Let users upload files directly into chat and use uploaded file data to perform web experience tasks.
Quick prompts – Feature sample prompts to inform users of the capabilities of their Amazon Q Business web experience.

To improve retrieved results and customize the user chat experience, you can map document attributes from your data sources to fields in your Amazon Q index. Learn more by exploring Microsoft Exchange data source connector field mappings.
Clean up
To avoid incurring future costs, clean up the resources you created as part of this solution. If you only added a new data source using the Microsoft Exchange connector for Amazon Q Business, delete that data source.
Complete the following steps to clean up your resources:

Open the Office 365 Admin Center using the account of a user member of the Tenant Global Admins group.
Navigate to the Microsoft Azure Portal.
Search for and choose App registrations.
Select the application you created earlier, then choose Delete.
On the Amazon Q Business console, choose Applications in the navigation pane.
Select the application you created, and on the Actions menu, choose Delete.
Delete the users that were added in IAM Identity Center.

Conclusion
With the Microsoft Exchange connector for Amazon Q Business, organizations can tap into the repository of information stored in their account securely using intelligent search powered by Amazon Q Business.
To learn about these possibilities and more, refer to the Amazon Q Business User Guide. For more information on how you can create, modify, or delete metadata and content when ingesting your data from Microsoft Exchange, refer to Enriching your documents during ingestion.

About the Authors
Ram Konchada is Senior Solutions Architect at AWS. He loves helping customers achieve their business goals using technology. Outside of work, Ram enjoys playing tennis.
Armstrong Onaiwu is a Solutions Architect at AWS. He is deeply passionate about technology and helping customers use AWS services to address business challenges. He specializes in designing highly scalable, resilient, and cost-effective network solutions on AWS. When not spending time with his family, Armstrong enjoys traveling and playing FIFA.

NASA Releases Galileo: The Open-Source Multimodal Model Advancing Eart …

Introduction

Galileo is an open-source, highly multimodal foundation model developed to process, analyze, and understand diverse Earth observation (EO) data streams—including optical, radar, elevation, climate, and auxiliary maps—at scale. Galileo is developed with the support from researchers from McGill University, NASA Harvest Ai2, Carleton University, University of British Columbia, Vector Institute, and Arizona State University. Galileo aims to provide a unified, generalist solution for critical applications like agricultural land mapping, disaster response, and environmental monitoring.

In contrast to prior remote sensing models limited to a single data type or scale, Galileo flexibly fuses multiple sensing modalities and is designed to recognize phenomena ranging from tiny objects (such as fishing boats, measuring just 1–2 pixels) to vast, slowly changing features like glaciers.

Key Features and Architecture

Multimodal Transformer Design

Galileo is based on a Vision Transformer (ViT) architecture, meticulously adapted to process:

Multispectral optical imagery (e.g., Sentinel-2)

Synthetic Aperture Radar (SAR) (e.g., Sentinel-1)

Elevation and slope data (e.g., NASA SRTM)

Weather/climate data (e.g., precipitation and temperature from ERA5)

Land cover maps, population, night-lights, and more

Flexible Input Handling:Galileo’s tokenization pipeline splits remote sensing inputs into spatial patches, timesteps, and logical channel groups. This allows the model to process images, time series, and static tabular data in a single architecture configuration.

Unified Local and Global Feature Learning

A core innovation is Galileo’s self-supervised pretraining algorithm, which combines:

Global losses: Encourage abstraction over wide spatial or temporal contexts—ideal for identifying “big” or slowly changing features (glaciers, forest loss).

Local losses: Enhance sensitivity to minute details—crucial for detecting small, fast-changing objects (boats, debris).

Local and global objectives differ in:

Prediction depth: Global tasks target deep latent representations; local tasks use shallow, linearly projected features.

Masking strategies: Global tasks use structured, correlated space-time masks (forcing predictions over large intervals); local tasks use random unstructured masks.

This dual-objective pretraining enhances multi-scale feature representation, making Galileo generalizable across tasks and robust even with limited labels.

Pretraining Dataset and Strategy

To ensure both semantic and geographic diversity, Galileo’s pretraining dataset covers the entire globe, sampled via a clustering approach to maximize both land cover variety and geographic spread. The dataset comprises over 127,000 spatiotemporally aligned samples, each including four categories and nine remote sensing data types.

Pretraining proceeds for 500 epochs on large compute resources. Key aspects:

Batch size: Effective batch size of 512.

Data augmentations: Flipping, rotation, and variable patch sizes.

Optimization: AdamW with scheduled learning rate and weight decay sweeps.

Benchmark Results

Superior Generalization

Galileo is benchmarked on 11 diverse datasets and 15 downstream tasks, spanning image and pixel time series classification, as well as segmentation. Specifically, it dominates on public datasets such as EuroSat, BigEarthNet, So2Sat, MADOS (marine debris), Sen1Floods11 (SAR flood mapping), CropHarvest (multimodal crop classification), and many others.

Performance Highlights of Galileo-Base (ViT-Base):

Classification (Finetune):

EuroSat: 97.7% (top-1 accuracy, 100% training data)

Outperforms specialist models like CROMA (96.6%) and SatMAE (96.6%)

Pixel Timeseries:

CropHarvest (Kenya): 84.2% (tops Presto and AnySat)

Breizhcrops: 73.0%

Segmentation (mIoU):

MADOS: 67.6%

PASTIS: 79.4%

Model Flexibility:Across all benchmarks, Galileo is the top performer overall—outclassing both image-specialized and time-series specialized competitors. Notably, small model variants (ViT-Nano, ViT-Tiny) also achieve top or near-top results, critical for resource-constrained settings.

Ablation and Input Importance

Removing any individual modality (e.g., VIIRS night lights, ERA5, Dynamic World maps) from pretraining leads to a measurable decline in performance—even on benchmarks not directly using that input type. For example, absence of VIIRS data reduces MADOS mIoU from 67.8% to 63.5%, demonstrating the value of full multimodality for feature generalization.

Open-Source and Real-World Impact

Open Weights & Code:All code, model weights, and pretraining data are available on GitHub, fostering transparency and adoption by the global EO community.

Societal Benefits:Galileo supports mission-critical NASA Harvest activities, such as global crop type mapping, rapid disaster mapping (floods, wildfires), and marine pollution detection. The model’s ability to work with limited labeled data makes it especially valuable in regions where ground truth is scarce, supporting food security and climate adaptation efforts.

Technical Summary Table

ModelParamsTasks SupportedRank (Lower=Better)Input ModalitiesGalileo-Base85MImages, Time Series1 (overall)Optical, SAR, Weather, etc.Specialist SOTAvariesUsually 1 or 2 types3–10Limited

Galileo-Base: consistently superior performance and flexibility across all major EO benchmarks.

Conclusion

Galileo’s methodological and engineering advances—multimodal inputs, multi-scale local-global feature learning, and large-scale globally diverse pretraining—set a new standard for generalist remote sensing AI. Its flexibility underpins practical deployments from environmental monitoring to climate resilience, offering reliable, high-quality maps and predictions regardless of the task or geography.

With open-source access and active development, Galileo is positioned to catalyze a new wave of innovation in earth system science, empowering practitioners everywhere.

Check out the Paper, Model and Technical Blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post NASA Releases Galileo: The Open-Source Multimodal Model Advancing Earth Observation and Remote Sensing appeared first on MarkTechPost.

Now It’s Claude’s World: How Anthropic Overtook OpenAI in the Ente …

The tides have turned in the enterprise AI landscape. According to Menlo Ventures’ 2025 “Mid-Year LLM Market Update,” Anthropic’s Claude has overtaken OpenAI as the leading language model provider for enterprise, now capturing 32% of market share compared to OpenAI’s 25%—a dramatic reversal from OpenAI’s dominant 50% share just one year ago. This is more than a leaderboard shuffle: it’s a testament to the maturation of enterprise AI and a signal for what businesses truly value in this next phase.

Anthropic’s Strategic Acceleration

Anthropic has charted a meteoric rise, catapulting revenues from $1B to $4B in just six months—largely on the strength of enterprise adoption by discerning, high-value customers. Rather than chasing ubiquity, Anthropic doubled down on the complex needs of large organizations, focusing on areas where AI adoption is not a curiosity but a necessity. With robust logic, structured reasoning, and rigorous regulatory compliance, Claude has become the preferred partner for industries where stakes are highest and trust is non-negotiable.

These efforts are evident in the suite of enterprise-tailored features that Claude now offers: advanced data privacy, granular user management, seamless integration with legacy IT, and sector-specific governance controls. The result? Anthropic’s dominance in code generation, where it now commands a remarkable 42% of the category—twice that of its nearest rival.

Why Enterprise Buyers Are Changing Course

The days when AI adoption decisions were swayed by splashy benchmarks or marginal gains in test scores are behind us. The Menlo Ventures report makes clear that, in 2025, enterprises are investing in outcomes, not outputs. They seek models that don’t merely process language, but power complex workflows, comply with stringent regulations, and snap natively into their existing digital fabric612.

Enterprise leaders now prioritize:

Code generation tools to fuel innovation and productivity—now a $1.9B market and steadily rising;

Agent-first architectures that enable autonomous, business-aware solutions;

Production-grade inference that moves AI from experimentation to mission-critical workloads;

Seamless integration with enterprise systems and data, rather than siloed “chatbots.”

The Paradox of Scale: Plummeting Costs, Surging Spend

Since 2022, model costs have plummeted a spectacular 280-fold, yet enterprise AI spending has never been higher. Investment is exploding at a 44% annual pace, headed toward $371B globally in 2025, driven by wide-scale deployment and real-world impact—not just experiments in the lab.

Why the paradox? Enterprises are no longer buying tokens; they are investing in transformation. They pay, and pay handsomely, for platforms that can be molded to their unique needs, that offer trust and compliance, and that promise lasting operational lift.

Model Parity, Workflow Primacy

With model performance now at near parity between Claude and OpenAI, the competitive edge has shifted decisively toward reliability, governance, and successful enterprise integration—not tiny improvements in general intelligence.

Image source: Marktechpost.com

The Road Ahead: Where Enterprise AI Will Win

As the Menlo report affirms, forward-thinking leaders must now orient their teams toward:

Advanced code generation with demonstrable business value;

Autonomous agent frameworks that embed AI deeply into workflow;

Optimization for live, always-on production inference;

Relentless focus on integration and compliance across the entire enterprise stack.

The New Playbook for Enterprise AI

The AI race is no longer about having the largest, fastest, or cheapest model—it’s about trust, results, and partnership. Anthropic’s rapid ascent proves that understanding and serving enterprise needs is the true competitive differentiator. In an era of technological parity, the winner will be the one who best translates model capabilities into business transformation, system-level integration, and operational trust.

As enterprise AI budgets continue to swell, the crown will belong not to the loudest innovator, but to the one that delivers quantifiable value at scale. In 2025, Anthropic wears that crown.

Sources:

https://www.linkedin.com/posts/matt-murphy-0415543_2025-mid-year-llm-market-update-foundation-activity-7356682316062056448-ZBNN

https://www.cnbc.com/2025/05/30/anthropic-hits-3-billion-in-annualized-revenue-on-business-demand-for-ai.html

Claude for Enterprise: What is it & Who is it for?

https://www.emarketer.com/content/anthropic-s-claude-enterprise-takes-on-openai-with-business-focused-ai-capabilities

2025 Mid-Year LLM Market Update: Foundation Model Landscape + Economics

https://explodingtopics.com/blog/ai-statistics

https://www.wsj.com/tech/ai/tech-ai-spending-company-valuations-7b92104b

The post Now It’s Claude’s World: How Anthropic Overtook OpenAI in the Enterprise AI Race appeared first on MarkTechPost.

7 Essential Layers for Building Real-World AI Agents in 2025: A Compre …

Building an intelligent agent goes far beyond clever prompt engineering for language models. To create real-world, autonomous AI systems that can think, reason, act, and learn, you need to engineer a full-stack solution that orchestrates multiple tightly–integrated components. The following seven-layer framework is a battle-tested mental model for anyone serious about AI agent development—whether you’re a founder, AI engineer, or product leader.

1. Experience Layer — The Human Interface

The Experience Layer acts as the touchpoint between humans and the agent. It defines how users interact with the system: conversation (chat/web/app), voice, image, or even multimodal engagement. This layer must be intuitive, accessible, and capable of capturing user intent precisely, while providing clear feedback.

Core design challenge: Translate ambiguous human goals into machine-understandable objectives.

Example: A customer support chatbot interface, or a voice assistant in a smart home.

2. Discovery Layer — Information Gathering & Context

Agents need to orient themselves: knowing what to ask, where to look, and how to gather relevant information. The Discovery Layer encompasses techniques like web search, document retrieval, data mining, context collection, sensor integration, and interaction history analysis.

Core design challenge: Efficient, reliable, and context-aware information retrieval that surfaces only what matters.

Example: Fetching product manuals, extracting knowledge bases, or summarizing recent emails.

3. Agent Composition Layer — Structure, Goals, and Behaviors

This layer defines what the agent is and how it should behave. It includes defining the agent’s goals, its modular architecture (sub-agents, policies, roles), possible actions, ethical boundaries, and configurable behaviors.

Core design challenge: Enabling customization and extensibility while ensuring coherence and alignment with user and business objectives.

Example: Setting up a sales assistant agent with negotiation tactics, brand voice, and escalation protocols.

4. Reasoning & Planning Layer — The Agent’s Brain

At the heart of autonomy, the Reasoning & Planning Layer handles logic, decision-making, inference, and action sequencing. Here, the agent evaluates information, weighs alternatives, plans steps, and adapts strategies. This layer can leverage symbolic reasoning engines, LLMs, classical AI planners, or hybrids.

Core design challenge: Moving beyond pattern-matching to true adaptive intelligence.

Example: Prioritizing customer queries, scheduling multi-step workflows, or generating argument chains.

5. Tool & API Layer — Acting in the World

This layer enables the agent to perform real actions: executing code, triggering APIs, controlling IoT devices, managing files, or running external workflows. The agent must safely interface with digital and (sometimes) physical systems, often requiring robust error handling, authentication, and permissions management.

Core design challenge: Safe, reliable, and flexible action-taking with external systems.

Example: Booking a meeting on your calendar, placing an e-commerce order, or running data analysis scripts.

6. Memory & Feedback Layer — Contextual Recall & Learning

Agents that learn and improve over time must maintain memory: tracking prior interactions, storing context, and incorporating user feedback. This layer supports both short-term contextual recall (for conversation) and long-term learning (improving models, policies, or knowledge bases).

Core design challenge: Scalable memory representation and effective feedback integration.

Example: Remembering user preferences, learning common support issues, or iteratively refining suggestions.

7. Infrastructure Layer — Scaling, Orchestration, & Security

Beneath the application stack, robust infrastructure ensures the agent is available, responsive, scalable, and secure. This layer includes orchestration platforms, distributed compute, monitoring, failover, and compliance safeguards.

Core design challenge: Reliability and robustness at scale.

Example: Managing thousands of concurrent agent instances with uptime guarantees and secure API gateways.

Key Takeaways

True autonomy requires more than language understanding.

Integrate all 7 layers for agents that can safely sense, plan, act, learn, and scale.

Adopt this framework to assess, design, and build next-generation AI systems that solve meaningful problems.

Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post 7 Essential Layers for Building Real-World AI Agents in 2025: A Comprehensive Framework appeared first on MarkTechPost.

Cost tracking multi-tenant model inference on Amazon Bedrock

Organizations serving multiple tenants through AI applications face a common challenge: how to track, analyze, and optimize model usage across different customer segments. Although Amazon Bedrock provides powerful foundation models (FMs) through its Converse API, the true business value emerges when you can connect model interactions to specific tenants, users, and use cases.
Using the Converse API requestMetadata parameter offers a solution to this challenge. By passing tenant-specific identifiers and contextual information with each request, you can transform standard invocation logs into rich analytical datasets. This approach means you can measure model performance, track usage patterns, and allocate costs with tenant-level precision—without modifying your core application logic.
Tracking and managing cost through application inference profiles
Managing costs for generative AI workloads is a challenge that organizations face daily, especially when using on-demand FMs that don’t support cost-allocation tagging. When you monitor spending manually and rely on reactive controls, you create risks of overspending while introducing operational inefficiencies.
Application inference profiles address this by allowing custom tags (for example, tenant, project, or department) to be applied directly to on-demand models, enabling granular cost tracking. Combined with AWS Budgets and cost allocation tools, organizations can automate budget alerts, prioritize critical workloads, and enforce spending guardrails at scale. This shift from manual oversight to programmatic control reduces financial risks while fostering innovation through enhanced visibility into AI spend across teams, applications, and tenants.
For tracking multi-tenant costs when dealing with tens to thousands of application inference profiles refer to Manage multi-tenant Amazon Bedrock costs using application inference profiles in the AWS Artificial Intelligence Blog post.
Managing costs and resources in large-scale multi-tenant environments adds complexity when you use application inference profiles in Amazon Bedrock. You face additional considerations when dealing with hundreds of thousands to millions of tenants and complex tagging requirements.
The lifecycle management of these profiles creates operational challenges. You need to handle profile creation, updates, and deletions at scale. Automating these processes requires robust error handling for edge cases like profile naming conflicts, Region-specific replication for high availability, and cascading AWS Identity and Access Management (IAM) policy updates that maintain secure access controls across tenants.
Another layer of complexity arises from cost allocation tagging constraints. Although organizations and teams can add multiple tags per application inference profile resource, organizations with granular tracking needs—such as combining tenant identifiers (tenantId), departmental codes (department), and cost centers (costCenter)—might find this limit restrictive, potentially compromising the depth of cost attribution. These considerations encourage organizations to implement a consumer or client-side tracking approach, and this is where metadata-based tagging might be a better fit.
Using Converse API with request metadata
You can use the Converse API to include request metadata when you call FMs through Amazon Bedrock. This metadata doesn’t affect the model’s response, but you can use it for tracking and logging purposes (JSON object with key-value pairs of metadata).Common uses for request metadata include:

Adding unique identifiers for tracking requests
Including timestamp information
Tagging requests with application-specific information
Adding version numbers or other contextual data

The request metadata is not typically returned in the API response. It’s primarily used for your own tracking and logging purposes on the client-side.
When using the Converse API, you typically include the request metadata as part of your API call. For example, using the AWS SDK for Python (Boto3), you might structure your request like this:

response = bedrock_runtime.converse(
modelId=’your-model-id’
messages=[…],
requestMetadata={
“requestId”: ‘unique-request-id’,
“timestamp”: ‘unix-timestamp’,
“tenantId”: ‘your-tenant-id’,
“departmentId”: ‘your-department-id’

},
# other parameters
)

Solution overview
The following diagram illustrates a comprehensive log processing and analytics architecture across two main environments: a Customer virtual private cloud (VPC) and an AWS Service Account.
In the Customer VPC, the flow begins with Amazon Bedrock invocation logs being processed through an extract, transform, and load (ETL) pipeline managed by AWS Glue. The logs go through a scheduler and transformation process, with an AWS Glue crawler cataloging the data. Failed logs are captured in a separate storage location.
In the AWS Service Account section, the architecture shows the reporting and analysis capabilities. Amazon QuickSight Enterprise edition serves as the primary analytics and visualization service, with tenant-based reporting dashboards.

To convert Amazon Bedrock invocation logs with tenant metadata into actionable business intelligence (BI), we’ve designed a scalable data pipeline that processes, transforms, and visualizes this information. The architecture consists of three main components working together to deliver tenant-specific analytics.
The process begins in your customer’s virtual private cloud (VPC), where Amazon Bedrock invocation logs capture each interaction with your AI application. These logs contain valuable information including the requestMetadata parameters you’ve configured to identify tenants, users, and other business contexts.
An ETL scheduler triggers AWS Glue jobs at regular intervals to process these logs. The AWS Glue ETL job extracts the tenant metadata from each log entry, transforms it into a structured format optimized for analysis, and loads the results into a transformed logs bucket. For data quality assurance, records that fail processing are automatically routed to a separate failed logs bucket for troubleshooting.
After the data is transformed, a crawler scheduler activates an AWS Glue crawler to scan the processed logs. The crawler updates the AWS Glue Data Catalog with the latest schema and partition information, making your tenant-specific data immediately discoverable and queryable.
This automated cataloging creates a unified view of tenant interactions across your Amazon Bedrock applications. The data catalog connects to your analytics environment through an elastic network interface, that provides secure access while maintaining network isolation.
Your reporting infrastructure in the Amazon QuickSight account transforms tenant data into actionable insights. Amazon QuickSight Enterprise edition serves as your visualization service and connects to the data catalog through the QuickSight to Amazon Athena connector.
Your reporting administrators can create tenant-based dashboards that show usage patterns, popular queries, and performance metrics segmented by tenant. Cost dashboards provide financial insights into model usage by tenant, helping you understand the economics of your multi-tenant AI application.
Monitoring and analyzing Amazon Bedrock performance metrics
The following Amazon QuickSight dashboard demonstrates how you can visualize your Amazon Bedrock usage data across multiple dimensions. You can examine your usage patterns through four key visualization panels.
Using the Bedrock Usage Summary horizontal bar chart shown in the top left, you can compare token usage across tenant groups. You get clear visibility into each tenant’s consumption levels. The Token Usage by Company pie chart in the top right breaks down token usage distribution by company, showing relative shares among organizations.
Token Usage by Department horizontal bar chart in the bottom left reveals departmental consumption. You can see how different business functions such as Finance, Research, HR, and Sales use Amazon Bedrock services. The Model Distribution graphic in the bottom right displays model distribution metrics with a circular gauge showing complete coverage.
You can filter and drill down into your data using the top filter controls for Year, Month, Day, Tenant, and Model selections. This enables detailed temporal and organizational analysis of your Amazon Bedrock consumption patterns.

Bedrock Usage Overview QuickSight dashboard

The comprehensive dashboard show in the following image provides vital insights into AWS Amazon Bedrock usage patterns and performance metrics across different environments. This “Usage Trends” visualization suite includes key metrics such as token usage trends, input and output token distribution, latency analysis, and environment-wide usage breakdown.
Using the dashboard, stakeholders can make data-driven decisions about resource allocation, performance optimization, and usage patterns across different deployment stages. With intuitive controls for year, month, day, tenant, and model selection, teams can quickly filter and analyze specific usage scenarios.

Usage Trends QuickSight Dashboard

Access to these insights is carefully managed through AWS IAM Identity Center and role-based permissions, so tenant data remains protected while still enabling powerful analytics.
By implementing this architecture, you transform basic model invocation logs into a strategic asset. Your business can answer sophisticated questions about tenant behavior, optimize model performance for specific customer segments, and make data-driven decisions about your AI application’s future development—all powered by the metadata you’ve thoughtfully included in your Amazon Bedrock Converse API requests.
Customize the solution
The Converse metadata cost reporting solution provides several customization points to adapt to your specific multi-tenant requirements and business needs. You can modify the ETL process by editing the AWS Glue ETL script at `cdk/glue/bedrock_logs_transform.py` to extract additional metadata fields or transform data according to your tenant structure. Schema definitions can be updated in the corresponding JSON files to accommodate custom tenant attributes or hierarchical organizational data.
For organizations with evolving pricing models, the pricing data stored in `cdk/glue/pricing.csv` can be updated to reflect current Amazon Bedrock costs, including cache read and write pricing. Edit the .csv file and upload it to your transformed data Amazon Simple Storage Service (Amazon S3) bucket, then run the pricing crawler to refresh the data catalog. This makes sure your cost allocation dashboards are accurate as pricing changes.
QuickSight dashboards offer extensive customization capabilities directly through the console interface. You can modify existing visualizations to focus on specific tenant metrics, add filters for departmental or regional views, and create new analytical insights that align with your business reporting requirements. You can save customized versions in the dashboard editor while preserving the original template for future reference.
Clean up
To avoid incurring future charges, delete the resources. Because the solution is deployed using AWS Cloud Development Kit (AWS CDK) cleaning up resources is straightforward. From the command line change into the CDK directory at the root of the converse-metadata-cost-reporting repo and enter the following command to delete the deployed resources. You can also find the instructions in README.md.

cd cdk
cdk destroy

Conclusion
Implementing tenant-specific metadata with Amazon Bedrock Converse API creates a powerful foundation for AI application analytics. This approach transforms standard invocation logs into a strategic asset that drives business decisions and improves customer experiences.
The architecture can deliver immediate benefits through automated processing of tenant metadata. You gain visibility into usage patterns across customer segments. You can allocate costs accurately and identify opportunities for model optimization based on tenant-specific needs. For implementation details, refer to the converse-metadata-cost-reporting GitHub repository.
This solution enables measurable business outcomes. Product teams can prioritize features on tenant usage data. Customer success managers can provide personalized guidance using tenant-specific insights. Finance teams can develop more accurate pricing models based on actual usage patterns across different customer segments. As AI applications become increasingly central to business operations, understanding how different tenants interact with your models becomes essential. Implementing the requestMetadata parameter in your Amazon Bedrock Converse API calls today builds the analytics foundation for your future AI strategy. Start small by identifying key tenant identifiers for your metadata, then expand your analytics capabilities as you gather more data. The flexible architecture described here scales with your needs. You can continuously refine your understanding of tenant behavior and deliver increasingly personalized AI experiences.

About the authors
Praveen Chamarthi brings exceptional expertise to his role as a Senior AI/ML Specialist at Amazon Web Services (AWS), with over two decades in the industry. His passion for machine learning and generative AI, coupled with his specialization in ML inference on Amazon SageMaker, enables him to empower organizations across the Americas to scale and optimize their ML operations. When he’s not advancing ML workloads, Praveen can be found immersed in books or enjoying science fiction films.
Srikanth Reddy is a Senior AI/ML Specialist with Amazon Web Services (AWS). He is responsible for providing deep, domain-specific expertise to enterprise customers, helping them use AWS AI and ML capabilities to their fullest potential.
Dhawal Patel is a Principal Machine Learning Architect at Amazon Web Services (AWS). He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and AI. He focuses on deep learning, including natural language processing (NLP) and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.
Alma Mohapatra is an Enterprise Support Manager helping strategic AI/ML customers optimize their workloads on HPC environments. She guides organizations through performance challenges and infrastructure optimization for LLMs across distributed GPU clusters. Alma translates technical requirements into practical solutions while collaborating with Technical Account Managers to ensure AI/ML initiatives meet business objectives.
John Boren is a Solutions Architect at AWS GenAI Labs in Seattle where he develops full-stack Generative AI demos. Originally from Alaska, he enjoys hiking, traveling, continuous learning, and fishing.
Rahul Sharma is a Senior Specialist Solutions Architect at AWS, helping AWS customers build ML and Generative AI solutions. Prior to joining AWS, Rahul has spent several years in the finance and insurance industries, helping customers build data and analytics platforms.

AI judging AI: Scaling unstructured text analysis with Amazon Nova

Picture this: Your team just received 10,000 customer feedback responses. The traditional approach? Weeks of manual analysis. But what if AI could not only analyze this feedback but also validate its own work? Welcome to the world of large language model (LLM) jury systems deployed using Amazon Bedrock.
As more organizations embrace generative AI, particularly LLMs for various applications, a new challenge has emerged: ensuring that the output from these AI models aligns with human perspectives and is accurate and relevant to the business context. Manual analysis of large datasets can be time consuming, resource intensive, and thus impractical. For example, manually reviewing 2,000 comments can take over 80 hours, depending on comment length, complexity, and researcher analyses. LLMs offer a scalable approach to serve as qualitative text annotators, summarizers, and even judges evaluating text outputs from other AI systems.
This prompts the question, “But how can we deploy such LLM-as-a-judge systems effectively and then use other LLMs to evaluate performance?”
In this post, we highlight how you can deploy multiple generative AI models in Amazon Bedrock to instruct an LLM model to create thematic summaries of text responses (such as from open-ended survey questions to your customers) and then use multiple LLM models as a jury to review these LLM generated summaries and assign a rating to judge the content alignment between the summary title and summary description. This setup is often referred to as an LLM jury system. Think of the LLM jury as a panel of AI judges, each bringing their own perspective to evaluate content. Instead of relying on a single model’s potentially biased view, multiple models work together to provide a more balanced assessment.
Problem: Analyzing text feedback
Your organization receives thousands of customer feedback responses. Traditional manual analysis of responses might painstakingly and resource-intensively take days or weeks, depending on the volume of free text comments you receive. Alternative natural language processing techniques, though likely faster, also require extensive data cleanup and coding know-how to analyze the data effectively. Pre-trained LLMs offer a promising, relatively low-code solution for quickly generating thematic summaries from text-based data because these models have been shown to scale data analysis and reduce manual review time. However, when relying on a single pre-trained LLM for both analysis and evaluation, concerns arise regarding biases, such as model hallucinations (that is, producing inaccurate information) or confirmation bias (that is, favoring expected outcomes). Without cross-validation mechanisms, such as comparing outputs from multiple models or benchmarking against human-reviewed data, the risk of unchecked errors increases. Using multiple pre-trained LLMs can address this concern by providing robust and comprehensive analyses, even allowing for enabling human-in-the-loop oversight, and enhancing reliability over a single-model evaluation. The concept of using LLMs as a jury means deploying multiple generative AI models to independently evaluate or validate each other’s outputs.
Solution: Deploy LLM as judges on Amazon Bedrock
You can use Amazon Bedrock to compare the various frontier foundation models (FMs) such as Anthropic’s Claude 3 Sonnet, Amazon Nova Pro, and Meta’s Llama 3. The unified Amazon Web Services (AWS) environment and standardized API calls simplify deploying multiple models for thematic analysis and judging model outputs. Amazon Bedrock also solves for operational needs through a unified security and compliance controlled system and a consistent model deployment environment across all models.
Our proposed workflow, illustrated in the following diagram, includes these steps:

The preprocessed raw data is prepared in a .txt file and uploaded into Amazon Bedrock. A thematic generation prompt is crafted and tested, then the data and prompt are run in Amazon SageMaker Studio using a pre-trained LLM of choice.
The LLM-generated summaries are converted into a .txt file, and the summary data is uploaded into SageMaker Studio.
Next, an LLM-as-a-judge prompt is crafted and tested, and the summary data and prompt are run in SageMaker Studio using different pre-trained LLMs.
Human-as-judge scores are then statistically compared against the model performance. We use percentage agreement, Cohen’s kappa, Krippendorff’s alpha, and Spearman’s rho.

Prerequisites
To complete the steps, you need to have the following prerequisites:

An AWS account with access to:

Amazon Bedrock – Check out Getting Started with Amazon Bedrock.
Amazon SageMaker AI – Check out Getting Started with Amazon SageMaker AI
Amazon Simple Storage Service (Amazon S3) – Check out Getting Started with Amazon S3

Basic understanding of Python and Jupyter notebooks
Preprocessed text data for analysis

Implementation details
In this section, we walk you through the step-by-step implementation.
Try this out for yourself by downloading the Jupyter notebook from GitHub.

Create a SageMaker notebook instance to run the analysis, and then initialize Amazon Bedrock and configure the input and output file locations on Amazon S3. Save the text feedback you’d like to analyze as a .txt file in an S3 bucket. Use the following code:

import boto3
import json

# Initialize our connection to AWS services
bedrock = boto3.client(‘bedrock’)
s3_client = boto3.client(‘s3’)

# Configure where we’ll store our evidence (data)
bucket = ‘my-example-name’
raw_input = ‘feedback_dummy_data.txt’
output_themes = ‘feedback_analyzed.txt’

Use Amazon Nova Pro in Amazon Bedrock to generate LLM-based thematic summaries for the feedback you want to analyze. Depending on your use case, you can use any or multiple models offered by Amazon Bedrock for this step. The prompt provided here is also generic and will need to be tuned for your specific use case to give the LLM model of choice adequate context on your data to enable appropriate thematic categorization:

def analyze_comment(comment):
prompt = f”””You must respond ONLY with a valid JSON object.
Analyze this customer review: “{comment}”
Respond with this exact JSON structure:
{{
“main_theme”: “theme here”,
“sub_theme”: “sub-theme here”,
“rationale”: “rationale here”
}}
“””
# Call pre-trained model through Bedrock
response = bedrock_runtime.invoke_model(
modelId=#model of choice goes here
body=json.dumps({
“prompt”: prompt,
“max_tokens”: 1000,
“temperature”: 0.1
})
)
return parse_response(response)

You can now use multiple LLMs as jury to evaluate the themes generated by the LLM in the previous step. In our example, we use Amazon Nova Pro and Anthropic’s Claude 3.5 Sonnet models to each analyze the themes per feedback and provide an alignment score. Here, our alignment score is on a scale of 1–3, where 1 indicates poor alignment in which themes don’t capture the main points, 2 indicates partial alignment in which themes capture some but not all key points, and 3 indicates strong alignment in which themes accurately capture the main points:

def evaluate_alignment_nova(comment, theme, subtheme, rationale):
judge_prompt = f”””Rate theme alignment (1-3):
Comment: “{comment}”
Main Theme: {theme}
Sub-theme: {subtheme}
Rationale: {rationale}
“””
# Complete code in attached notebook

When you have the alignment scores from the LLMs, here’s how you can implement the following agreement metrics to compare and contrast the scores. Here, if you have ratings from human judges, you can quickly add those as another set of scores to discover how closely the human ratings (gold standard) aligns with that of the models:

def calculate_agreement_metrics(ratings_df):
return {
‘Percentage Agreement’: calculate_percentage_agreement(ratings_df),
‘Cohens Kappa’: calculate_pairwise_cohens_kappa(ratings_df),
‘Krippendorffs Alpha’: calculate_krippendorffs_alpha(ratings_df),
‘Spearmans Rho’: calculate_spearmans_rho(ratings_df)
}

We used the following popular agreement metrics to compare alignment and therefore performance across and among models:

Percentage agreement – Percentage agreement tells us how many times two raters provide the same rating (for example, 1–5) of the same thing, such as two people providing the same 5-star rating of a movie. The more times they agree, the better. This is expressed as a percentage of the total number of cases rated and calculated by dividing the total agreements by the total number of ratings and multiplying by 100.
Cohen’s kappa – Cohen’s kappa is essentially a smarter version of percentage agreement. It’s like when two people guess how many of their 5 coworkers will wear blue in the office each day. Sometimes both people guess the same number (for example, 1–5) by chance. Cohen’s kappa considers how well the two people agree, beyond any lucky guesses. The coefficients range from −1 to +1, where 1 represents perfect agreement, 0 represents agreement equivalent to chance, and negative values indicate agreement less than chance.
Spearman’s rho – Spearman’s rho is like a friendship meter for numbers. It shows how well two sets of numbers “get along” or move together. If one set of numbers goes up and the other set also goes up, they have a positive relationship. If one goes up while the other goes down, they have a negative relationship. Coefficients range from 1 to +1, with values closer to ±1 indicating stronger correlations.
Krippendorff’s alpha – Krippendorff’s alpha is a test used to determine how much all raters agree on something. Imagine two people taste-testing different foods at a restaurant and rating the foods on a scale of 1–5. Krippendorff’s alpha provides a score to show how much the two people agree on their food ratings, even if they didn’t taste every dish in the restaurant. The alpha coefficient ranges from 0–1, where values closer to 1 indicate higher agreement among raters. Generally, an alpha above 0.80 signifies strong agreement, an alpha between 0.67 and 0.80 indicates acceptable agreement, and an alpha below 0.67 suggests low agreement. If calculated with the rationale that the levels (1, 2, and 3) are ordinal, Krippendorff’s alpha considers not only agreement but also the magnitude of disagreement. It’s less affected by marginal distributions compared to kappa and provides a more nuanced assessment when ratings are ranked (ordinal). That is, although percentage agreement and kappa treat all disagreements equally, alpha recognizes the difference between minor (for example, “1” compared to “2”) and major disagreements (for example, “1” compared to “3”).

Success! If you followed along, you have now successfully deployed multiple LLMs to judge thematic analysis output from an LLM.
Additional considerations
To help manage costs when running this solution, consider the following options:

Use SageMaker managed Spot Instances
Implement batch processing for large datasets with Amazon Bedrock batch inference
Cache intermediate results in Amazon S3

For sensitive data, consider the following options:

Enable encryption at rest for all S3 buckets
Use AWS Identity and Access Management (IAM) roles with minimum required permissions
Implement Amazon Virtual Private Cloud (Amazon VPC) endpoints for enhanced security

Results
In this post, we demonstrated how you can use Amazon Bedrock to seamlessly use multiple LLMs to generate and judge thematic summaries of qualitative data, such as from customer feedback. We also showed how we can compare human evaluator ratings of text-based summaries from survey response data against ratings from multiple LLMs such as Anthropic’s Claude 3 Sonnet, Amazon Nova Pro, and Meta’s Llama 3. In recently published research, Amazon scientists showed LLMs showed inter-model agreement up to 91% compared with human-to-model agreement up to 79%. Our findings suggest that although LLMs can provide reliable thematic evaluations at scale, human oversight continues to remain important for identifying subtle contextual nuances that LLMs might miss.
The best part? Through Amazon Bedrock model hosting, you can compare the various models using the same preprocessed data across all models, so you can choose the one that works best for your context and need.
Conclusion
With organizations turning to generative AI for analyzing unstructured data, this post provides insight into the value of using multiple LLMs to validate LLM-generated analyses. The strong performance of LLM-as-a-judge models opens opportunities to scale text data analyses at scale, and Amazon Bedrock can help organizations interact with and use multiple models to use an LLM-as-a-judge framework.

About the Authors
Dr. Sreyoshi Bhaduri is a Senior Research Scientist at Amazon. Currently, she spearheads innovative research in applying generative AI at scale to solve complex supply chain logistics and operations challenges. Her expertise spans applied statistics and natural language processing, with a PhD from Virginia Tech and specialized training in responsible AI from MILA. Sreyoshi is committed to demystifying and democratizing generative AI solutions and bridging the gap between theoretical research and practical applications using AWS technologies.
Dr. Natalie Perez specializes in transformative approaches to customer insights and innovative solutions using generative AI. Previously at AWS, Natalie pioneered large-scale voice of employee research, driving product and programmatic improvements. Natalie is dedicated to revolutionizing how organizations scale, understand, and act on customer needs through the strategic integration of generative AI and human-in-the-loop strategies, driving innovation that puts customers at the heart of product, program, and service development.
John Kitaoka is a Solutions Architect at Amazon Web Services (AWS) and works with government entities, universities, nonprofits, and other public sector organizations to design and scale AI solutions. His work covers a broad range of machine learning (ML) use cases, with a primary interest in inference, responsible AI, and security. In his spare time, he loves woodworking and snowboarding.
Dr. Elizabeth (Liz) Conjar is a Principal Research Scientist at Amazon, where she pioneers at the intersection of HR research, organizational transformation, and AI/ML. Specializing in people analytics, she helps reimagine employees’ work experiences, drive high-velocity organizational change, and develop the next generation of Amazon leaders. Throughout her career, Elizabeth has established herself as a thought leader in translating complex people analytics into actionable strategies. Her work focuses on optimizing employee experiences and accelerating organizational success through data-driven insights and innovative technological solutions.

Building an AI-driven course content generation system using Amazon Be …

The education sector needs efficient, high-quality course material development that can keep pace with rapidly evolving knowledge domains. Faculty invest days to create content and quizzes for topics to be taught in weeks. Increased faculty engagement in manual content creation creates a time deficit for innovation in teaching, inconsistent course material, and a poor experience for both faculty and students.
Generative AI–powered systems can significantly reduce the time and effort faculty spend on course material development while improving educational quality. Automating content creation tasks gives educators more time for interactive teaching and creative classroom strategies.
The solution in this post addresses this challenge by using large language models (LLMs), specifically Anthropic’s Claude 3.5 through Amazon Bedrock, for educational content creation. This AI-powered approach supports the automated generation of structured course outlines and detailed content, reducing development cycles from days to hours while ensuring materials remain current and comprehensive. This technical exploration demonstrates how institutions can use advanced AI capabilities to transform their educational content development process, making it more efficient, scalable, and responsive to modern learning needs.
The solution uses Amazon Simple Queue Service (Amazon SQS), AWS Lambda, Amazon Bedrock, Amazon API Gateway WebSocket APIs, Amazon Simple Storage Service (Amazon S3), Amazon CloudFront, Amazon DynamoDB, Amazon Cognito and AWS WAF. The architecture is designed following the AWS Well-Architected Framework, facilitating robustness, scalability, cost-optimization, high performance, and enhanced security.
In this post, we explore each component in detail, along with the technical implementation of the two core modules: course outline generation and course content generation. Course outline generates course structure for a subject with module and submodules by week. Primary and secondary outcomes are generated in a hierarchical structure by week and by semester. Content generation is content generated for the module and submodule generated in content outline. Content generated includes text and video scripts with corresponding multiple-choice questions.
Solution overview
The solution architecture integrates the two core modules through WebSocket APIs. This design is underpinned by using AWS Lambda function for serverless compute, Amazon Bedrock for AI model integration, and Amazon SQS for reliable message queuing.
The system’s security uses multilayered approach, combining Amazon Cognito for user authentication, AWS WAF for threat mitigation, and a Lambda authorizers function for fine-grained access control. To optimize performance and enhance user experience, AWS WAF is deployed to filter out malicious traffic and help protect against common web vulnerabilities. Furthermore, Amazon CloudFront is implemented as a WebSocket distribution layer, to significantly improve content delivery speeds and reduce latency for end users. This comprehensive architecture creates a secure, scalable, and high-performance system for generating and delivering educational content.

WebSocket API and authentication mechanisms
Course WebSocket API manages real-time interactions for course outline and content generation. WebSockets enable streaming AI responses and real-time interactions, reducing latency and improving user responsiveness compared to traditional REST APIs. They also support scalable concurrency, allowing parallel processing of multiple requests without overwhelming system resources. AWS WAF provides rule-based filtering to help protect against web-based threats before traffic reaches API Gateway. Amazon CloudFront enhances performance and security by distributing WebSocket traffic globally. Amazon Cognito and JWT Lambda authorizer function handles authentication, validation of user identity before allowing access.
Each WebSocket implements three primary routes:

$connect – Triggers a Lambda function to log the connection_id in DynamoDB. This enables tracking of active connections, targeted messaging, and efficient connection management, supporting real-time communication and scalability across multiple server instances.
$disconnect – Logs the disconnection in DynamoDB to remove the connection_id record from DynamoDB table. This facilitates proper cleanup of inactive connections, helps prevent resource waste, maintains an accurate list of active clients, and helps optimize system performance and resource allocation.
$default – Handles unexpected or invalid traffic.

WebSocket authentication using Amazon Cognito
The WebSocket API integrates Amazon Cognito for authentication and uses a JWT-based Lambda authorizer function for token validation. The authentication flow follows these steps:

User authentication

The course designer signs in using Amazon Cognito, which issues a JWT access token upon successful authentication.
Amazon Cognito supports multiple authentication methods, including username-password login, social identity providers (such as Google or Facebook), and SAML-based federation.

WebSocket connection request

When a user attempts to connect to the WebSocket API, the client includes the JWT access token in the WebSocket request headers.

JWT token validation (Lambda authorizer function)

The JWT token authorizer Lambda function extracts and verifies the token against the Amazon Cognito public key.
If the token is valid, the request proceeds. If the token isn’t valid, the connection is rejected.

Maintaining user sessions

Upon successful authentication, the $connect route Lambda function stores the connection_id and user details in DynamoDB, allowing targeted messaging.
When the user disconnects, the $disconnect Lambda function removes the connection_id to maintain an accurate session record.

The following is a sample AWS CDK code to set up the WebSocket API with Amazon Cognito. AWS CDK is an open source software development framework to define cloud infrastructure in code and provision it through AWS CloudFormation. The following code is written in Python. For more information, refer to Working with the AWS CDK in Python:

from aws_cdk import (
aws_apigatewayv2 as apigwv2,
aws_lambda as _lambda,
aws_lambda_python_alpha as _alambda,
aws_cognito as cognito,
aws_dynamodb as dynamodb,
aws_apigatewayv2_integrations as integrationsv2,
aws_apigatewayv2_authorizers as authorizersv2,
)
class CourseStack(Stack):
def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
super().__init__(scope, id, **kwargs)

….
# Previous code …
….

# DynamoDB table to track connections
course_connections_ddb_table = dynamodb.Table(self, “CourseConnectionsTable”,
partition_key=dynamodb.Attribute(name=”connectionId”, type=dynamodb.AttributeType.STRING),
time_to_live_attribute=”ttl”,
billing_mode=dynamodb.BillingMode.PAY_PER_REQUEST,
encryption=dynamodb.TableEncryption.AWS_MANAGED,
point_in_time_recovery=True,
removal_policy=RemovalPolicy.DESTROY
)

# Create userpool for Amazon Cognito
user_pool = cognito.UserPool(
self, “CourseUserPool”,
user_pool_name=”CourseUserPool”,
self_sign_up_enabled=True,
account_recovery=cognito.AccountRecovery.EMAIL_ONLY,
user_verification=cognito.UserVerificationConfig(
email_subject=”Verify your email for outline and content generation App”,
email_body=”Hello {username}, Thanks for signing up to Course outline and content generation App! Your verification code is {####}”,
email_style=cognito.VerificationEmailStyle.CODE,
),
standard_attributes={“fullname”: cognito.StandardAttribute(required=True, mutable=True)},
removal_policy=RemovalPolicy.DESTROY,
)

# Create a new Amazon Cognito User Pool Client
user_pool_client = user_pool.add_client(“CourseUserPoolAppClient”,
user_pool_client_name=”CourseUserPoolAppClient”,
id_token_validity=Duration.days(1),
access_token_validity=Duration.days(1),
auth_flows=cognito.AuthFlow(user_password=True)
)

# WebSocket Connect, disconnect, default Lambda functions
course_ws_connect_lambda = _lambda.Function(
self, “CourseWSConnect”,
code=_lambda.Code.from_asset(“./lambda/connect”),
runtime=_lambda.Runtime.PYTHON_3_12,
handler=”index.lambda_handler”,
timeout=Duration.seconds(30),
environment={“CONNECTIONS_TABLE”: course_connections_ddb_table.table_name},
)
course_connections_ddb_table.grant_read_write_data(course_ws_connect_lambda)

course_ws_disconnect_lambda = _lambda.Function(…)

course_ws_default_lambda = _lambda.Function(…)

jwt_auth_course_lambda = _lambda.Function(…)

course_outline_ws_lambda = _lambda.Function(…)

course_content_ws_lambda = _lambda.Function(…)

# Course Web Socket API
course_ws_authorizer = authorizersv2.WebSocketLambdaAuthorizer(“CourseWSAuthorizer”, jwt_auth_course_lambda, identity_source=[“route.request.header.Authorization”,]) # “route.request.querystring.Authorization”,
course_ws_connect_integration = integrationsv2.WebSocketLambdaIntegration(“CourseWSConnectIntegration”, course_ws_connect_lambda)
course_ws_disconnect_integration = integrationsv2.WebSocketLambdaIntegration(“CourseWSDisconnectIntegration”, course_ws_disconnect_lambda)
course_ws_default_integration = integrationsv2.WebSocketLambdaIntegration(“CourseWSDefaultIntegration”, course_ws_default_lambda)
course_outline_ws_integration = integrationsv2.WebSocketLambdaIntegration(“CourseOutlineIntegration”, course_outline_ws_lambda)
course_content_ws_integration = integrationsv2.WebSocketLambdaIntegration(“CourseContentIntegration”, course_content_ws_lambda)

course_ws_api=apigwv2.WebSocketApi(self, “CourseWSApi”,
api_name=”CourseWSApi”,
description=”WebSocket API for Course Outline and Content Generation”,
connect_route_options=apigwv2.WebSocketRouteOptions(
integration=course_ws_connect_integration,
authorizer=course_ws_authorizer
),
disconnect_route_options=apigwv2.WebSocketRouteOptions(
integration=course_ws_disconnect_integration,
),
default_route_options=apigwv2.WebSocketRouteOptions(
integration=course_ws_default_integration,
)
)

# Add a custom message route, to generate course outline
course_ws_api.add_route(“courseOutline”, integration=course_outline_ws_integration)

# Add a custom message route, to generate course content
course_ws_api.add_route(“courseContent”, integration=course_content_ws_integration)

# Create a WebSocket API stage (usually, “dev” or “prod”)
course_ws_stage = apigwv2.WebSocketStage(
self, “CourseWSApiStage”,
web_socket_api=course_ws_api,
stage_name=”dev”, # Change this based on the environment (e.g., “prod”)
auto_deploy=True,
)

# Grant permissions for Lambda to manage the WebSocket connection (for sending messages back to clients)
course_ws_api.grant_manage_connections(course_ws_connect_lambda)
course_ws_api.grant_manage_connections(course_ws_disconnect_lambda)
course_ws_api.grant_manage_connections(course_ws_default_lambda)
course_ws_api.grant_manage_connections(course_outline_ws_lambda)
course_ws_api.grant_manage_connections(course_content_ws_lambda)

Course outline generation
The course outline generation module helps course designers create a structured course outline. For this proof of concept, the default structure spans 4 weeks, with each week containing three main learning outcomes and supporting secondary outcomes, but it can be changed according to each course or institution’s reality. The module follows this workflow:

The course designer submits a prompt using the course WebSocket (courseOutline route).
CourseOutlineWSLambda sends the request to an SQS queue for asynchronous processing.
The SQS queue triggers CourseOutlineLLMLambda, which invokes Anthropic’s Claude 3.5 Sonnet in Amazon Bedrock to generate the outline.
The response is structured using Pydantic models and returned as JSON.
The structured outline is stored in an S3 OutputBucket, with a finalized version stored in a portal bucket for faculty review.

The following code sample is a sample payload for the courseOutline route, which can be customized to meet institutional requirements. The fields are defined as follows:

action – Specifies the operation to be performed (courseOutline).
is_streaming – Indicates whether the response should be streamed (yes for real-time streaming and no for single output at one time).
s3_input_uri_list – A list of S3 URIs containing reference materials (which can be left empty if not available).
course_title – The title of the course for which the outline is being generated.
course_duration – The total number of weeks for the course.
user_prompt – A structured prompt guiding the AI to generate a detailed course outline based on syllabus information, providing a well-organized weekly learning structure. If using a different LLM, optimize the user_prompt for that model to achieve the best results.

{
“action”: “courseOutline”,
“is_streaming”: “yes”,
“s3_input_uri_list”: [],
“course_title”: “Fundamental of Machine Learning”,
“course_duration”: 2,
“user_prompt”: “I need help developing a {course_duration}-week course content for a {course_title} course. Please use the following syllabus to:nn1. If provided, refer to the syllabus text from <syllabus> tags to extract the course learning outcomes.n2. Design each week to focus on 3 main learning outcomes.n3. For each main learning outcome, provide 3 supporting sub-learning outcomes.nn<syllabus>nn{syllabus_text}nn</syllabus>nnEnsure that each week has 3 main learning outcomes and each of those has 3 supporting sub-learning outcomes.”
}

When interacting with the courseOutline route of the WebSocket API, the response follows a structured format that details the course outline and structure. The following is an example of a WebSocket response for a course. This format is designed for straightforward parsing and seamless integration into your applications:

{
“course_title”: “Sample Course”,
“course_duration”: “4”,
“weekly_outline”: [
{
“week”: 1,
“main_outcomes”: [
{
“outcome”: “Learning Outcome 1”,
“sub_outcomes”: [“Sub-outcome 1”, “Sub-outcome 2”, “Sub-outcome 3”]
},
{… similar for Learning outcome 2},
{… similar for Learning outcome 3}
]
},
{… similar for week 2},
{… similar for week 3},
{… similar for week 4},
]
}

Here’s a snippet of the Lambda function for processing the outline request:

event = json.loads(event[‘Records’][0][‘body’])

route_key = event[‘requestContext’][‘routeKey’]
connection_id = event[‘requestContext’][‘connectionId’]
body = json.loads(event[“body”])
s3_input_uri_list = body[“s3_input_uri_list”]
user_prompt = body[“user_prompt”]
course_title = body[“course_title”]
course_duration = body[“course_duration”]
model_id = os.getenv(“MODEL_ID”, “”)
is_streaming = body[“is_streaming”]
websocket_endpoint_url = os.getenv(“WEBSOCKET_ENDPOINT_URL”,””)
output_bucket = os.getenv(“OUTPUT_BUCKET”, “”)

# Send message to api that message received
apigatewaymanagementapi_client = boto3.client(‘apigatewaymanagementapi’, endpoint_url=websocket_endpoint_url)

# Read the syllabus text from uploaded doc
syllabus_text = “”
for s3_input_uri in s3_input_uri_list:
bucket, key = get_s3_bucket_and_key(s3_input_uri)
if key.endswith(‘.pdf’):
pdf_text = extract_text_from_pdf(bucket, key)
syllabus_text = syllabus_text + pdf_text

# Initialize the Pydantic model
pydantic_classes = [CourseOutline]

course_outline = {}

system_prompt = f”””You are an AI assistant tasked with helping an instructor develop a course outline for a {course_title} course.
You have expertise in curriculum design. Your role is to analyze the provided syllabus, extract learning outcomes,
and structure a {course_duration}-week course with specific learning objectives for each week.
Format your response in valid JSON for easy parsing and integration.
Respond only with the requested content, without any preamble or explanation.”””

user_msg_prompt = PromptTemplate.from_template(user_prompt)

user_msg = user_msg_prompt.format(course_title=course_title, course_duration=course_duration, syllabus_text=syllabus_text)

messages = [{“role”: “user”,”content”: [{“text”: user_msg}]}]

tools = []
for class_ in pydantic_classes:
tools.append(convert_pydantic_to_bedrock_converse_function(class_))
tool_config = { “tools”: tools }

inference_config = {“temperature”: 0.5 }

converse_response = bedrock_runtime_client.converse(
system=[{ “text”: system_prompt}],
modelId=model_id,
messages=messages,
inferenceConfig=inference_config,
toolConfig=tool_config,
)

# Parse the LLM response into JSON format
course_outline = parse_bedrock_tool_response(converse_response)

send_message_to_ws_client(apigatewaymanagementapi_client, connection_id, response=course_outline)

return {‘statusCode’: 200,
‘body’: json.dumps({‘course_outline’: course_outline})
}

Course content generation
The course content generation module creates detailed week-by-week content based on the course outline. Although the default configuration generates the following for each main learning outcome, these outputs are fully customizable to meet specific course needs and institutional preferences:

One set of reading materials
Three video scripts (3 minutes each)
A quiz with a multiple-choice question for each video

The module follows this workflow:

The course designer submits learning outcomes using the courseContent route.
CourseContentWSLambda function sends the request to an SQS queue.
The SQS queue triggers CourseContentLLMLambda function, which calls Amazon Bedrock to generate the content.
The generated content is structured and stored in Amazon S3.

The following is a sample payload for the courseContent route, which can be customized to align with institutional requirements. The fields are defined as follows:

action – Specifies the operation to be performed (courseContent).
is_streaming – Determines the response mode (yes for real-time streaming and no for a single output at one time).
s3_input_uri_list – An array of S3 URIs containing additional course materials which will be used to generate course content (optional).
week_number – Indicates the week number for which content is being generated.
course_title – The title of the course.
main_learning_outcome – The primary learning objective for the specified week.
sub_learning_outcome_list – A list of supporting learning outcomes to be covered.
user_prompt – A structured instruction guiding the LLM to generate week-specific course content, facilitating comprehensive coverage. If switching to a different LLM, optimize the user_prompt for optimal performance.

{
“action”:”courseContent”,
“is_streaming”: “yes”,
“s3_input_uri_list”: [“s3://coursestack-inputbucket3bf8630a-v0xovtepdtey/dinesh_testing_folder/Fundamentals Of Machine Learning/Machine Learning Basics.pdf”],
“week_number”:1,
“course_title”: “Fundamental of Machine Learning”,
“main_learning_outcome” : “Understand the basics of machine learning and its applications”,
“sub_learning_outcome_list” : [
“Define machine learning and its relationship to artificial intelligence”,
“Identify real-world applications of machine learning”,
“Distinguish between supervised, unsupervised, and reinforcement learning”
],
“user_prompt”:”For the course {course_title}, ngenerate Week {week_number} content for the main learning outcome:n{main_learning_outcome}nnInclude the following sub-learning outcomes:n{sub_learning_outcome_list}nnFor each sub-learning outcome, provide:n- 3 video scripts, each 3 minutes longn- 1 set of reading materials, atleast one page longn- 1 multiple-choice question per video with correct answernnIf provided, refer to the information within the <additional_context> tags for any supplementary details or guidelines.nn<additional_context>n{additional_context}n</additional_context>nnGenerate the content without any introductory text or explanations.”
}

When interacting with the courseContent route of the WebSocket API, the response follows a structured format that details the course content. The following is an example of a WebSocket response for a course content. This format is designed for easy parsing and seamless integration into your applications:

{
“CourseContent”:{
“week_number”:1,
“main_learning_outcome”:”Learning Outcome 1″,
“reading_material”:{
“title”:”xxx title of the reading material”,
“content”:”xxx reading material content”
},
“sub_learning_outcomes_content”:[
{
“sub_learning_outcome”:”Sub-outcome 1″,
“video_script”:{
“script”:”xxx video script”
},
“multiple_choice_question”:{
“question”:”xxx MCQ question”,
“options”:[“option 1″,”option 2″,”option 3″,”option 4”],
“correct_answer”:”option 1″
}
},
{… similar for sub_learning_outcome 2},
{… similar for sub_learning_outcome 3},
]
}
}

Here’s a Lambda function code snippet for content generation:

event = json.loads(event[‘Records’][0][‘body’])

connection_id = event[‘requestContext’][‘connectionId’]
body = json.loads(event[“body”])
s3_input_uri_list = body[“s3_input_uri_list”]
user_prompt = body[“user_prompt”]
week_number = body[“week_number”]
course_title = body[“course_title”]
main_learning_outcome = body[“main_learning_outcome”]
sub_learning_outcome_list = body[“sub_learning_outcome_list”]
is_streaming = body[“is_streaming”]
model_id = os.getenv(“MODEL_ID”,””)
websocket_endpoint_url = os.environ[“WEBSOCKET_ENDPOINT_URL”]
output_bucket = os.environ[“OUTPUT_BUCKET”]

# Send message to api that message received
apigatewaymanagementapi_client = boto3.client(‘apigatewaymanagementapi’, endpoint_url=websocket_endpoint_url)

# Read the additional_context text from uploaded doc
additional_context = “”
for s3_input_uri in s3_input_uri_list:
bucket, key = get_s3_bucket_and_key(s3_input_uri)
if key.endswith(‘.pdf’):
pdf_text = extract_text_from_pdf(bucket, key)
additional_context = additional_context + pdf_text

# Initialize the Pydantic model
pydantic_classes = [CourseContent]

course_content={}

system_prompt =f”””You are an AI assistant specialized in educational content creation.
Your task is to generate course materials based on given learning outcomes.
Produce concise, accurate, and engaging content suitable for college-level courses.
You may refer to additional context provided within <additional_context> tags if present.
Format your response in valid JSON for easy parsing and integration.
Respond only with the requested content, without any preamble or explanation.”””

user_msg_prompt = PromptTemplate.from_template(user_prompt)

user_msg = user_msg_prompt.format(course_title=course_title,
week_number=week_number,
main_learning_outcome=main_learning_outcome,
sub_learning_outcome_list=sub_learning_outcome_list,
additional_context=additional_context)

messages = [{“role”: “user”,”content”: [{“text”: user_msg}]}]

tools = []
for class_ in pydantic_classes:
tools.append(convert_pydantic_to_bedrock_converse_function(class_))
tool_config = { “tools”: tools }

converse_response = bedrock_runtime_client.converse(
system=[{ “text”: system_prompt}],
modelId=model_id,
messages=messages,
toolConfig=tool_config,
)

# Parse the LLM response into JSON format
course_content = parse_bedrock_tool_response(converse_response)

send_message_to_ws_client(apigatewaymanagementapi_client, connection_id, response=course_content)

return {‘statusCode’: 200,
‘body’: json.dumps({‘course_content’: json.dumps(course_content)})
}

Prerequisites
To implement the solution provided in this post, you should have the following:

An active AWS account and familiarity with foundation models (FMs) and Amazon Bedrock. Enable model access for Anthropic’s Claude 3.5v2 Sonnet and Anthropic’s Claude 3.5 Haiku
The AWS Cloud Development Kit (AWS CDK) already set up. For installation instructions, refer to the AWS CDK workshop.
When deploying the CDK stack, select a Region where Anthropic’s Claude models in Amazon Bedrock are available. Although this solution uses the US West (Oregon) us-west-2 Region, you can choose a different Region but you need to verify that it supports Anthropic’s Claude models in Amazon Bedrock before proceeding. The Region you use to access the model must match the Region where you deploy your stack.

Set up the solution
When the prerequisite steps are complete, you’re ready to set up the solution:

Clone the repository:

git clone https://github.com/aws-samples/educational-course-content-generator-with-qna-bot-using-bedrock.git

Navigate to the project directory:

cd educational-course-content-generator-with-qna-bot-using-bedrock/

Create and activate the virtual environment:

python3 -m venv .venv
source .venv/bin/activate

The activation of the virtual environment differs based on the operating system; refer to the AWS CDK workshop for activating in other environments.

After the virtual environment is activated, you can install the required dependencies:

pip install -r requirements.txt

Review and modify the project_config.json file to customize your deployment settings
In your terminal, export your AWS credentials for a role or user in ACCOUNT_ID. The role needs to have all necessary permissions for CDK deployment:

export AWS_REGION=”<region>” # Same region as ACCOUNT_REGION above export AWS_ACCESS_KEY_ID=”<access-key>” # Set to the access key of your role/user export AWS_SECRET_ACCESS_KEY=”<secret-key>” # Set to the secret key of your role/user

If you’re deploying the AWS CDK for the first time, invoke the following command:

cdk bootstrap

Deploy the stacks:

cdk deploy –all

Note the CloudFront endpoints, WebSocket API endpoints, and Amazon Cognito user pool details from deployment outputs.

Create a user in the Amazon Cognito user pool using the AWS Management Console or AWS Command Line Interface (AWS CLI). Alternatively, you can use the cognito-user-token-helper repository to quickly create a new Amazon Cognito user and generate JSON Web Tokens (JWTs) for testing.
Connect to the WebSocket endpoint using wscat.

wscat -c wss://xxxxxxxxxx.execute-api.us-west-2.amazonaws.com/dev
-H “Authorization: Bearer YOUR_JWT_TOKEN”

Scalability and security considerations
The solution is designed with scalability and security as core principles. Because Amazon API Gateway for WebSockets doesn’t inherently support AWS WAF, we’ve integrated Amazon CloudFront as a distribution layer and applied AWS WAF to enhance security.
By using Amazon SQS and AWS Lambda, the system enables asynchronous processing, supports high concurrency, and dynamically scales to handle varying workloads. AWS WAF helps to protects against malicious traffic and common web-based threats. Amazon CloudFront can improve global performance, reduce latency, and provide built-in DDoS protection. Amazon Cognito handles authentication so that only authorized users can access the WebSocket API. AWS IAM policies enforce strict access control to secure resources such as Amazon Bedrock, Amazon S3, AWS Lambda, and Amazon DynamoDB.
Clean up
To avoid incurring future charges on the AWS account, invoke the following command in the terminal to delete the CloudFormation stack provisioned using the AWS CDK:

cdk destroy –all

Conclusion
This innovative solution represents a significant leap forward in educational technology, demonstrating how AWS services can be used in course development. By integrating Amazon Bedrock, AWS Lambda, WebSockets, and a robust suite of AWS services, we’ve built a system that streamlines content creation, enhances real-time interactivity, and facilitates secure, scalable, and high-quality learning experiences.
By developing comprehensive course materials rapidly, course designers can focus more on personalized instruction and student mentoring. AI-assisted generation facilitates high-quality, standardized content across courses. The event-driven architecture scales effortlessly to meet institutional demands, and CloudFront, AWS WAF, and Amazon Cognito support secure and optimized content delivery. Institutions adopting this technology position themselves at the forefront of educational innovation, redefining modern learning environments.
This solution goes beyond simple automation—it means teachers and professors can shift their focus from manual content creation to high-impact teaching and mentoring. By using AWS AI and cloud technologies, institutions can enhance student engagement, optimize content quality, and scale seamlessly.
We invite you to explore how this solution can transform your institution’s approach to course creation and student engagement. To learn more about implementing this system or to discuss custom solutions for your specific needs, contact your AWS account team or an AWS education specialist.
Together, let’s build the future of education on the cloud.

About the authors
Dinesh Mane is a Senior ML Prototype Architect at AWS, specializing in machine learning, generative AI, and MLOps. In his current role, he helps customers address real-world, complex business problems by developing machine learning and generative AI solutions through rapid prototyping.
Tasneem Fathima is Senior Solutions Architect at AWS. She supports Higher Education and Research customers in the United Arab Emirates to adopt cloud technologies, improve their time to science, and innovate on AWS.
Amir Majlesi leads the EMEA prototyping team within AWS Worldwide Specialist Organization. Amir has extensive experiences in helping customers accelerate adoption of cloud technologies, expedite path to production and catalyze a culture of innovation. He enables customer teams to build cloud native applications using agile methodologies, with a focus on emerging technologies such as Generative AI, Machine Learning, Analytics, Serverless and IoT.

A Technical Roadmap to Context Engineering in LLMs: Mechanisms, Benchm …

Estimated reading time: 4 minutes

Table of contentsWhat Is Context Engineering?Taxonomy of Context EngineeringKey Insights and Research GapsApplications and ImpactFuture Directions

The paper “A Survey of Context Engineering for Large Language Models” establishes Context Engineering as a formal discipline that goes far beyond prompt engineering, providing a unified, systematic framework for designing, optimizing, and managing the information that guides Large Language Models (LLMs). Here’s an overview of its main contributions and framework:

What Is Context Engineering?

Context Engineering is defined as the science and engineering of organizing, assembling, and optimizing all forms of context fed into LLMs to maximize performance across comprehension, reasoning, adaptability, and real-world application. Rather than viewing context as a static string (the premise of prompt engineering), context engineering treats it as a dynamic, structured assembly of components—each sourced, selected, and organized through explicit functions, often under tight resource and architectural constraints.

Taxonomy of Context Engineering

The paper breaks down context engineering into:

1. Foundational Components

a. Context Retrieval and Generation

Encompasses prompt engineering, in-context learning (zero/few-shot, chain-of-thought, tree-of-thought, graph-of-thought), external knowledge retrieval (e.g., Retrieval-Augmented Generation, knowledge graphs), and dynamic assembly of context elements1.

Techniques like CLEAR Framework, dynamic template assembly, and modular retrieval architectures are highlighted.

b. Context Processing

Addresses long-sequence processing (with architectures like Mamba, LongNet, FlashAttention), context self-refinement (iterative feedback, self-evaluation), and integration of multimodal and structured information (vision, audio, graphs, tables).

Strategies include attention sparsity, memory compression, and in-context learning meta-optimization.

c. Context Management

Involves memory hierarchies and storage architectures (short-term context windows, long-term memory, external databases), memory paging, context compression (autoencoders, recurrent compression), and scalable management over multi-turn or multi-agent settings.

2. System Implementations

a. Retrieval-Augmented Generation (RAG)

Modular, agentic, and graph-enhanced RAG architectures integrate external knowledge and support dynamic, sometimes multi-agent retrieval pipelines.

Enables both real-time knowledge updates and complex reasoning over structured databases/graphs.

b. Memory Systems

Implement persistent and hierarchical storage, enabling longitudinal learning and knowledge recall for agents (e.g., MemGPT, MemoryBank, external vector databases).

Key for extended, multi-turn dialogs, personalized assistants, and simulation agents.

c. Tool-Integrated Reasoning

LLMs use external tools (APIs, search engines, code execution) via function calling or environment interaction, combining language reasoning with world-acting abilities.

Enables new domains (math, programming, web interaction, scientific research).

d. Multi-Agent Systems

Coordination among multiple LLMs (agents) via standardized protocols, orchestrators, and context sharing—essential for complex, collaborative problem-solving and distributed AI applications.

Key Insights and Research Gaps

Comprehension–Generation Asymmetry: LLMs, with advanced context engineering, can comprehend very sophisticated, multi-faceted contexts but still struggle to generate outputs matching that complexity or length.

Integration and Modularity: Best performance comes from modular architectures combining multiple techniques (retrieval, memory, tool use).

Evaluation Limitations: Current evaluation metrics/benchmarks (like BLEU, ROUGE) often fail to capture the compositional, multi-step, and collaborative behaviors enabled by advanced context engineering. New benchmarks and dynamic, holistic evaluation paradigms are needed.

Open Research Questions: Theoretical foundations, efficient scaling (especially computationally), cross-modal and structured context integration, real-world deployment, safety, alignment, and ethical concerns remain open research challenges.

Applications and Impact

Context engineering supports robust, domain-adaptive AI across:

Long-document/question answering

Personalized digital assistants and memory-augmented agents

Scientific, medical, and technical problem-solving

Multi-agent collaboration in business, education, and research

Future Directions

Unified Theory: Developing mathematical and information-theoretic frameworks.

Scaling & Efficiency: Innovations in attention mechanisms and memory management.

Multi-Modal Integration: Seamless coordination of text, vision, audio, and structured data.

Robust, Safe, and Ethical Deployment: Ensuring reliability, transparency, and fairness in real-world systems.

In summary: Context Engineering is emerging as the pivotal discipline for guiding the next generation of LLM-based intelligent systems, shifting the focus from creative prompt writing to the rigorous science of information optimization, system design, and context-driven AI.

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Technical Roadmap to Context Engineering in LLMs: Mechanisms, Benchmarks, and Open Challenges appeared first on MarkTechPost.

The Ultimate Guide to CPUs, GPUs, NPUs, and TPUs for AI/ML: Performanc …

Artificial intelligence and machine learning workloads have fueled the evolution of specialized hardware to accelerate computation far beyond what traditional CPUs can offer. Each processing unit—CPU, GPU, NPU, TPU—plays a distinct role in the AI ecosystem, optimized for certain models, applications, or environments. Here’s a technical, data-driven breakdown of their core differences and best use cases.

CPU (Central Processing Unit): The Versatile Workhorse

Design & Strengths: CPUs are general-purpose processors with a few powerful cores—ideal for single-threaded tasks and running diverse software, including operating systems, databases, and light AI/ML inference.

AI/ML Role: CPUs can execute any kind of AI model, but lack the massive parallelism needed for efficient deep learning training or inference at scale.

Best for:

Classical ML algorithms (e.g., scikit-learn, XGBoost)

Prototyping and model development

Inference for small models or low-throughput requirements

Technical Note: For neural network operations, CPU throughput (typically measured in GFLOPS—billion floating point operations per second) lags far behind specialized accelerators.

GPU (Graphics Processing Unit): The Deep Learning Backbone

Design & Strengths: Originally for graphics, modern GPUs feature thousands of parallel cores designed for matrix/multiple vector operations, making them highly efficient for training and inference of deep neural networks.

Performance Examples:

NVIDIA RTX 3090: 10,496 CUDA cores, up to 35.6 TFLOPS (teraFLOPS) FP32 compute.

Recent NVIDIA GPUs include “Tensor Cores” for mixed precision, accelerating deep learning operations.

Best for:

Training and inferencing large-scale deep learning models (CNNs, RNNs, Transformers)

Batch processing typical in datacenter and research environments

Supported by all major AI frameworks (TensorFlow, PyTorch)

Benchmarks: A 4x RTX A5000 setup can surpass a single, far more expensive NVIDIA H100 in certain workloads, balancing acquisition cost and performance.

NPU (Neural Processing Unit): The On-device AI Specialist

Design & Strengths: NPUs are ASICs (application-specific chips) crafted exclusively for neural network operations. They optimize parallel, low-precision computation for deep learning inference, often running at low power for edge and embedded devices.

Use Cases & Applications:

Mobile & Consumer: Powering features like face unlock, real-time image processing, language translation on devices like the Apple A-series, Samsung Exynos, Google Tensor chips.

Edge & IoT: Low-latency vision and speech recognition, smart city cameras, AR/VR, and manufacturing sensors.

Automotive: Real-time data from sensors for autonomous driving and advanced driver assistance.

Performance Example: The Exynos 9820’s NPU is ~7x faster than its predecessor for AI tasks.

Efficiency: NPUs prioritize energy efficiency over raw throughput, extending battery life while supporting advanced AI features locally.

TPU (Tensor Processing Unit): Google’s AI Powerhouse

Design & Strengths: TPUs are custom chips developed by Google specifically for large tensor computations, tuning hardware around the needs of frameworks like TensorFlow.

Key Specifications:

TPU v2: Up to 180 TFLOPS for neural network training and inference.

TPU v4: Available in Google Cloud, up to 275 TFLOPS per chip, scalable to “pods” exceeding 100 petaFLOPS.

Specialized matrix multiplication units (“MXU”) for enormous batch computations.

Up to 30–80x better energy efficiency (TOPS/Watt) for inference compared to contemporary GPUs and CPUs.

Best for:

Training and serving massive models (BERT, GPT-2, EfficientNet) in cloud at scale

High-throughput, low-latency AI for research and production pipelines

Tight integration with TensorFlow and JAX; increasingly interfacing with PyTorch

Note: TPU architecture is less flexible than GPU—optimized for AI, not graphics or general-purpose tasks.

Which Models Run Where?

HardwareBest Supported ModelsTypical WorkloadsCPUClassical ML, all deep learning models*General software, prototyping, small AIGPUCNNs, RNNs, TransformersTraining and inference (cloud/workstation)NPUMobileNet, TinyBERT, custom edge modelsOn-device AI, real-time vision/speechTPUBERT/GPT-2/ResNet/EfficientNet, etc.Large-scale model training/inference

*CPUs support any model, but are not efficient for large-scale DNNs.

Data Processing Units (DPUs): The Data Movers

Role: DPUs accelerate networking, storage, and data movement, offloading these tasks from CPUs/GPUs. They enable higher infrastructure efficiency in AI datacenters by ensuring compute resources focus on model execution, not I/O or data orchestration.

Summary Table: Technical Comparison

FeatureCPUGPUNPUTPUUse CaseGeneral ComputeDeep LearningEdge/On-device AIGoogle Cloud AIParallelismLow–ModerateVery High (~10,000+)Moderate–HighExtremely High (Matrix Mult.)EfficiencyModeratePower-hungryUltra-efficientHigh for large modelsFlexibilityMaximumVery high (all FW)SpecializedSpecialized (TensorFlow/JAX)Hardwarex86, ARM, etc.NVIDIA, AMDApple, Samsung, ARMGoogle (Cloud only)ExampleIntel XeonRTX 3090, A100, H100Apple Neural EngineTPU v4, Edge TPU

Key Takeaways

CPUs are unmatched for general-purpose, flexible workloads.

GPUs remain the workhorse for training and running neural networks across all frameworks and environments, especially outside Google Cloud.

NPUs dominate real-time, privacy-preserving, and power-efficient AI for mobile and edge, unlocking local intelligence everywhere from your phone to self-driving cars.

TPUs offer unmatched scale and speed for massive models—especially in Google’s ecosystem—pushing the frontiers of AI research and industrial deployment.

Choosing the right hardware depends on model size, compute demands, development environment, and desired deployment (cloud vs. edge/mobile). A robust AI stack often leverages a mix of these processors, each where it excels.
The post The Ultimate Guide to CPUs, GPUs, NPUs, and TPUs for AI/ML: Performance, Use Cases, and Key Differences appeared first on MarkTechPost.

Building an End-to-End Object Tracking and Analytics System with Robof …

In this advanced Roboflow Supervision tutorial, we build a complete object detection pipeline with the Supervision library. We begin by setting up real-time object tracking using ByteTracker, adding detection smoothing, and defining polygon zones to monitor specific regions in a video stream. As we process the frames, we annotate them with bounding boxes, object IDs, and speed data, enabling us to track and analyze object behavior over time. Our goal is to showcase how we can combine detection, tracking, zone-based analytics, and visual annotation into a seamless and intelligent video analysis workflow. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser!pip install supervision ultralytics opencv-python
!pip install –upgrade supervision

import cv2
import numpy as np
import supervision as sv
from ultralytics import YOLO
import matplotlib.pyplot as plt
from collections import defaultdict

model = YOLO(‘yolov8n.pt’)

We start by installing the necessary packages, including Supervision, Ultralytics, and OpenCV. After ensuring we have the latest version of Supervision, we import all required libraries. We then initialize the YOLOv8n model, which serves as the core detector in our pipeline. Check out the Full Codes here.

Copy CodeCopiedUse a different Browsertry:
tracker = sv.ByteTrack()
except AttributeError:
try:
tracker = sv.ByteTracker()
except AttributeError:
print(“Using basic tracking – install latest supervision for advanced tracking”)
tracker = None

try:
smoother = sv.DetectionsSmoother(length=5)
except AttributeError:
smoother = None
print(“DetectionsSmoother not available in this version”)

try:
box_annotator = sv.BoundingBoxAnnotator(thickness=2)
label_annotator = sv.LabelAnnotator()
if hasattr(sv, ‘TraceAnnotator’):
trace_annotator = sv.TraceAnnotator(thickness=2, trace_length=30)
else:
trace_annotator = None
except AttributeError:
try:
box_annotator = sv.BoxAnnotator(thickness=2)
label_annotator = sv.LabelAnnotator()
trace_annotator = None
except AttributeError:
print(“Using basic annotators – some features may be limited”)
box_annotator = None
label_annotator = None
trace_annotator = None

def create_zones(frame_shape):
h, w = frame_shape[:2]

try:
entry_zone = sv.PolygonZone(
polygon=np.array([[0, h//3], [w//3, h//3], [w//3, 2*h//3], [0, 2*h//3]]),
frame_resolution_wh=(w, h)
)

exit_zone = sv.PolygonZone(
polygon=np.array([[2*w//3, h//3], [w, h//3], [w, 2*h//3], [2*w//3, 2*h//3]]),
frame_resolution_wh=(w, h)
)
except TypeError:
entry_zone = sv.PolygonZone(
polygon=np.array([[0, h//3], [w//3, h//3], [w//3, 2*h//3], [0, 2*h//3]])
)
exit_zone = sv.PolygonZone(
polygon=np.array([[2*w//3, h//3], [w, h//3], [w, 2*h//3], [2*w//3, 2*h//3]])
)

return entry_zone, exit_zone

We set up essential components from the Supervision library, including object tracking with ByteTrack, optional smoothing using DetectionsSmoother, and flexible annotators for bounding boxes, labels, and traces. To ensure compatibility across versions, we use try-except blocks to fall back to alternative classes or basic functionality when needed. Additionally, we define dynamic polygon zones within the frame to monitor specific regions like entry and exit areas, enabling advanced spatial analytics. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass AdvancedAnalytics:
def __init__(self):
self.track_history = defaultdict(list)
self.zone_crossings = {“entry”: 0, “exit”: 0}
self.speed_data = defaultdict(list)

def update_tracking(self, detections):
if hasattr(detections, ‘tracker_id’) and detections.tracker_id is not None:
for i in range(len(detections)):
track_id = detections.tracker_id[i]
if track_id is not None:
bbox = detections.xyxy[i]
center = np.array([(bbox[0] + bbox[2]) / 2, (bbox[1] + bbox[3]) / 2])
self.track_history[track_id].append(center)

if len(self.track_history[track_id]) >= 2:
prev_pos = self.track_history[track_id][-2]
curr_pos = self.track_history[track_id][-1]
speed = np.linalg.norm(curr_pos – prev_pos)
self.speed_data[track_id].append(speed)

def get_statistics(self):
total_tracks = len(self.track_history)
avg_speed = np.mean([np.mean(speeds) for speeds in self.speed_data.values() if speeds])
return {
“total_objects”: total_tracks,
“zone_entries”: self.zone_crossings[“entry”],
“zone_exits”: self.zone_crossings[“exit”],
“avg_speed”: avg_speed if not np.isnan(avg_speed) else 0
}

def process_video(source=0, max_frames=300):
“””
Process video source with advanced supervision features
source: video path or 0 for webcam
max_frames: limit processing for demo
“””
cap = cv2.VideoCapture(source)
analytics = AdvancedAnalytics()

ret, frame = cap.read()
if not ret:
print(“Failed to read video source”)
return

entry_zone, exit_zone = create_zones(frame.shape)

try:
entry_zone_annotator = sv.PolygonZoneAnnotator(
zone=entry_zone,
color=sv.Color.GREEN,
thickness=2
)
exit_zone_annotator = sv.PolygonZoneAnnotator(
zone=exit_zone,
color=sv.Color.RED,
thickness=2
)
except (AttributeError, TypeError):
entry_zone_annotator = sv.PolygonZoneAnnotator(zone=entry_zone)
exit_zone_annotator = sv.PolygonZoneAnnotator(zone=exit_zone)

frame_count = 0
results_frames = []

cap.set(cv2.CAP_PROP_POS_FRAMES, 0)

while ret and frame_count < max_frames:
ret, frame = cap.read()
if not ret:
break

results = model(frame, verbose=False)[0]
detections = sv.Detections.from_ultralytics(results)

detections = detections[detections.class_id == 0]

if tracker is not None:
detections = tracker.update_with_detections(detections)

if smoother is not None:
detections = smoother.update_with_detections(detections)

analytics.update_tracking(detections)

entry_zone.trigger(detections)
exit_zone.trigger(detections)

labels = []
for i in range(len(detections)):
confidence = detections.confidence[i] if detections.confidence is not None else 0.0

if hasattr(detections, ‘tracker_id’) and detections.tracker_id is not None:
track_id = detections.tracker_id[i]
if track_id is not None:
speed = analytics.speed_data[track_id][-1] if analytics.speed_data[track_id] else 0
label = f”ID:{track_id} | Conf:{confidence:.2f} | Speed:{speed:.1f}”
else:
label = f”Conf:{confidence:.2f}”
else:
label = f”Conf:{confidence:.2f}”
labels.append(label)

annotated_frame = frame.copy()

annotated_frame = entry_zone_annotator.annotate(annotated_frame)
annotated_frame = exit_zone_annotator.annotate(annotated_frame)

if trace_annotator is not None:
annotated_frame = trace_annotator.annotate(annotated_frame, detections)

if box_annotator is not None:
annotated_frame = box_annotator.annotate(annotated_frame, detections)
else:
for i in range(len(detections)):
bbox = detections.xyxy[i].astype(int)
cv2.rectangle(annotated_frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (0, 255, 0), 2)

if label_annotator is not None:
annotated_frame = label_annotator.annotate(annotated_frame, detections, labels)
else:
for i, label in enumerate(labels):
if i < len(detections):
bbox = detections.xyxy[i].astype(int)
cv2.putText(annotated_frame, label, (bbox[0], bbox[1]-10),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)

stats = analytics.get_statistics()
y_offset = 30
for key, value in stats.items():
text = f”{key.replace(‘_’, ‘ ‘).title()}: {value:.1f}”
cv2.putText(annotated_frame, text, (10, y_offset),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)
y_offset += 30

if frame_count % 30 == 0:
results_frames.append(cv2.cvtColor(annotated_frame, cv2.COLOR_BGR2RGB))

frame_count += 1

if frame_count % 50 == 0:
print(f”Processed {frame_count} frames…”)

cap.release()

if results_frames:
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

for i, (ax, frame) in enumerate(zip(axes, results_frames[:4])):
ax.imshow(frame)
ax.set_title(f”Frame {i*30}”)
ax.axis(‘off’)

plt.tight_layout()
plt.show()

final_stats = analytics.get_statistics()
print(“n=== FINAL ANALYTICS ===”)
for key, value in final_stats.items():
print(f”{key.replace(‘_’, ‘ ‘).title()}: {value:.2f}”)

return analytics

print(“Starting advanced supervision demo…”)
print(“Features: Object detection, tracking, zones, speed analysis, smoothing”)

We define the AdvancedAnalytics class to track object movement, calculate speed, and count zone crossings, enabling rich real-time video insights. Inside the process_video function, we read each frame from the video source and run it through our detection, tracking, and smoothing pipeline. We annotate frames with bounding boxes, labels, zone overlays, and live statistics, giving us a powerful, flexible system for object monitoring and spatial analytics. Throughout the loop, we also collect data for visualization and print final statistics, showcasing the effectiveness of Roboflow Supervision’s end-to-end capabilities. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserdef create_demo_video():
“””Create a simple demo video with moving objects”””
fourcc = cv2.VideoWriter_fourcc(*’mp4v’)
out = cv2.VideoWriter(‘demo.mp4’, fourcc, 20.0, (640, 480))

for i in range(100):
frame = np.zeros((480, 640, 3), dtype=np.uint8)

x1 = int(50 + i * 2)
y1 = 200
x2 = int(100 + i * 1.5)
y2 = 250

cv2.rectangle(frame, (x1, y1), (x1+50, y1+50), (0, 255, 0), -1)
cv2.rectangle(frame, (x2, y2), (x2+50, y2+50), (255, 0, 0), -1)

out.write(frame)

out.release()
return ‘demo.mp4’

demo_video = create_demo_video()
analytics = process_video(demo_video, max_frames=100)

print(“nTutorial completed! Key features demonstrated:”)
print(“✓ YOLO integration with Supervision”)
print(“✓ Multi-object tracking with ByteTracker”)
print(“✓ Detection smoothing”)
print(“✓ Polygon zones for area monitoring”)
print(“✓ Advanced annotations (boxes, labels, traces)”)
print(“✓ Real-time analytics and statistics”)
print(“✓ Speed calculation and tracking history”)

To test our full pipeline, we generate a synthetic demo video with two moving rectangles simulating tracked objects. This allows us to validate detection, tracking, zone monitoring, and speed analysis without needing a real-world input. We then run the process_video function on the generated clip. At the end, we print out a summary of all key features we’ve implemented, showcasing the power of Roboflow Supervision for real-time visual analytics.

In conclusion, we have successfully implemented a full pipeline that brings together object detection, tracking, zone monitoring, and real-time analytics. We demonstrate how to visualize key insights like object speed, zone crossings, and tracking history with annotated video frames. This setup empowers us to go beyond basic detection and build a smart surveillance or analytics system using open-source tools. Whether for research or production use, we now have a powerful foundation to expand upon with even more advanced capabilities.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Building an End-to-End Object Tracking and Analytics System with Roboflow Supervision appeared first on MarkTechPost.

MIT Researchers Develop Methods to Control Transformer Sensitivity wit …

Training large-scale transformers stably has been a longstanding challenge in deep learning, particularly as models grow in size and expressivity. MIT researchers tackle a persistent problem at its root: the unstable growth of activations and loss spikes caused by unconstrained weight and activation norms. Their solution is to enforce provable Lipschitz bounds on the transformer by *spectrally regulating the weights—*with no use of activation normalization, QK norm, or logit softcapping tricks.

What is a Lipschitz Bound—and Why Enforce It?

A Lipschitz bound on a neural network quantifies the maximum amount by which the output can change in response to input (or weight) perturbations. Mathematically, a function fff is KKK-Lipschitz if:∥f(x1)−f(x2)∥≤K∥x1−x2∥ ∀x1,x2|f(x_1) – f(x_2)| leq K |x_1 – x_2| forall x_1, x_2∥f(x1)−f(x2)∥≤K∥x1−x2∥ ∀x1,x2

Lower Lipschitz bound ⇒ greater robustness and predictability.

It is crucial for stability, adversarial robustness, privacy, and generalization, with lower bounds meaning the network is less sensitive to changes or adversarial noise.

Motivation and Problem Statement

Traditionally, training stable transformers at scale has involved a variety of “band-aid” stabilization tricks:

Layer normalization

QK normalization

Logit tanh softcapping

But these do not directly address the underlying spectral norm (largest singular value) growth in the weights, a root cause of exploding activations and training instability—especially in large models.

The central hypothesis: If we spectrally regulate the weights themselves—beyond just the optimizer or activations—we can maintain tight control over Lipschitzness, potentially solving instability at its source.

Key Innovations

Weight Spectral Regulation and the Muon Optimizer

Muon optimizer spectrally regularizes gradients, ensuring each gradient step does not increase the spectral norm beyond a set limit.

The researchers extend regulation to the weights: After each step, they apply operations to cap the singular values of every weight matrix. Activation norms stay remarkably small as a result—rarely exceeding values compatible with fp8 precision in their GPT-2 scale transformers.

Removing Stability Tricks

In all experiments, no layer normalization, no QK norm, no logit tanh were used. Yet,

Maximum activation entries in their GPT-2 scale transformer never exceeded ~100, while the unconstrained baseline surpassed 148,000.

Table Sample (NanoGPT Experiment)

ModelMax ActivationLayer Stability TricksValidation AccuracyLipschitz BoundBaseline (Speedrun)148,480Yes39.4%∞Lipschitz Transformer160None39.5%10¹⁰²⁶⁴

Methods for Enforcing Lipschitz Constraints

A variety of weight norm constraint methods were explored and compared for their ability to:

Maintain high performance,

Guarantee a Lipschitz bound, and

Optimize the performance-Lipschitz tradeoff.

Techniques

Weight Decay: Standard method, but not always strict on spectral norm.

Spectral Normalization: Ensures top singular value is capped, but may affect all singular values globally.

Spectral Soft Cap: Novel method, smoothly and efficiently applies σ→min⁡(σmax,σ)sigma to min(sigma_{text{max}}, sigma)σ→min(σmax,σ) to all singular values in parallel (using odd polynomial approximations). This is co-designed for Muon’s high stable-rank updates for tight bounds.

Spectral Hammer: Sets only the largest singular value to σmaxsigma_{text{max}}σmax, best suited for AdamW optimizer.

Experimental Results and Insights

Model Evaluation at Various Scales

Shakespeare (Small Transformer, <2-Lipschitz):

Achieves 60% validation accuracy with a provable Lipschitz bound below.

Outperforms unconstrained baseline in validation loss.

NanoGPT (145M Parameters):

With a Lipschitz bound <10, validation accuracy: 21.2%.

To match the strong unconstrained baseline (39.4% accuracy), required a large upper bound of 1026410^{264}10264. This highlights how strict Lipschitz constraints often trade off with expressivity at large scales for now.

Weight Constraint Method Efficiency

Muon + Spectral Cap: Leads the tradeoff frontier—lower Lipschitz constants for matched or better validation loss compared to AdamW + weight decay.

Spectral soft cap and normalization (under Muon) consistently enable best frontier on the loss-Lipschitz tradeoff.

Stability and Robustness

Adversarial robustness increases sharply at lower Lipschitz bounds.

In experiments, models with a constrained Lipschitz constant suffered much milder accuracy drop under adversarial attack compared to unconstrained baselines.

Activation Magnitudes

With spectral weight regulation: Maximum activations remain tiny (near-fp8 compatible), compared to the unbounded baselines, even at scale.

This opens avenues for low-precision training and inference in hardware, where smaller activations reduce compute, memory, and power costs.

Limitations and Open Questions

Selecting the “tightest” tradeoff for weight norms, logit scaling, and attention scaling still relies on sweeps, not principle.

Current upper-bounding is loose: Calculated global bounds can be astronomically large (e.g. 1026410^{264}10264), while real activation norms remain small.

It’s unclear if matching unconstrained baseline performance with strictly small Lipschitz bounds is possible as scale increases—more research needed.

Conclusion

Spectral weight regulation—especially when paired with the Muon optimizer—can stably train large transformers with enforced Lipschitz bounds, without activation normalization or other band-aid tricks. This addresses instability at a deeper level and keeps activations in a compact, predictable range, greatly improving adversarial robustness and potentially hardware efficiency.

This line of work points to new, efficient computational primitives for neural network regulation, with broad applications for privacy, safety, and low-precision AI deployment.

Check out the Paper, GitHub Page and Hugging Face Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post MIT Researchers Develop Methods to Control Transformer Sensitivity with Provable Lipschitz Bounds and Muon appeared first on MarkTechPost.

How to Use the SHAP-IQ Package to Uncover and Visualize Feature Intera …

In this tutorial, we explore how to use the SHAP-IQ package to uncover and visualize feature interactions in machine learning models using Shapley Interaction Indices (SII), building on the foundation of traditional Shapley values.

Shapley values are great for explaining individual feature contributions in AI models but fail to capture feature interactions. Shapley interactions go a step further by separating individual effects from interactions, offering deeper insights—like how longitude and latitude together influence house prices. In this tutorial, we’ll get started with the shapiq package to compute and explore these Shapley interactions for any model. Check out the Full Codes here

Installing the dependencies

Copy CodeCopiedUse a different Browser!pip install shapiq overrides scikit-learn pandas numpy

Data Loading and Pre-processing

In this tutorial, we’ll use the Bike Sharing dataset from OpenML. After loading the data, we’ll split it into training and testing sets to prepare it for model training and evaluation. Check out the Full Codes here

Copy CodeCopiedUse a different Browserimport shapiq
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np

# Load data
X, y = shapiq.load_bike_sharing(to_numpy=True)

# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model Training and Performance Evaluation

Copy CodeCopiedUse a different Browser# Train model
model = RandomForestRegressor()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f”R² Score: {r2:.4f}”)
print(f”Mean Absolute Error: {mae:.4f}”)
print(f”Root Mean Squared Error: {rmse:.4f}”)

Setting up an Explainer

We set up a TabularExplainer using the shapiq package to compute Shapley interaction values based on the k-SII (k-order Shapley Interaction Index) method. By specifying max_order=4, we allow the explainer to consider interactions of up to 4 features simultaneously, enabling deeper insights into how groups of features collectively impact model predictions. Check out the Full Codes here

Copy CodeCopiedUse a different Browser# set up an explainer with k-SII interaction values up to order 4
explainer = shapiq.TabularExplainer(
model=model,
data=X,
index=”k-SII”,
max_order=4
)

Explaining a Local Instance

We select a specific test instance (index 100) to generate local explanations. The code prints the true and predicted values for this instance, followed by a breakdown of its feature values. This helps us understand the exact inputs passed to the model and sets the context for interpreting the Shapley interaction explanations that follow. Check out the Full Codes here

Copy CodeCopiedUse a different Browserfrom tqdm.asyncio import tqdm
# create explanations for different orders
feature_names = list(df[0].columns) # get the feature names
n_features = len(feature_names)

# select a local instance to be explained
instance_id = 100
x_explain = X_test[instance_id]
y_true = y_test[instance_id]
y_pred = model.predict(x_explain.reshape(1, -1))[0]
print(f”Instance {instance_id}, True Value: {y_true}, Predicted Value: {y_pred}”)
for i, feature in enumerate(feature_names):
print(f”{feature}: {x_explain[i]}”)

Analyzing Interaction Values

We use the explainer.explain() method to compute Shapley interaction values for a specific data instance (X[100]) with a budget of 256 model evaluations. This returns an InteractionValues object, which captures how individual features and their combinations influence the model’s output. The max_order=4 means we consider interactions involving up to 4 features. Check out the Full Codes here

Copy CodeCopiedUse a different Browserinteraction_values = explainer.explain(X[100], budget=256)
# analyse interaction values
print(interaction_values)

First-Order Interaction Values

To keep things simple, we compute first-order interaction values—i.e., standard Shapley values that capture only individual feature contributions (no interactions).

By setting max_order=1 in the TreeExplainer, we’re saying:

“Tell me how much each feature individually contributes to the prediction, without considering any interaction effects.”

These values are known as standard Shapley values. For each feature, it estimates the average marginal contribution to the prediction across all possible permutations of feature inclusion. Check out the Full Codes here

Copy CodeCopiedUse a different Browserfeature_names = list(df[0].columns)
explainer = shapiq.TreeExplainer(model=model, max_order=1, index=”SV”)
si_order = explainer.explain(x=x_explain)
si_order

Plotting a Waterfall chart

A Waterfall chart visually breaks down a model’s prediction into individual feature contributions. It starts from the baseline prediction and adds/subtracts each feature’s Shapley value to reach the final predicted output.

In our case, we’ll use the output of TreeExplainer with max_order=1 (i.e., individual contributions only) to visualize the contribution of each feature. Check out the Full Codes here

Copy CodeCopiedUse a different Browsersi_order.plot_waterfall(feature_names=feature_names, show=True)

In our case, the baseline value (i.e., the model’s expected output without any feature information) is 190.717.

As we add the contributions from individual features (order-1 Shapley values), we can observe how each one pushes the prediction up or pulls it down:

Features like Weather and Humidity have a positive contribution, increasing the prediction above the baseline.

Features like Temperature and Year have a strong negative impact, pulling the prediction down by −35.4 and −45, respectively.

Overall, the Waterfall chart helps us understand which features are driving the prediction, and in which direction—providing valuable insight into the model’s decision-making.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Use the SHAP-IQ Package to Uncover and Visualize Feature Interactions in Machine Learning Models Using Shapley Interaction Indices (SII) appeared first on MarkTechPost.

A Coding Guide to Build Intelligent Multi-Agent Systems with the PEER …

In this tutorial, we explore a powerful multi-agent system built around the PEER pattern: Plan, Execute, Express, and Review. We run the entire workflow in Google Colab/Notebook, integrating agents with specialized roles and leveraging Google’s Gemini 1.5 Flash model via a free API key. As we walk through the system, we observe how each agent collaborates to tackle complex tasks across different domains such as finance, technology, and creative strategy. This hands-on tutorial allows us to understand the architecture, workflow, and iterative refinement that underpin high-quality AI outputs.

Copy CodeCopiedUse a different Browser!pip install agentUniverse google-generativeai python-dotenv pydantic

import os
import asyncio
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
from enum import Enum
import json
import time
import google.generativeai as genai

GEMINI_API_KEY = ‘Use Your API Key Here’
genai.configure(api_key=GEMINI_API_KEY)

We begin by installing the required libraries, including agentUniverse and google-generativeai, to set up our multi-agent system. After importing the necessary modules, we configure the Gemini API using our free API key to enable AI-powered content generation. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass AgentRole(Enum):
PLANNER = “planner”
EXECUTOR = “executor”
EXPRESSER = “expresser”
REVIEWER = “reviewer”

@dataclass
class Task:
id: str
description: str
context: Dict[str, Any]
status: str = “pending”
result: Optional[str] = None
feedback: Optional[str] = None

class BaseAgent:
“””Base agent class with core functionality”””
def __init__(self, name: str, role: AgentRole, system_prompt: str):
self.name = name
self.role = role
self.system_prompt = system_prompt
self.memory: List[Dict] = []

async def process(self, task: Task) -> str:
prompt = f”{self.system_prompt}nnTask: {task.description}nContext: {json.dumps(task.context)}”

result = await self._simulate_llm_call(prompt, task)

self.memory.append({
“task_id”: task.id,
“input”: task.description,
“output”: result,
“timestamp”: time.time()
})

return result

async def _simulate_llm_call(self, prompt: str, task: Task) -> str:
“””Call Google Gemini API for real LLM processing”””
try:
model = genai.GenerativeModel(‘gemini-1.5-flash’)

enhanced_prompt = self._create_role_prompt(prompt, task)

response = await asyncio.to_thread(
lambda: model.generate_content(enhanced_prompt)
)

return response.text.strip()

except Exception as e:
print(f” Gemini API error for {self.role.value}: {str(e)}”)
return self._get_fallback_response(task)

def _create_role_prompt(self, base_prompt: str, task: Task) -> str:
“””Create enhanced role-specific prompts for Gemini”””
role_instructions = {
AgentRole.PLANNER: “You are a strategic planning expert. Create detailed, actionable plans. Break down complex tasks into clear steps with priorities and dependencies.”,
AgentRole.EXECUTOR: “You are a skilled executor. Analyze the task thoroughly and provide detailed implementation insights. Focus on practical solutions and potential challenges.”,
AgentRole.EXPRESSER: “You are a professional communicator. Present information clearly, professionally, and engagingly. Structure your response with headers, bullet points, and clear conclusions.”,
AgentRole.REVIEWER: “You are a quality assurance expert. Evaluate completeness, accuracy, and clarity. Provide specific, actionable improvement suggestions.”
}

context_info = f”Previous context: {json.dumps(task.context, indent=2)}” if task.context else “No previous context”

return f”””
{role_instructions[self.role]}

{base_prompt}

{context_info}

Task to process: {task.description}

Provide a comprehensive, professional response appropriate for your role as {self.role.value}.
“””

def _get_fallback_response(self, task: Task) -> str:
“””Fallback responses if Gemini API is unavailable”””
fallbacks = {
AgentRole.PLANNER: f”STRATEGIC PLAN for ‘{task.description}’: 1) Requirement analysis 2) Resource assessment 3) Implementation roadmap 4) Risk mitigation 5) Success metrics”,
AgentRole.EXECUTOR: f”EXECUTION ANALYSIS for ‘{task.description}’: Comprehensive analysis completed. Key findings identified, practical solutions developed, implementation considerations noted.”,
AgentRole.EXPRESSER: f”PROFESSIONAL SUMMARY for ‘{task.description}’: ## Analysis Completenn**Key Insights:** Detailed analysis performedn**Recommendations:** Strategic actions identifiedn**Next Steps:** Implementation ready”,
AgentRole.REVIEWER: f”QUALITY REVIEW for ‘{task.description}’: **Assessment:** High quality output achieved. **Strengths:** Comprehensive analysis, clear structure. **Suggestions:** Consider additional quantitative metrics.”
}
return fallbacks[self.role]

We define four distinct agent roles, Planner, Executor, Expresser, and Reviewer, using an Enum to represent their specialized functions. Then, we create a Task dataclass to manage task metadata, including status, result, and feedback. The BaseAgent class serves as the core blueprint for all agents, enabling them to process tasks, call the Gemini API with role-specific prompts, store results in memory, and gracefully fall back to predefined responses if the API fails. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass PEERAgent:
“””PEER Pattern Implementation – Plan, Execute, Express, Review”””
def __init__(self):
self.planner = BaseAgent(“Strategic Planner”, AgentRole.PLANNER,
“You are a strategic planning agent. Break down complex tasks into actionable steps.”)

self.executor = BaseAgent(“Task Executor”, AgentRole.EXECUTOR,
“You are an execution agent. Complete tasks efficiently using available tools and knowledge.”)

self.expresser = BaseAgent(“Result Expresser”, AgentRole.EXPRESSER,
“You are a communication agent. Present results clearly and professionally.”)

self.reviewer = BaseAgent(“Quality Reviewer”, AgentRole.REVIEWER,
“You are a quality assurance agent. Review outputs and provide improvement feedback.”)

self.iteration_count = 0
self.max_iterations = 3

async def collaborate(self, task: Task) -> Dict[str, Any]:
“””Execute PEER collaboration pattern”””
results = {“iterations”: [], “final_result”: None}

while self.iteration_count < self.max_iterations:
iteration_result = {}

print(f” Planning Phase (Iteration {self.iteration_count + 1})”)
plan = await self.planner.process(task)
iteration_result[“plan”] = plan
task.context[“current_plan”] = plan

print(f” Execution Phase”)
execution = await self.executor.process(task)
iteration_result[“execution”] = execution
task.context[“execution_result”] = execution

print(f” Expression Phase”)
expression = await self.expresser.process(task)
iteration_result[“expression”] = expression
task.result = expression

print(f” Review Phase”)
review = await self.reviewer.process(task)
iteration_result[“review”] = review
task.feedback = review

results[“iterations”].append(iteration_result)

if “high” in review.lower() and self.iteration_count >= 1:
results[“final_result”] = expression
break

self.iteration_count += 1
task.context[“previous_feedback”] = review

return results

We implement the PEER pattern, Plan, Execute, Express, Review, through the PEERAgent class, which coordinates four specialized agents for collaborative task-solving. Each iteration runs through all four phases, refining the task output based on structured planning, execution, professional expression, and quality review. We allow up to three iterations, concluding early if the review indicates high-quality completion, making the workflow both adaptive and efficient. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass MultiAgentOrchestrator:
“””Orchestrates multiple specialized agents”””
def __init__(self):
self.agents = {}
self.peer_system = PEERAgent()
self.task_queue = []

def register_agent(self, agent: BaseAgent):
“””Register a specialized agent”””
self.agents[agent.name] = agent

async def process_complex_task(self, description: str, domain: str = “general”) -> Dict[str, Any]:
“””Process complex task using PEER pattern and domain agents”””
task = Task(
id=f”task_{int(time.time())}”,
description=description,
context={“domain”: domain, “complexity”: “high”}
)

print(f” Starting Complex Task Processing: {description}”)
print(“=” * 60)

peer_results = await self.peer_system.collaborate(task)

if domain in [“financial”, “technical”, “creative”]:
domain_agent = self._get_domain_agent(domain)
if domain_agent:
print(f” Domain-Specific Processing ({domain})”)
domain_result = await domain_agent.process(task)
peer_results[“domain_enhancement”] = domain_result

return {
“task_id”: task.id,
“original_request”: description,
“peer_results”: peer_results,
“status”: “completed”,
“processing_time”: f”{len(peer_results[‘iterations’])} iterations”
}

def _get_domain_agent(self, domain: str) -> Optional[BaseAgent]:
“””Get domain-specific agent with enhanced Gemini prompts”””
domain_agents = {
“financial”: BaseAgent(“Financial Analyst”, AgentRole.EXECUTOR,
“You are a senior financial analyst with expertise in market analysis, risk assessment, and investment strategies. Provide detailed financial insights with quantitative analysis.”),
“technical”: BaseAgent(“Technical Expert”, AgentRole.EXECUTOR,
“You are a lead software architect with expertise in system design, scalability, and best practices. Provide detailed technical solutions with implementation considerations.”),
“creative”: BaseAgent(“Creative Director”, AgentRole.EXPRESSER,
“You are an award-winning creative director with expertise in brand strategy, content creation, and innovative campaigns. Generate compelling and strategic creative solutions.”)
}
return domain_agents.get(domain)

class KnowledgeBase:
“””Simple knowledge management system”””
def __init__(self):
self.knowledge = {
“financial_analysis”: [“Risk assessment”, “Portfolio optimization”, “Market analysis”],
“technical_development”: [“System architecture”, “Code optimization”, “Security protocols”],
“creative_content”: [“Brand storytelling”, “Visual design”, “Content strategy”]
}

def get_domain_knowledge(self, domain: str) -> List[str]:
return self.knowledge.get(domain, [“General knowledge”])

async def run_advanced_demo():

orchestrator = MultiAgentOrchestrator()
knowledge_base = KnowledgeBase()

print(“n DEMO 1: Financial Analysis with PEER Pattern”)
print(“-” * 40)

financial_task = “Analyze the potential impact of rising interest rates on tech stocks portfolio”
result1 = await orchestrator.process_complex_task(financial_task, “financial”)

print(f”n Task Completed: {result1[‘processing_time’]}”)
print(f”Final Result: {result1[‘peer_results’][‘final_result’]}”)

print(“n DEMO 2: Technical Problem Solving”)
print(“-” * 40)

technical_task = “Design a scalable microservices architecture for a high-traffic e-commerce platform”
result2 = await orchestrator.process_complex_task(technical_task, “technical”)

print(f”n Task Completed: {result2[‘processing_time’]}”)
print(f”Final Result: {result2[‘peer_results’][‘final_result’]}”)

print(“n DEMO 3: Creative Content with Multi-Agent Collaboration”)
print(“-” * 40)

creative_task = “Create a comprehensive brand strategy for a sustainable fashion startup”
result3 = await orchestrator.process_complex_task(creative_task, “creative”)

print(f”n Task Completed: {result3[‘processing_time’]}”)
print(f”Final Result: {result3[‘peer_results’][‘final_result’]}”)

print(“n AGENT MEMORY & LEARNING”)
print(“-” * 40)
print(f”Planner processed {len(orchestrator.peer_system.planner.memory)} tasks”)
print(f”Executor processed {len(orchestrator.peer_system.executor.memory)} tasks”)
print(f”Expresser processed {len(orchestrator.peer_system.expresser.memory)} tasks”)
print(f”Reviewer processed {len(orchestrator.peer_system.reviewer.memory)} tasks”)

return {
“demo_results”: [result1, result2, result3],
“agent_stats”: {
“total_tasks”: 3,
“success_rate”: “100%”,
“avg_iterations”: sum(len(r[‘peer_results’][‘iterations’]) for r in [result1, result2, result3]) / 3
}
}

def explain_peer_pattern():
“””Explain the PEER pattern in detail”””
explanation = “””
PEER Pattern Explained:

P – PLAN: Strategic decomposition of complex tasks
E – EXECUTE: Systematic implementation using tools and knowledge
E – EXPRESS: Clear, structured communication of results
R – REVIEW: Quality assurance and iterative improvement

This pattern enables:
Better task decomposition
Systematic execution
Professional output formatting
Continuous quality improvement
“””
print(explanation)

def show_architecture():
“””Display the multi-agent architecture”””
architecture = “””
agentUniverse Architecture:

Task Input

PEER System
├── Planner Agent
├── Executor Agent
├── Expresser Agent
└── Reviewer Agent

Domain Specialists
├── Financial Analyst
├── Technical Expert
└── Creative Director

Knowledge Base

Results & Analytics
“””
print(architecture)

We bring everything together through the MultiAgentOrchestrator, which coordinates the PEER system and, when needed, invokes domain-specific agents like the Financial Analyst or Technical Expert. This orchestrator handles each complex task by first leveraging the PEER pattern and then enhancing results with specialized knowledge. We also define a simple KnowledgeBase to support domain-aware reasoning. In the run_advanced_demo() function, we test the full pipeline with three tasks, financial, technical, and creative, while capturing agent performance and iteration metrics to showcase the power and versatility of our multi-agent setup. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
print(” Get your FREE API key at: https://makersuite.google.com/app/apikey”)
print(” Make sure to replace ‘your-gemini-api-key-here’ with your actual key!”)

if GEMINI_API_KEY == ‘your-gemini-api-key-here’:
print(” WARNING: Please set your Gemini API key first!”)
print(” 1. Go to https://makersuite.google.com/app/apikey”)
print(” 2. Create a free API key”)
print(” 3. Replace ‘your-gemini-api-key-here’ with your key”)
print(” 4. Re-run the tutorial”)
else:
print(” API key configured! Starting tutorial…”)

explain_peer_pattern()
show_architecture()

print(“n Running Advanced Demo with Gemini AI (This may take a moment)…”)

try:
import nest_asyncio
nest_asyncio.apply()

demo_results = asyncio.run(run_advanced_demo())

print(“n TUTORIAL COMPLETED SUCCESSFULLY!”)
print(“=” * 50)
print(f” Performance Summary:”)
print(f” • Tasks Processed: {demo_results[‘agent_stats’][‘total_tasks’]}”)
print(f” • Success Rate: {demo_results[‘agent_stats’][‘success_rate’]}”)
print(f” • Avg Iterations: {demo_results[‘agent_stats’][‘avg_iterations’]:.1f}”)
print(f” • Powered by: Google Gemini (FREE)”)

print(“n Key Takeaways:”)
print(” • PEER pattern enables systematic problem-solving”)
print(” • Multi-agent collaboration improves output quality”)
print(” • Domain expertise integration enhances specialization”)
print(” • Iterative refinement ensures high-quality results”)
print(” • Gemini provides powerful, free AI capabilities”)

except ImportError:
print(” Note: Install nest_asyncio for full async support in Colab”)
print(“Run: !pip install nest_asyncio”)
except Exception as e:
print(f” Error running demo: {str(e)}”)
print(“This might be due to API key configuration or network issues.”)

print(“n Next Steps:”)
print(” • Customize agents for your specific domain”)
print(” • Experiment with different Gemini models (gemini-pro, gemini-1.5-flash)”)
print(” • Build production-ready multi-agent applications”)

We conclude the tutorial by initializing the system, verifying the Gemini API key, and executing the full PEER-based multi-agent workflow. We explain the architecture and pattern before running the demo, and upon successful completion, we display a performance summary and key takeaways.

In conclusion, we successfully demonstrate how a multi-agent system can systematically solve complex problems with the help of domain-specific reasoning, structured communication, and iterative quality checks. We gain insights into the collaborative power of the PEER framework and witness how Gemini enhances each agent’s output. Through this experience, we realize the potential of modular AI systems in creating scalable, reliable, and intelligent applications ready for real-world deployment.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide to Build Intelligent Multi-Agent Systems with the PEER Pattern appeared first on MarkTechPost.

Falcon LLM Team Releases Falcon-H1 Technical Report: A Hybrid Attentio …

Introduction

The Falcon-H1 series, developed by the Technology Innovation Institute (TII), marks a significant advancement in the evolution of large language models (LLMs). By integrating Transformer-based attention with Mamba-based State Space Models (SSMs) in a hybrid parallel configuration, Falcon-H1 achieves exceptional performance, memory efficiency, and scalability. Released in multiple sizes (0.5B to 34B parameters) and versions (base, instruct-tuned, and quantized), Falcon-H1 models redefine the trade-off between compute budget and output quality, offering parameter efficiency superior to many contemporary models such as Qwen2.5-72B and LLaMA3.3-70B.

Key Architectural Innovations

The technical report explains how Falcon-H1 adopts a novel parallel hybrid architecture where both attention and SSM modules operate concurrently, and their outputs are concatenated before the projection. This design deviates from traditional sequential integration and provides the flexibility to tune the number of attention and SSM channels independently. The default configuration uses a 2:1:5 ratio for SSM, attention, and MLP channels respectively, optimizing both efficiency and learning dynamics.

To further refine the model, Falcon-H1 explores:

Channel allocation: Ablations show that increasing attention channels deteriorates performance, whereas balancing SSM and MLP yields robust gains.

Block configuration: The SA_M configuration (semi-parallel with attention and SSM run together, followed by MLP) performs best in training loss and computational efficiency.

RoPE base frequency: An unusually high base frequency of 10^11 in Rotary Positional Embeddings (RoPE) proved optimal, improving generalization during long-context training.

Width-depth trade-off: Experiments show that deeper models outperform wider ones under fixed parameter budgets. Falcon-H1-1.5B-Deep (66 layers) outperforms many 3B and 7B models.

Tokenizer Strategy

Falcon-H1 uses a customized Byte Pair Encoding (BPE) tokenizer suite with vocabulary sizes ranging from 32K to 261K. Key design choices include:

Digit and punctuation splitting: Empirically improves performance in code and multilingual settings.

LATEX token injection: Enhances model accuracy on math benchmarks.

Multilingual support: Covers 18 languages and scales to 100+, using optimized fertility and bytes/token metrics.

Pretraining Corpus and Data Strategy

Falcon-H1 models are trained on up to 18T tokens from a carefully curated 20T token corpus, comprising:

High-quality web data (filtered FineWeb)

Multilingual datasets: Common Crawl, Wikipedia, arXiv, OpenSubtitles, and curated resources for 17 languages

Code corpus: 67 languages, processed via MinHash deduplication, CodeBERT quality filters, and PII scrubbing

Math datasets: MATH, GSM8K, and in-house LaTeX-enhanced crawls

Synthetic data: Rewritten from raw corpora using diverse LLMs, plus textbook-style QA from 30K Wikipedia-based topics

Long-context sequences: Enhanced via Fill-in-the-Middle, reordering, and synthetic reasoning tasks up to 256K tokens

Training Infrastructure and Methodology

Training utilized customized Maximal Update Parametrization (µP), supporting smooth scaling across model sizes. The models employ advanced parallelism strategies:

Mixer Parallelism (MP) and Context Parallelism (CP): Enhance throughput for long-context processing

Quantization: Released in bfloat16 and 4-bit variants to facilitate edge deployments

Evaluation and Performance

Falcon-H1 achieves unprecedented performance per parameter:

Falcon-H1-34B-Instruct surpasses or matches 70B-scale models like Qwen2.5-72B and LLaMA3.3-70B across reasoning, math, instruction-following, and multilingual tasks

Falcon-H1-1.5B-Deep rivals 7B–10B models

Falcon-H1-0.5B delivers 2024-era 7B performance

Benchmarks span MMLU, GSM8K, HumanEval, and long-context tasks. The models demonstrate strong alignment via SFT and Direct Preference Optimization (DPO).

Conclusion

Falcon-H1 sets a new standard for open-weight LLMs by integrating parallel hybrid architectures, flexible tokenization, efficient training dynamics, and robust multilingual capability. Its strategic combination of SSM and attention allows for unmatched performance within practical compute and memory budgets, making it ideal for both research and deployment across diverse environments.

Check out the Paper and Models on Hugging Face. Feel free to check our Tutorials page on AI Agent and Agentic AI for various applications. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Falcon LLM Team Releases Falcon-H1 Technical Report: A Hybrid Attention–SSM Model That Rivals 70B LLMs appeared first on MarkTechPost.