OpenAI Adds Full MCP Tool Support in ChatGPT Developer Mode: Enabling …

OpenAI has just introduced a major upgrade to ChatGPT’s developer mode by adding full support for Model Context Protocol (MCP) tools. Until now, MCP integrations inside ChatGPT were limited to search and fetch operations—essentially read-only. With this update, MCP connectors can perform write actions, which means developers can now directly update systems, trigger workflows, and chain complex automations from within a ChatGPT conversation. The capability is currently available to Plus and Pro users.

This change moves ChatGPT beyond being just an intelligent query layer. Instead of only retrieving data from connected sources, it can now act on that data. For example, developers can update Jira tickets directly through chat, kick off a Zapier workflow, or combine connectors to perform multi-step tasks such as analyzing error logs, opening an incident ticket, and notifying a team channel. ChatGPT is no longer just a conversational assistant—it is positioned as an orchestration layer for real work across distributed tools.

The technical foundation of this expansion lies in the MCP framework, which defines how large language models interact with external services through structured protocols. Connectors expose capabilities that ChatGPT can call, typically described using JSON schemas. The addition of write support introduces new requirements around authentication, security, and reliability. Since connectors now modify external state, API tokens, OAuth scopes, and access controls need to be tightly scoped. Error handling becomes critical: when a write operation fails, ChatGPT must be able to surface the issue clearly, log it, and recover gracefully. Developers also need to consider transaction safety when chaining multiple write actions across services.

From a developer experience standpoint, enabling these capabilities is straightforward. Once developer mode is activated in ChatGPT, developers can register connectors that include both read and write methods. These connectors can then be invoked naturally during a conversation. The workflow is designed for iteration—developers can prototype, test, and refine integrations directly in chat rather than building custom middleware from scratch. OpenAI’s documentation provides schemas, endpoint definitions, and examples to standardize connector behavior across services.

The impact for enterprise and automation use cases is significant. Operations teams can streamline incident response by having ChatGPT log issues, update tickets, and push alerts automatically. Business teams can embed ChatGPT into CRM pipelines, where a single conversational update might sync customer data, generate reports, and notify account managers. For engineering teams, ChatGPT can now trigger builds, update GitHub pull requests, or synchronize task trackers—all without leaving the chat interface. In each case, ChatGPT is not just summarizing information but actively driving workflows.

This update marks an important step in the future of ChatGPT. By enabling full MCP tool support, OpenAI is pushing the assistant from being a knowledge layer to a true automation platform. It provides developers with the flexibility to build connectors that bridge natural language instructions and real-world actions, effectively turning conversation into a universal interface for enterprise systems. For organizations using ChatGPT Plus or Pro, developer mode now opens the door to integrating conversational AI directly into daily operations, where chat doesn’t just answer questions—it gets work done.

We’ve (finally) added full support for MCP tools in ChatGPT. In developer mode, developers can create connectors and use them in chat for write actions (not just search/fetch). Update Jira tickets, trigger Zapier workflows, or combine connectors for complex automations. pic.twitter.com/1W0rTGGEnu— OpenAI Developers (@OpenAIDevs) September 10, 2025

The post OpenAI Adds Full MCP Tool Support in ChatGPT Developer Mode: Enabling Write Actions, Workflow Automation, and Enterprise Integrations appeared first on MarkTechPost.

Enhance video understanding with Amazon Bedrock Data Automation and op …

In real-world video and image analysis, businesses often face the challenge of detecting objects that weren’t part of a model’s original training set. This becomes especially difficult in dynamic environments where new, unknown, or user-defined objects frequently appear. For example, media publishers might want to track emerging brands or products in user-generated content; advertisers need to analyze product appearances in influencer videos despite visual variations; retail providers aim to support flexible, descriptive search; self-driving cars must identify unexpected road debris; and manufacturing systems need to catch novel or subtle defects without prior labeling.In all these cases, traditional closed-set object detection (CSOD) models—which only recognize a fixed list of predefined categories—fail to deliver. They either misclassify the unknown objects or ignore them entirely, limiting their usefulness for real-world applications.Open-set object detection (OSOD) is an approach that enables models to detect both known and previously unseen objects, including those not encountered during training. It supports flexible input prompts, ranging from specific object names to open-ended descriptions, and can adapt to user-defined targets in real time without requiring retraining. By combining visual recognition with semantic understanding—often through vision-language models—OSOD helps users query the system broadly, even if it’s unfamiliar, ambiguous, or entirely new.
In this post, we explore how Amazon Bedrock Data Automation uses OSOD to enhance video understanding.
Amazon Bedrock Data Automation and video blueprints with OSOD
Amazon Bedrock Data Automation is a cloud-based service that extracts insights from unstructured content like documents, images, video and audio. Specifically, for video content, Amazon Bedrock Data Automation supports functionalities such as chapter segmentation, frame-level text detection, chapter-level classification Interactive Advertising Bureau (IAB) taxonomies, and frame-level OSOD. For more information about Amazon Bedrock Data Automation, see Automate video insights for contextual advertising using Amazon Bedrock Data Automation.
Amazon Bedrock Data Automation video blueprints support OSOD on the frame level. You can input a video along with a text prompt specifying the desired objects to detect. For each frame, the model outputs a dictionary containing bounding boxes in XYWH format (the x and y coordinates of the top-left corner, followed by the width and height of the box), along with corresponding labels and confidence scores. You can further customize the output based on their needs—for instance, filtering by high-confidence detections when precision is prioritized.
The input text is highly flexible, so you can define dynamic fields in the Amazon Bedrock Data Automation video blueprints powered by OSOD.
Example use cases
In this section, we explore some examples of different use cases for Amazon Bedrock Data Automation video blueprints using OSOD. The following table summarizes the functionality of this feature.

Functionality
Sub-functionality
Examples

Multi-granular visual comprehension
Object detection from fine-grained object reference
“Detect the apple in the video.”

Object detection from cross-granularity object reference
“Detect all the fruit items in the image.”

Object detection from open questions
“Find and detect the most visually important elements in the image.”

Visual hallucination detection
Identify and flag object mentionings in the input text that do not correspond to actual content in the given image.
“Detect if apples appear in the image.”

Ads analysis
Advertisers can use this feature to compare the effectiveness of various ad placement strategies across different locations and conduct A/B testing to identify the most optimal advertising approach. For example, the following image is the output in response to the prompt “Detect the locations of echo devices.”

Smart resizing
By detecting key elements in the video, you can choose appropriate resizing strategies for devices with different resolutions and aspect ratios, making sure important visual information is preserved. For example, the following image is the output in response to the prompt “Detect the key elements in the video.”

Surveillance with intelligent monitoring
In home security systems, producers or users can take advantage of the model’s high-level understanding and localization capabilities to maintain safety, without the need to manually enumerate all possible scenarios. For example, the following image is the output in response to the prompt “Check dangerous elements in the video.”

Custom labels
You can define your own labels and search through videos to retrieve specific, desired results. For example, the following image is the output in response to the prompt “Detect the white car with red wheels in the video.”

Image and video editing
With flexible text-based object detection, you can accurately remove or replace objects in photo editing software, minimizing the need for imprecise, hand-drawn masks that often require multiple attempts to achieve the desired result. For example, the following image is the output in response to the prompt “Detect the people riding motorcycles in the video.”
Sample video blueprint input and output
The following example demonstrates how to define an Amazon Bedrock Data Automation video blueprint to detect visually prominent objects at the chapter level, with sample output including objects and their bounding boxes.
The following code is our example blueprint schema:

blueprint = {
  “$schema”: “http://json-schema.org/draft-07/schema#”,
  “description”: “This blueprint enhances the searchability and discoverability of video content by providing comprehensive object detection and scene analysis.”,
  “class”: “media_search_video_analysis”,
  “type”: “object”,
  “properties”: {
    # Targeted Object Detection: Identifies visually prominent objects in the video
    # Set granularity to chapter level for more precise object detection
    “targeted-object-detection”: {
      “type”: “array”,
      “instruction”: “Please detect all the visually prominent objects in the video”,
      “items”: {
        “$ref”: “bedrock-data-automation#/definitions/Entity”
      },
      “granularity”: [“chapter”]  # Chapter-level granularity provides per-scene object detection
    },  
  }
}

The following code is out example video custom output:

“chapters”: [
        …..,
        {
            “inference_result”: {
                “emotional-tone”: “Tension and suspense”
            },
            “frames”: [
                {
                    “frame_index”: 10289,
                    “inference_result”: {
                        “targeted-object-detection”: [
                            {
                                “label”: “man”,
                                “bounding_box”: {
                                    “left”: 0.6198254823684692,
                                    “top”: 0.10746771097183228,
                                    “width”: 0.16384708881378174,
                                    “height”: 0.7655990719795227
                                },
                                “confidence”: 0.9174646443068981
                            },
                            {
                                “label”: “ocean”,
                                “bounding_box”: {
                                    “left”: 0.0027531087398529053,
                                    “top”: 0.026655912399291992,
                                    “width”: 0.9967235922813416,
                                    “height”: 0.7752640247344971
                                },
                                “confidence”: 0.7712276351034641
                            },
                            {
                                “label”: “cliff”,
                                “bounding_box”: {
                                    “left”: 0.4687306359410286,
                                    “top”: 0.5707792937755585,
                                    “width”: 0.168929323554039,
                                    “height”: 0.20445972681045532
                                },
                                “confidence”: 0.719932173293829
                            }
                        ],
                    },
                    “timecode_smpte”: “00:05:43;08”,
                    “timestamp_millis”: 343276
                }
            ],
            “chapter_index”: 11,
            “start_timecode_smpte”: “00:05:36;16”,
            “end_timecode_smpte”: “00:09:27;14”,
            “start_timestamp_millis”: 336503,
            “end_timestamp_millis”: 567400,
            “start_frame_index”: 10086,
            “end_frame_index”: 17006,
            “duration_smpte”: “00:03:50;26”,
            “duration_millis”: 230897,
            “duration_frames”: 6921
        },
        ……….
]

For the full example, refer to the following GitHub repo.
Conclusion
The OSOD capability within Amazon Bedrock Data Automation significantly enhances the ability to extract actionable insights from video content. By combining flexible text-driven queries with frame-level object localization, OSOD helps users across industries implement intelligent video analysis workflows—ranging from targeted ad evaluation and security monitoring to custom object tracking. Integrated seamlessly into the broader suite of video analysis tools available in Amazon Bedrock Data Automation, OSOD not only streamlines content understanding but also help reduce the need for manual intervention and rigid pre-defined schemas, making it a powerful asset for scalable, real-world applications.
To learn more about Amazon Bedrock Data Automation video and audio analysis, see New Amazon Bedrock Data Automation capabilities streamline video and audio analysis.

About the authors
Dongsheng An is an Applied Scientist at AWS AI, specializing in face recognition, open-set object detection, and vision-language models. He received his Ph.D. in Computer Science from Stony Brook University, focusing on optimal transport and generative modeling.
Lana Zhang is a Senior Solutions Architect in the AWS World Wide Specialist Organization AI Services team, specializing in AI and generative AI with a focus on use cases including content moderation and media analysis. She’s dedicated to promoting AWS AI and generative AI solutions, demonstrating how generative AI can transform classic use cases by adding business value. She assists customers in transforming their business solutions across diverse industries, including social media, gaming, ecommerce, media, advertising, and marketing.
Raj Jayaraman is a Senior Generative AI Solutions Architect at AWS, bringing over a decade of experience in helping customers extract valuable insights from data. Specializing in AWS AI and generative AI solutions, Raj’s expertise lies in transforming business solutions through the strategic application of AWS’s AI capabilities, ensuring customers can harness the full potential of generative AI in their unique contexts. With a strong background in guiding customers across industries in adopting AWS Analytics and Business Intelligence services, Raj now focuses on assisting organizations in their generative AI journey—from initial demonstrations to proof of concepts and ultimately to production implementations.

How Skello uses Amazon Bedrock to query data in a multi-tenant environ …

This is a guest post co-written with Skello.
Skello is a leading human resources (HR) software as a service (SaaS) solution focusing on employee scheduling and workforce management. Catering to diverse sectors such as hospitality, retail, healthcare, construction, and industry, Skello offers features including schedule creation, time tracking, and payroll preparation. With approximately 20,000 customers and 400,000 daily users across Europe as of 2024, Skello continually innovates to meet its clients’ evolving needs.
One such innovation is the implementation of an AI-powered assistant to enhance user experience and data accessibility. In this post, we explain how Skello used Amazon Bedrock to create this AI assistant for end-users while maintaining customer data safety in a multi-tenant environment. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
We dive deep into the challenges of implementing large language models (LLMs) for data querying, particularly in the context of a French company operating under the General Data Protection Regulation (GDPR). Our solution demonstrates how to balance powerful AI capabilities with strict data protection requirements.
Challenges with multi-tenant data access
As Skello’s platform grew to serve thousands of businesses, we identified a critical need: our users needed better ways to access and understand their workforce data. Many of our customers, particularly those in HR and operations roles, found traditional database querying tools too technical and time-consuming. This led us to identify two key areas for improvement:

Quick access to non-structured data – Our users needed to find specific information across various data types—employee records, scheduling data, attendance logs, and performance metrics. Traditional search methods often fell short when users had complex questions like “Show me all part-time employees who worked more than 30 hours last month” or “What’s the average sick leave duration in the retail department?”
Visualization of data through graphs for analytics – Although our platform collected comprehensive workforce data, users struggled to transform this raw information into actionable insights. They needed an intuitive way to create visual representations of trends and patterns without writing complex SQL queries or learning specialized business intelligence tools.

To address these challenges, we needed a solution that could:

Understand natural language questions about complex workforce data
Correctly interpret context and intent from user queries
Generate appropriate database queries while respecting data access rules
Return results in user-friendly formats, including visualizations
Handle variations in how users might phrase similar questions
Process queries about time-based data and trends

LLMs emerged as the ideal solution for this task. Their ability to understand natural language and context, combined with their capability to generate structured outputs, made them perfectly suited for translating user questions into precise database queries. However, implementing LLMs in a business-critical application required careful consideration of security, accuracy, and performance requirements.
Solution overview
Using LLMs to generate structured queries from natural language input is an emerging area of interest. This process enables the transformation of user requests into organized data structures, which can then be used to query databases automatically.
The following diagram of Skello’s high-level architecture illustrates this user request transformation process.

The implementation using AWS Lambda and Amazon Bedrock provides several advantages:

Scalability through serverless architecture
Cost-effective processing with pay-as-you-go pricing
Low-latency performance
Access to advanced language models like Anthropic’s Claude 3.5 Sonnet
Rapid deployment capabilities
Flexible integration options

Basic query generation process
The following diagram illustrates how we transform natural language queries into structured database requests. For this example, the user asks “Give me the gender parity.”

The process works as follows:

The authentication service validates the user’s identity and permissions.
The LLM converts the natural language to a structured query format.
The query validation service enforces compliance with security policies.
The database access layer executes the query within the user’s permitted scope.

Handling complex queries
For more sophisticated requests like “Give me the worked hours per week per position for the last 3 months,” our system completes the following steps:

Extract query components:

Target metric: worked hours
Aggregation levels: week, position
Time frame: 3 months

Generate temporal calculations:

Use relative time expressions instead of hard-coded dates
Implement standardized date handling patterns

Data schema optimization
To make our system as efficient and user-friendly as possible, we carefully organized our data structure—think of it as creating a well-organized filing system for a large office.
We created standardized schema definitions, establishing consistent ways to store similar types of information. For example, date-related fields (hire dates, shift times, vacation periods) follow the same format. This helps prevent confusion when users ask questions like “Show me all events from last week.” It’s similar to having all calendars in your office using the same date format instead of some using MM/DD/YY and others using DD/MM/YY.
Our system employs consistent naming conventions with clear, predictable names for all data fields. Instead of technical abbreviations like emp_typ_cd, we use clear terms like employee_type. This makes it straightforward for the AI to understand what users mean when they ask questions like “Show me all full-time employees.”
For optimized search patterns, we strategically organized our data to make common searches fast and efficient. This is particularly important because it directly impacts user experience and system performance. We analyzed usage patterns to identify the most frequently requested information and designed our database indexes accordingly. Additionally, we created specialized data views that pre-aggregate common report requests. This comprehensive approach means questions like “Who’s working today?” get answered almost instantly.
We also established clear data relationships by mapping out how different pieces of information relate to each other. For example, we clearly connect employees to their departments, shifts, and managers. This helps answer complex questions like “Show me all department managers who have team members on vacation next week.”
These optimizations deliver real benefits to our users:

Faster response times when asking questions
More accurate answers to queries
Less confusion when referring to specific types of data
Ability to ask more complex questions about relationships between different types of information
Consistent results when asking similar questions in different ways

For example, whether a user asks “Show me everyone’s vacation time” or “Display all holiday schedules,” the system understands they’re looking for the same type of information. This reliability makes the system more trustworthy and easier to use for everyone, regardless of their technical background.
Graph generation and display
One of the most powerful features of our system is its ability to turn data into meaningful visual charts and graphs automatically. This consists of the following actions:

Smart label creation – The system understands what your data means and creates clear, readable labels. For example, if you ask “Show me employee attendance over the last 6 months,” the horizontal axis automatically labels the months (January through June), the vertical axis shows attendance numbers with simple-to-read intervals, and the title clearly states what you’re looking at: “Employee Attendance Trends.”
Automatic legend creation – The system creates helpful legends that explain what each part of the chart means. For instance, if you ask “Compare sales across different departments,” different departments get different colors, a clear legend shows which color represents which department, and additional information like “Dashed lines show previous year” is automatically added when needed.
Choosing the right type of chart – The system is smart about picking the best way to show your information. For example, it uses bar charts for comparing different categories (“Show me sales by department”), line graphs for trends over time (“How has attendance changed this year?”), pie charts for showing parts of a whole (“What’s the breakdown of full-time vs. part-time staff?”), and heat maps for complex patterns (“Show me busiest hours per day of the week”).
Smart sizing and scaling – The system automatically adjusts the size and scale of charts to make them simple to read. For example, if numbers range from 1–100, it might show intervals of 10; if you’re looking at millions, it might show them in a more readable way (1M, 2M, etc.); charts automatically resize to show patterns clearly; and important details are never too small to see.

All of this happens automatically—you ask your question, and the system handles the technical details of creating a clear, professional visualization. For example, the following figure is an example for the question “How many hours my employees worked over the past 7 weeks?”

Security-first architecture
Our implementation adheres to OWASP best practices (specifically LLM06) by maintaining complete separation between security controls and the LLM.
Through dedicated security services, user authentication and authorization checks are performed before LLM interactions, with user context and permissions managed through Amazon Bedrock SessionParameters, keeping security information entirely outside of LLM processing.
Our validation layer uses Amazon Bedrock Guardrails to protect against prompt injection, inappropriate content, and forbidden topics such as racism, sexism, or illegal content.
The system’s architecture implements strict role-based access controls through a detailed permissions matrix, so users can only access data within their authorized scope. For authentication, we use industry-standard JWT and SAML protocols, and our authorization service maintains granular control over data access permissions.
This multi-layered approach prevents potential security bypasses through prompt manipulation or other LLM-specific attacks. The system automatically enforces data boundaries at both database and API levels, effectively preventing cross-contamination between different customer accounts. For instance, department managers can only access their team’s data, with these restrictions enforced through database compartmentalization.
Additionally, our comprehensive audit system maintains immutable logs of all actions, including timestamps, user identifiers, and accessed resources, stored separately to protect their integrity. This security framework operates seamlessly in the background, maintaining robust protection of sensitive information without disrupting the user experience or legitimate workflows.
Benefits
Creating data visualizations has never been more accessible. Even without specialized expertise, you can now produce professional-quality charts that communicate your insights effectively. The streamlined process makes sure your visualizations remain consistently clear and intuitive, so you can concentrate on exploring your data questions instead of spending time on presentation details.
The solution works through simple conversational requests that require no technical knowledge or specialized software. You simply describe what you want to visualize using everyday language and the system interprets your request and creates the appropriate visualization. There’s no need to learn complex software interfaces, remember specific commands, or understand data formatting requirements. The underlying technology handles the data processing, chart selection, and professional formatting automatically, transforming your spoken or written requests into polished visual presentations within moments.
Your specific information needs to drive how the data is displayed, making the insights more relevant and actionable. When it’s time to share your findings, these visualizations seamlessly integrate into your reports and presentations with polished formatting that enhances your overall message. This democratization of data visualization empowers everyone to tell compelling data stories.
Conclusion
In this post, we explored Skello’s implementation of an AI-powered assistant using Amazon Bedrock and Lambda. We saw how end-users can query their own data in a multi-tenant environment while maintaining logical boundaries and complying with GDPR regulations. The combination of serverless architecture and advanced language models proved effective in enhancing data accessibility and user experience.
We invite you to explore the AWS Machine Learning Blog for more insights on AI solutions and their potential business applications. If you’re interested in learning more about Skello’s journey in modernizing HR software, check out our blog post series on the topic.
If you have any questions or suggestions about implementing similar solutions in your own multi-tenant environment, please feel free to share them in the comments section.

About the authors
Nicolas de Place is a Data & AI Solutions Architect specializing in machine learning strategy for high-growth startups. He empowers emerging companies to harness the full potential of artificial intelligence and advanced analytics, designing scalable ML architectures and data-driven solutions
Cédric Peruzzi is a Software Architect at Skello, where he focuses on designing and implementing Generative AI features. Before his current role, he worked as a software engineer and architect, bringing his experience to help build better software solutions.

Create a private workforce on Amazon SageMaker Ground Truth with the A …

Private workforces for Amazon SageMaker Ground Truth and Amazon Augmented AI (Amazon A2I) help organizations build proprietary, high-quality datasets while keeping high standards of security and privacy.
The AWS Management Console provides a fast and intuitive way to create a private workforce, but many organizations need to automate their infrastructure deployment through infrastructure as code (IaC) because it provides benefits such as automated and consistent deployments, increased operational efficiency, and reduced chances of human errors or misconfigurations.
However, creating a private workforce with IaC is not a straightforward task because of some complex technical dependencies between services during the initial creation.
In this post, we present a complete solution for programmatically creating private workforces on Amazon SageMaker AI using the AWS Cloud Development Kit (AWS CDK), including the setup of a dedicated, fully configured Amazon Cognito user pool. The accompanying GitHub repository provides a customizable AWS CDK example that shows how to create and manage a private workforce, paired with a dedicated Amazon Cognito user pool, and how to integrate the necessary Amazon Cognito configurations.
Solution overview
This solution demonstrates how to create a private workforce and a coupled Amazon Cognito user pool and its dependent resources. The goal is to provide a comprehensive setup for the base infrastructure to enable machine learning (ML) labeling tasks.
The key technical challenge in this solution is the mutual dependency between the Amazon Cognito resources and the private workforce.
Specifically, the creation of the user pool app client requires certain parameters, such as the callback URL, which is only available after the private workforce is created. However, the private workforce creation itself needs the app client to be already present. This mutual dependency makes it challenging to set up the infrastructure in a straightforward manner.
Additionally, the user pool domain name must remain consistent across deployments, because it can’t be easily changed after the initial creation and inconsistency in the name can lead to deployment errors.
To address these challenges, the solution uses several AWS CDK constructs, including AWS CloudFormation custom resources. This custom approach allows the orchestration of the user pool and SageMaker private workforce creation, to correctly configure the resources and manage their interdependencies.
The solution architecture is composed of one stack with several resources and services, some of which are needed only for the initial setup of the private workforce, and some that are used by the private workforce workers when logging in to complete a labeling task. The following diagram illustrates this architecture.

The solution’s deployment requires AWS services and resources that work together to set up the private workforce. The numbers in the diagram reflect the stack components that support the stack creation, which occur in the following order:

Amazon Cognito user pool – The user pool provides user management and authentication for the SageMaker private workforce. It handles user registration, login, and password management. A default email invitation is initially set to onboard new users to the private workforce. The user pool is both associated with an AWS WAF firewall and configured to deliver user activity logs to Amazon CloudWatch for enhanced security.
Amazon Cognito user pool app client – The user pool app client configures the client application that will interact with the user pool. During the initial deployment, a temporary placeholder callback URL is used, because the actual callback URL can only be determined later in the process.
AWS Systems Manager Parameter Store – Parameter Store, a capability of AWS Systems Manager, stores and persists the prefix of the user pool domain across deployments in a string parameter. The provided prefix must be such that the resulting domain is globally unique.
Amazon Cognito user pool domain – The user pool domain defines the domain name for the managed login experience provided by the user pool. This domain name must remain consistent across deployments, because it can’t be easily changed after the initial creation.
IAM roles – AWS Identity and Access Management (IAM) roles for CloudFormation custom resources include permissions to make AWS SDK calls to create the private workforce and other API calls during the next steps.
Private workforce – Implemented using a custom resource backing the CreateWorkforce API call, the private workforce is the foundation to manage labeling activities. It creates the labeling portal and manages portal-level access controls, including authentication through the integrated user pool. Upon creation, the labeling portal URL is made available to be used as a callback URL by the Amazon Cognito app client. The connected Amazon Cognito app client is automatically updated with the new callback URL.
SDK call to fetch the labeling portal domain – This SDK call reads the subdomain of labeling portal. This is implemented as a CloudFormation custom resource.
SDK call to update user pool – This SDK call updates the user pool with a user invitation email that points to the labeling portal URL. This is implemented as a CloudFormation custom resource.
Filter for placeholder callback URL – Custom logic separates the placeholder URL from the app client’s callback URLs. This is implemented as a CloudFormation custom resource, backed by a custom AWS Lambda function.
SDK call to update the app client to remove the placeholder callback URL – This SDK call updates the app client with the correct callback URLs. This is implemented as a CloudFormation custom resource.
User creation and invitation emails – Amazon Cognito users are created and sent invitation emails with instructions to join the private workforce.

After this initial setup, a worker can join the private workforce and access the labeling. The authentication flow includes the email invitation, initial registration, authentication, and login to the labeling portal. The following diagram illustrates this workflow.

The detailed workflow steps are as follows:

A worker receives an email invitation that provides the user name, temporary password, and URL of the labeling portal.
When trying to reach the labeling portal, the worker is redirected to the Amazon Cognito user pool domain for authentication. Amazon Cognito domain endpoints are additionally protected by AWS WAF. The worker then sets a new password and registers with multi-factor authentication.
Authentication actions by the worker are logged and sent to CloudWatch.
The worker can log in and is redirected to the labeling portal.
In the labeling portal, the worker can access existing labeling jobs in SageMaker Ground Truth.

The solution uses a mix of AWS CDK constructs and CloudFormation custom resources to integrate the Amazon Cognito user pool and the SageMaker private workforce so workers can register and access the labeling portal. In the following sections, we show how to deploy the solution.
Prerequisites
You must have the following prerequisites:

An AWS account, already bootstrapped for the AWS CDK
AWS credentials with sufficient permissions to deploy the solution
The AWS CDK installed (version 2.178.1 or later)
Python (version 3.13 or later)
The AWS Command Line Interface (AWS CLI) installed
A mobile device with an authenticator app installed

Deploy the solution
To deploy the solution, complete the following steps. Make sure you have AWS credentials available in your environment with sufficient permissions to deploy the solution resources.

Clone the GitHub repository.
Follow the detailed instructions in the README file to deploy the stack using the AWS CDK and AWS CLI.
Open the AWS CloudFormation console and choose the Workforce stack for more information on the ongoing deployment and the created resources.

Test the solution
If you invited yourself from the AWS CDK CLI to join the private workforce, follow the instructions in the email that you received to register and access the labeling portal. Otherwise, complete the following steps to invite yourself and others to join the private workforce. For more information, see Creating a new user in the AWS Management Console.

On the Amazon Cognito console, choose User pools in the navigation pane.
Choose the existing user pool, MyWorkforceUserPool.
Choose Users, then choose Create a user.
Choose Email as the alias attribute to sign in.
Choose Send an email invitation as the invitation message.
For User name, enter a name for the new user. Make sure not to use the email address.
For Email address, enter the email address of the worker to be invited.
For simplicity, choose Generate a password for the user.
Choose Create.

After you receive the invitation email, follow the instructions to set a new password and register with an authenticator application. Then you can log in and see a page listing your labeling jobs.

Best practices and considerations
When setting up a private workforce, consider the best practices for Amazon Cognito and the AWS CDK, as well as additional customizations:

Customized domain – Provide your own prefix for the Amazon Cognito subdomain when deploying the solution. This way, you can use a more recognizable domain name for the labeling application, rather than a randomly generated one. For even greater customization, integrate the user pool with a custom domain that you own. This gives you full control over the URL used for the login and aligns it with the rest your organization’s applications.
Enhance security controls – Depending on your organization’s security and compliance requirements, you can further adapt the Amazon Cognito resources, for instance, by integrating with external identity providers and following other security best practices.
Implement VPC configuration – You can implement additional security controls, such as adding a virtual private cloud (VPC) configuration to the private workforce. This helps you enhance the overall security posture of your solution, providing an additional layer of network-level security and isolation.
Restrict the source IPs – When creating the SageMaker private workforce, you can specify a list of IP addresses ranges (CIDR) from which workers can log in.
AWS WAF customization – Bring your own existing AWS WAF or configure one to your organization’s needs by setting up custom rules, IP filtering, rate-based rules, and web access control lists (ACLs) to protect your application.
Integrate with CI/CD – Incorporate the IaC in a continuous integration and continuous delivery (CI/CD) pipeline to standardize deployment, track changes, and further improve resource tracking and observability also across multiple environments (for instance, development, staging, production).
Extend the solution – Depending on your specific use case, you might want to extend the solution to include the creation and management of work teams and labeling jobs or flows. This can help integrate the private workforce setup more seamlessly with your existing ML workflows and data labeling processes.
Integrate with additional AWS services – To suit your specific requirements, you can further integrate the private workforce and user pool with other relevant AWS services, such as CloudWatch for logging, monitoring, and alarms, and Amazon Simple Notification Service (Amazon SNS) for notifications to enhance the capabilities of your data labeling solution.

Clean up
To clean up your resources, open the AWS CloudFormation console and delete the Workforce stack. Alternatively, if you deployed using the AWS CDK CLI, you can run cdk destroy from the same terminal where you ran cdk deploy and use the same AWS CDK CLI arguments as during deployment.
Conclusion
This solution demonstrates how to programmatically create a private workforce on SageMaker Ground Truth, paired with a dedicated and fully configured Amazon Cognito user pool. By using the AWS CDK and AWS CloudFormation, this solution brings the benefits of IaC to the setup of your ML data labeling private workforce.
To further customize this solution to meet your organization’s standards, discover how to accelerate your journey on the cloud with the help of AWS Professional Services.
We encourage you to learn more from the developer guides on data labeling on SageMaker and Amazon Cognito user pools. Refer to the following blog posts for more examples of labeling data using SageMaker Ground Truth:

Power Your LLM Training and Evaluation with the New SageMaker AI Generative AI Tools
Enhance speech synthesis and video generation models with RLHF using audio and video segmentation in Amazon SageMaker
Accelerate custom labeling workflows in Amazon SageMaker Ground Truth without using AWS Lambda
Create a data labeling project with Amazon SageMaker Ground Truth Plus
High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus

About the author
Dr. Giorgio Pessot is a Machine Learning Engineer at Amazon Web Services Professional Services. With a background in computational physics, he specializes in architecting enterprise-grade AI systems at the confluence of mathematical theory, DevOps, and cloud technologies, where technology and organizational processes converge to achieve business objectives. When he’s not whipping up cloud solutions, you’ll find Giorgio engineering culinary creations in his kitchen.

Building Advanced MCP (Model Context Protocol) Agents with Multi-Agent …

In this tutorial, we are walking through the process of building an advanced MCP (Model Context Protocol) Agent that runs smoothly inside Jupyter or Google Colab. We are designing the system with real-world practicality in mind, focusing on multi-agent coordination, context awareness, memory management, and dynamic tool usage. As we progress, we see how each agent specializes in its own role, whether it’s coordinating, researching, analyzing, or executing, and how together they form a swarm that can handle complex tasks. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport json
import logging
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
from enum import Enum
from datetime import datetime
import time

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

try:
import google.generativeai as genai
GEMINI_AVAILABLE = True
except ImportError:
print(” google-generativeai not installed. Run: pip install google-generativeai”)
GEMINI_AVAILABLE = False

We start by importing essential Python libraries for data handling, logging, and agent structuring, while also setting up logging for better debugging. We then check for the availability of the Gemini API, so we can seamlessly integrate it if it is installed; otherwise, we run in demo mode. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AgentRole(Enum):
COORDINATOR = “coordinator”
RESEARCHER = “researcher”
ANALYZER = “analyzer”
EXECUTOR = “executor”

@dataclass
class Message:
role: str
content: str
timestamp: datetime
metadata: Dict[str, Any] = None

@dataclass
class AgentContext:
agent_id: str
role: AgentRole
capabilities: List[str]
memory: List[Message]
tools: List[str]

We define the core building blocks of our agent system. We create AgentRole to assign clear responsibilities, use Message to store conversations with context, and build AgentContext to capture each agent’s identity, role, memory, and tools so we can manage interactions effectively. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MCPAgent:
“””Advanced MCP Agent with evolved capabilities – Jupyter Compatible”””

def __init__(self, agent_id: str, role: AgentRole, api_key: str = None):
self.agent_id = agent_id
self.role = role
self.api_key = api_key
self.memory = []
self.context = AgentContext(
agent_id=agent_id,
role=role,
capabilities=self._init_capabilities(),
memory=[],
tools=self._init_tools()
)

self.model = None
if GEMINI_AVAILABLE and api_key:
try:
genai.configure(api_key=api_key)
self.model = genai.GenerativeModel(‘gemini-pro’)
print(f” Agent {agent_id} initialized with Gemini API”)
except Exception as e:
print(f” Gemini configuration failed: {e}”)
print(” Running in demo mode with simulated responses”)
else:
print(f” Agent {agent_id} running in demo mode”)

def _init_capabilities(self) -> List[str]:
“””Initialize role-specific capabilities”””
capabilities_map = {
AgentRole.COORDINATOR: [“task_decomposition”, “agent_orchestration”, “priority_management”],
AgentRole.RESEARCHER: [“data_gathering”, “web_search”, “information_synthesis”],
AgentRole.ANALYZER: [“pattern_recognition”, “data_analysis”, “insight_generation”],
AgentRole.EXECUTOR: [“action_execution”, “result_validation”, “output_formatting”]
}
return capabilities_map.get(self.role, [])

def _init_tools(self) -> List[str]:
“””Initialize available tools based on role”””
tools_map = {
AgentRole.COORDINATOR: [“task_splitter”, “agent_selector”, “progress_tracker”],
AgentRole.RESEARCHER: [“search_engine”, “data_extractor”, “source_validator”],
AgentRole.ANALYZER: [“statistical_analyzer”, “pattern_detector”, “visualization_tool”],
AgentRole.EXECUTOR: [“code_executor”, “file_handler”, “api_caller”]
}
return tools_map.get(self.role, [])

def process_message(self, message: str, context: Optional[Dict] = None) -> Dict[str, Any]:
“””Process incoming message with context awareness – Synchronous version”””

msg = Message(
role=”user”,
content=message,
timestamp=datetime.now(),
metadata=context
)
self.memory.append(msg)

prompt = self._generate_contextual_prompt(message, context)

try:
if self.model:
response = self._generate_response_gemini(prompt)
else:
response = self._generate_demo_response(message)

response_msg = Message(
role=”assistant”,
content=response,
timestamp=datetime.now(),
metadata={“agent_id”: self.agent_id, “role”: self.role.value}
)
self.memory.append(response_msg)

return {
“agent_id”: self.agent_id,
“role”: self.role.value,
“response”: response,
“capabilities_used”: self._analyze_capabilities_used(message),
“next_actions”: self._suggest_next_actions(response),
“timestamp”: datetime.now().isoformat()
}

except Exception as e:
logger.error(f”Error processing message: {e}”)
return {“error”: str(e)}

def _generate_response_gemini(self, prompt: str) -> str:
“””Generate response using Gemini API – Synchronous”””
try:
response = self.model.generate_content(prompt)
return response.text
except Exception as e:
logger.error(f”Gemini API error: {e}”)
return self._generate_demo_response(prompt)

def _generate_demo_response(self, message: str) -> str:
“””Generate simulated response for demo purposes”””
role_responses = {
AgentRole.COORDINATOR: f”As coordinator, I’ll break down the task: ‘{message[:50]}…’ into manageable components and assign them to specialized agents.”,
AgentRole.RESEARCHER: f”I’ll research information about: ‘{message[:50]}…’ using my data gathering and synthesis capabilities.”,
AgentRole.ANALYZER: f”Analyzing the patterns and insights from: ‘{message[:50]}…’ to provide data-driven recommendations.”,
AgentRole.EXECUTOR: f”I’ll execute the necessary actions for: ‘{message[:50]}…’ and validate the results.”
}

base_response = role_responses.get(self.role, f”Processing: {message[:50]}…”)

time.sleep(0.5)

additional_context = {
AgentRole.COORDINATOR: ” I’ve identified 3 key subtasks and will coordinate their execution across the agent team.”,
AgentRole.RESEARCHER: ” My research indicates several relevant sources and current trends in this area.”,
AgentRole.ANALYZER: ” The data shows interesting correlations and actionable insights for decision making.”,
AgentRole.EXECUTOR: ” I’ve completed the requested actions and verified the outputs meet quality standards.”
}

return base_response + additional_context.get(self.role, “”)

def _generate_contextual_prompt(self, message: str, context: Optional[Dict]) -> str:
“””Generate context-aware prompt based on agent role”””

base_prompt = f”””
You are an advanced AI agent with the role: {self.role.value}
Your capabilities: {‘, ‘.join(self.context.capabilities)}
Available tools: {‘, ‘.join(self.context.tools)}

Recent conversation context:
{self._get_recent_context()}

Current request: {message}
“””

role_instructions = {
AgentRole.COORDINATOR: “””
Focus on breaking down complex tasks, coordinating with other agents,
and maintaining overall project coherence. Consider dependencies and priorities.
Provide clear task decomposition and agent assignments.
“””,
AgentRole.RESEARCHER: “””
Prioritize accurate information gathering, source verification,
and comprehensive data collection. Synthesize findings clearly.
Focus on current trends and reliable sources.
“””,
AgentRole.ANALYZER: “””
Focus on pattern recognition, data interpretation, and insight generation.
Provide evidence-based conclusions and actionable recommendations.
Highlight key correlations and implications.
“””,
AgentRole.EXECUTOR: “””
Concentrate on practical implementation, result validation,
and clear output delivery. Ensure actions are completed effectively.
Focus on quality and completeness of execution.
“””
}

return base_prompt + role_instructions.get(self.role, “”)

def _get_recent_context(self, limit: int = 3) -> str:
“””Get recent conversation context”””
if not self.memory:
return “No previous context”

recent = self.memory[-limit:]
context_str = “”
for msg in recent:
context_str += f”{msg.role}: {msg.content[:100]}…n”
return context_str

def _analyze_capabilities_used(self, message: str) -> List[str]:
“””Analyze which capabilities were likely used”””
used_capabilities = []
message_lower = message.lower()

capability_keywords = {
“task_decomposition”: [“break down”, “divide”, “split”, “decompose”],
“data_gathering”: [“research”, “find”, “collect”, “gather”],
“pattern_recognition”: [“analyze”, “pattern”, “trend”, “correlation”],
“action_execution”: [“execute”, “run”, “implement”, “perform”],
“agent_orchestration”: [“coordinate”, “manage”, “organize”, “assign”],
“information_synthesis”: [“synthesize”, “combine”, “merge”, “integrate”]
}

for capability, keywords in capability_keywords.items():
if capability in self.context.capabilities:
if any(keyword in message_lower for keyword in keywords):
used_capabilities.append(capability)

return used_capabilities

def _suggest_next_actions(self, response: str) -> List[str]:
“””Suggest logical next actions based on response”””
suggestions = []
response_lower = response.lower()

if “need more information” in response_lower or “research” in response_lower:
suggestions.append(“delegate_to_researcher”)
if “analyze” in response_lower or “pattern” in response_lower:
suggestions.append(“delegate_to_analyzer”)
if “implement” in response_lower or “execute” in response_lower:
suggestions.append(“delegate_to_executor”)
if “coordinate” in response_lower or “manage” in response_lower:
suggestions.append(“initiate_multi_agent_collaboration”)
if “subtask” in response_lower or “break down” in response_lower:
suggestions.append(“task_decomposition_required”)

return suggestions if suggestions else [“continue_conversation”]

We implement the MCPAgent as a notebook-friendly, role-aware agent that initializes capabilities and tools based on its assigned role, keeps a memory of messages, and generates context-aware responses. We seamlessly use Gemini when available (falling back to a demo response otherwise) and wrap everything with structured outputs like capabilities used and suggested next actions. We also provide utilities to craft role-specific prompts, surface recent context, detect implied capabilities, and propose the next step in a multi-agent workflow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MCPAgentSwarm:
“””Multi-agent coordination system – Jupyter Compatible”””

def __init__(self, api_key: str = None):
self.api_key = api_key
self.agents = {}
self.task_history = []
self.results = {}

def create_agent(self, agent_id: str, role: AgentRole) -> MCPAgent:
“””Create and register a new agent”””
agent = MCPAgent(agent_id, role, self.api_key)
self.agents[agent_id] = agent
print(f” Created agent: {agent_id} with role: {role.value}”)
return agent

def coordinate_task(self, task: str) -> Dict[str, Any]:
“””Coordinate complex task across multiple agents – Synchronous”””

print(f”n Coordinating task: {task}”)
print(“=” * 60)

if “coordinator” not in self.agents:
self.create_agent(“coordinator”, AgentRole.COORDINATOR)

coordinator = self.agents[“coordinator”]

print(“n Step 1: Task Decomposition”)
decomposition = coordinator.process_message(
f”Decompose this complex task into subtasks and identify which specialized agents are needed: {task}”
)
print(f”Coordinator: {decomposition[‘response’]}”)

self._ensure_required_agents()

print(“n Step 2: Agent Collaboration”)
results = {}
for agent_id, agent in self.agents.items():
if agent_id != “coordinator”:
print(f”n {agent_id.upper()} working…”)
result = agent.process_message(
f”Handle your specialized part of this task: {task}n”
f”Coordinator’s guidance: {decomposition[‘response’][:200]}…”
)
results[agent_id] = result
print(f” {agent_id}: {result[‘response’][:150]}…”)

print(“n Step 3: Final Synthesis”)
final_result = coordinator.process_message(
f”Synthesize these agent results into a comprehensive final output for the task ‘{task}’:n”
f”Results summary: {[f'{k}: {v[‘response’][:100]}…’ for k, v in results.items()]}”
)
print(f”Final Result: {final_result[‘response’]}”)

task_record = {
“task”: task,
“timestamp”: datetime.now().isoformat(),
“decomposition”: decomposition,
“agent_results”: results,
“final_synthesis”: final_result,
“agents_involved”: list(self.agents.keys())
}
self.task_history.append(task_record)

return task_record

def _ensure_required_agents(self):
“””Ensure all required agent types exist”””
required_roles = [AgentRole.RESEARCHER, AgentRole.ANALYZER, AgentRole.EXECUTOR]

for role in required_roles:
agent_id = role.value
if agent_id not in self.agents:
self.create_agent(agent_id, role)

def get_swarm_status(self) -> Dict[str, Any]:
“””Get current status of the agent swarm”””
return {
“total_agents”: len(self.agents),
“agent_roles”: {aid: agent.role.value for aid, agent in self.agents.items()},
“tasks_completed”: len(self.task_history),
“last_task”: self.task_history[-1][“task”] if self.task_history else “None”
}

We manage a swarm of role-specific agents, create them on demand, and coordinate complex tasks through decomposition, collaboration, and final synthesis. We track results and history, ensure required agents exist, and provide a quick status view of the whole system at any time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_notebook_compatible():
“””Demonstrate advanced MCP agent capabilities – Notebook Compatible”””

print(” Starting Advanced MCP Agent Tutorial”)
print(” Jupyter/Colab Compatible Version”)
print(“=” * 60)

API_KEY = None # Set to your actual key

if not API_KEY:
print(” Running in DEMO MODE (simulated responses)”)
print(” Set API_KEY variable for real Gemini AI responses”)
print(“-” * 60)

swarm = MCPAgentSwarm(API_KEY)

print(“n Demo 1: Single Agent Interaction”)
researcher = swarm.create_agent(“research_agent”, AgentRole.RESEARCHER)

result = researcher.process_message(
“Research the latest trends in AI agent architectures and multi-agent systems”
)
print(f”n Researcher Response:”)
print(f” {result[‘response’]}”)
print(f” Capabilities Used: {result[‘capabilities_used’]}”)
print(f” Suggested Next Actions: {result[‘next_actions’]}”)

print(“nn Demo 2: Multi-Agent Coordination”)

complex_task = “””
Analyze the impact of AI agents on software development productivity.
Include research on current tools, performance metrics, future predictions,
and provide actionable recommendations for development teams.
“””

coordination_result = swarm.coordinate_task(complex_task)

print(“nn Demo 3: Swarm Status”)
status = swarm.get_swarm_status()
print(f” Total Agents: {status[‘total_agents’]}”)
print(f” Agent Roles: {status[‘agent_roles’]}”)
print(f” Tasks Completed: {status[‘tasks_completed’]}”)

print(“n Tutorial Completed Successfully!”)
return swarm

def run_demo():
“””Simple function to run the demo”””
return demo_notebook_compatible()

if __name__ == “__main__”:
print(” Running MCP Agent Demo…”)
swarm = run_demo()
else:
print(” MCP Agent Tutorial loaded!”)
print(” Run: swarm = run_demo() to start the demonstration”)

We wrap everything into a notebook-friendly demo that showcases the MCP agent system in action. We start by creating a researcher agent for single-agent interaction, then demonstrate multi-agent collaboration on a complex task, and finally check swarm status. We also ensure the code runs smoothly in both script mode and Jupyter/Colab mode, with a clear fallback to demo responses when no Gemini API key is set.

In conclusion, we have successfully demonstrated how our MCP agents can coordinate, decompose tasks, and synthesize results into actionable insights, all within a notebook-friendly, synchronous setup. We have seen how memory enables continuity of context, how role-based specialization ensures efficiency, and how the swarm can adapt to various challenges. With Gemini integration available for real AI responses and a fallback demo mode for simulation, we are leaving with a working foundation for advanced multi-agent systems.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Building Advanced MCP (Model Context Protocol) Agents with Multi-Agent Coordination, Context Awareness, and Gemini Integration appeared first on MarkTechPost.

NVIDIA AI Releases Universal Deep Research (UDR): A Prototype Framewor …

Why do existing deep research tools fall short?

Deep Research Tools (DRTs) like Gemini Deep Research, Perplexity, OpenAI’s Deep Research, and Grok DeepSearch rely on rigid workflows bound to a fixed LLM. While effective, they impose strict limitations: users cannot define custom strategies, swap models, or enforce domain-specific protocols.

NVIDIA’s analysis identifies three core problems:

Users cannot enforce preferred sources, validation rules, or cost control.

Specialized research strategies for domains such as finance, law, or healthcare are unsupported.

DRTs are tied to single models, preventing flexible pairing of the best LLM with the best strategy.

These issues restrict adoption in high-value enterprise and scientific applications.

https://arxiv.org/pdf/2509.00244

What is Universal Deep Research (UDR)?

Universal Deep Research (UDR) is an open-source system (in preview) that decouples strategy from model. It allows users to design, edit, and run their own deep research workflows without retraining or fine-tuning any LLM.

Unlike existing tools, UDR works at the system orchestration level:

It converts user-defined research strategies into executable code.

It runs workflows in a sandboxed environment for safety.

It treats the LLM as a utility for localized reasoning (summarization, ranking, extraction) instead of giving it full control.

This architecture makes UDR lightweight, flexible, and model-agnostic.

https://arxiv.org/pdf/2509.00244

How does UDR process and execute research strategies?

UDR takes two inputs: the research strategy (step-by-step workflow) and the research prompt (topic and output requirements).

Strategy Processing

Natural language strategies are compiled into Python code with enforced structure.

Variables store intermediate results, avoiding context-window overflow.

All functions are deterministic and transparent.

Strategy Execution

Control logic runs on CPU; only reasoning tasks call the LLM.

Notifications are emitted via yield statements, keeping users updated in real time.

Reports are assembled from stored variable states, ensuring traceability.

This separation of orchestration vs. reasoning improves efficiency and reduces GPU cost.

What example strategies are available?

NVIDIA ships UDR with three template strategies:

Minimal – Generate a few search queries, gather results, and compile a concise report.

Expansive – Explore multiple topics in parallel for broader coverage.

Intensive – Iteratively refine queries using evolving subcontexts, ideal for deep dives.

These serve as starting points, but the framework allows users to encode entirely custom workflows.

https://arxiv.org/pdf/2509.00244

What outputs does UDR generate?

UDR produces two key outputs:

Structured Notifications – Progress updates (with type, timestamp, and description) for transparency.

Final Report – A Markdown-formatted research document, complete with sections, tables, and references.

This design gives users both auditability and reproducibility, unlike opaque agentic systems.

Where can UDR be applied?

UDR’s general-purpose design makes it adaptable across domains:

Scientific discovery: structured literature reviews.

Enterprise due diligence: validation against filings and datasets.

Business intelligence: market analysis pipelines.

Startups: custom assistants built without retraining LLMs.

By separating model choice from research logic, UDR supports innovation in both dimensions.

Summary

Universal Deep Research signals a shift from model-centric to system-centric AI agents. By giving users direct control over workflows, NVIDIA enables customizable, efficient, and auditable research systems.

For startups and enterprises, UDR provides a foundation for building domain-specific assistants without the cost of model retraining—opening new opportunities for innovation across industries.

Check out the PAPER, PROJECT and CODE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post NVIDIA AI Releases Universal Deep Research (UDR): A Prototype Framework for Scalable and Auditable Deep Research Agents appeared first on MarkTechPost.

Baidu Releases ERNIE-4.5-21B-A3B-Thinking: A Compact MoE Model for Dee …

Baidu AI Research team has just released ERNIE-4.5-21B-A3B-Thinking, a new reasoning-focused large language model designed around efficiency, long-context reasoning, and tool integration. Being part of the ERNIE-4.5 family, this model is a Mixture-of-Experts (MoE) architecture with 21B total parameters but only 3B active parameters per token, making it computationally efficient while maintaining competitive reasoning capability. Released under the Apache-2.0 license, it is accessible for both research and commercial deployment via Hugging Face.

What is the architectural design of ERNIE-4.5-21B-A3B-Thinking?

ERNIE-4.5-21B-A3B-Thinking is built on a Mixture-of-Experts backbone. Instead of activating all 21B parameters, the router selects a subset of experts, resulting in 3B active parameters per token. This structure reduces computation without compromising the specialization of different experts. The research team applies router orthogonalization loss and token-balanced loss to encourage diverse expert activation and stable training.

This design provides a middle ground between small dense models and ultra-large systems. The research team’s assumptions include a theory that ~3B active parameters per token may represent a practical sweet spot for reasoning performance versus deployment efficiency.

How does the model handle long-context reasoning?

A defining capability of ERNIE-4.5-21B-A3B-Thinking is its 128K context length. This allows the model to process very long documents, perform extended multi-step reasoning, and integrate structured data sources such as academic papers or multi-file codebases.

The research team achieves this through progressive scaling of Rotary Position Embeddings (RoPE)—gradually increasing the frequency base from 10K up to 500K during training. Additional optimizations, including FlashMask attention and memory-efficient scheduling, make these long-context operations computationally feasible.

What training strategy supports its reasoning?

The model follows the multi-stage recipe defined across the ERNIE-4.5 family:

Stage I – Text-only pretraining builds the core language backbone, starting with 8K context and expanding to 128K.

Stage II – Vision training is skipped for this text-only variant.

Stage III – Joint multimodal training is not used here, as A3B-Thinking is purely textual.

Post-training focuses on reasoning tasks. The research team employs Supervised Fine-Tuning (SFT) across mathematics, logic, coding, and science, followed by Progressive Reinforcement Learning (PRL). Reinforcement stages begin with logic, then extend to mathematics and programming, and finally to broader reasoning tasks. This is enhanced by Unified Preference Optimization (UPO), which integrates preference learning with PPO to stabilize alignment and reduce reward hacking.

What role does tool usage play in this model?

ERNIE-4.5-21B-A3B-Thinking supports structured tool and function calling, making it useful for scenarios where external computation or retrieval is required. Developers can integrate it with vLLM, Transformers 4.54+, and FastDeploy. This tool-use capability is particularly suited for program synthesis, symbolic reasoning, and multi-agent workflows.

Built-in function calling allows the model to reason over long contexts while dynamically invoking external APIs, a key requirement for applied reasoning in enterprise systems.

How does ERNIE-4.5-21B-A3B-Thinking perform on reasoning benchmarks?

It show strong performance improvements across logical reasoning, mathematics, scientific QA, and programming tasks. In evaluations, the model demonstrates:

Enhanced accuracy in multi-step reasoning datasets, where long chains of thought are required.

Competitiveness with larger dense models on STEM reasoning tasks.

Stable text generation and academic synthesis performance, benefiting from extended context training.

These results suggest that the MoE structure amplifies reasoning specialization, making it efficient without requiring trillion-scale dense parameters.

https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking

How does it compare to other reasoning-focused LLMs?

This release gets into the landscape that includes OpenAI’s o3, Anthropic’s Claude 4, DeepSeek-R1, and Qwen-3. Many of these competitors rely on dense architectures or larger active parameter counts. Baidu research team’s choice of a compact MoE with 3B active parameters offers a different balance:

Scalability: Sparse activation reduces compute overhead while scaling expert capacity.

Long-context readiness: 128K context is directly trained, not retrofitted.

Commercial openness: Apache-2.0 license lowers adoption friction for enterprises.

Summary

ERNIE-4.5-21B-A3B-Thinking explains how deep reasoning can be achieved without massive dense parameter counts. By combining efficient MoE routing, 128K context training, and tool integration, Baidu’s research team offers a model that balances research-grade reasoning with deployment feasibility.

Check out the Model on Hugging Face and PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Baidu Releases ERNIE-4.5-21B-A3B-Thinking: A Compact MoE Model for Deep Reasoning appeared first on MarkTechPost.

TII Falcon-H1 models now available on Amazon Bedrock Marketplace and A …

This post was co-authored with Jingwei Zuo from TII.
We are excited to announce the availability of the Technology Innovation Institute (TII)’s Falcon-H1 models on Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, developers and data scientists can now use six instruction-tuned Falcon-H1 models (0.5B, 1.5B, 1.5B-Deep, 3B, 7B, and 34B) on AWS, and have access to a comprehensive suite of hybrid architecture models that combine traditional attention mechanisms with State Space Models (SSMs) to deliver exceptional performance with unprecedented efficiency.
In this post, we present an overview of Falcon-H1 capabilities and show how to get started with TII’s Falcon-H1 models on both Amazon Bedrock Marketplace and SageMaker JumpStart.
Overview of TII and AWS collaboration
TII is a leading research institute based in Abu Dhabi. As part of UAE’s Advanced Technology Research Council (ATRC), TII focuses on advanced technology research and development across AI, quantum computing, autonomous robotics, cryptography, and more. TII employs international teams of scientists, researchers, and engineers in an open and agile environment, aiming to drive technological innovation and position Abu Dhabi and the UAE as a global research and development hub in alignment with the UAE National Strategy for Artificial Intelligence 2031.
TII and Amazon Web Services (AWS) are collaborating to expand access to made-in-the-UAE AI models across the globe. By combining TII’s technical expertise in building large language models (LLMs) with AWS Cloud-based AI and machine learning (ML) services, professionals worldwide can now build and scale generative AI applications using the Falcon-H1 series of models.
About Falcon-H1 models
The Falcon-H1 architecture implements a parallel hybrid design, using elements from Mamba and Transformer architectures to combine the faster inference and lower memory footprint of SSMs like Mamba with the effectiveness of Transformers’ attention mechanism in understanding context and enhanced generalization capabilities. The Falcon-H1 architecture scales across multiple configurations ranging from 0.5–34 billion parameters and provides native support for 18 languages. According to TII, the Falcon-H1 family demonstrates notable efficiency with published metrics indicating that smaller model variants achieve performance parity with larger models. Some of the benefits of Falcon-H1 series include:

Performance – The hybrid attention-SSM model has optimized parameters with adjustable ratios between attention and SSM heads, leading to faster inference, lower memory usage, and strong generalization capabilities. According to TII benchmarks published in Falcon-H1’s technical blog post and technical report, Falcon-H1 models demonstrate superior performance across multiple scales against other leading Transformer models of similar or larger scales. For example, Falcon-H1-0.5B delivers performance similar to typical 7B models from 2024, and Falcon-H1-1.5B-Deep rivals many of the current leading 7B-10B models.
Wide range of model sizes – The Falcon-H1 series includes six sizes: 0.5B, 1.5B, 1.5B-Deep, 3B, 7B, and 34B, with both base and instruction-tuned variants. The Instruct models are now available in Amazon Bedrock Marketplace and SageMaker JumpStart.
Multilingual by design – The models support 18 languages natively (Arabic, Czech, German, English, Spanish, French, Hindi, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Romanian, Russian, Swedish, Urdu, and Chinese) and can scale to over 100 languages according to TII, thanks to a multilingual tokenizer trained on diverse language datasets.
Up to 256,000 context length – The Falcon-H1 series enables applications in long-document processing, multi-turn dialogue, and long-range reasoning, showing a distinct advantage over competitors in practical long-context applications like Retrieval Augmented Generation (RAG).
Robust data and training strategy – Training of Falcon-H1 models employs an innovative approach that introduces complex data early on, contrary to traditional curriculum learning. It also implements strategic data reuse based on careful memorization window assessment. Additionally, the training process scales smoothly across model sizes through a customized Maximal Update Parametrization (µP) recipe, specifically adapted for this novel architecture.
Balanced performance in science and knowledge-intensive domains – Through a carefully designed data mixture and regular evaluations during training, the model achieves strong general capabilities and broad world knowledge while minimizing unintended specialization or domain-specific biases.

In line with their mission to foster AI accessibility and collaboration, TII have released Falcon-H1 models under the Falcon LLM license. It offers the following benefits:

Open source nature and accessibility
Multi-language capabilities
Cost-effectiveness compared to proprietary models
Energy-efficiency

About Amazon Bedrock Marketplace and SageMaker JumpStart
Amazon Bedrock Marketplace offers access to over 100 popular, emerging, specialized, and domain-specific models, so you can find the best proprietary and publicly available models for your use case based on factors such as accuracy, flexibility, and cost. On Amazon Bedrock Marketplace you can discover models in a single place and access them through unified and secure Amazon Bedrock APIs. You can also select your desired number of instances and the instance type to meet the demands of your workload and optimize your costs.
SageMaker JumpStart helps you quickly get started with machine learning. It provides access to state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch. With SageMaker JumpStart you can deploy models in a secure environment by provisioning them on SageMaker inference instances and isolating them within your virtual private cloud (VPC). You can also use Amazon SageMaker AI to further customize and fine-tune the models and streamline the entire model deployment process.
Solution overview
This post demonstrates how to deploy a Falcon-H1 model using both Amazon Bedrock Marketplace and SageMaker JumpStart. Although we use Falcon-H1-0.5B as an example, you can apply these steps to other models in the Falcon-H1 series. For help determining which deployment option—Amazon Bedrock Marketplace or SageMaker JumpStart—best suits your specific requirements, see Amazon Bedrock or Amazon SageMaker AI?
Deploy Falcon-H1-0.5B-Instruct with Amazon Bedrock Marketplace
In this section, we show how to deploy the Falcon-H1-0.5B-Instruct model in Amazon Bedrock Marketplace.
Prerequisites
To try the Falcon-H1-0.5B-Instruct model in Amazon Bedrock Marketplace, you must have access to an AWS account that will contain your AWS resources.Prior to deploying Falcon-H1-0.5B-Instruct, verify that your AWS account has sufficient quota allocation for ml.g6.xlarge instances. The default quota for endpoints using several instance types and sizes is 0, so attempting to deploy the model without a higher quota will trigger a deployment failure.
To request a quota increase, open the AWS Service Quotas console and search for Amazon SageMaker. Locate ml.g6.xlarge for endpoint usage and choose Request quota increase, then specify your required limit value. After the request is approved, you can proceed with the deployment.
Deploy the model using the Amazon Bedrock Marketplace UI
To deploy the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, under Discover in the navigation pane, choose Model catalog.
Filter for Falcon-H1 as the model name and choose Falcon-H1-0.5B-Instruct.

The model overview page includes information about the model’s license terms, features, setup instructions, and links to further resources.

Review the model license terms, and if you agree with the terms, choose Deploy.

For Endpoint name, enter an endpoint name or leave it as the default pre-populated name.
To minimize costs while experimenting, set the Number of instances to 1.
For Instance type, choose from the list of compatible instance types. Falcon-H1-0.5B-Instruct is an efficient model, so ml.m6.xlarge is sufficient for this exercise.

Although the default configurations are typically sufficient for basic needs, you can customize advanced settings like VPC, service access permissions, encryption keys, and resource tags. These advanced settings might require adjustment for production environments to maintain compliance with your organization’s security protocols.

Choose Deploy.
A prompt asks you to stay on the page while the AWS Identity and Access Management (IAM) role is being created. If your AWS account lacks sufficient quota for the selected instance type, you’ll receive an error message. In this case, refer to the preceding prerequisite section to increase your quota, then try the deployment again.

While deployment is in progress, you can choose Marketplace model deployments in the navigation pane to monitor the deployment progress in the Managed deployment section. When the deployment is complete, the endpoint status will change from Creating to In Service.
Interact with the model in the Amazon Bedrock Marketplace playground
You can now test Falcon-H1 capabilities directly in the Amazon Bedrock playground by selecting the managed deployment and choosing Open in playground.

You can now use the Amazon Bedrock Marketplace playground to interact with Falcon-H1-0.5B-Instruct.
Invoke the model using code
In this section, we demonstrate to invoke the model using the Amazon Bedrock Converse API.
Replace the placeholder code with the endpoint’s Amazon Resource Name (ARN), which begins with arn:aws:sagemaker. You can find this ARN on the endpoint details page in the Managed deployments section.

import boto3
bedrock_runtime = boto3.client(“bedrock-runtime”)
endpoint_arn = “{ENDPOINT ARN}” # Replace with endpoint ARN
response = bedrock_runtime.converse( modelId=endpoint_arn, messages=[{“role”: “user”, “content”: [{“text”: “What is generative AI?”}]}], inferenceConfig={“temperature”: 0.1, “topP”: 0.1})

print(response[“output”][“message”][“content”][0][“text”])

To learn more about the detailed steps and example code for invoking the model using Amazon Bedrock APIs, refer to Submit prompts and generate response using the API.
Deploy Falcon-H1-0.5B-Instruct with SageMaker JumpStart
You can access FMs in SageMaker JumpStart through Amazon SageMaker Studio, the SageMaker SDK, and the AWS Management Console. In this walkthrough, we demonstrate how to deploy Falcon-H1-0.5B-Instruct using the SageMaker Python SDK. Refer to Deploy a model in Studio to learn how to deploy the model through SageMaker Studio.
Prerequisites
To deploy Falcon-H1-0.5B-Instruct with SageMaker JumpStart, you must have the following prerequisites:

An AWS account that will contain your AWS resources.
An IAM role to access SageMaker AI. To learn more about how IAM works with SageMaker AI, see Identity and Access Management for Amazon SageMaker AI.
Access to SageMaker Studio with a JupyterLab space, or an interactive development environment (IDE) such as Visual Studio Code or PyCharm.

Deploy the model programmatically using the SageMaker Python SDK
Before deploying Falcon-H1-0.5B-Instruct using the SageMaker Python SDK, make sure you have installed the SDK and configured your AWS credentials and permissions.
The following code example demonstrates how to deploy the model:

import sagemakerfrom sagemaker.jumpstart.model
import JumpStartModelfrom sagemaker
import Session
import boto3
import json

# Initialize SageMaker session
session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Specify model parameters
model_id = “huggingface-llm-falcon-h1-0-5b-instruct”
instance_type = “ml.g6.xlarge” # Choose appropriate instance based on your needs

# Create and deploy the model
model = JumpStartModel( model_id=model_id, role=role, instance_type=instance_type, model_version=”*” # Latest version)

# Deploy the model
predictor = model.deploy( initial_instance_count=1, accept_eula=True # Required for deploying foundation models)

print(“Endpoint name:”)
print(predictor.endpoint_name)

Perform inference using the SageMaker Python API

When the previous code segment completes successfully, the Falcon-H1-0.5B-Instruct model deployment is complete and available on a SageMaker endpoint. Note the endpoint name shown in the output—you will replace the placeholder in the following code segment with this value.The following code demonstrates how to prepare the input data, make the inference API call, and process the model’s response:

import json
import boto3

session = boto3.Session() # Make sure your AWS credentials are configured
sagemaker_runtime = session.client(“sagemaker-runtime”)

endpoint_name = “{ENDPOINT_NAME}” # Replace with endpoint name from deployment output

payload = { “messages”: [ { “role”: “user”, “content”: “What is generative AI?” } ], “parameters”: { “max_tokens”: 256, “temperature”: 0.1, “top_p”: 0.1 } }

# Perform inference
response = sagemaker_runtime.invoke_endpoint( EndpointName=endpoint_name, ContentType=”application/json”, Body=json.dumps(payload))

# Parse the response
result = json.loads(response[“Body”].read().decode(“utf-8”))generated_text = result[“choices”][0][“message”][“content”].strip()
print(“Generated Response:”)
print(generated_text)

Clean up
To avoid ongoing charges for AWS resources used while experimenting with Falcon-H1 models, make sure to delete all deployed endpoints and their associated resources when you’re finished. To do so, complete the following steps:

Delete Amazon Bedrock Marketplace resources:

On the Amazon Bedrock console, choose Marketplace model deployment in the navigation pane.
Under Managed deployments, choose the Falcon-H1 model endpoint you deployed earlier.
Choose Delete and confirm the deletion if you no longer need to use this endpoint in Amazon Bedrock Marketplace.

Delete SageMaker endpoints:

On the SageMaker AI console, in the navigation pane, choose Endpoints under Inference.
Select the endpoint associated with the Falcon-H1 models.
Choose Delete and confirm the deletion. This stops the endpoint and avoids further compute charges.

Delete SageMaker models:

On the SageMaker AI console, choose Models under Inference.
Select the model associated with your endpoint and choose Delete.

Always verify that all endpoints are deleted after experimentation to optimize costs. Refer to the Amazon SageMaker documentation for additional guidance on managing resources.
Conclusion
The availability of Falcon-H1 models in Amazon Bedrock Marketplace and SageMaker JumpStart helps developers, researchers, and businesses build cutting-edge generative AI applications with ease. Falcon-H1 models offer multilingual support (18 languages) across various model sizes (from 0.5B to 34B parameters) and support up to 256K context length, thanks to their efficient hybrid attention-SSM architecture.
By using the seamless discovery and deployment capabilities of Amazon Bedrock Marketplace and SageMaker JumpStart, you can accelerate your AI innovation while benefiting from the secure, scalable, and cost-effective AWS Cloud infrastructure.
We encourage you to explore the Falcon-H1 models in Amazon Bedrock Marketplace or SageMaker JumpStart. You can use these models in AWS Regions where Amazon Bedrock or SageMaker JumpStart and the required instance types are available.
For further learning, explore the AWS Machine Learning Blog, SageMaker JumpStart GitHub repository, and Amazon Bedrock User Guide. Start building your next generative AI application with Falcon-H1 models and unlock new possibilities with AWS!
Special thanks to everyone who contributed to the launch: Evan Kravitz, Varun Morishetty, and Yotam Moss.

About the authors
Mehran Nikoo leads the Go-to-Market strategy for Amazon Bedrock and agentic AI in EMEA at AWS, where he has been driving the development of AI systems and cloud-native solutions over the last four years. Prior to joining AWS, Mehran held leadership and technical positions at Trainline, McLaren, and Microsoft. He holds an MBA from Warwick Business School and an MRes in Computer Science from Birkbeck, University of London.
Mustapha Tawbi is a Senior Partner Solutions Architect at AWS, specializing in generative AI and ML, with 25 years of enterprise technology experience across AWS, IBM, Sopra Group, and Capgemini. He has a PhD in Computer Science from Sorbonne and a Master’s degree in Data Science from Heriot-Watt University Dubai. Mustapha leads generative AI technical collaborations with AWS partners throughout the MENAT region.
Jingwei Zuo is a Lead Researcher at the Technology Innovation Institute (TII) in the UAE, where he leads the Falcon Foundational Models team. He received his PhD in 2022 from University of Paris-Saclay, where he was awarded the Plateau de Saclay Doctoral Prize. He holds an MSc (2018) from the University of Paris-Saclay, an Engineer degree (2017) from Sorbonne Université, and a BSc from Huazhong University of Science & Technology.
John Liu is a Principal Product Manager for Amazon Bedrock at AWS. Previously, he served as the Head of Product for AWS Web3/Blockchain. Prior to joining AWS, John held various product leadership roles at public blockchain protocols and financial technology (fintech) companies for 14 years. He also has nine years of portfolio management experience at several hedge funds.
Hamza MIMI is a Solutions Architect for partners and strategic deals in the MENAT region at AWS, where he bridges cutting-edge technology with impactful business outcomes. With expertise in AI and a passion for sustainability, he helps organizations architect innovative solutions that drive both digital transformation and environmental responsibility, transforming complex challenges into opportunities for growth and positive change.

Oldcastle accelerates document processing with Amazon Bedrock

This post was written with Avdhesh Paliwal of Oldcastle APG.
Oldcastle APG, one of the largest global networks of manufacturers in the architectural products industry, was grappling with an inefficient and labor-intensive process for handling proof of delivery (POD) documents, known as ship tickets. The company was processing 100,000–300,000 ship tickets per month across more than 200 facilities. Their existing optical character recognition (OCR) system was unreliable, requiring constant maintenance and manual intervention. It could only accurately read 30–40% of the documents, leading to significant time and resource expenditure.
This post explores how Oldcastle partnered with AWS to transform their document processing workflow using Amazon Bedrock with Amazon Textract. We discuss how Oldcastle overcame the limitations of their previous OCR solution to automate the processing of hundreds of thousands of POD documents each month, dramatically improving accuracy while reducing manual effort. This solution demonstrates a practical, scalable approach that can be adapted to your specific needs, such as similar challenges addressing document processing or using generative AI for business process optimization.
Challenges with document processing
The primary challenge for Oldcastle was to find a solution that could accomplish the following:

Accurately process a high volume of ship tickets (PODs) with minimal human intervention
Scale to handle 200,000–300,000 documents per month
Handle inconsistent inputs like rotated pages and variable formatting
Improve the accuracy of data extraction from the current 30–40% to a much higher rate
Add new capabilities like signature validation on PODs
Provide real-time visibility into outstanding PODs and deliveries

Additionally, Oldcastle needed a solution for processing supplier invoices and matching them against purchase orders, which presented similar challenges due to varying document formats.The existing process required dispatchers at more than 200 facilities to spend 4–5 hours daily manually processing ship tickets. This consumed valuable human resources and led to delays in processing and potential errors in data entry. The IT team was burdened with constant maintenance and development efforts to keep the unreliable OCR system functioning.
Solution overview
AWS Solutions Architects worked closely with Oldcastle engineers to build a solution addressing these challenges. The end-to-end workflow uses Amazon Simple Email Service (Amazon SES) to receive ship tickets, which are sent directly from drivers in the field. The system processes emails at scale using an event-based architecture centered on Amazon S3 Event Notifications. The workflow sends ship ticket documents to an automatic scaling compute job orchestrator. Documents are processed with the following steps:

The system sends PDF files to Amazon Textract using the Start Document Analysis API with Layout and Signature features.
Amazon Textract results are processed by an AWS Lambda microservice. This microservice resolves rotation issues with page text and generates a collection of pages of markdown representation of the text.
The markdown is passed to Amazon Bedrock, which efficiently extracts key values from the markdown text.
The orchestrator saves the results to their Amazon Relational Database Service (Amazon RDS) for PostgreSQL database.

The following diagram illustrates the solution architecture.

In this architecture, Amazon Textract is an effective solution to handle large PDF files at scale. The output of Amazon Textract contains the necessary geometries used to calculate rotation and fix layout issues before generating markdown. Quality markdown layouts are critical for Amazon Bedrock in identifying the right key-value pairs from the content. We further optimized cost by extracting only the data needed to limit output tokens and by using Amazon Bedrock batch processing to get the lowest token cost. Amazon Bedrock was used for its cost-effectiveness and ability to process format shipping tickets where the fields that need to be extracted are the same.
Results
The implementation using this architecture on AWS brought numerous benefits to Oldcastle:

Business process improvement – The solution accomplished the following:

Alleviated the need for manual processing of ship tickets at each facility
Automated document processing with minimal human intervention
Improved accuracy and reliability of data extraction
Enhanced ability to validate signatures and reject incomplete documents
Provided real-time visibility into outstanding PODs and deliveries

Productivity gains – Oldcastle saw the following benefits:

Significantly fewer human hours were spent on manual data entry and document processing
Staff had more time for more value-added activities
The IT team benefited from reduced development and maintenance efforts

Scalability and performance – The team experienced the following performance gains:

They seamlessly scaled from processing a few thousand documents to 200,000–300,000 documents per month
The team observed no performance issues with increased volume

User satisfaction – The solution improved user sentiment in several ways:

High user confidence in the new system due to its accuracy and reliability
Positive feedback from business users on the ease of use and effectiveness

Cost-effective – With this approach, Oldcastle can process documents at less than $0.04 per page

Conclusion
With the success of the AWS implementation, Oldcastle is exploring potential expansion to other use cases such as AP invoice processing, W9 form validation, and automated document approval workflows. This strategic move towards AI-powered document processing is positioning Oldcastle for improved efficiency and scalability in its operations.
Review your current manual document processing procedures and identify where intelligent document processing can help you automate these workflows for your business.
For further exploration and learning, we recommend checking out the following resources:

Intelligent Document Processing on AWS
Automate document processing with Amazon Bedrock Prompt Flows
Intelligent Document Processing with Generative AI

About the authors
Erik Cordsen is a Solutions Architect at AWS serving customers in Georgia. He is passionate about applying cloud technologies and ML to solve real life problems. When he is not designing cloud solutions, Erik enjoys travel, cooking, and cycling.
Sourabh Jain is a Senior Solutions Architect with over 8 years of experience developing cloud solutions that drive better business outcomes for organizations worldwide. He specializes in architecting and implementing robust cloud software solutions, with extensive experience working alongside global Fortune 500 teams across diverse time zones and cultures.
Avdhesh Paliwal is an accomplished Application Architect at Oldcastle APG with 29 years of extensive ERP experience. His expertise spans Manufacturing, Supply Chain, and Human Resources modules, with a proven track record of designing and implementing enterprise solutions that drive operational efficiency and business value.

How London Stock Exchange Group is detecting market abuse with their A …

London Stock Exchange Group (LSEG) is a global provider of financial markets data and infrastructure. It operates the London Stock Exchange and manages international equity, fixed income, and derivative markets. The group also develops capital markets software, offers real-time and reference data products, and provides extensive post-trade services. This post was co-authored with Charles Kellaway and Rasika Withanawasam of LSEG.
Financial markets are remarkably complex, hosting increasingly dynamic investment strategies across new asset classes and interconnected venues. Accordingly, regulators place great emphasis on the ability of market surveillance teams to keep pace with evolving risk profiles. However, the landscape is vast; London Stock Exchange alone facilitates the trading and reporting of over £1 trillion of securities by 400 members annually. Effective monitoring must cover all MiFID asset classes, markets and jurisdictions to detect market abuse, while also giving weight to participant relationships, and market surveillance systems must scale with volumes and volatility. As a result, many systems are outdated and unsatisfactory for regulatory expectations, requiring manual and time-consuming work.
To address these challenges, London Stock Exchange Group (LSEG) has developed an innovative solution using Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models from leading AI companies, to automate and enhance their market surveillance capabilities. LSEG’s AI-powered Surveillance Guide helps analysts efficiently review trades flagged for potential market abuse by automatically analyzing news sensitivity and its impact on market behavior.
In this post, we explore how LSEG used Amazon Bedrock and Anthropic’s Claude foundation models to build an automated system that significantly improves the efficiency and accuracy of market surveillance operations.
The challenge
Currently, LSEG’s surveillance monitoring systems generate automated, customized alerts to flag suspicious trading activity to the Market Supervision team. Analysts then conduct initial triage assessments to determine whether the activity warrants further investigation, which might require undertaking differing levels of qualitative analysis. This could involve manual collation of all and any evidence that might be applicable when methodically corroborating regulation, news, sentiment and trading activity. For example, during an insider dealing investigation, analysts are alerted to statistically significant price movements. The analyst must then conduct an initial assessment of related news during the observation period to determine if the highlighted price move has been caused by specific news and its likely price sensitivity. This initial step in assessing the presence, or absence, of price sensitive news guides the subsequent actions an analyst will take with a possible case of market abuse.
Initial triaging can be a time-consuming and resource-intensive process and still necessitate a full investigation if the identified behavior remains potentially suspicious or abusive.
Moreover, the dynamic nature of financial markets and evolving tactics and sophistication of bad actors demand that market facilitators revisit automated rules-based surveillance systems. The increasing frequency of alerts and high number of false positives adversely impact an analyst’s ability to devote quality time to the most meaningful cases, and such heightened emphasis on resources could result in operational delays.
Solution overview
To address these challenges, LSEG collaborated with AWS to improve insider dealing detection, developing a generative AI prototype that automatically predicts the probability of news articles being price sensitive. The system employs Anthropic’s Claude Sonnet 3.5 model—the most price performant model at the time—through Amazon Bedrock to analyze news content from LSEG’s Regulatory News Service (RNS) and classify articles based on their potential market impact. The results support analysts to more quickly determine whether highlighted trading activity can be mitigated during the observation period.
The architecture consists of three main components:

A data ingestion and preprocessing pipeline for RNS articles
Amazon Bedrock integration for news analysis using Claude Sonnet 3.5
Inference application for visualising results and predictions

The following diagram illustrates the conceptual approach:

The workflow processes news articles through the following steps:

Ingest raw RNS news documents in HTML format
Preprocess and extract clean news text
Fill the classification prompt template with text from the news documents
Prompt Anthropic’s Claude Sonnet 3.5 through Amazon Bedrock
Receive and process model predictions and justifications
Present results through the visualization interface developed using Streamlit

Methodology
The team collated a comprehensive dataset of approximately 250,000 RNS articles spanning 6 consecutive months of trading activity in 2023. The raw data—HTML documents from RNS—were initially pre-processed within the AWS environment by removing extraneous HTML elements and formatted to extract clean textual content. Having isolated substantive news content, the team subsequently carried out exploratory data analysis to understand distribution patterns within the RNS corpus, focused on three dimensions:

News categories: Distribution of articles across different regulatory categories
Instruments: Financial instruments referenced in the news articles
Article length: Statistical distribution of document sizes

Exploration provided contextual understanding of the news landscape and informed the sampling strategy in creating a representative evaluation dataset. 110 articles were selected to cover major news categories, and this curated subset was presented to market surveillance analysts who, as domain experts, evaluated each article’s price sensitivity on a nine-point scale, as shown in the following image:

1–3: PRICE_NOT_SENSITIVE – Low probability of price sensitivity
4–6: HARD_TO_DETERMINE – Uncertain price sensitivity
7–9: PRICE_SENSITIVE – High probability of price sensitivity

The experiment was executed within Amazon SageMaker using Jupyter Notebooks as the development environment. The technical stack consisted of:

Instructor library: Provided integration capabilities with Anthropic’s Claude Sonnet 3.5 model in Amazon Bedrock
Amazon Bedrock: Served as the API infrastructure for model access
Custom data processing pipelines (Python): For data ingestion and preprocessing

This infrastructure enabled systematic experimentation with various algorithmic approaches, including traditional supervised learning methods, prompt engineering with foundation models, and fine-tuning scenarios.
The evaluation framework established specific technical success metrics:

Data pipeline implementation: Successful ingestion and preprocessing of RNS data
Metric definition: Clear articulation of precision, recall, and F1 metrics
Workflow completion: Execution of comprehensive exploratory data analysis (EDA) and experimental workflows

The analytical approach was a two-step classification process, as shown in the following figure:

Step 1: Classify news articles as potentially price sensitive or other
Step 2: Classify news articles as potentially price not sensitive or other

This multi-stage architecture was designed to maximize classification accuracy by allowing analysts to focus on specific aspects of price sensitivity at each stage. The results from each step were then merged to produce the final output, which was compared with the human-labeled dataset to generate quantitative results.
To consolidate the results from both classification steps, the data merging rules followed were:

Step 1 Classification
Step 2 Classification
Final Classification

Sensitive
Other
Sensitive

Other
Non-sensitive
Non-sensitive

Other
Other
Ambiguous – requires manual review i.e., Hard to Determine

Sensitive
Non-sensitive
Ambiguous – requires manual review i.e., Hard to Determine

Based on the insights gathered, prompts were optimized. The prompt templates elicited three key components from the model:

A concise summary of the news article
A price sensitivity classification
A chain-of-thought explanation justifying the classification decision

The following is an example prompt:

system non sensitive = “*”
You are an expert financial analyst with deep knowledge of market dynamics, investor
psychology, and the intricate relationships between news events and asset prices.
Your core function is to analyze news articles and assess their likelihood of being
non-price sensitive with unparalleled accuracy and insight.
Key aspects of your expertise include:
1. Market Dynamics: You have a comprehensive understanding of how financial markets
operate, including the factors that typically drive price movements and those that
are often overlooked by the market.
2. Investor Psychology: You possess keen insight into how different types of news affect
investor sentiment and decision-making, particularly in distinguishing between
information that causes reactions and information that doesn’t.
3. News Analysis: You excel at dissecting financial news articles, identifying key
elements, and determining their relevance (or lack thereof) to asset valuations and
market movements.
4. Pattern Recognition: You can draw upon a vast knowledge of historical market
reactions to various types of news, allowing you to identify patterns of
non-impactful information.
5. Sector-Specific Knowledge: You understand the nuances of different industry sectors
and how the importance of news can vary across them.
6. Regulatory Insight: You’re well-versed in financial regulations and can identify when
news does or doesn’t meet thresholds for material information.
7. Macroeconomic Perspective: You can place company-specific news in the broader context
of economic trends and assess whether it’s likely to be overshadowed by larger market
forces.
8. Quantitative Skills: You can evaluate financial metrics and understand when changes or
announcements related to them are significant enough to impact prices.
Your primary task is to analyze given news articles and determine, with a high degree of
confidence, whether they are likely to be non-price sensitive. This involves:
– Carefully examining the content and context of each news item
– Assessing its potential (or lack thereof) to influence investor decisions
– Considering both short-term and long-term implications
– Providing clear, well-reasoned justifications for your assessments
– Identifying key factors that support your conclusion
– Recommending further information that could enhance the analysis
– Offering insights that can help traders make more informed decisions
You should always maintain a conservative approach, erring on the side of caution. If
there’s any reasonable doubt about whether news could be price-sensitive, you should
classify it as ‘OTHER’ rather than ‘NOT_PRICE_SENSITIVE’.
Your analyses should be sophisticated yet accessible, catering to both experienced
traders and those new to the market. Always strive for objectivity, acknowledging any
uncertainties or limitations in your assessment.
Remember, your insights play a crucial role in helping traders filter out market noise
and focus on truly impactful information, ultimately contributing to more effective
and educated trading decisions.

As shown in the following figure, the solution was optimized to maximize:

Precision for the NOT SENSITIVE class
Recall for the PRICE SENSITIVE class

This optimization strategy was deliberate, facilitating high confidence in non-sensitive classifications to reduce unnecessary escalations to human analysts (in other words, to reduce false positives). Through this methodical approach, prompts were iteratively refined while maintaining rigorous evaluation standards through comparison against the expert-annotated baseline data.
Key benefits and results
Over a 6-week period, Surveillance Guide demonstrated remarkable accuracy when evaluated on a representative sample dataset. Key achievements include the following:

100% precision in identifying non-sensitive news, allocating 6 articles to this category that analysts confirmed were non price sensitive
100% recall in detecting price-sensitive content, allocating 36 hard to determine and 28 price sensitive articles labelled by analysts into one of these two categories (never misclassifying price sensitive content)
Automated analysis of complex financial news
Detailed justifications for classification decisions
Effective triaging of results by sensitivity level

In this implementation, LSEG has employed Amazon Bedrock so that they can use secure, scalable access to foundation models through a unified API, minimizing the need for direct model management and reducing operational complexity. Because of the serverless architecture of Amazon Bedrock, LSEG can take advantage of dynamic scaling of model inference capacity based on news volume, while maintaining consistent performance during market-critical periods. Its built-in monitoring and governance features support reliable model performance and maintain audit trails for regulatory compliance.
Impact on market surveillance
This AI-powered solution transforms market surveillance operations by:

Reducing manual review time for analysts
Improving consistency in price-sensitivity assessment
Providing detailed audit trails through automated justifications
Enabling faster response to potential market abuse cases
Scaling surveillance capabilities without proportional resource increases

The system’s ability to process news articles instantly and provide detailed justifications helps analysts focus their attention on the most critical cases while maintaining comprehensive market oversight.
Proposed next steps
LSEG plans to first enhance the solution, for internal use, by:

Integrating additional data sources, including company financials and market data
Implementing few-shot prompting and fine-tuning capabilities
Expanding the evaluation dataset for continued accuracy improvements
Deploying in live environments alongside manual processes for validation
Adapting to additional market abuse typologies

Conclusion
LSEG’s Surveillance Guide demonstrates how generative AI can transform market surveillance operations. Powered by Amazon Bedrock, the solution improves efficiency and enhances the quality and consistency of market abuse detection.
As financial markets continue to evolve, AI-powered solutions architected along similar lines will become increasingly important for maintaining integrity and compliance. AWS and LSEG are intent on being at the forefront of this change.
The selection of Amazon Bedrock as the foundation model service provides LSEG with the flexibility to iterate on their solution while maintaining enterprise-grade security and scalability. To learn more about building similar solutions with Amazon Bedrock, visit the Amazon Bedrock documentation or explore other financial services use cases in the AWS Financial Services Blog.

About the authors
Charles Kellaway is a Senior Manager in the Equities Trading team at LSE plc, based in London. With a background spanning both Equity and Insurance markets, Charles specialises in deep market research and business strategy, with a focus on deploying technology to unlock liquidity and drive operational efficiency. His work bridges the gap between finance and engineering, and he always brings a cross-functional perspective to solving complex challenges.
Rasika Withanawasam is a seasoned technology leader with over two decades of experience architecting and developing mission-critical, scalable, low-latency software solutions. Rasika’s core expertise lies in big data and machine learning applications, focusing intently on FinTech and RegTech sectors. He has held several pivotal roles at LSEG, including Chief Product Architect for the flagship Millennium Surveillance and Millennium Analytics platforms, and currently serves as Manager of the Quantitative Surveillance & Technology team, where he leads AI/ML solution development.
Richard Chester is a Principal Solutions Architect at AWS, advising large Financial Services organisations. He has 25+ years’ experience across the Financial Services Industry where he has held leadership roles in transformation programs, DevOps engineering, and Development Tooling. Since moving across to AWS from being a customer, Richard is now focused on driving the execution of strategic initiatives, mitigating risks and tackling complex technical challenges for AWS customers.

MBZUAI Researchers Release K2 Think: A 32B Open-Source System for Adva …

A team of researchers from MBZUAI’s Institute of Foundation Models and G42 released K2 Think, is a 32B-parameter open reasoning system for advanced AI reasoning. It pairs long chain-of-thought supervised fine-tuning with reinforcement learning from verifiable rewards, agentic planning, test-time scaling, and inference optimizations (speculative decoding + wafer-scale hardware). The result is frontier-level math performance with markedly lower parameter count and competitive results on code and science—together with a transparent, fully open release spanning weights, data, and code.

System overview

K2 Think is built by post-training an open-weight Qwen2.5-32B base model and adding a lightweight test-time compute scaffold. The design emphasizes parameter efficiency: a 32B backbone is deliberately chosen to enable fast iteration and deployment while leaving headroom for post-training gains. The core recipe combines six “pillars”: (1) Long chain-of-thought (CoT) supervised fine-tuning; (2) Reinforcement Learning with Verifiable Rewards (RLVR); (3) agentic planning before solving; (4) test-time scaling via best-of-N selection with verifiers; (5) speculative decoding; and (6) inference on a wafer-scale engine.

The goals are straightforward: raise pass@1 on competition-grade math benchmarks, maintain strong code/science performance, and keep response length and wall-clock latency under control through plan-before-you-think prompting and hardware-aware inference.

Pillar 1: Long CoT SFT

Phase-1 SFT uses curated, long chain-of-thought traces and instruction/response pairs spanning math, code, science, instruction following, and general chat (AM-Thinking-v1-Distilled). The effect is to teach the base model to externalize intermediate reasoning and adopt a structured output format. Rapid pass@1 gains occur early (≈0.5 epoch), with AIME’24 stabilizing around ~79% and AIME’25 around ~72% on the SFT checkpoint before RL, indicating convergence.

Pillar 2: RL with Verifiable Rewards

K2 Think then trains with RLVR on Guru, a ~92k-prompt, six-domain dataset (Math, Code, Science, Logic, Simulation, Tabular) designed for verifiable end-to-end correctness. The implementation uses the verl library with a GRPO-style policy-gradient algorithm. Notable observation: starting RL from a strong SFT checkpoint yields modest absolute gains and can plateau/degenerate, whereas applying the same RL recipe directly on the base model shows large relative improvements (e.g., ~40% on AIME’24 over training), supporting a trade-off between SFT strength and RL headroom.

A second ablation shows multi-stage RL with a reduced initial context window (e.g., 16k → 32k) underperforms—failing to recover the SFT baseline—suggesting that reducing max sequence length below the SFT regime can disrupt learned reasoning patterns.

Pillars 3–4: Agentic “Plan-Before-You-Think” and Test-time Scaling

At inference, the system first elicits a compact plan before generating a full solution, then performs best-of-N (e.g., N=3) sampling with verifiers to select the most likely-correct answer. Two effects are reported: (i) consistent quality gains from the combined scaffold; and (ii) shorter final responses despite the added plan—average token counts drop across benchmarks, with reductions up to ~11.7% (e.g., Omni-HARD), and overall lengths comparable to much larger open models. This matters for both latency and cost.

Table-level analysis shows K2 Think’s response lengths are shorter than Qwen3-235B-A22B and in the same range as GPT-OSS-120B on math; after adding plan-before-you-think and verifiers, K2 Think’s average tokens fall versus its own post-training checkpoint (e.g., AIME’24 −6.7%, AIME’25 −3.9%, HMMT25 −7.2%, Omni-HARD −11.7%, LCBv5 −10.5%, GPQA-D −2.1%).

Pillars 5–6: Speculative decoding and wafer-scale inference

K2 Think targets Cerebras Wafer-Scale Engine inference with speculative decoding, advertising per-request throughput upwards of 2,000 tokens/sec, which makes the test-time scaffold practical for production and research loops. The hardware-aware inference path is a central part of the release and aligns with the system’s “small-but-fast” philosophy.

https://k2think-about.pages.dev/assets/tech-report/K2-Think_Tech-Report.pdf

Evaluation protocol

Benchmarking covers competition-level math (AIME’24, AIME’25, HMMT’25, Omni-MATH-HARD), code (LiveCodeBench v5; SciCode sub/main), and science knowledge/reasoning (GPQA-Diamond; HLE). The research team reports a standardized setup: max generation length 64k tokens, temperature 1.0, top-p 0.95, stop marker </answer>, and each score as an average of 16 independent pass@1 evaluations to reduce run-to-run variance.

https://k2think-about.pages.dev/assets/tech-report/K2-Think_Tech-Report.pdf

Results

Math (micro-average across AIME’24/’25, HMMT25, Omni-HARD). K2 Think reaches 67.99, leading the open-weight cohort and comparing favorably even to much larger systems; it posts 90.83 (AIME’24), 81.24 (AIME’25), 73.75 (HMMT25), and 60.73 on Omni-HARD—the latter being the most difficult split. The positioning is consistent with strong parameter efficiency relative to DeepSeek V3.1 (671B) and GPT-OSS-120B (120B).

Code. LiveCodeBench v5 score is 63.97, exceeding similarly sized peers and even larger open models (e.g., > Qwen3-235B-A22B at 56.64). On SciCode, K2 Think is 39.2/12.0 (sub/main), tracking the best open systems closely on sub-problem accuracy.

Science. GPQA-Diamond reaches 71.08; HLE is 9.95. The model is not just a math specialist: it stays competitive across knowledge-heavy tasks.

https://k2think-about.pages.dev/assets/tech-report/K2-Think_Tech-Report.pdf

https://k2think-about.pages.dev/assets/tech-report/K2-Think_Tech-Report.pdf

Key numbers at a glance

Backbone: Qwen2.5-32B (open weight), post-trained with long CoT SFT + RLVR (GRPO via verl).

RL data: Guru (~92k prompts) across Math/Code/Science/Logic/Simulation/Tabular.

Inference scaffold: Plan-before-you-think + BoN with verifiers; shorter outputs (e.g., −11.7% tokens on Omni-HARD) at higher accuracy.

Throughput target: ~2,000 tok/s on Cerebras WSE with speculative decoding.

Math micro-avg: 67.99 (AIME’24 90.83, AIME’25 81.24, HMMT’25 73.75, Omni-HARD 60.73).

Code/Science: LCBv5 63.97; SciCode 39.2/12.0; GPQA-D 71.08; HLE 9.95.

Safety-4 macro: 0.75 (Refusal 0.83, Conv. Robustness 0.89, Cybersecurity 0.56, Jailbreak 0.72).

Summary

K2 Think demonstrates that integrative post-training + test-time compute + hardware-aware inference can close much of the gap to larger, proprietary reasoning systems. At 32B, it is tractable to fine-tune and serve; with plan-before-you-think and BoN-with-verifiers, it controls token budgets; with speculative decoding on wafer-scale hardware, it reaches ~2k tok/s per request. K2 Think is presented as a fully open system—weights, training data, deployment code, and test-time optimization code.

Check out the Paper, Model on Hugging Face, GitHub and Direct Access. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post MBZUAI Researchers Release K2 Think: A 32B Open-Source System for Advanced AI Reasoning and Outperforms 20x Larger Reasoning Models appeared first on MarkTechPost.

Alibaba Qwen Team Releases Qwen3-ASR: A New Speech Recognition Model B …

Alibaba Cloud’s Qwen team unveiled Qwen3-ASR Flash, an all-in-one automatic speech recognition (ASR) model (available as API service) built upon the strong intelligence of Qwen3-Omni that simplifies multilingual, noisy, and domain-specific transcription without juggling multiple systems.

Key Capabilities

Multilingual recognition: Supports automatic detection and transcription across 11 languages including English and Chinese, plus Arabic, German, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, and simplified Chinese (zh). That breadth positions Qwen3-ASR for global usage without separate models.

Context injection mechanism: Users can paste arbitrary text—names, domain-specific jargon, even nonsensical strings—to bias transcription. This is especially powerful in scenarios rich in idioms, proper nouns, or evolving lingo.

Robust audio handling: Maintains performance in noisy environments, low-quality recordings, far-field input (e.g., distance mics), and multimedia vocals like songs or raps. Reported Word Error Rate (WER) remains under 8%, which is technically impressive for such diverse inputs.

Single-model simplicity: Eliminates complexity of maintaining different models for languages or audio contexts—one model with an API Service to rule them all.

Use cases span edtech platforms (lecture capture, multilingual tutoring), media (subtitling, voice-over), and customer service (multilingual IVR or support transcription).

https://qwen.ai/blog?id=41e4c0f6175f9b004a03a07e42343eaaf48329e7&from=research.latest-advancements-list

Technical Assessment

Language Detection + TranscriptionAutomatic language detection lets the model determine the language before transcribing—crucial for mixed-language environments or passive audio capture. This reduces the need for manual language selection and improves usability.

Context Token InjectionPasting text as “context” biases recognition toward expected vocabulary. Technically, this could operate via prefix tuning or prefix-injection—embedding context in the input stream to influence decoding. It’s a flexible way to adapt to domain-specific lexicons without re-training the model.

WER < 8% Across Complex ScenariosHolding sub-8% WER across music, rap, background noise, and low-fidelity audio puts Qwen3-ASR in the upper echelon of open recognition systems. For comparison, robust models on clean read speech target 3–5% WER, but performance typically degrades significantly in noisy or musical contexts.

Multilingual CoverageSupporting 11 languages, including divergence into logographic Chinese and languages with varying phonotactics like Arabic and Japanese, suggests substantial multilingual training data and cross-lingual modeling capacity. Handling both tonal (Mandarin) and non-tonal languages is non-trivial.

Single-Model ArchitectureOperationally elegant: deploy one model for all tasks. This reduces ops burden—no need to swap or select models dynamically. Everything runs in a unified ASR pipeline with built-in language detection.

Deployment and Demo

The Hugging Face Space for Qwen3-ASR provides a live interface: upload audio, optionally input context, and choose a language or use auto-detect. It is available as an API Service.

Conclusion

Qwen3-ASR Flash (available as an API Service) is a technically compelling, deploy-friendly ASR solution. It offers a rare combination: multilingual support, context-aware transcription, and noise-robust recognition—all in one model.

Check out the API Service, Technical details and Demo on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Alibaba Qwen Team Releases Qwen3-ASR: A New Speech Recognition Model Built Upon Qwen3-Omni Achieving Robust Speech Recogition Performance appeared first on MarkTechPost.

Top 7 Model Context Protocol (MCP) Servers for Vibe Coding

Modern software development is shifting from static workflows to dynamic, agent-driven coding experiences. At the center of this transition is the Model Context Protocol (MCP), a standard for connecting AI agents to external tools, data, and services. MCP provides a structured way for large language models (LLMs) to request, consume, and persist context. This makes coding sessions more adaptive, reproducible, and collaborative. In short, MCP acts as the “middleware” that enables Vibe Coding—an interactive style of programming where developers and AI agents co-create in real time.

Below are seven notable MCP servers that extend developer environments with specialized capabilities for version control, memory, database integration, research, and browser automation for Vibe Coders.

GitMCP – Git Integration for AI Agents

GitMCP focuses on making repositories natively accessible to AI agents. It bridges MCP with Git workflows, allowing models to clone, browse, and interact with codebases directly. This reduces the overhead of manually feeding context to the agent.

Key Features: Direct access to branches, commits, diffs, and pull requests.

Practical Use: Automating code reviews, generating contextual explanations of commits, and preparing documentation.

Developer Value: Keeps the agent aware of project history and structure, avoiding redundant queries.

Supabase MCP – Database-First Coding

Supabase MCP integrates real-time databases and authentication directly into MCP-enabled workflows. By exposing a Postgres-native API to LLMs, it lets agents query live data, run migrations, or even test queries without leaving the coding session.

Key Features: Postgres queries, authentication, storage access.

Practical Use: Rapid prototyping of applications with live data interaction.

Developer Value: Eliminates the need for separate tooling when testing queries or managing schema changes.

Browser MCP – Web Automation Layer

Browser MCP enables agents to launch headless browsers, scrape data, and interact with web applications. It effectively equips an LLM with browsing capabilities inside a coding environment.

Key Features: Navigation, DOM inspection, form interaction, and screenshot capture.

Practical Use: Debugging frontend applications, testing authentication flows, and collecting real-time content.

Developer Value: Simplifies automated QA and lets developers test code against live production environments without custom scripting.

Context7 – Scalable Context Management

Context7, developed by Upstash, is built to handle persistent memory across sessions. It ensures that agents have long-term awareness of projects without repeatedly re-feeding context.

Key Features: Scalable memory storage, context retrieval APIs.

Practical Use: Multi-session projects where state and knowledge must persist across restarts.

Developer Value: Reduces token costs and boosts reliability by avoiding repeated context injection.

21stDev – Experimental Multi-Agent MCP

21stDev MCP is an experimental server that supports orchestration of multiple agents. Instead of a single AI instance managing all tasks, 21stDev coordinates different specialized agents through MCP.

Key Features: Multi-agent orchestration, modular plugin design.

Practical Use: Building pipelines where one agent manages code generation, another handles database validation, and another performs testing.

Developer Value: Enables a distributed agentic system without complex integration overhead.

OpenMemory MCP – Agent Memory Layer

OpenMemory MCP addresses one of the hardest problems in LLM workflows: persistent, inspectable memory. Unlike vector databases that act as black boxes, OpenMemory MCP provides transparent, queryable memory that developers can inspect and debug.

Key Features: Memory persistence, explainable retrieval, developer-level inspection.

Practical Use: Building agents that can remember user preferences, project requirements, or coding styles across sessions.

Developer Value: Improves trust by making memory retrieval transparent, not opaque.

Exa Search MCP – Research-Driven Development

Exa Search, built by Exa AI, is an MCP server specialized for research. It connects developers to live, verifiable information from the web without leaving the coding environment.

Key Features: Retrieves current statistics, bug fixes, and real-world examples.

Practical Use: When coding requires up-to-date references—such as API changes, performance benchmarks, or bug reports—Exa Search finds and integrates them directly.

Developer Value: Reduces the risk of using outdated or hallucinated information, accelerating bug resolution and feature development.

Conclusion

MCP servers are redefining how developers interact with AI systems by embedding context directly into workflows. Whether it’s GitMCP for version control, Supabase MCP for database interaction, Browser MCP for live web testing, Context7 for persistent memory, or Exa Search for research-driven coding, each server targets a different layer of the development stack. Together, these tools make Vibe Coding a practical reality—where human developers and AI agents collaborate seamlessly, grounded in accurate context and real-time feedback.
The post Top 7 Model Context Protocol (MCP) Servers for Vibe Coding appeared first on MarkTechPost.

Powering innovation at scale: How AWS is tackling AI infrastructure ch …

As generative AI continues to transform how enterprises operate—and develop net new innovations—the infrastructure demands for training and deploying AI models have grown exponentially. Traditional infrastructure approaches are struggling to keep pace with today’s computational requirements, network demands, and resilience needs of modern AI workloads.
At AWS, we’re also seeing a transformation across the technology landscape as organizations move from experimental AI projects to production deployments at scale. This shift demands infrastructure that can deliver unprecedented performance while maintaining security, reliability, and cost-effectiveness. That’s why we’ve made significant investments in networking innovations, specialized compute resources, and resilient infrastructure that’s designed specifically for AI workloads.
Accelerating model experimentation and training with SageMaker AI
The gateway to our AI infrastructure strategy is Amazon SageMaker AI, which provides purpose-built tools and workflows to streamline experimentation and accelerate the end-to-end model development lifecycle. One of our key innovations in this area is Amazon SageMaker HyperPod, which removes the undifferentiated heavy lifting involved in building and optimizing AI infrastructure.
At its core, SageMaker HyperPod represents a paradigm shift by moving beyond the traditional emphasis on raw computational power toward intelligent and adaptive resource management. It comes with advanced resiliency capabilities so that clusters can automatically recover from model training failures across the full stack, while automatically splitting training workloads across thousands of accelerators for parallel processing.
The impact of infrastructure reliability on training efficiency is significant. On a 16,000-chip cluster, for instance, every 0.1% decrease in daily node failure rate improves cluster productivity by 4.2% —translating to potential savings of up to $200,000 per day for a 16,000 H100 GPU cluster. To address this challenge, we recently introduced Managed Tiered Checkpointing in HyperPod, leveraging CPU memory for high-performance checkpoint storage with automatic data replication. This innovation helps deliver faster recovery times and is a cost-effective solution compared to traditional disk-based approaches.
For those working with today’s most popular models, HyperPod also offers over 30 curated model training recipes, including support for OpenAI GPT-OSS, DeepSeek R1, Llama, Mistral, and Mixtral. These recipes automate key steps like loading training datasets, applying distributed training techniques, and configuring systems for checkpointing and recovery from infrastructure failures. And with support for popular tools like Jupyter, vLLM, LangChain, and MLflow, you can manage containerized apps and scale clusters dynamically as you scale your foundation model training and inference workloads.
Overcoming the bottleneck: Network performance
As organizations scale their AI initiatives from proof of concept to production, network performance often becomes the critical bottleneck that can make or break success. This is particularly true when training large language models, where even minor network delays can add days or weeks to training time and significantly increase costs. In 2024, the scale of our networking investments was unprecedented; we installed over 3 million network links to support our latest AI network fabric, or 10p10u infrastructure. Supporting more than 20,000 GPUs while delivering 10s of petabits of bandwidth with under 10 microseconds of latency between servers, this infrastructure enables organizations to train massive models that were previously impractical or impossibly expensive. To put this in perspective: what used to take weeks can now be accomplished in days, allowing companies to iterate faster and bring AI innovations to customers sooner.
At the heart of this network architecture is our revolutionary Scalable Intent Driven Routing (SIDR) protocol and Elastic Fabric Adapter (EFA). SIDR acts as an intelligent traffic control system that can instantly reroute data when it detects network congestion or failures, responding in under one second—ten times faster than traditional distributed networking approaches.
Accelerated computing for AI
The computational demands of modern AI workloads are pushing traditional infrastructure to its limits. Whether you’re fine-tuning a foundation model for your specific use case or training a model from scratch, having the right compute infrastructure isn’t just about raw power—it’s about having the flexibility to choose the most cost-effective and efficient solution for your specific needs.
AWS offers the industry’s broadest selection of accelerated computing options, anchored by both our long-standing partnership with NVIDIA and our custom-built AWS Trainium chips. This year’s launch of P6 instances featuring NVIDIA Blackwell chips demonstrates our continued commitment to bringing the latest GPU technology to our customers. The P6-B200 instances provide 8 NVIDIA Blackwell GPUs with 1.4 TB of high bandwidth GPU memory and up to 3.2 Tbps of EFAv4 networking. In preliminary testing, customers like JetBrains have already seen greater than 85% faster training times on P6-B200 over H200-based P5en instances across their ML pipelines.
To make AI more affordable and accessible, we also developed AWS Trainium, our custom AI chip designed specifically for ML workloads. Using a unique systolic array architecture, Trainium creates efficient computing pipelines that reduce memory bandwidth demands. To simplify access to this infrastructure, EC2 Capacity Blocks for ML also enable you to reserve accelerated compute instances within EC2 UltraClusters for up to six months, giving customers predictable access to the accelerated compute they need.
Preparing for tomorrow’s innovations, today
As AI continues to transform every aspect of our lives, one thing is clear: AI is only as good as the foundation upon which it is built. At AWS, we’re committed to being that foundation, delivering the security, resilience, and continuous innovation needed for the next generation of AI breakthroughs. From our revolutionary 10p10u network fabric to custom Trainium chips, from P6e-GB200 UltraServers to SageMaker HyperPod’s advanced resilience capabilities, we’re enabling organizations of all sizes to push the boundaries of what’s possible with AI. We’re excited to see what our customers will build next on AWS.

About the author
Barry Cooks is a global enterprise technology veteran with 25 years of experience leading teams in cloud computing, hardware design, application microservices, artificial intelligence, and more. As VP of Technology at Amazon, he is responsible for compute abstractions (containers, serverless, VMware, micro-VMs), quantum experimentation, high performance computing, and AI training. He oversees key AWS services including AWS Lambda, Amazon Elastic Container Service, Amazon Elastic Kubernetes Service, and Amazon SageMaker. Barry also leads responsible AI initiatives across AWS, promoting the safe and ethical development of AI as a force for good. Prior to joining Amazon in 2022, Barry served as CTO at DigitalOcean, where he guided the organization through its successful IPO. His career also includes leadership roles at VMware and Sun Microsystems. Barry holds a BS in Computer Science from Purdue University and an MS in Computer Science from the University of Oregon.

Accelerate your model training with managed tiered checkpointing on Am …

As organizations scale their AI infrastructure to support trillion-parameter models, they face a difficult trade-off: reduced training time with lower cost or faster training time with a higher cost. When they checkpoint frequently to speed up recovery time and minimize lost training time, they incur in substantially higher storage cost. And when they checkpoint infrequently, they reduce costs at the risk of losing valuable training progress when failures occur.
This challenge is exacerbated in large distributed training environments, with thousands of accelerators, where issues can occur frequently. According to an article released by Meta, one failure happened every 3 hours during the Meta Llama 3 model training. The GPU issues accounted for 60% of the total failures, and network, CPU, and disks account the other 40%. With infrequent checkpointing, these accumulated failures can result in losing days of training progress over the course of a complete training run, thereby driving up costs and time to market. Frequent checkpoints can saturate networks, overload storage, and result in unpredictable performance.
To help solve these challenges, AWS announced managed tiered checkpointing in Amazon SageMaker HyperPod, a purpose-built infrastructure to scale and accelerate generative AI model development across thousands of AI accelerators. Managed tiered checkpointing uses CPU memory for high-performance checkpoint storage with automatic data replication across adjacent compute nodes for enhanced reliability. Although SageMaker HyperPod identifies node issues automatically and replaces those nodes so your training can resume, managed tiered checkpointing helps you implement the best checkpointing strategy and maximize your training throughput.
Managed tiered checkpointing has been tested on large distributed training clusters ranging from hundreds of GPU to over 15,000 GPU, with checkpoints being saved within seconds.
In this post, we dive deep into those concepts and understand how to use the managed tiered checkpointing feature.
Solution overview
Checkpointing is the method of saving an intermediate model’s state during the training process. You can resume training from a recent checkpoint in the event of an issue by saving the model’s parameters, optimizer states, and other metadata during training. Additionally, you can resolve training problems, such as irregular learning rates, without a full restart by loading an earlier checkpoint state.
Use the following formula to find a rough initial estimate of the total size of the checkpoint for your model without the optimizer state:Model checkpoint size (GB) = (Number of parameters × Bytes per parameter) ÷ 10243 bytesFor example, if you train a Meta Llama 3 70-billion-parameter model using BFloat16 as the parameter’s precision, the checkpoint size will be 130 GB. If you train a DeepSeek-R1 671-billion-parameter model using BFloat16, the checkpoint size will be 1.25 TB. All without storing optimizer states.Checkpoints include optimizer states, training metadata (such as step number), and other additional data, resulting in a larger than expected size. When using an Adam optimizer, the optimizer will save three additional float16 statistics per parameter, resulting in an additional 6 bytes per parameter. Therefore, with the optimizer state saved, the Meta Llama 3 70B model checkpoint size will be approximately 521 GB, and the DeepSeek-R1 671B model checkpoint size will be approximately 5 TB. That is a four-times increase in size, and handling those checkpoints becomes a challenge.
The following table summarizes the checkpoint sizes for each model.

Model name
Size of Checkpoint
Size of Checkpoint + Optimizer States

Meta Llama 3 70B
130 GB
521 GB

DeepSeek R1 671B
1.43 TB
5 TB

It’s also important to consider the training strategy. In a Fully Sharded Data Parallel (FSDP) scenario, each rank (a single GPU process in a distributed training) saves its own part of the checkpoint. At the same time, it reduces the amount of data each rank has to save during a checkpoint, and imposes a stress on the file system level. On a Network File System (NFS) shared file system, those concurrent writes become a bottleneck. Using a distributed file system, such Amazon FSx for Lustre, can help alleviate that pressure at a higher total cost. In a Distributed Data Parallel (DDP) scenario, a single rank writes the complete checkpoint at one time, and all ranks read the checkpoint when loading it back. On the file system level, this means a single writer and multiple readers. On an NFS file system, many readers can be a problem because they will be constrained based on the file system, network stack, and queue size. A single writer, over the network, will not take advantage of all the network throughput. Here again, a fast, distributed file system like FSx for Lustre can help solve those problems at a higher total cost of ownership.
As we can see, traditional checkpointing methods that rely solely on remote persistent storage create a computational overhead during checkpoint creation, because writing terabytes of model parameters to persistent storage might throttle it, consume expensive network bandwidth, and require complex orchestration across distributed systems. By storing checkpoints in fast-access in-memory locations, such as CPU RAM, while maintaining configurable backup to Amazon Simple Storage Service (Amazon S3) for persistence, the system delivers faster recovery times, and is a cost-effective solution compared to traditional disk-based approaches.
Managed tiered checkpointing works as follows:

When training your model, you define the checkpoint frequency.
Model training uses GPU HBM memory to store the model, its parameters, and intermediate results, and do the heavy computation.
Triggering a checkpoint stops model training. The GPU will convert the model weights (tensors) into a state dictionary and copy the data to the instance’s CPU, then the training resumes while managed tiered checkpointing copies the data to RAM.
Because RAM is volatile, managed tiered checkpointing copies the data asynchronously from the host RAM to adjacent nodes using RDMA over Elastic Fabric Adapter (EFA). If a node experiences an issue, its checkpoint data will be available on other nodes too.
From time to time, it copies the data to a second layer of persistent storage, such as Amazon S3. This helps both when writing to RAM fails and when you want to persistently store the checkpoint data for future use.

With managed tiered checkpointing, you can configure frequency and retention policies for both in-memory and persistent storage tiers. You use the first layer (in-memory) to save checkpoints at a high frequency and for fast recovery, periodically saving to Amazon S3 for backup. Managed tiered checkpointing provides a file system that can be seamlessly integrated with your PyTorch Distributed Checkpointing (DCP) training. Adding it to your training script only requires a few lines of code. Furthermore, it improves the performance of checkpoints by using in-memory storage while using other tiers for persistent storage. PyTorch DCP solves the issue of saving a model’s checkpoint when it uses distributed resources, such as multiple GPUs across multiple compute nodes. Trainers, parameters, and the dataset are partitioned across those nodes and resources, then PyTorch DCP saves and loads from multiple ranks in parallel. PyTorch DCP produces multiple files per checkpoint, at least one per rank. Depending on the volume of those files, number and size, shared and network file systems such as NFS will struggle with inode and metadata management. Managed tiered checkpointing helps solve that issue by making it possible to use multiple tiers, reducing intrusion to the training time and still receiving the benefits of PyTorch DCP, such as deduplication of checkpoint data.
With managed tiered checkpointing in SageMaker HyperPod, you can maintain a high training throughput even in large-scale environments prone to failures. It uses your existing SageMaker HyperPod cluster orchestrated by Amazon Elastic Kubernetes Service (Amazon EKS) and compute nodes, and there are no additional costs to use the library.
In the following sections, we explore how to configure the SageMaker HyperPod cluster’s training scripts to use this new feature.
Configure your SageMaker HyperPod cluster for managed tiered checkpointing
SageMaker HyperPod provisions resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). By reducing the complex work of building and maintaining compute clusters using accelerators like AWS Trainium and NVIDIA H200/B200 GPUs, it speeds up the creation of foundation models. To create a new SageMaker HyperPod cluster, refer to the Amazon SageMaker HyperPod Developer Guide. If you want to accelerate your deployment by using field hardened assets, refer to the following GitHub repo.
The examples shared in this post are intended to help you learn more about this new feature. If you’re considering running the examples provided here in a production environment, have your security team review the content and make sure they adhere to your security standards. At AWS, security is the top priority and we understand that every customer has their own security framework.Before creating or updating a cluster to add the managed tiered checkpointing feature, you must set up the EKS pods to access an S3 bucket either on your own account or across accounts. When working with buckets on the same account as the SageMaker HyperPod EKS cluster, you can use the following policy (change your bucket name before applying it):

{
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Action”: [
                “s3:DeleteObject”,
                “s3:GetBucketLocation”,
                “s3:GetObject”,
                “s3:ListBucket”,
                “s3:PutObject”
            ],
            “Resource”: [
                “arn:aws:s3:::<bucket_name>”,
                “arn:aws:s3:::<bucket_name>/*”
            ],
            “Effect”: “Allow”
        }
    ]
}

If the bucket is in a different account, you must authorize an AWS Identity and Access Management (IAM) principal to access those buckets. The following IAM policy will do that for you. Be sure to change both the bucket name and the IAM principal (for example, your AWS account ID).

{
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Sid”: “CheckPointCrossAccountAccess”,
            “Effect”: “Allow”,
            “Principal”: {
                “AWS”: “arn:aws:iam::<account_id>:root”
            },
            “Action”: [
                “s3:DeleteObject”,
                “s3:GetBucketLocation”,
                “s3:GetObject”,
                “s3:ListBucket”,
                “s3:PutObject”
            ],
            “Resource”: [
                “arn:aws:s3:::<bucket_name>”,
                “arn:aws:s3:::<bucket_name>/*”
            ]
        }
    ]
}

To create a new cluster with managed tiered checkpointing, you can pass a parameter using –tiered-storage-config and setting Mode to Enable using an AWS Command Line Interface (AWS CLI) command:

aws sagemaker create-cluster
–cluster-name “ml-cluster”
–tiered-storage-config { “Mode”: “Enable” }
–instance-groups ‘[{
“InstanceCount”: 1,
….
}]’

You can also update it using the UpdateCluster API and pass the CachingConfig parameter with the required AllocatedMemory configuration. You can use the CachingConfiguration parameter to define a fixed value or a percentage of the CPU RAM for checkpointing.

aws sagemaker update-cluster
–cluster-name <my-training-cluster>
–tiered-storage-config {
“Mode”: “Enable”
“InstanceMemoryAllocationPercentage”: <percent>
}

Now that your SageMaker HyperPod cluster has the managed tiered checkpointing feature, let’s prepare the training scripts and add them.
Install the managed tiered checkpoint libraries and integrate with your training script
Managed tiered checkpointing integrates with PyTorch DCP. You start by installing the sagemaker-checkpointing library. Then you create and configure a namespace to store the checkpoints based on the defined frequency. Finally, you add the checkpoint function inside your training loop.
To install the library, we simply use Python’s pip. Make sure you already have the dependencies installed: Python 3.10 or higher, PyTorch with DCP support, and the AWS credentials configured properly. To integrate Amazon S3 as another storage layer, you also need s3torchconnector installed.

# Install the pre-requisites
pip install torch boto3 botocore tenacity s3torchconnector

# Install the Managed Tiered Checkpointing library
pip install amzn-sagemaker-checkpointing

Now you can import the library on your script and configure the namespace and frequency for checkpointing:

import torchimport torch.distributed as dist
from torch.distributed.checkpoint import async_save, load
from amzn_sagemaker_checkpointing.config.sagemaker_checkpoint_config import SageMakerCheckpointConfig
from amzn_sagemaker_checkpointing.checkpointing.filesystem.filesystem import (
    SageMakerTieredStorageWriter,
    SageMakerTieredStorageReader
)

checkpoint_config = SageMakerCheckpointConfig(
    # Unique ID for your training job
    # Allowed characters in ID include: alphanumeric, hyphens, and underscores
    namespace=os.environ.get(‘TRAINING_JOB_NAME’, f’job-{int(time.time())}’),

    # Number of distributed processes/available GPUs
    world_size=dist.get_world_size(),

    # Amazon S3 storage location, required for SageMakerTieredStorageReader for read fallbacks
    # Required for SageMakerTieredStorageWriter when save_to_s3 is True
    s3_tier_base_path=”s3://<my-bucket>/checkpoints”

In the preceding code snippet, we have configured managed tiered checkpointing with the same world_size as the number of ranks in our cluster. When you start a distributed training, each GPU in the cluster is assigned a rank number, and the total number of GPUs available is the world_size. We set up Amazon S3 as our backup persistent storage, setting managed tiered checkpointing to store data in Amazon S3 every 100 training steps. Both world_size and namespace are required parameters; the others are optional.
Now that the configuration is ready, let’s set up PyTorch DCP and integrate managed tiered checkpointing.
First, configure the storage writer. This component will pass on to the PyTorch DCP async_save function alongside the model’s state dictionary. We use the SageMakerTieredStorageWriter when writing the checkpoints and the SageMakeTieredStorageReader when restoring from those checkpoints.
Inside your model training loop, you add the storage writer configuration and pass along both the managed tiered checkpointing configuration and the step number:

   state_dict = {
       “model”: model.state_dict(),
       “optimizer”: optimizer.state_dict(),
       “step”: training_step,
       “epoch”: epoch
   }
   
   # Create storage writer for current step and if it need to save to a persistent storage too 
   checkpoint_config.save_to_s3 = training_step % s3_ckpt_freq == 0
   storage_writer = SageMakerTieredStorageWriter(
       checkpoint_config=checkpoint_config,
       step=training_step
   )

You can define the step number explicitly for the storage writer, or you can let the storage writer identify the step number from the path where the checkpoint is being saved. If you want to let the storage writer infer the step number from the base path, don’t set the stepparameter and make sure your path contains the step number in it.
Now you can call the PyTorch DCP asynchronous save function and pass along the state dictionary and the storage writer configuration:async_save(state_dict=state_dict, storage_writer=storage_writer)
We have set up managed tiered checkpointing to write checkpoints at our desired frequency and location (in-memory). Let’s use the storage reader to restore those checkpoints. First, pass the managed tiered checkpointing configuration to the SageMakerTieredStorageReader, then call the PyTorch DCP load function, passing the model state dictionary and the storage reader configuration:

storage_reader = SageMakerTieredStorageReader(checkpoint_config=checkpoint_config)
load(state_dict, storage_reader=storage_reader)

To work through a complete example, refer to the following GitHub repository, where we’ve created a simple training script, including the managed tiered checkpointing feature.
Clean up
After you have worked with managed tiered checkpointing, and you want to clean up the environment, simply remove the amzn-sagemaker-checkpointing library by running pip uninstall amzn-sagemaker-checkpointing.
If you installed the solution in a Python virtual environment, then just deleting the virtual environment will suffice.Managed tiered checkpointing is a free feature that doesn’t require additional resources to run. You use your existing SageMaker HyperPod EKS cluster and compute nodes.
Best practices to optimize your checkpoint strategy with managed tiered checkpointing
Managed tiered checkpointing will attempt to write to the in-memory tier first. This optimizes the writing performance because in-memory provides ultra-low latency checkpoint access. You should configure managed tiered checkpointing to write to a second layer, such as Amazon S3, from time to time. For example, configure managed tiered checkpointing to write to the in-memory layer every 10 steps, and configure it to write to Amazon S3 every 100 steps.
If managed tiered checkpointing fails to write to the in-memory layer, and the node experiences an issue, then you still have your checkpoint saved on Amazon S3. While writing to Amazon S3, managed tiered checkpointing uses multiple TCP streams (chunks) to optimize Amazon S3 writes.
In terms of consistency, managed tiered checkpointing uses an all-or-nothing writing strategy. It implements a fallback mechanism that will seamlessly transition between the storage tiers. Checkpoint metadata, such as step number, is stored alongside the data for every tier.
When trying to troubleshoot managed tiered checkpointing, you can check the log written locally to /var/log/sagemaker_checkpointing/{namespace}_checkpointing.log. It publishes data about the training step, rank number, and the operation details. The following is an example output of that file:

[timestamp] [namespace] [logger_name] [INFO] [filename:451] [Rank 0] Step 240: Starting checkpoint write ([SavePlan Items Count] items)
[timestamp] [namespace] [logger_name] [INFO] [filename:498] [Rank 0] Step 240: In-memory write completed in [Latency]s ([Throughput] MB/s)
[timestamp] [namespace] [logger_name] [INFO] [filename:530] [Rank 0] Step 240: S3 batch write completed in [Latency]s ([Size] total, [Throughput] MB/s average)

Managed tiered checkpointing also writes those metrics to the console, so it’s straightforward to troubleshoot during development. They contain information on which step number is being written to which storage layer and the throughput and total time taken to write the data. With that information, you can monitor and troubleshoot managed tiered checkpointing thoroughly.
When you combine those tools with the SageMaker HyperPod observability stack, you get a complete view of all metrics of your training or inference workload.
Conclusion
The new managed tiered checkpointing feature in SageMaker HyperPod augments FM training efficiency by intelligently distributing checkpoints across multiple storage tiers. This advanced approach places model states in fast access locations such as CPU RAM memory, while using persistent storage such as Amazon S3 for cost-effective, long-term persistence. As of the time of this launch, managed tiered checkpointing is supported only on SageMaker HyperPod on Amazon EKS.
Managed tiered checkpointing delivers fast recovery times without increased storage costs, avoiding complex trade-offs between resiliency, training efficiency, and storage costs. It has been validated on large distributed training clusters that range from hundreds of GPU to more than 15,000 GPU, with checkpoints being saved within seconds.
Integrating managed tiered checkpointing on your training scripts is straightforward, with just a few lines of code, providing immediate access to sophisticated checkpoint management without requiring deep engineering expertise.
For more information on how managed tiered checkpointing works, how to set it up, and other details, refer to HyperPod managed tier checkpointing.

About the authors
Paulo Aragao is a Principal WorldWide Solutions Architect focused on Generative AI at the Specialist Organisation on AWS. He helps Enterprises and Startups to build their Foundation Models strategy and innovate faster by leveraging his extensive knowledge on High Perfomance Computing and Machine Learning. A long time bass player, and natural born rock fan, Paulo enjoys spending time travelling with his family, scuba diving, and playing real time strategy and role-playing games.
Kunal Jha is a Principal Product Manager at AWS. He is focused on building Amazon SageMaker Hyperpod as the best-in-class choice for Generative AI model’s training and inference. In his spare time, Kunal enjoys skiing and exploring the Pacific Northwest.
Mandar Kulkarni is a Software Development Engineer II at AWS, where he works on Amazon SageMaker. He specializes in building scalable and performant machine learning libraries and infrastructure solutions, particularly focusing on SageMaker HyperPod. His technical interests span machine learning, artificial intelligence, distributed systems and application security. When not architecting ML solutions, Mandar enjoys hiking, practicing Indian classical music, sports, and spending quality time with his young family.
Vinay Devadiga is a Software Development Engineer II at AWS with a deep passion for artificial intelligence and cloud computing. He focuses on building scalable, high-performance systems that enable the power of AI and machine learning to solve complex problems.Vinay enjoys staying at the forefront of technology, continuously learning, and applying new advancements to drive innovation. Outside of work, he likes playing sports and spending quality time with his family.
Vivek Maran is a Software Engineer at AWS. He currently works on the development of Amazon SageMaker HyperPod, a resilient platform for large scale distributed training and inference. His interests include large scale distributed systems, network systems, and artificial intelligence. Outside of work, he enjoys music, running, and keeping up to date with business & technology trends.