Scaling medical content review at Flo Health using Amazon Bedrock (Par …

This blog post is based on work co-developed with Flo Health.
Healthcare science is rapidly advancing. Maintaining accurate and up-to-date medical content directly impacts people’s lives, health decisions, and well-being. When someone searches for health information, they are often at their most vulnerable, making accuracy not just important, but potentially life-saving.
Flo Health creates thousands of medical articles every year, providing millions of users worldwide with medically credible information on women’s health. Verifying the accuracy and relevance of this vast content library is a significant challenge. Medical knowledge evolves continuously, and manual review of each article is not only time-consuming but also prone to human error. This is why the team at Flo Health, the company behind the leading women’s health app Flo, is using generative AI to facilitate medical content accuracy at scale. Through a partnership with AWS Generative AI Innovation Center, Flo Health is developing an innovative approach, further called, “Medical Automated Content Review and Revision Optimization Solution” (MACROS) to verify and maintain the accuracy of its extensive health information library. This AI-powered solution is capable of:

Efficiently processing large volumes of medical content based on credible scientific sources.
Identifying potential inaccuracies or outdated information based on credible scientific resources.
Proposing updates based on the latest medical research and guidelines, as well as incorporating user feedback.

The system powered by Amazon Bedrock enables Flo Health to conduct medical content reviews and revision assessments at scale, ensuring up-to-date accuracy and supporting more informed healthcare decision-making. This system performs detailed content analysis, providing comprehensive insights on medical standards and guidelines adherence for Flo’s medical experts to review. It is also designed for seamless integration with Flo’s existing tech infrastructure, facilitating automatic updates where appropriate.
This two-part series explores Flo Health’s journey with generative AI for medical content verification. Part 1 examines our proof of concept (PoC), including the initial solution, capabilities, and early results. Part 2 covers focusing on scaling challenges and real-world implementation. Each article stands alone while collectively showing how AI transforms medical content management at scale.
Proof of Concept goals and success criteria
Before diving into the technical solution, we established clear objectives for our PoC medical content review system:
Key Objectives:

Validate the feasibility of using generative AI for medical content verification
Determine accuracy levels compared to manual review
Assess processing time and cost improvements

Success Metrics:

Accuracy: Content piece recall of 90%
Efficiency: Reduce detection time from hours to minutes per guideline
Cost Reduction: Reduce expert review workload
Quality: Maintain Flo’s editorial standards and medical accuracy
Speed: 10x faster than manual review process

To verify the solution meets Flo Health’s high standards for medical content, Flo Health’s medical experts and content teams were working closely with AWS technical specialists through regular review sessions, providing critical feedback and medical expertise to continuously enhance the AI model’s performance and accuracy. The result is MACROS, our custom-built solution for AI-assisted medical content verification.
Solution overview
In this section, we outline how the MACROS solution uses Amazon Bedrock and other AWS services to automate medical content review and revisions.

Figure 1. Medical Automated Content Review and Revision Optimization Solution Overview
As shown in Figure 1, the developed solution supports two major processes:

Content Review and Revision: Enables the medical standards and style adherence of existing medical articles at scale given the pre-specified custom rules and guidelines and proposes a revision that conforms to the new medical standards as well as Flo’s style and tone guidelines.
Rule Optimization: MACROS accelerates the process of extracting the new (medical) guidelines from the (medical) research, pre-processing them into the format needed for content review, as well as optimizing their quality.

Both steps can be conducted through the user interface (UI) as well as the direct API call. The UI support enables medical experts to directly see the content review statistics, interact with changes, and do manual adjustments. The API call support is intended for the integration into pipeline for periodic assessment.
Architecture
Figure 2 depicts the architecture of MACROS. It consists of two major parts: backend and frontend.

Figure 2. MACROS architecture

In the following, the flow of major app components is presented:
1. Users begin by gathering and preparing content that must meet medical standards and rules.
2. In the second step, the data is provided as PDF, TXT files or text through the Streamlit UI that is hosted in Amazon Elastic Container Service (ECS). The authentication for file upload happens through Amazon API Gateway
3. Alternatively, custom Flo Health JSON files can be directly uploaded to the Amazon Simple Storage Service (S3) bucket of the solution stack.
4. The ECS hosted frontend has AWS IAM permissions to orchestrate tasks using AWS Step Functions.
5. Further, the ECS container has access to the S3 for listing, downloading and uploading files either via pre-signed URL or boto3.
6. Optionally, if the input file is uploaded via the UI, the solution invokes AWS Step Functions service that starts the pre-processing functionality within hosted by an AWS Lambda function. This Lambda has access to Amazon Textract for extracting text from PDF files. The files are stored in S3 and also returned to the UI.
7-9. Hosted on AWS Lambda, Rule Optimizer, Content Review and Revision functions are orchestrated via AWS Step Function. They have access to Amazon Bedrock for generative AI capabilities to perform rule extraction from unstructured data, content review and revision, respectively. Furthermore, they have access to S3 via boto3 SDK to store the results.
10. The Compute Stats AWS Lambda function has access to S3 and can read and combine the results of individual revision and review runs.
11. The solution leverages Amazon CloudWatch for system monitoring and log management. For production deployments dealing with critical medical content, the monitoring capabilities could be extended with custom metrics and alarms to provide more granular insights into system performance and content processing patterns.
Future enhancements
While our current architecture utilizes AWS Step Functions for workflow orchestration, we’re exploring the potential of Amazon Bedrock Flows for future iterations. Bedrock Flows offers promising capabilities for streamlining AI-driven workflows, potentially simplifying our architecture and enhancing integration with other Bedrock services. This alternative could provide more seamless management of our AI processes, especially as we scale and evolve our solution.
Content review and revision
At the core of MACROS lies its Content Review and Revision functionality with Amazon Bedrock foundation models. The Content Review and Revision block consists of five major components: 1) The optional Filtering stage 2) Chunking 3) Review 4) Revision and 5) Post-processing, depicted in Figure 3.

Figure 3. Content review and revision pipeline

Here’s how MACROS processes the uploaded medical content:

Filtering (Optional): The journey begins with an optional filtering step. This smart feature checks whether the set of rules is relevant for the article, potentially saving time and resources on unnecessary processing.
Chunking: The source text is then split into paragraphs. This crucial step facilitates good quality assessment and helps prevent unintended revisions to unrelated text. Chunking can be conducted using heuristics, such as punctuation or regular expression-based splits, as well as using large language models (LLM) to identify semantically complete chunks of text.
Review: Each paragraph or section undergoes a thorough review against the relevant rules and guidelines.
Revision: Only the paragraphs flagged as non-adherent move forward to the revision stage, streamlining the process and maintaining the integrity of adherent content. The AI suggests updates to bring non-adherent paragraphs in line with the latest guidelines and Flo’s style requirements.
Post-processing: Finally, the revised paragraphs are seamlessly integrated back into the original text, resulting in an updated, adherent document.

The Filtering step can be conducted using an additional LLM via Amazon Bedrock call that assesses each section separately with the following prompt structure:

Figure 4. Simplified LLM-based filtering step

Further, non-LLM approaches can be feasible to support the Filtering step:

Encoding the rules and the articles into dense embedding vectors and calculating similarity between them. By setting the similarity threshold we can identify which rule set is considered to be relevant for the input document.
Similarly, the direct keyword-level overlap between the document and the rule can be identified using BLEU or ROUGE metrics.

Content review, as already mentioned, is conducted on a text section basis against group of rules and leads to response in XML format, such as:

<xml>
<section_text> Section text without any changes </section_text>
<adherence> 0 <adherence>
<rule_name> Text of the non-adherent rule </rule_name>
<reason> Reason why the section is non-adherent to the rule </reason>
<rule_name> Text of the non-adherent rule </rule_name>
<reason> Reason why the section is non-adherent to the rule </reason>
<section_text> Section text without any changes </section_text>
<adherence> 1 <adherence>
<section_text> Section text without any changes </section_text>
<adherence> 1 <adherence>
</xml>

Here, 1 indicates adherence and 0 – non-adherence of the text to the specified rules. Using XML format helps to achieve reliable parsing of the output.
This Review step iterates over the sections in the text to make sure that the LLM pays attention to each section separately, which led to more robust results in our experimentation. To facilitate higher non-adherent section detection accuracy, the user can also use the Multi-call mode, where instead of one Amazon Bedrock call assessing adherence of the article against all rules, we have one independent call per rule.
The Revision step receives the output of the Review (non-adherent sections and the reasons for non-adherence), as well as the instruction to create the revision in a similar tone. It then suggests revisions of the non-adherent sentences in a style similar to the original text. Finally, the Post-processing step combines the original text with new revisions, making sure that no other sections are changed.
Different steps of the flow require different levels of LLM model complexity. While simpler tasks like chunking can be done efficiently with a relatively small model like Claude Haiku models family, more complex reasoning tasks like content review and revision require larger models like Claude Sonnet or Opus models family to facilitate accurate analysis and high-quality content generation. This tiered approach to model selection optimizes both performance and cost-efficiency of the solution.
Operating modes
The Content Review and Revision feature operates in two UI modes: Detailed Document Processing and Multi Document Processing, each catering to different scales of content management. The Detailed Document Processing mode offers a granular approach to content assessment and is depicted in Figure 5. Users can upload documents in various formats (PDF, TXT, JSON or paste text directly) and specify the guidelines against which the content should be evaluated.

Figure 5. Detailed Document Processing example
Users can choose from predefined rule sets, here, Vitamin D, Breast Health, and Premenstrual Syndrome and Dysphoric Disorder (PMS and PMDD), or input custom guidelines. These custom guidelines can include rules such as “The title of the article must be medically accurate” as well as adherent and non-adherent to the rule examples of content.
The rulesets make sure that the assessment aligns with specific medical standards and Flo’s unique style guide. The interface allows for on-the-fly adjustments, making it ideal for thorough, individual document reviews. For larger-scale operations, the Multi Document Processing mode should be used. This mode is designed to handle numerous custom JSON files simultaneously, mimicking how Flo would integrate MACROS into their content management system.
Extracting rules and guidelines from unstructured data
Actionable and well-prepared guidelines are not always immediately available. Sometimes they are given in unstructured files or need to be found. Using the Rule Optimizer feature, we can extract and refine actionable guidelines from multiple complex documents.
Rule Optimizer processes raw PDF documents to extract text, which is then chunked into meaningful sections based on document headers. This segmented content is processed through Amazon Bedrock using specialized system prompts, with two distinct modes: Style/tonality and Medical mode.
Style/tonality mode focuses on extracting the guidelines on how the text should be written, its style, what formats and words can or cannot be used.
Rule Optimizer assigns a priority for each rule: high, medium, and low. The priority level indicates the rule’s importance, guiding the order of content review and focusing attention on critical areas first. Rule Optimizer includes a manual editing interface where users can refine rule text, adjust classifications, and manage priorities. Therefore, if users need to update a given rule, the changes are stored for future use in Amazon S3.
The Medical mode is designed to process medical documents and is adapted to a more scientific language. It allows grouping of extracted rules into three classes:

Medical condition guidelines
Treatment specific guidelines
Changes to advice and trends in health

Figure 6. Simplified medical rule optimization prompt

Figure 6 provides an example of a medical rule optimization prompt, consisting of three main components: role setting – medical AI expert, description of what makes a good rule, and finally the expected output. We identify the sufficiently good quality for a rule if it is:

Clear, unambiguous, and actionable
Relevant, consistent, and concise (max two sentences)
Written in active voice
Avoids unnecessary jargon

Implementation considerations and challenges
During our PoC development, we identified several crucial considerations that would benefit others implementing similar solutions:

Data preparation: This emerged as a fundamental challenge. We learned the importance of standardizing input formats for both medical content and guidelines while maintaining consistent document structures. Creating diverse test sets across different medical topics proved essential for comprehensive validation.
Cost management: Monitoring and optimizing cost quickly became a key priority. We implemented token usage tracking and optimized prompt design and batch processing to balance performance and efficiency.
Regulatory and ethical compliance: Given the sensitive nature of medical content, strict regulatory and ethical safeguards were critical. We established robust documentation practices for AI decisions, implemented strict version control for medical guidelines and continuous human medical expert oversight for the AI-generated suggestions. Regional healthcare regulations were carefully considered throughout implementation.
Integration and scaling: We recommend starting with a standalone testing environment while planning for future content management system (CMS) integration through well-designed API endpoints. Building with modularity in mind proved valuable for future enhancements. Throughout the process, we faced common challenges such as maintaining context in long medical articles, balancing processing speed with accuracy, and facilitating consistent tone across AI-suggested revisions.
Model optimization: The diverse model selection capability of Amazon Bedrock proved particularly valuable. Through its platform, we can choose optimal models for specific tasks, achieve cost efficiency without sacrificing accuracy, and smoothly upgrade to newer models – all while maintaining our existing architecture.

Preliminary Results
Our Proof of Concept delivered strong results across the critical success metrics, demonstrating the potential of AI-assisted medical content review. The solution exceeded target processing speed improvements while maintaining 80% accuracy and over 90% recall in identifying content requiring updates. Most notably, the AI-powered system applied medical guidelines more consistently than manual reviews and significantly reduced the time burden on medical experts.
Key Takeaways
During implementation, we uncovered several insights critical for optimizing AI performance in medical content analysis. Content chunking was essential for accurate assessment across long documents, and expert validation of parsing rules helped medical experts to maintain clinical precision.Most importantly, the project confirmed that human-AI collaboration – not full automation – is key to successful implementation. Regular expert feedback and clear performance metrics guided system refinements and incremental improvements. While the system significantly streamlines the review process, it works best as an augmentation tool, with medical experts remaining essential for final validation, creating a more efficient hybrid approach to medical content management.
Conclusion and next steps
This first part of our series demonstrates how generative AI can make the medical content review process faster, more efficient, and scalable while maintaining high accuracy. Stay tuned for Part 2 of this series, where we cover the production journey, deep diving into challenges and scaling strategies.Are you ready to move your AI initiatives into production?

Learn more about the AWS Generative AI Innovation Center and contact your AWS Account Manager to be connected to our expert guidance and support.
Visit the Amazon Bedrock documentation to learn more about available foundation models and their capabilities
Join our AWS Builder community to connect with others on a similar AI journey.

About the authors
Liza (Elizaveta) Zinovyeva, Ph.D., is an Applied Scientist at AWS Generative AI Innovation Center and is based in Berlin. She helps customers across different industries to integrate Generative AI into their existing applications and workflows. She is passionate about AI/ML, finance and software security topics. In her spare time, she enjoys spending time with her family, sports, learning new technologies, and table quizzes.
Callum Macpherson is a Data Scientist at the AWS Generative AI Innovation Center, where cutting-edge AI meets real-world business transformation. Callum partners directly with AWS customers to design, build, and scale generative AI solutions that unlock new opportunities, accelerate innovation, and deliver measurable impact across industries.
Arefeh Ghahvechi is a Senior AI Strategist at the AWS GenAI Innovation Center, specializing in helping customers realize rapid value from generative AI technologies by bridging innovation and implementation. She identifies high-impact AI opportunities while building the organizational capabilities needed for scaled adoption across enterprises and national initiatives.
Nuno Castro is a Sr. Applied Science Manager. He’s has 19 years experience in the field in industries such as finance, manufacturing, and travel, leading ML teams for 11 years.
Dmitrii Ryzhov is a Senior Account Manager at Amazon Web Services (AWS), helping digital-native companies unlock business potential through AI, generative AI, and cloud technologies. He works closely with customers to identify high-impact business initiatives and accelerate execution by orchestrating strategic AWS support, including access to the right expertise, resources, and innovation programs.
Nikita Kozodoi, PhD, is a Senior Applied Scientist at the AWS Generative AI Innovation Center working on the frontier of AI research and business. Nikita builds and deploys generative AI and ML solutions that solve real-world problems and drive business impact for AWS customers across industries.
Aiham Taleb, PhD, is a Senior Applied Scientist at the Generative AI Innovation Center, working directly with AWS enterprise customers to leverage Gen AI across several high-impact use cases. Aiham has a PhD in unsupervised representation learning, and has industry experience that spans across various machine learning applications, including computer vision, natural language processing, and medical imaging.

Detect and redact personally identifiable information using Amazon Bed …

Organizations handle vast amounts of sensitive customer information through various communication channels. Protecting Personally Identifiable Information (PII), such as social security numbers (SSNs), driver’s license numbers, and phone numbers has become increasingly critical for maintaining compliance with data privacy regulations and building customer trust. However, manually reviewing and redacting PII is time-consuming, error-prone, and scales poorly as data volumes grow.
Organizations face challenges when dealing with PII scattered across different content types – from texts to images. Traditional approaches often require separate tools and workflows for handling text and image content, leading to inconsistent redaction practices and potential security gaps. This fragmented approach not only increases operational overhead but also raises the risk of accidental PII exposure.
This post shows an automated PII detection and redaction solution using Amazon Bedrock Data Automation and Amazon Bedrock Guardrails through a use case of processing text and image content in high volumes of incoming emails and attachments. The solution features a complete email processing workflow with a React-based user interface for authorized personnel to more securely manage and review redacted email communications and attachments. We walk through the step-by-step solution implementation procedures used to deploy this solution. Finally, we discuss the solution benefits, including operational efficiency, scalability, security and compliance, and adaptability.
Solution overview
The solution provides an automated system for protecting sensitive information in business communications through three main capabilities:

Automated PII detection and redaction for both email content and attachments using Amazon Bedrock Data Automation and Guardrails, making sure that sensitive data is consistently protected across different content types.
More secure data management workflows where processed communications are encrypted and stored with appropriate access controls, while maintaining a complete audit trail of operations.
Web-based interface options for authorized agents to efficiently manage redacted communications, supported by features like automated email categorization and customizable folder management.

This unified approach helps organizations maintain compliance with data privacy requirements while streamlining their communication workflows.
The following diagram outlines the solution architecture. 
The diagram illustrates the backend PII detection and redaction workflow and the frontend application user interface orchestrated by AWS Lambda and Amazon EventBridge. The process follows these steps:

The workflow starts with the user sending an email to the incoming email server hosted on Amazon Simple Email Service (Amazon SES). This is an optional step.
Alternatively, users can upload the emails and attachments directly into an Amazon Simple Storage Service (S3) landing bucket.
An S3 event notification triggers the initial processing AWS Lambda function that generates a unique case ID and creates a tracking record in Amazon DynamoDB.
Lambda orchestrates the PII detection and redaction workflow by extracting email body and attachments from the email and saving it in a raw email bucket followed by invoking Amazon Bedrock Data Automation and Guardrails for detecting and redacting PII.
Amazon Bedrock Data Automation processes attachments to extract text from the files.
Amazon Bedrock Guardrails detects and redacts the PII from both email body and text from attachments and stores the redacted content in another S3 bucket.
DynamoDB tables are updated with email messages, folders metadata, and email filtering rules.
An Amazon EventBridge Scheduler is used to run the Rules Engine Lambda on a schedule which processes new emails that have yet to be categorized into folders based on enabled email filtering rules criteria.
The Rules Engine Lambda also communicates with DynamoDB to access the messages table and the rules table.
Users can access the optional application user interface through Amazon API Gateway, which manages user API requests and routes requests to render the user interface through S3 static hosting. Users may choose to enable authentication for the user interface based on their security requirements. Alternatively, users can check the status of their email processing in the DynamoDB table and S3 bucket with PII redacted content.
A Portal API Lambda fetches the case details based on user requests.
The static assets served by API Gateway are stored in a private S3 bucket.
Optionally, users may enable Amazon CloudWatch and AWS CloudTrail to provide visibility into the PII detection and redaction process, while using Amazon Simple Notification Service to deliver real-time alerts for any failures, facilitating immediate attention to issues.

In the following sections, we walk through the procedures for implementing this solution.
Walkthrough
The solution implementation involves infrastructure and optional portal setup.
Prerequisites
Before beginning the implementation, make sure to have the following components installed and configured.

An AWS account
Git
Python 3.7 or higher
Node v18 or higher
NPM v9.8 or higher
AWS CDK v2.166 or higher
Terminal/CLI such as macOS Terminal, PowerShell or Windows Terminal, or the Linux command line. AWS CloudShell can also be used when all code is located within an AWS account

Infrastructure setup and deployment process
Verify that an existing virtual private cloud VPC that contains three private subnets with no internet access is created in your AWS account. All AWS CloudFormation stacks need to be deployed within the same AWS account.
CloudFormation stacks
The solution contains three stacks (two required, one optional) that deploys in your AWS account:

S3Stack – Provisions the core infrastructure including S3 buckets for raw and redacted email storage with automatic lifecycle policies, a DynamoDB table for email metadata tracking with time-to-live (TTL) and global secondary indexes, and VPC security groups for more secure Lambda function access. It also creates Amazon Identity and Access Management (IAM) roles with comprehensive permissions for S3, DynamoDB, and Bedrock services, forming a more secure foundation for the entire PII detection and redaction workflow.
ConsumerStack – Provisions the core processing infrastructure including Amazon Bedrock Data Automation projects for document text extraction and Bedrock Guardrails configured to anonymize comprehensive PII entities, along with Lambda functions for email and attachment processing with Amazon Simple Notification Service (SNS) topics for success/failure notifications. It also creates Amazon Simple Email Service (SES) receipt rules for incoming email handling when a domain is configured and S3 event notifications to trigger the email processing workflow automatically.
PortalStack (optional) – This is only needed when users want to use a web-based user interface for managing emails. It provisions the optional web interface including a regional API Gateway, DynamoDB tables for redacted message storage, and S3 buckets for static web assets.

Amazon SES (optional)
Move directly to the Solution Deployment section that follows if Amazon SES is not being used.
The following Amazon SES Setup is optional. The code may be tested without this setup as well. Steps to test the application with or without Amazon SES is covered in the Testing section.
Set up Amazon SES with prod access and verify the domain/email identities for which the solution is to work. We also need to add the MX records in the DNS provider maintaining the domain. Please refer to the following links:

Request SES Production Access
Setting up Amazon SES email receiving

Create credentials for SMTP and save it in AWS Secrets Manager secret with name SmtpCredentials. An IAM user is created for this process.
If any other name is being used for the secret, update the context.json line secret_name with the name of the secret created.
The key for the username in the secret should be smtp_username and the key for password should be smtp_password when storing the same in AWS Secrets Manager.

Obtaining Amazon SES SMTP credentials

Solution deployment
Run the following commands from within a terminal/CLI environment.

Clone the repository

git clone https://github.com/aws-samples/sample-bda-redaction.git

The infra/cdk.json file tells the CDK Toolkit how to execute your app

cd sample-bda-redaction/infra/

Optional: Create and activate a new Python virtual environment (make sure to use python 3.12 as lambda is in CDK is configured for same. If using some other python version update CDK code to reflect the same in lambda runtime)

python3 -m venv .venv
. .venv/bin/activate

Upgrade pip

pip install –upgrade pip

Install Python packages

pip install -r requirements.txt

Create context.json file

cp context.json.example context.json

Update the context.json file with the correct configuration options for the environment.

Property Name
Default
Description
When to Create

vpc_id
“”
VPC ID where resources are deployed
VPC needs to be created prior to execution

raw_bucket
“”
S3 bucket storing raw messages and attachments
Created during CDK deployment

redacted_bucket_name
“”
S3 bucket storing redacted messages and attachments
Created during CDK deployment

inventory_table_name
“”
DynamoDB table name storing redacted message details
Created during CDK deployment

resource_name_prefix
“”
Prefix used when naming resources during the stack creation
During stack creation

retention
90
Number of days for retention of the messages in the redacted and raw S3 buckets
During stack creation

The following properties are only required when the portal is being provisioned.

Property Name
Default
Description

environment
development
The type of environment where resources are provisioned. Values are development or production

Use cases that require the usage of Amazon SES to manage redacted email messages need to set the following configuration variables. Otherwise, these are optional.

Property Name
Description
Comment

domain
The verified domain or email name that is used for Amazon SES
This can be left blank if not setting up Amazon SES

auto_reply_from_email
Email address of the “from” field of the email message. Also used as the email address where emails are forwarded from the Portal application
This can be left blank if not setting up the Portal

secret_name
AWS Secrets Manager secret containing SMTP credentials for forward email functionality from the portal

Deploy Infrastructure by running the following commands from the root of the infra directory.

Bootstrap the AWS account to use AWS CDK

cdk bootstrap

Users can now synthesize the CloudFormation template for this code. Additional environment variables before the cdk synth suppresses the warnings. The deployment process should take approximately 10 min for a first-time deployment to complete.

JSII_DEPRECATED=quiet
JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet
cdk synth –no-notices

Replace <<resource_name_prefix>> with its chosen value and then run:

JSII_DEPRECATED=quiet
JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet
cdk deploy <<resource_name_prefix>>-S3Stack <<resource_name_prefix>>-ConsumerStack –no-notices

Testing

Testing the application with Amazon SES

Before starting the test, make sure the Amazon SES Email Receiving rule set that was created by the <<resource_name_prefix>>-ConsumerStack stack is active. We can check by executing the below command and make sure name in the output is <<resource_name_prefix>>-rule-setaws ses describe-active-receipt-rule-set. If the name does not match or the output is blank, execute the following to activate the same:

# Replace <<resource_name_prefix>> with resource_name_prefix used in context.json

aws ses set-active-receipt-rule-set –rule-set-name <<resource_name_prefix>>-rule-set

Once we have the correct rule set active, we can test the application using Amazon SES by sending an email to the verified email or domain in Amazon SES, which automatically triggers the redaction pipeline. Progress can be tracked in the DynamoDB table <<inventory_table_name>>. The inventory table name can be found on the resources tab in the AWS CloudFormation Console for the <<resource_name_prefix>>-S3Stack stack and Logical ID EmailInventoryTable. A unique <<case_id>> is generated and used in the DynamoDB inventory table for each email being processed. Once redaction is complete, the redacted email body can be found in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/email_body/ and redacted attachments in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/attachments/.

Testing the application without Amazon SES

As described earlier, the solution is used to redact any PII data in the email body and attachments. Therefore, to test the application, we need to provide an email file which needs to be redacted. We can do that without Amazon SES by directly uploading an email file to the raw S3 bucket. The raw bucket name can be found on the output tab in the AWS CloudFormation Console for <<resource_name_prefix>>-S3Stack stack and Export Name RawBucket. This triggers the workflow of redacting the email body and attachments by S3 event notification triggering the Lambda. For your convenience, a sample email is available in the infra/pii_redaction/sample_email directory of the repository. Below are the steps to test the application without Amazon SES using the same email file.

# Replace <<raw_bucket>> with raw bucket name created during deployment

aws s3 cp pii_redaction/sample_email/ccvod0ot9mu6s67t0ce81f8m2fp5d2722a7hq8o1 s3://<<raw_bucket>>/domain_emails/

The above triggers the redaction of the email process. You can track the progress in the DynamoDB table <<inventory_table_name>>. A unique <<case_id>> is generated and used in the DynamoDB inventory table for each email being processed. The inventory table name can be found on the resources tab in the AWS CloudFormation Console for <<resource_name_prefix>>-S3Stack stack and Logical ID EmailInventoryTable. Once redaction is complete, the redacted email body can be found in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/email_body/ and redacted attachments in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/attachments/.
Portal setup
The installation of the portal is completely optional. This section can be skipped; check the console of the AWS account where the solution is deployed to view the resources created. The portal serves as a web interface to manage the PII-redacted emails processed by the backend AWS infrastructure, allowing users to view sanitized email content. The Portal can be used to:

List messages: View processed emails with redacted content
Message details: View individual email content and attachments

Portal Prerequisites: This portal requires the installation of the following software tools:

TypeScript
Node v18 or higher
NPM v9.8 or higher

Infrastructure Deployment

Synthesize the CloudFormation template for this code by going to the directory root of the solution. Now run the following command:

cd sample-bda-redaction/infra/

Optional: Create and activate a new Python virtual environment (if the virtual environment has not been created previously):

python3 -m venv .venv. .venv/bin/activatepip install -r requirements.txt

Users can now synthesize the CloudFormation template for this code.

JSII_DEPRECATED=quiet
JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet
cdk synth –no-notices

Deploy the React-based portal. Replace <<resource_name_prefix>> with its chosen value:

JSII_DEPRECATED=quiet
JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet
cdk deploy <<resource_name_prefix>>-PortalStack –no-notices

The first-time deployment should take approximately 10 minutes to complete.
Environment Variables

Create a new environment file by going to the root of the app directory and update the following variables in the .env file (by copying the .env.example file to .env) using the following command to create the .env file using a terminal/CLI environment.

cp .env.example .env

The file can be created using your preferred text editor as well.

Environment Variable Name
Default
Description
Required

VITE_APIGW
“”
URL of the API Gateway invokes URL (including protocol) without the path (remove /portal from the value). This value can be found in the output of the PortalStack after deploying through AWS CDK. It can also be found under the Outputs tab of the PortalStack CloudFormation stack under the export name of PiiPortalApiGatewayInvokeUrl
Yes

VITE_BASE
/portal
It specifies the path used to request the static files needed to render the portal
Yes

VITE_API_PATH
/api
It specifies the path needed to send requests to the API Gateway
Yes

Portal deployment
Run the following commands from within a terminal/CLI environment.

Before running any of the following commands, go to the root of the app directory to build this application for production by running the following commands:

Install NPM packages

npm install

Build the files

npm run build

After the build succeeds, transfer all of the files within the dist/ directory into the Amazon S3 bucket that is designated for these assets (specified in the PortalStack provisioned via CDK).

Example: aws s3 sync dist/ s3://<<name-of-s3-bucket>> –delete<<name-of-s3-bucket>> is the S3 bucket that has been created in the <<resource-name-prefix>>-PortalStack CloudFormation stack with the Logical ID of PrivateWebHostingAssets. This value can be obtained from the Resources tab of the CloudFormation stack in the AWS Console. This value is also output during the cdk deploy process when the PortalStack has been successfully completed.

Accessing the portal
Use the API Gateway invoke URL from the API Gateway that has been created during the cdk deploy process to access the portal from a web browser. This URL can be found by following these steps:

Visit the AWS Console
Go to API Gateway and find the API Gateway that has been created during the cdk deploy process. The name of the API Gateway can be found in the Resources section of the <<resource-name-prefix>>-PortalStack CloudFormation stack.
Click on the Stages link in the left-hand menu.
Make sure that the portal stage is selected
Find the Invoke URL and copy that value
Enter that value in the address bar of your web browser

The portal’s user interface is now visible within the web browser. If any emails have been processed, they are listed on the home page of the portal.
Access control (optional)
For production deployment, we recommend these approaches to controlling and managing access to the Portal.
Clean up
To avoid incurring future charges, follow these steps to remove the resources created by this solution:

Delete the contents of the S3 buckets created by the solution:

Raw email bucket
Redacted email bucket
Portal static assets bucket (if portal was deployed)

Delete or disable the Amazon SES rule step created by the solution using below cli command:

#to disable the rule set use below command
aws ses set-active-receipt-rule-set

#to delete the rule set use below command
# Replace <<resource_name_prefix>> with resource_name_prefix used in context.json
aws ses delete-receipt-rule-set –rule-set-name <resource_name_prefix>>-rule-set

Remove the CloudFormation stacks in the following order:

cdk destroy <<resource_name_prefix>>-PortalStack (if deployed)
cdk destroy <<resource_name_prefix>>-ConsumerStack
cdk destroy <<resource_name_prefix>>-S3Stack

CDK Destroy does not remove the access log Amazon S3 bucket created as part of the deployment. Users can get access to the log bucket name in the output tab of stack <<resource_name_prefix>>-S3Stack with export name AccessLogsBucket. Execute the below steps to delete the access log bucket:

To delete the contents of the access log bucket, follow the instructions on deleting S3 bucket
Access to the log bucket is version-enabled and deleting the content of the bucket in the above step does not delete versioned objects in the bucket. That needs to be removed separately using below aws cli commands:

#to remove versioned objects use below aws cli command
aws s3api delete-objects –bucket ${accesslogbucket} –delete “$(aws s3api list-object-versions –bucket ${accesslogbucket} –query='{Objects: Versions[].{Key:Key,VersionId:VersionId}}’)”

#once versioned objects are removed we need to remove the delete markers of the versioned objects using below aws cli command
aws s3api delete-objects –bucket ${accesslogbucket} –delete “$(aws s3api list-object-versions –bucket ${accesslogbucket} –query='{Objects: DeleteMarkers[].{Key:Key,VersionId:VersionId}}’)”

Delete the access log Amazon S3 bucket using below aws cli command:

#delete the access log bucket itself using below aws cli command
aws s3api delete-bucket –bucket ${accesslogbucket}

If Amazon SES is configured:

Remove the verified domain/email identities
Delete the MX records from your DNS provider
Delete the SMTP credentials from AWS Secrets Manager

Delete any CloudWatch Log groups created by the Lambda functions

The VPC and its associated resources as prerequisites for this solution may not be deleted if they may be used by other applications.
Conclusion
In this post, we demonstrated how to automate the detection and redaction of PII across both text and image content using Amazon Bedrock Data Automation and Amazon Bedrock Guardrails. By centralizing and streamlining the redaction process, organizations can strengthen alignment with data privacy requirements, enhance security practices, and minimize operational overhead.
However, it is equally important to make sure that your solution is built with Amazon Bedrock Data Automation’s document processing constraints in mind. Amazon Bedrock Data Automation supports PDF, JPEG, and PNG file formats with a maximum console-processing size of 200 MB (500 MB via API), and single documents may not exceed 20 pages unless document splitting is enabled.
By using Amazon Bedrock Data Automation and Amazon Bedrock Guardrails centralized redaction capabilities, organizations can boost data privacy compliance management, cut operational overhead, and maintain stringent security across diverse workloads. This solution’s extensibility further enables integration with other AWS services, fine-tuning detection logic for more advanced PII patterns, and broadening support for additional file types or languages in the future, thereby evolving into a more robust, enterprise-scale data protection framework.
We encourage exploration of the provided GitHub repository to deploy this solution within your organization. In addition to delivering operational efficiency, scalability, security, and adaptability, the solution also provides a unified interface and robust audit trail that simplifies data governance. By refining detection rules, users can integrate additional file formats where possible and use Amazon Bedrock Data Automation and Amazon Bedrock Guardrails modular framework.
We invite you to implement this PII detection and redaction solution in the following GitHub repo to build a more secure, compliance-aligned, and highly adaptable data protection solution on Amazon Bedrock that addresses evolving business and regulatory requirements.

About the Authors
Himanshu Dixit is a Delivery Consultant at AWS Professional Services specializing in databases and analytics, bringing over 18 years of experience in technology. He is passionate for artificial intelligence, machine learning, and generative AI, leveraging these cutting-edge technologies to create innovative solutions that address real-world challenges faced by customers. Outside of work, he enjoys playing badminton, tennis, cricket, table tennis and spending time with her two daughters.
David Zhang is an Engagement Manager at AWS Professional Services, where he leads enterprise-scale AI/ML, cloud transformation initiatives for Fortune 100 customers in telecom, finance, media, and entertainment. Outside of work, he enjoys experimenting with new recipes in his kitchen, playing tenor saxophone, and capturing life’s moments through his camera.
Richard Session is a Lead User Interface Developer for AWS ProServe, bringing over 15 years of experience as a full-stack developer across marketing/advertising, enterprise technology, automotive, and ecommerce industries. With a passion for creating intuitive and engaging user experiences, he uses his extensive background to craft exceptional interfaces for AWS’s enterprise customers. When he’s not designing innovative user experiences, Richard can be found pursuing his love for coffee, spinning tracks as a DJ, or exploring new destinations around the globe.
Viyoma Sachdeva is a Principal Industry Specialist in AWS. She is specialized in AWS DevOps, containerization and IoT helping Customer’s accelerate their journey to AWS Cloud.

Speed meets scale: Load testing SageMakerAI endpoints with Observe.AI …

This post is cowritten with Aashraya Sachdeva from Observe.ai.
You can use Amazon SageMaker to build, train and deploy machine learning (ML) models, including large language models (LLMs) and other foundation models (FMs). This helps you significantly reduce the time required for a range of generative AI and ML development tasks. An AI/ML development cycle typically involves data pre-processing, model development, training, testing and deployment lifecycles. By using SageMaker, your data science and ML engineering teams can offload a lot of the undifferentiated heavy lifting involved with model development.
While SageMaker can help teams offload a lot of heavy lifting, engineering teams still have to use manual steps to implement and fine-tune related services that are part of inference pipelines, such as queues and databases. In addition, teams have to test multiple GPU instance types to find the right balance between performance and cost.
Observe.ai provides a Conversation Intelligence (CI) product that integrates with contact center as a service (CCaaS) solutions. The tool analyzes calls in real time and after they’re complete to enable features such as call summarizations, agent feedback, and auto response . The Conversation Intelligence (CI) features need to scale from customers that have fewer than 100 agents to customers that have thousands of agents—a tenfold increase in scale. To help with this, Observe.ai needed a mechanism to optimize their ML infrastructure and model serving costs. Without such a mechanism, developers had to write multiple test scripts and develop testing pipelines and debugging systems, which consumed a lot of time.
To solve this challenge, Observe.ai developed the One Load Audit Framework (OLAF), which integrates with SageMaker to identify bottlenecks and performance issues in ML services, offering latency and throughput measurements under both static and dynamic data loads. The framework also seamlessly incorporates ML performance testing into the software development lifecycle, facilitating accurate provisioning and cost savings. Using OLAF, Observe.AI’s ML team was able to reduce testing time from a week to a few hours. This helped Observe.AI scale up their frequency of endpoint deployment and customer onboarding multifold. The OLAF utility is available on GitHub and is free to use. It is open source and distributed under the Apache 2.0 license.
In this blog post, you will learn how to use the OLAF utility to test and validate your SageMaker endpoint.
Solution overview
After you’ve deployed your model for inference and verified that it’s functionally accurate, you’ll want to improve the performance of your model. The first step to do this is to load test the inference endpoint. You can use the load test metrics to apply optimizations to your model, decide on GPU instances, and fine tune the ML pipeline to increase performance without compromising on accuracy. Load testing needs to be repeated multiple times to measure the impact of any optimization. To load test, you need to configure load testing scripts to integrate with the relevant SageMaker APIs, extract metrics like latency, CPU, and memory utilization. You also need to set up a dashboard to view the results of the load test and, export the load test metrics for further analysis; and you need a configurable framework to apply concurrent load to the endpoint.
How OLAF helps
OLAF saves you the heavy lifting by providing the preceding elements as a package. OLAF is integrated with Locust, a load testing framework, to provide the capability to create concurrent load and a dashboard to view the results as the test progresses. OLAF integrates with the SageMaker API to invoke the API and to extract the metrics to measure the performance by.
In the following solution, you will learn how to deploy OLAF on your workstation as a Docker container. Using the Load test setup UI (as shown in the following figure), the load test configuration is provided and the OLAF framework uses the Boto3 SDK to push inference requests to a SageMaker inference endpoint. OLAF monitors the latency and available performance metrics using the Performance reports dashboard provided by OLAF.

Prerequisites
For this solution walkthrough, you need the following:

An AWS account
Docker installed on your workstation
The AWS Command Line Interface (AWS CLI) installed and configured. If you’re using long term credentials such as access keys, see manage access keys for IAM users and secure access keys for best practices. This post uses temporary short term credentials generated by the AWS Security Token Service (AWS STS).

Generate your AWS credentials using AWS STS
To get started, use the AWS CLI to generate your credentials.
Note: Ensure that the role or user from which the access keys are generated has AmazonSageMakerFullAccess permission. Your AWS CLI role should have the necessary trust policy to assume the role from which the access keys are generated.
Getting the role-arn
In your AWS CLI type in the following command:

aws iam get-role –role-name sagemaker_role

The command will generate the JSON output below. The role arn is the value in the arn property in the JSON below.

{
“Role”:{
“Path”:”/”,
“RoleName”:”sagemaker_role”,
“RoleId”:”AROA123456789EXAMPLE”,
“Arn”:”arn:aws:iam::111122223333:role/sagemaker_role”,
“CreateDate”:”2025-12-05T13:02:33+00:00″,
“AssumeRolePolicyDocument”:{
“Version”:”2012-10-17″,
“Statement”:[
{
“Effect”:”Allow”,
“Principal”:{
“Service”:”ec2.amazonaws.com”
},
“Action”:”sts:AssumeRole”
}
]
},
“Description”:”Allows EC2 instances to call AWS services on your behalf.”,
“MaxSessionDuration”:3600,
“RoleLastUsed”:{

}
}
}

Run the following command in your AWS CLI:

aws sts assume-role –role-arn <role arn to assume> –role-session-name <session name> –duration-seconds <timeout duration>

Set the role arn value from the step above in the –role-arn argument.
Provide the value olaf_session to the —role-session-name argument and set a value equivalent to how long you expect your load test to run in the –duration-seconds argument. In this blog we are setting it at 1800 seconds which give 30 minutes of load testing time.

The assume-role command will generate temporary AWS credentials as below

{
“Credentials”:{
“AccessKeyId”:”ASIAIOSFODNN7EXAMPLE”,
“SecretAccessKey”:”wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY”,
“SessionToken”:”IQoJb3JpZ2luX2VjEJf//////////wEaCXVzLWVhc3QtMSJFMEMCIFdzSaxLya/E71xi2SeG8KFDF46DkxvsWt6Ih0I5X2I6Ah9FYGmWi3fnQfyPQWuzE0Co44xp+qOAxbfaHJ53OYbBKpkCCF8QARoMNjE1NTE1NDU5MjM5IgyoWu5a5DJX3BMn7LYq9gHiRr2sQvStZT9tvvdS8QFjTntBYFEkDL636Crj4xw5rDieBoYFB9h+ozSqMXOtze79DHQLyCduT+McWOlB9Ic5x/xtzPT9HZsfMaEMUOPgI9LtKWUK367rVdcqBV8HH8wOwUS9RhwIyXg2vsGa+WanaS8o6sO8PVkvqOs4ea3CFguncGgSqIftJvgMg0OswzkAoUKXG6jMwL3Ppu13Dg9NV3YKOsS80vejhEJ8QFiKiTsJKX2QmQz/wUN4DN83y8qeFfYEpuYC92oZzv2gErrsXqFd+7/+2w97mInPlD6g1tyd8FlGdXg821WckmwdPu7TYqsCR9kwiM3LyQY6nwFM3U7f/sCre28o2Js31dig0WHb1iv3nTR6m/bIKqsQL4EtYXPGjHD6Ifsf9nQYtkPQC/PqzXg7anx6Q6OW5CzVvk4xU/G9+HcCej84MutK/hQGp3xnRPuJvUIs/q/QlddURk/MFZW9X3njLCn89FRmJ/tI1Mzy/yctwgLcBetE7RIPgaM/90HNXp62vBMK0tzqR1orm6/7eOGV5DXaprQ=”,
“Expiration”:”2025-12-05T14:34:56+00:00″
},
“AssumedRoleUser”:{
“AssumedRoleId”:”AROA123456789EXAMPLE:olaf-session”,
“Arn”:”arn:aws:sts::111122223333:assumed-role/sm-blog-role/olaf-session”
}
}

Make a note of the access key, secret key, and session token, which you will use to configure the test in the OLAF tool.

Set up your SageMaker inference endpoint
In this step, you set up a SageMaker inference endpoint. The following is a CloudFormation script to set up the endpoint. Copy the content below and save it as a yaml file for use in the steps below.

Resources:
SageMakerExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: ‘2012-10-17’
Statement:
– Effect: Allow
Principal:
Service: sagemaker.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
– arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
– arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
SageMakerModel:
Type: AWS::SageMaker::Model
Properties:
ModelName: !Sub ‘${AWS::StackName}-flan-t5-model’
ExecutionRoleArn: !GetAtt SageMakerExecutionRole.Arn
EnableNetworkIsolation: true
PrimaryContainer:
Image: !Sub ‘763104351884.dkr.ecr.${AWS::Region}.amazonaws.com/huggingface-pytorch-inference:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04’
Environment:
HF_MODEL_ID: !Sub ‘google/flan-t5-${ModelSize}’
SageMakerEndpointConfig:
Type: AWS::SageMaker::EndpointConfig
Properties:
EndpointConfigName: !Sub ‘${EndpointName}-config’
ProductionVariants:
– VariantName: AllTraffic
ModelName: !GetAtt SageMakerModel.ModelName
InstanceType: !Ref InstanceType
InitialInstanceCount: 1
SageMakerEndpoint:
Type: AWS::SageMaker::Endpoint
Properties:
EndpointName: !Ref EndpointName
EndpointConfigName: !GetAtt SageMakerEndpointConfig.EndpointConfigName
Parameters:
ModelName:
Type: String
Default: flan-t5-model
Description: Name of the SageMaker model
EndpointName:
Type: String
Default: flan-t5-endpoint-blog
Description: Name of the SageMaker endpoint
InstanceType:
Type: String
Default: ml.g5.xlarge
Description: Instance type for the SageMaker endpoint
AllowedValues:
– ml.g4dn.xlarge
– ml.g4dn.2xlarge
– ml.g5.xlarge
– ml.g5.2xlarge
– ml.p3.2xlarge
ModelSize:
Type: String
Default: base
Description: Size of the FLAN-T5 model
AllowedValues:
– small
– base
– large
– xl
– xxl
Outputs:
SageMakerEndpointId:
Description: ID of the SageMaker Endpoint
Value: !Ref SageMakerEndpoint
SageMakerEndpointName:
Description: Name of the SageMaker Endpoint
Value: !Ref EndpointName
ModelName:
Description: Name of the deployed model
Value: !Ref ModelName
AWSTemplateFormatVersion: ‘2010-09-09’
Description: ‘CloudFormation template for deploying FLAN-T5 model on Amazon SageMaker’

Open an AWS CloudShell window by selecting the CloudShell icon at the top of the AWS Management Console in the AWS Region where you want the endpoint to be created.

In your CloudShell window, choose Actions and select Upload file. Select and upload the CloudFormation YAML file shared at the start of this section.

Run the following command at the CloudShell prompt

aws cloudformation create-stack
–stack-name flan-t5-endpoint-stack
–template-body file://<YAML_FILE_NAME>
–capabilities CAPABILITY_IAM

Navigate to the Amazon SageMaker AI Studio console. You might need to change the Region to match where you have deployed your SageMaker endpoint. Select the Inference and then Endpoints in the navigation pane to view the deployed endpoint. The SageMaker endpoint will take a few minutes to complete provisioning. When ready the value of the Status field will be InService. Note the endpoint name.

Install OLAF
You’re ready to install and configure OLAF to help you load test your SageMaker AI inference endpoint.

Clone the OLAF repository from the OLAF GitHub repo:

git clone https://github.com/Observeai-Research/olaf.git

Navigate to the olaf directory and build the docker image for OLAF:

cd olaf
docker build -t olaf .

Run OLAF:

docker run -p 80:8000 olaf

Open a browser window and enter the following URL to bring up the OLAF UI.

http://localhost

Enter olaf as the username and password to sign in to the OLAF dashboard. On the left is a series of radio buttons to select the resource to be tested, including SageMaker, S3, and so on. On the right is a setup screen that changes based on the resource selected.

OLAF supports additional options, including:

Multi-model
Enable batch mode

Test the SageMaker endpoint

Open the OLAF UI at http://localhost:80/.
Select Sagemaker from the navigation pane and configure the test:

SageMaker endpoint– Enter the name of the SageMaker endpoint from the SageMaker Unified Studio console here.
Predictor type – OLAF supports pytorch, sklearn and tensorflow predictors. Keep the default values.
Input Serializer – Serialization options are numpy and json. Keep the default values.
Output Serializer – Serialization options are numpy and json. Keep the default values.
AWS Region – Select the Region where the SageMaker endpoint is deployed
AWS access key – Enter the AWS access key generated from AWS STS in the section “Generate your AWS credentials using AWS STS” above.
AWS secret key – Enter the AWS secret key generated from AWS STS in the section “Generate your AWS credentials using AWS STS” above.
AWS session token – Enter the session token generated from AWS STS in the section “Generate your AWS credentials using AWS STS” above.
Input query json – For this test, enter the following prompt to translate a phrase from English to French.

[
{
“inputs”: “translate the following phrase in English to French : Hello, how are you”
}
]

Choose START LOAD SESSION to start a load test session. The session is started and a link to the session is provided at the bottom of the page. If the link doesn’t appear in a few seconds choose START LOAD SESSION to generate the link to the session.

Selecting the link takes you to a LOCUST dashboard. Enter the number of concurrent users that you want the test to simulate in the Number of users field and the interval (in seconds) that the users must be started in the spawn rate. Choose Start swarming to start the load test.

On starting the test, a reporting page, shown in the following figure, is presented that you can use to monitor the various performance parameters as the test proceeds. The information on this page provides a summary of the statistics, the p50 and p95 latency values, and the CPU and memory usage of the SageMaker workers.

Choose Charts at the top of the screen to view charts that show the Total Requests per Second and the Response Times in milliseconds. The Total Requests per Second chart shows the successful requests in green and the failed requests in red. The Response Times chart shows the fiftieth percentile response times in green and the ninety-fifth percentile response times in yellow.

Choose Workers at the top of the screen to view the worker statistics. Workers are created to generate the desired load. The # users show the number of users generated by the worker, the CPU usage and Memory Usage show the resource utilization by the worker.

You can view and download the final statistics for analysis. Choose Download Data at the top of the screen to view data download options. You can download the data as a CSV file from the Statistics, Failures, Exceptions, and Charts reporting pages.

You must stop the current load session before you can execute a new session. Choose the STOP RUNNING LOAD SESSION to stop the session. If configured, the data can be uploaded into a specified Amazon Simple Storage Service (Amazon S3) bucket. Follow the instructions in Advanced OLAF Usage item 3, Automated Backup of Load Test Report, to configure the upload of test results to Amazon S3.

Hosting the client
For the solution described in this post, you used a desktop to host the OLAF container and set up the load tests. The choice of using your desktop or an Amazon Elastic Compute Cloude (Amazon EC2) instance can impact the latency because the round trip time will be impacted. Network bandwidth can also impact the latency. The key is to standardize the environment that you use to run the tests based on how your customers use the endpoints.
Clean up
When you’re done with this demonstration, remove any resources that you no longer need to avoid incurring future costs.

In the CloudShell terminal run the following command to delete the SageMaker endpoint:

aws cloudformation delete-stack –stack-name flan-t5-endpoint-stack

Run the following command to list the running Docker images

docker ps

Note the container_id and then run the following command to stop the Docker images.

docker stop <container_id>

Conclusion
In this post, you’ve learned how to set up OLAF and use it to load test a SageMaker endpoint with a few basic steps. OLAF represents a significant step forward in streamlining the optimization of ML infrastructure and model serving costs. Through this demonstration, you’ve seen how OLAF seamlessly integrates with SageMaker to provide valuable insights into endpoint performance under various load conditions. Key benefits of OLAF include:

Straightforward setup and integration with existing SageMaker endpoints
Real-time monitoring of performance metrics including latency and throughput
Detailed statistics and downloadable reports for analysis
Ability to test different load patterns and concurrency levels
Support for multiple model types and serialization options

For organizations like Observe.ai that need to scale their ML operations efficiently, OLAF eliminates the need to develop custom testing infrastructure and debugging systems. This means that development teams can focus on their core product features while ensuring optimal performance and cost-effectiveness of their ML infrastructure. As the adoption of ML continues to grow, tools like OLAF become increasingly valuable in helping organizations optimize their ML operations. Whether you’re running a few models or managing a large-scale ML infrastructure, OLAF provides the insights needed to make informed decisions about instance types, scaling, and resource allocation.
In this sample solution, you used short term credentials generated by the AWS STS service to connect to SageMaker from OLAF. Ensure that the necessary steps are taken to secure your access keys and credentials in a production environment.
To get started with OLAF, visit the GitHub repository and follow the installation steps outlined in this post. The framework’s intuitive interface and comprehensive monitoring capabilities make it an essential tool for organizations that want to optimize their SageMaker deployments.

About the authors
Aashraya Sachdeva is a technology leader with deep expertise in genAI, product development, and platform engineering. As the Director of Engineering at Observe, he oversees teams building scalable, agentic solutions that enhance both customer experience and operational efficiency. With extensive experience guiding ML initiatives from early data exploration through deployment and large-scale operations, he brings a pragmatic, reliability-focused approach to delivering high-performing platforms. Throughout his career, he has played a key role in launching multiple products, leveraging his ML background to create innovative yet practical solutions, while consistently fostering collaboration, mentorship, and technical excellence across engineering teams.
Shibu Jacob is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers architect and implement cloud-native solutions. With over two decades of experience in software development and architecture, Shibu specializes in containerization, microservices, and event-driven architectures. He is particularly passionate about the transformative potential of AI in software development and architectural design. Prior to joining AWS, he spent 20 years working with enterprises and startups, bringing a wealth of practical experience to his current role. Outside of work, Shibu enjoys following Formula 1 racing, working on DIY automotive projects, going on long road trips, and spending time with his family.

A Coding Implementation to Build a Unified Apache Beam Pipeline Demons …

In this tutorial, we demonstrate how to build a unified Apache Beam pipeline that works seamlessly in both batch and stream-like modes using the DirectRunner. We generate synthetic, event-time–aware data and apply fixed windowing with triggers and allowed lateness to demonstrate how Apache Beam consistently handles both on-time and late events. By switching only the input source, we keep the core aggregation logic identical, which helps us clearly understand how Beam’s event-time model, windows, and panes behave without relying on external streaming infrastructure. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install -U “grpcio>=1.71.2” “grpcio-status>=1.71.2”
!pip -q install -U apache-beam crcmod

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
from apache_beam.transforms.window import FixedWindows
from apache_beam.transforms.trigger import AfterWatermark, AfterProcessingTime, AccumulationMode
from apache_beam.testing.test_stream import TestStream
import json
from datetime import datetime, timezone

We install the required dependencies and ensure version compatibility so that Apache Beam. We import the core Beam APIs along with windowing, triggers, and TestStream utilities needed later in the pipeline. We also bring in standard Python modules for time handling and JSON formatting. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserMODE = “stream”
WINDOW_SIZE_SECS = 60
ALLOWED_LATENESS_SECS = 120

def make_event(user_id, event_type, amount, event_time_epoch_s):
return {“user_id”: user_id, “event_type”: event_type, “amount”: float(amount), “event_time”: int(event_time_epoch_s)}

base = datetime.now(timezone.utc).replace(microsecond=0)
t0 = int(base.timestamp())

BATCH_EVENTS = [
make_event(“u1”, “purchase”, 20, t0 + 5),
make_event(“u1”, “purchase”, 15, t0 + 20),
make_event(“u2”, “purchase”, 8, t0 + 35),
make_event(“u1”, “refund”, -5, t0 + 62),
make_event(“u2”, “purchase”, 12, t0 + 70),
make_event(“u3”, “purchase”, 9, t0 + 75),
make_event(“u2”, “purchase”, 3, t0 + 50),
]

We define the global configuration that controls window size, lateness, and execution mode. We create synthetic events with explicit event-time timestamps so that windowing behavior is deterministic and easy to reason about. We prepare a small dataset that intentionally includes out-of-order and late events to observe Beam’s event-time semantics. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef format_joined_record(kv):
user_id, d = kv
return {
“user_id”: user_id,
“count”: int(d[“count”][0]) if d[“count”] else 0,
“sum_amount”: float(d[“sum_amount”][0]) if d[“sum_amount”] else 0.0,
}

class WindowedUserAgg(beam.PTransform):
def expand(self, pcoll):
stamped = pcoll | beam.Map(lambda e: beam.window.TimestampedValue(e, e[“event_time”]))
windowed = stamped | beam.WindowInto(
FixedWindows(WINDOW_SIZE_SECS),
allowed_lateness=ALLOWED_LATENESS_SECS,
trigger=AfterWatermark(
early=AfterProcessingTime(10),
late=AfterProcessingTime(10),
),
accumulation_mode=AccumulationMode.ACCUMULATING,
)
keyed = windowed | beam.Map(lambda e: (e[“user_id”], e[“amount”]))
counts = keyed | beam.combiners.Count.PerKey()
sums = keyed | beam.CombinePerKey(sum)
return (
{“count”: counts, “sum_amount”: sums}
| beam.CoGroupByKey()
| beam.Map(format_joined_record)
)

We build a reusable Beam PTransform that encapsulates all windowed aggregation logic. We apply fixed windows, triggers, and accumulation rules, then group events by user and compute counts and sums. We keep this transform independent of the data source, so the same logic applies to both batch and streaming inputs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AddWindowInfo(beam.DoFn):
def process(self, element, window=beam.DoFn.WindowParam, pane_info=beam.DoFn.PaneInfoParam):
ws = float(window.start)
we = float(window.end)
yield {
**element,
“window_start_utc”: datetime.fromtimestamp(ws, tz=timezone.utc).strftime(“%H:%M:%S”),
“window_end_utc”: datetime.fromtimestamp(we, tz=timezone.utc).strftime(“%H:%M:%S”),
“pane_timing”: str(pane_info.timing),
“pane_is_first”: pane_info.is_first,
“pane_is_last”: pane_info.is_last,
}

def build_test_stream():
return (
TestStream()
.advance_watermark_to(t0)
.add_elements([
beam.window.TimestampedValue(make_event(“u1”, “purchase”, 20, t0 + 5), t0 + 5),
beam.window.TimestampedValue(make_event(“u1”, “purchase”, 15, t0 + 20), t0 + 20),
beam.window.TimestampedValue(make_event(“u2”, “purchase”, 8, t0 + 35), t0 + 35),
])
.advance_processing_time(5)
.advance_watermark_to(t0 + 61)
.add_elements([
beam.window.TimestampedValue(make_event(“u1”, “refund”, -5, t0 + 62), t0 + 62),
beam.window.TimestampedValue(make_event(“u2”, “purchase”, 12, t0 + 70), t0 + 70),
beam.window.TimestampedValue(make_event(“u3”, “purchase”, 9, t0 + 75), t0 + 75),
])
.advance_processing_time(5)
.add_elements([
beam.window.TimestampedValue(make_event(“u2”, “purchase”, 3, t0 + 50), t0 + 50),
])
.advance_watermark_to(t0 + 121)
.advance_watermark_to_infinity()
)

We enrich each aggregated record with window and pane metadata so we can clearly see when and why results are emitted. We convert Beam’s internal timestamps into human-readable UTC times for clarity. We also define a TestStream that simulates real streaming behavior using watermarks, processing-time advances, and late data. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run_batch():
with beam.Pipeline(options=PipelineOptions([])) as p:
(
p
| beam.Create(BATCH_EVENTS)
| WindowedUserAgg()
| beam.ParDo(AddWindowInfo())
| beam.Map(json.dumps)
| beam.Map(print)
)

def run_stream():
opts = PipelineOptions([])
opts.view_as(StandardOptions).streaming = True
with beam.Pipeline(options=opts) as p:
(
p
| build_test_stream()
| WindowedUserAgg()
| beam.ParDo(AddWindowInfo())
| beam.Map(json.dumps)
| beam.Map(print)
)

run_stream() if MODE == “stream” else run_batch()

We wire everything together into executable batch and stream-like pipelines. We toggle between modes by changing a single flag while reusing the same aggregation transform. We run the pipeline and print the windowed results directly, making the execution flow and outputs easy to inspect.

In conclusion, we demonstrated that the same Beam pipeline can process both bounded batch data and unbounded, stream-like data while preserving identical windowing and aggregation semantics. We observed how watermarks, triggers, and accumulation modes influence when results are emitted and how late data updates previously computed windows. Also, we focused on the conceptual foundations of Beam’s unified model, providing a solid base for later scaling the same design to real streaming runners and production environments.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export
The post A Coding Implementation to Build a Unified Apache Beam Pipeline Demonstrating Batch and Stream Processing with Event-Time Windowing Using DirectRunner appeared first on MarkTechPost.

TII Abu-Dhabi Released Falcon H1R-7B: A New Reasoning Model Outperform …

Technology Innovation Institute (TII), Abu Dhabi, has released Falcon-H1R-7B, a 7B parameter reasoning specialized model that matches or exceeds many 14B to 47B reasoning models in math, code and general benchmarks, while staying compact and efficient. It builds on Falcon H1 7B Base and is available on Hugging Face under the Falcon-H1R collection.

Falcon-H1R-7B is interesting because it combines 3 design choices in 1 system, a hybrid Transformer along with Mamba2 backbone, a very long context that reaches 256k tokens in standard vLLM deployments, and a training recipe that mixes supervised long form reasoning with reinforcement learning using GRPO.

Hybrid Transformer plus Mamba2 architecture with long context

Falcon-H1R-7B is a causal decoder only model with a hybrid architecture that combines Transformer layers and Mamba2 state space components. The Transformer blocks provide standard attention based reasoning, while the Mamba2 blocks give linear time sequence modeling and better memory scaling as context length grows. This design targets the 3 axes of reasoning efficiency that the team describes, speed, token efficiency and accuracy.

The model runs with a default –max-model-len of 262144 when served through vLLM, which corresponds to a practical 256k token context window. This allows very long chain of thought traces, multi step tool use logs and large multi document prompts in a single pass. The hybrid backbone helps control memory use at these sequence lengths and improves throughput compared with a pure Transformer 7B baseline on the same hardware.

Training recipe for reasoning tasks

Falcon H1R 7B uses a 2 stage training pipeline:

In the first stage, the team runs cold start supervised fine tuning on top of Falcon-H1-7B Base. The SFT (supervised fine tuning) data mixes step by step long form reasoning traces in 3 main domains, mathematics, coding and science, plus non reasoning domains such as chat, tool calling and safety. Difficulty aware filtering upweights harder problems and downweights trivial ones. Targets can reach up to 48k tokens, so the model sees long derivations and full solution paths during training.

In the second stage, the SFT checkpoint is refined with GRPO, which is a group relative policy optimization method for reinforcement learning. Rewards are given when the generated reasoning chain is verifiably correct. For math problems, the system uses symbolic checks on the final answer. For code, it executes the generated program against unit tests. This RL stage pushes the model to keep useful intermediate steps while staying within a token budget.

The result is a 7B model that is tuned specifically for chain of thought reasoning, rather than general chat.

Benchmarks in math, coding and general reasoning

The Falcon-H1R-7B benchmark scores are grouped across math, code and agentic tasks, and general reasoning tasks.

In the math group, Falcon-H1R-7B reaches an aggregate score of 73.96%, ahead of Apriel-1.5-15B at 69.32% and larger models like Qwen3-32B and Nemotron-H-47B. On individual benchmarks:

AIME 24, 88.1%, higher than Apriel-1.5-15B at 86.2%

AIME 25, 83.1%, higher than Apriel-1.5-15B at 80%

HMMT 25, 64.9%, above all listed baselines

AMO Bench, 36.3%, compared with 23.3% for DeepSeek-R1-0528 Qwen3-8B

For code and agentic workloads, the model reaches 33.95% as a group score. On LiveCodeBench v6, Falcon-H1R-7B scores 68.6%, which is higher than Qwen3-32B and other baselines. It also scores 28.3% on the SciCode sub problem benchmark and 4.9% on Terminal Bench Hard, where it ranks second behind Apriel 1.5-15B but ahead of several 8B and 32B systems.

https://huggingface.co/blog/tiiuae/falcon-h1r-7b

On general reasoning, Falcon-H1R-7B achieves 49.48% as a group score. It records 61.3% on GPQA D, close to other 8B models, 72.1% on MMLU Pro, which is higher than all other 8B models in the above table, 11.1% on HLE and 53.4% on IFBench, where it is second only to Apriel 1.5 15B.

The key takeaway is that a 7B model can sit in the same performance band as many 14B to 47B reasoning models, if the architecture and training pipeline are tuned for reasoning tasks.

Inference throughput and test time scaling

The team also benchmarked Falcon-H1R-7B on throughput and test time scaling under realistic batch settings.

For a 512 token input and 32k token output, Falcon-H1R-7B reaches about 1,000 tokens per second per GPU at batch size 32 and about 1,500 tokens per second per GPU at batch size 64, nearly double the throughput of Qwen3-8B in the same configuration. For an 8k input and 16k output, Falcon-H1R-7B reaches around 1,800 tokens per second per GPU, while Qwen3-8B stays below 900. The hybrid Transformer along with Mamba architecture is a key factor in this scaling behavior, because it reduces the quadratic cost of attention for long sequences.

Falcon-H1R-7B is also designed for test time scaling using Deep Think with confidence, known as DeepConf. The idea is to run many chains of thought in parallel, then use the model’s own next token confidence scores to filter noisy traces and keep only high quality candidates.

On AIME 24 and AIME 25, Falcon-H1R-7B reaches 96.7% accuracy with fewer than 100 million generated tokens, which puts it on a favorable Pareto frontier of accuracy versus token cost compared with other 8B, 14B and 32B reasoning models. On the parser verifiable subset of AMO Bench, it reaches 35.9% accuracy with 217 million tokens, again ahead of the comparison models at similar or larger scale.

Key Takeaways

Falcon-H1R-7B is a 7B parameter reasoning model that uses a hybrid Transformer along with Mamba2 architecture and supports a 256k token context for long chain of thought prompts.

The model is trained in 2 stages, supervised fine tuning on long reasoning traces in math, code and science up to 48k tokens, followed by GRPO based reinforcement learning with verifiable rewards for math and code.

Falcon-H1R-7B achieves strong math performance, including about 88.1% on AIME 24, 83.1% on AIME 25 and a 73.96% aggregate math score, which is competitive with or better than larger 14B to 47B models.

On coding and agentic tasks, Falcon-H1R-7B obtains 33.95% as a group score and 68.6% on LiveCodeBench v6, and it is also competitive on general reasoning benchmarks such as MMLU Pro and GPQA D.

The hybrid design improves throughput, reaching around 1,000 to 1,800 tokens per second per GPU in the reported settings, and the model supports test time scaling through Deep Think with confidence to improve accuracy using multiple reasoning samples under a controlled token budget.

Check out the Technical details and MODEL WEIGHTS here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export
The post TII Abu-Dhabi Released Falcon H1R-7B: A New Reasoning Model Outperforming Others in Math and Coding with only 7B Params with 256k Context Window appeared first on MarkTechPost.

Implementing Softmax From Scratch: Avoiding the Numerical Stability Tr …

In deep learning, classification models don’t just need to make predictions—they need to express confidence. That’s where the Softmax activation function comes in. Softmax takes the raw, unbounded scores produced by a neural network and transforms them into a well-defined probability distribution, making it possible to interpret each output as the likelihood of a specific class. 

This property makes Softmax a cornerstone of multi-class classification tasks, from image recognition to language modeling. In this article, we’ll build an intuitive understanding of how Softmax works and why its implementation details matter more than they first appear. Check out the FULL CODES here.

Implementing Naive Softmax

Copy CodeCopiedUse a different Browserimport torch

def softmax_naive(logits):
exp_logits = torch.exp(logits)
return exp_logits / exp_logits.sum(dim=1, keepdim=True)

This function implements the Softmax activation in its most straightforward form. It exponentiates each logit and normalizes it by the sum of all exponentiated values across classes, producing a probability distribution for each input sample. 

While this implementation is mathematically correct and easy to read, it is numerically unstable—large positive logits can cause overflow, and large negative logits can underflow to zero. As a result, this version should be avoided in real training pipelines. Check out the FULL CODES here.

Sample Logits and Target Labels

This example defines a small batch with three samples and three classes to illustrate both normal and failure cases. The first and third samples contain reasonable logit values and behave as expected during Softmax computation. The second sample intentionally includes extreme values (1000 and -1000) to demonstrate numerical instability—this is where the naive Softmax implementation breaks down. 

The targets tensor specifies the correct class index for each sample and will be used to compute the classification loss and observe how instability propagates during backpropagation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser# Batch of 3 samples, 3 classes
logits = torch.tensor([
[2.0, 1.0, 0.1],
[1000.0, 1.0, -1000.0],
[3.0, 2.0, 1.0]
], requires_grad=True)

targets = torch.tensor([0, 2, 1])

Forward Pass: Softmax Output and the Failure Case

During the forward pass, the naive Softmax function is applied to the logits to produce class probabilities. For normal logit values (first and third samples), the output is a valid probability distribution where values lie between 0 and 1 and sum to 1. 

However, the second sample clearly exposes the numerical issue: exponentiating 1000 overflows to infinity, while -1000 underflows to zero. This results in invalid operations during normalization, producing NaN values and zero probabilities. Once NaN appears at this stage, it contaminates all subsequent computations, making the model unusable for training. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser# Forward pass
probs = softmax_naive(logits)

print(“Softmax probabilities:”)
print(probs)

Target Probabilities and Loss Breakdown

Here, we extract the predicted probability corresponding to the true class for each sample. While the first and third samples return valid probabilities, the second sample’s target probability is 0.0, caused by numerical underflow in the Softmax computation. When the loss is calculated using -log(p), taking the logarithm of 0.0 results in +∞. 

This makes the overall loss infinite, which is a critical failure during training. Once the loss becomes infinite, gradient computation becomes unstable, leading to NaNs during backpropagation and effectively halting learning. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser# Extract target probabilities
target_probs = probs[torch.arange(len(targets)), targets]

print(“nTarget probabilities:”)
print(target_probs)

# Compute loss
loss = -torch.log(target_probs).mean()
print(“nLoss:”, loss)

Backpropagation: Gradient Corruption

When backpropagation is triggered, the impact of the infinite loss becomes immediately visible. The gradients for the first and third samples remain finite because their Softmax outputs were well-behaved. However, the second sample produces NaN gradients across all classes due to the log(0) operation in the loss. 

These NaNs propagate backward through the network, contaminating weight updates and effectively breaking training. This is why numerical instability at the Softmax–loss boundary is so dangerous—once NaNs appear, recovery is nearly impossible without restarting training. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserloss.backward()

print(“nGradients:”)
print(logits.grad)

Numerical Instability and Its Consequences

Separating Softmax and cross-entropy creates a serious numerical stability risk due to exponential overflow and underflow. Large logits can push probabilities to infinity or zero, causing log(0) and leading to NaN gradients that quickly corrupt training. At production scale, this is not a rare edge case but a certainty—without stable, fused implementations, large multi-GPU training runs would fail unpredictably. 

The core numerical problem comes from the fact that computers cannot represent infinitely large or infinitely small numbers. Floating-point formats like FP32 have strict limits on how big or small a value can be stored. When Softmax computes exp(x), large positive values grow so fast that they exceed the maximum representable number and turn into infinity, while large negative values shrink so much that they become zero. Once a value becomes infinity or zero, subsequent operations like division or logarithms break down and produce invalid results. Check out the FULL CODES here.

Implementing Stable Cross-Entropy Loss Using LogSumExp

This implementation computes cross-entropy loss directly from raw logits without explicitly calculating Softmax probabilities. To maintain numerical stability, the logits are first shifted by subtracting the maximum value per sample, ensuring exponentials stay within a safe range. 

The LogSumExp trick is then used to compute the normalization term, after which the original (unshifted) target logit is subtracted to obtain the correct loss. This approach avoids overflow, underflow, and NaN gradients, and mirrors how cross-entropy is implemented in production-grade deep learning frameworks. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef stable_cross_entropy(logits, targets):

# Find max logit per sample
max_logits, _ = torch.max(logits, dim=1, keepdim=True)

# Shift logits for numerical stability
shifted_logits = logits – max_logits

# Compute LogSumExp
log_sum_exp = torch.log(torch.sum(torch.exp(shifted_logits), dim=1)) + max_logits.squeeze(1)

# Compute loss using ORIGINAL logits
loss = log_sum_exp – logits[torch.arange(len(targets)), targets]

return loss.mean()

Stable Forward and Backward Pass

Running the stable cross-entropy implementation on the same extreme logits produces a finite loss and well-defined gradients. Even though one sample contains very large values (1000 and -1000), the LogSumExp formulation keeps all intermediate computations in a safe numerical range. As a result, backpropagation completes successfully without producing NaNs, and each class receives a meaningful gradient signal. 

This confirms that the instability seen earlier was not caused by the data itself, but by the naive separation of Softmax and cross-entropy—an issue fully resolved by using a numerically stable, fused loss formulation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserlogits = torch.tensor([
[2.0, 1.0, 0.1],
[1000.0, 1.0, -1000.0],
[3.0, 2.0, 1.0]
], requires_grad=True)

targets = torch.tensor([0, 2, 1])

loss = stable_cross_entropy(logits, targets)
print(“Stable loss:”, loss)

loss.backward()
print(“nGradients:”)
print(logits.grad)

Conclusion

In practice, the gap between mathematical formulas and real-world code is where many training failures originate. While Softmax and cross-entropy are mathematically well-defined, their naive implementation ignores the finite precision limits of IEEE 754 hardware, making underflow and overflow inevitable. 

The key fix is simple but critical: shift logits before exponentiation and operate in the log domain whenever possible. Most importantly, training rarely requires explicit probabilities—stable log-probabilities are sufficient and far safer. When a loss suddenly turns into NaN in production, it’s often a signal that Softmax is being computed manually somewhere it shouldn’t be.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export
The post Implementing Softmax From Scratch: Avoiding the Numerical Stability Trap appeared first on MarkTechPost.

How to Design an Agentic AI Architecture with LangGraph and OpenAI Usi …

In this tutorial, we build a genuinely advanced Agentic AI system using LangGraph and OpenAI models by going beyond simple planner, executor loops. We implement adaptive deliberation, where the agent dynamically decides between fast and deep reasoning; a Zettelkasten-style agentic memory graph that stores atomic knowledge and automatically links related experiences; and a governed tool-use mechanism that enforces constraints during execution. By combining structured state management, memory-aware retrieval, reflexive learning, and controlled tool invocation, we demonstrate how modern agentic systems can reason, act, learn, and evolve rather than respond in a single pass. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install -U langgraph langchain-openai langchain-core pydantic numpy networkx requests

import os, getpass, json, time, operator
from typing import List, Dict, Any, Optional, Literal
from typing_extensions import TypedDict, Annotated
import numpy as np
import networkx as nx
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.messages import SystemMessage, HumanMessage, ToolMessage, AnyMessage
from langchain_core.tools import tool
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import InMemorySaver

We set up the execution environment by installing all required libraries and importing the core modules. We bring together LangGraph for orchestration, LangChain for model and tool abstractions, and supporting libraries for memory graphs and numerical operations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass.getpass(“Enter OPENAI_API_KEY: “)

MODEL = os.environ.get(“OPENAI_MODEL”, “gpt-4o-mini”)
EMB_MODEL = os.environ.get(“OPENAI_EMBED_MODEL”, “text-embedding-3-small”)

llm_fast = ChatOpenAI(model=MODEL, temperature=0)
llm_deep = ChatOpenAI(model=MODEL, temperature=0)
llm_reflect = ChatOpenAI(model=MODEL, temperature=0)
emb = OpenAIEmbeddings(model=EMB_MODEL)

We securely load the OpenAI API key at runtime and initialize the language models used for fast, deep, and reflective reasoning. We also configure the embedding model that powers semantic similarity in memory. This separation allows us to flexibly switch reasoning depth while maintaining a shared representation space for memory. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass Note(BaseModel):
note_id: str
title: str
content: str
tags: List[str] = Field(default_factory=list)
created_at_unix: float
context: Dict[str, Any] = Field(default_factory=dict)

class MemoryGraph:
def __init__(self):
self.g = nx.Graph()
self.note_vectors = {}

def _cos(self, a, b):
return float(np.dot(a, b) / ((np.linalg.norm(a) + 1e-9) * (np.linalg.norm(b) + 1e-9)))

def add_note(self, note, vec):
self.g.add_node(note.note_id, **note.model_dump())
self.note_vectors[note.note_id] = vec

def topk_related(self, vec, k=5):
scored = [(nid, self._cos(vec, v)) for nid, v in self.note_vectors.items()]
scored.sort(key=lambda x: x[1], reverse=True)
return [{“note_id”: n, “score”: s, “title”: self.g.nodes[n][“title”]} for n, s in scored[:k]]

def link_note(self, a, b, w, r):
if a != b:
self.g.add_edge(a, b, weight=w, reason=r)

def evolve_links(self, nid, vec):
for r in self.topk_related(vec, 8):
if r[“score”] >= 0.78:
self.link_note(nid, r[“note_id”], r[“score”], “evolve”)

MEM = MemoryGraph()

We construct an agentic memory graph inspired by the Zettelkasten method, where each interaction is stored as an atomic note. We embed each note and connect it to semantically related notes using similarity scores. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@tool
def web_get(url: str) -> str:
import urllib.request
with urllib.request.urlopen(url, timeout=15) as r:
return r.read(25000).decode(“utf-8″, errors=”ignore”)

@tool
def memory_search(query: str, k: int = 5) -> str:
qv = np.array(emb.embed_query(query))
hits = MEM.topk_related(qv, k)
return json.dumps(hits, ensure_ascii=False)

@tool
def memory_neighbors(note_id: str) -> str:
if note_id not in MEM.g:
return “[]”
return json.dumps([
{“note_id”: n, “weight”: MEM.g[note_id][n][“weight”]}
for n in MEM.g.neighbors(note_id)
])

TOOLS = [web_get, memory_search, memory_neighbors]
TOOLS_BY_NAME = {t.name: t for t in TOOLS}

We define the external tools the agent can invoke, including web access and memory-based retrieval. We integrate these tools in a structured way so the agent can query past experiences or fetch new information when necessary. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass DeliberationDecision(BaseModel):
mode: Literal[“fast”, “deep”]
reason: str
suggested_steps: List[str]

class RunSpec(BaseModel):
goal: str
constraints: List[str]
deliverable_format: str
must_use_memory: bool
max_tool_calls: int

class Reflection(BaseModel):
note_title: str
note_tags: List[str]
new_rules: List[str]
what_worked: List[str]
what_failed: List[str]

class AgentState(TypedDict, total=False):
run_spec: Dict[str, Any]
messages: Annotated[List[AnyMessage], operator.add]
decision: Dict[str, Any]
final: str
budget_calls_remaining: int
tool_calls_used: int
max_tool_calls: int
last_note_id: str

DECIDER_SYS = “Decide fast vs deep.”
AGENT_FAST = “Operate fast.”
AGENT_DEEP = “Operate deep.”
REFLECT_SYS = “Reflect and store learnings.”

We formalize the agent’s internal representations using structured schemas for deliberation, execution goals, reflection, and global state. We also define the system prompts that guide behavior in fast and deep modes. This ensures the agent’s reasoning and decisions remain consistent, interpretable, and controllable. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef deliberate(st):
spec = RunSpec.model_validate(st[“run_spec”])
d = llm_fast.with_structured_output(DeliberationDecision).invoke([
SystemMessage(content=DECIDER_SYS),
HumanMessage(content=json.dumps(spec.model_dump()))
])
return {“decision”: d.model_dump(), “budget_calls_remaining”: st[“budget_calls_remaining”] – 1}

def agent(st):
spec = RunSpec.model_validate(st[“run_spec”])
d = DeliberationDecision.model_validate(st[“decision”])
llm = llm_deep if d.mode == “deep” else llm_fast
sys = AGENT_DEEP if d.mode == “deep” else AGENT_FAST
out = llm.bind_tools(TOOLS).invoke([
SystemMessage(content=sys),
*st.get(“messages”, []),
HumanMessage(content=json.dumps(spec.model_dump()))
])
return {“messages”: [out], “budget_calls_remaining”: st[“budget_calls_remaining”] – 1}

def route(st):
return “tools” if st[“messages”][-1].tool_calls else “finalize”

def tools_node(st):
msgs = []
used = st.get(“tool_calls_used”, 0)
for c in st[“messages”][-1].tool_calls:
obs = TOOLS_BY_NAME[c[“name”]].invoke(c[“args”])
msgs.append(ToolMessage(content=str(obs), tool_call_id=c[“id”]))
used += 1
return {“messages”: msgs, “tool_calls_used”: used}

def finalize(st):
out = llm_deep.invoke(st[“messages”] + [HumanMessage(content=”Return final output”)])
return {“final”: out.content}

def reflect(st):
r = llm_reflect.with_structured_output(Reflection).invoke([
SystemMessage(content=REFLECT_SYS),
HumanMessage(content=st[“final”])
])
note = Note(
note_id=str(time.time()),
title=r.note_title,
content=st[“final”],
tags=r.note_tags,
created_at_unix=time.time()
)
vec = np.array(emb.embed_query(note.title + note.content))
MEM.add_note(note, vec)
MEM.evolve_links(note.note_id, vec)
return {“last_note_id”: note.note_id}

We implement the core agentic behaviors as LangGraph nodes, including deliberation, action, tool execution, finalization, and reflection. We orchestrate how information flows between these stages and how decisions affect the execution path. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserg = StateGraph(AgentState)
g.add_node(“deliberate”, deliberate)
g.add_node(“agent”, agent)
g.add_node(“tools”, tools_node)
g.add_node(“finalize”, finalize)
g.add_node(“reflect”, reflect)

g.add_edge(START, “deliberate”)
g.add_edge(“deliberate”, “agent”)
g.add_conditional_edges(“agent”, route, [“tools”, “finalize”])
g.add_edge(“tools”, “agent”)
g.add_edge(“finalize”, “reflect”)
g.add_edge(“reflect”, END)

graph = g.compile(checkpointer=InMemorySaver())

def run_agent(goal, constraints=None, thread_id=”demo”):
if constraints is None:
constraints = []
spec = RunSpec(
goal=goal,
constraints=constraints,
deliverable_format=”markdown”,
must_use_memory=True,
max_tool_calls=6
).model_dump()

return graph.invoke({
“run_spec”: spec,
“messages”: [],
“budget_calls_remaining”: 10,
“tool_calls_used”: 0,
“max_tool_calls”: 6
}, config={“configurable”: {“thread_id”: thread_id}})

We assemble all nodes into a LangGraph workflow and compile it with checkpointed state management. We also define a reusable runner function that executes the agent while preserving memory across runs.

In conclusion, we showed how an agent can continuously improve its behavior through reflection and memory rather than relying on static prompts or hard-coded logic. We used LangGraph to orchestrate deliberation, execution, tool governance, and reflexion as a coherent graph, while OpenAI models provide the reasoning and synthesis capabilities at each stage. This approach illustrated how agentic AI systems can move closer to autonomy by adapting their reasoning depth, reusing prior knowledge, and encoding lessons as persistent memory, forming a practical foundation for building scalable, self-improving agents in real-world applications.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export
The post How to Design an Agentic AI Architecture with LangGraph and OpenAI Using Adaptive Deliberation, Memory Graphs, and Reflexion Loops appeared first on MarkTechPost.

Liquid AI Releases LFM2.5: A Compact AI Model Family For Real On Devic …

Liquid AI has introduced LFM2.5, a new generation of small foundation models built on the LFM2 architecture and focused at on device and edge deployments. The model family includes LFM2.5-1.2B-Base and LFM2.5-1.2B-Instruct and extends to Japanese, vision language, and audio language variants. It is released as open weights on Hugging Face and exposed through the LEAP platform.

Architecture and training recipe

LFM2.5 keeps the hybrid LFM2 architecture that was designed for fast and memory efficient inference on CPUs and NPUs and scales the data and post training pipeline. Pretraining for the 1.2 billion parameter backbone is extended from 10T to 28T tokens. The instruct variant then receives supervised fine tuning, preference alignment, and large scale multi stage reinforcement learning focused on instruction following, tool use, math, and knowledge reasoning.

Text model performance at one billion scale

LFM2.5-1.2B-Instruct is the main general purpose text model. Liquid AI team reports benchmark results on GPQA, MMLU Pro, IFEval, IFBench, and several function calling and coding suites. The model reaches 38.89 on GPQA and 44.35 on MMLU Pro. Competing 1B class open models such as Llama-3.2-1B Instruct and Gemma-3-1B IT score significantly lower on these metrics.

https://www.liquid.ai/blog/introducing-lfm2-5-the-next-generation-of-on-device-ai

On IFEval and IFBench, which target multi step instruction following and function calling quality, LFM2.5-1.2B-Instruct reports 86.23 and 47.33. These values are ahead of the other 1B class baselines in the above Liquid AI table.

Japanese optimized variant

LFM2.5-1.2B-JP is a Japanese optimized text model derived from the same backbone. It targets tasks such as JMMLU, M-IFEval in Japanese, and GSM8K in Japanese. This checkpoint improves over the general instruct model on Japanese tasks and competes with or surpasses other small multilingual models like Qwen3-1.7B, Llama 3.2-1B Instruct, and Gemma 3-1B IT on these localized benchmarks.

Vision language model for multimodal edge workloads

LFM2.5-VL-1.6B is the updated vision language model in the series. It uses LFM2.5-1.2B-Base as the language backbone and adds a vision tower for image understanding. The model is tuned on a range of visual reasoning and OCR benchmarks, including MMStar, MM IFEval, BLINK, InfoVQA, OCRBench v2, RealWorldQA, MMMU, and multilingual MMBench. LFM2.5-VL-1.6B improves over the previous LFM2-VL-1.6B on most metrics and is intended for real world tasks such as document understanding, user interface reading, and multi image reasoning under edge constraints.

Audio language model with native speech generation

LFM2.5-Audio-1.5B is a native audio language model that supports both text and audio inputs and outputs. It is presented as an Audio to Audio model and uses an audio detokenizer that is described as eight times faster than the previous Mimi based detokenizer at the same precision on constrained hardware.

The model supports two main generation modes. Interleaved generation is designed for real time speech to speech conversational agents where latency dominates. Sequential generation is aimed at tasks such as automatic speech recognition and text to speech and allows switching the generated modality without reinitializing the model. The audio stack is trained with quantization aware training at low precision, which keeps metrics such as STOI and UTMOS close to the full precision baseline while enabling deployment on devices with limited compute.

https://www.liquid.ai/blog/introducing-lfm2-5-the-next-generation-of-on-device-ai

Key Takeaways

LFM2.5 is a 1.2B scale hybrid model family built on the LFM2 device optimized architecture, with Base, Instruct, Japanese, Vision Language, and Audio Language variants, all released as open weights on Hugging Face and LEAP.

Pretraining for LFM2.5 extends from 10T to 28T tokens and the Instruct model adds supervised fine tuning, preference alignment, and large scale multi stage reinforcement learning, which pushes instruction following and tool use quality beyond other 1B class baselines.

LFM2.5-1.2B-Instruct delivers strong text benchmark performance at the 1B scale, reaching 38.89 on GPQA and 44.35 on MMLU Pro and leading peer models such as Llama 3.2 1B Instruct, Gemma 3 1B IT, and Granite 4.0 1B on IFEval and IFBench.

The family includes specialized multimodal and regional variants, with LFM2.5-1.2B-JP achieving state of the art results for Japanese benchmarks at its scale and LFM2.5-VL-1.6B and LFM2.5-Audio-1.5B covering vision language and native audio language workloads for edge agents.

Check out the Technical details and Model weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export
The post Liquid AI Releases LFM2.5: A Compact AI Model Family For Real On Device Agents appeared first on MarkTechPost.

Marktechpost Releases ‘AI2025Dev’: A Structured Intelligence Layer …

Marktechpost has released AI2025Dev, its 2025 analytics platform (available to AI Devs and Researchers without any signup or login) designed to convert the year’s AI activity into a queryable dataset spanning model releases, openness, training scale, benchmark performance, and ecosystem participants. Marktechpost is a California based AI news platform covering machine learning, deep learning, and data science research.

What’s new in this release

The 2025 release of AI2025Dev expands coverage across two layers:

Release analytics, focusing on model and framework launches, license posture, vendor activity, and feature level segmentation.

Ecosystem indexes, including curated “Top 100” collections that connect models to papers and the people and capital behind them. This release includes dedicated sections for:

Top 100 research papers

Top 100 AI researchers

Top AI startups

Top AI founders

Top AI investors

Funding views that link investors and companies

These indexes are designed to be navigable and filterable, rather than static editorial lists, so teams can trace relationships across artifacts like company, model type, benchmark scores, and release timing.

AI Releases in 2025: year level metrics from the market map dataset

AI2025Dev’s ‘AI Releases in 2025’ overview is backed by a structured market map dataset covering 100 tracked releases and 39 active companies. The dataset normalizes each entry into a consistent schema: name, company, type, license, flagship, and release_date.

Key aggregate indicators in this release include:

Total releases: 100

Open share: 69%, computed as the combined share of Open Source and Open Weights releases (44 and 25 entries respectively), with 31 Proprietary releases

Flagship models: 63, enabling separation of frontier tier launches from derivative or narrow scope releases

Active companies: 39, reflecting a concentration of major releases among a relatively fixed set of vendors

Model category coverage in the market map is explicitly typed, enabling faceted queries and comparative analysis. The distribution includes LLM (58), Agentic Model (11), Vision Model (8), Tool (7), Multimodal (6), Framework (4), Code Model (2), Audio Model (2), plus Embedding Model (1) and Agent (1).

Key Findings 2025: category level shifts captured as measurable signals

The release packages a ‘Key Findings 2025’ layer that surfaces year level shifts as measurable slices of the dataset rather than commentary. The platform highlights three recurring technical themes:

Open weights adoption, capturing the rising share of releases with weights available under open source or open weights terms, and the downstream implication that more teams can benchmark, fine tune, and deploy without vendor locked inference.

Agentic and tool using systems, tracking the growth of models and systems categorized around tool use, orchestration, and task execution, rather than pure chat interaction.

Efficiency and compression, reflecting a 2025 pattern where distillation and other model optimization techniques increasingly target smaller footprints while maintaining competitive benchmark behavior.

LLM Training Data Scale in 2025: token scale with timeline alignment

A dedicated visualization tracks LLM training data scale in 2025, spanning 1.4T to 36T tokens and aligning token budgets to a release timeline. By encoding token scale and date in a single view, the platform makes it possible to compare how vendors are allocating training budgets over time and how extreme scale relates to observed benchmark outcomes.

Performance Benchmarks: benchmark normalized scoring and inspection

The Analytics section includes a Performance Benchmarks view and an Intelligence Index derived from standard evaluation axes, including MMLU, HumanEval, and GSM8K. The objective is not to replace task specific evaluations, but to provide a consistent baseline for comparing vendor releases when public reporting differs in format and completeness.

The platform exposes:

Ranked performance summaries for quick scanning

Per benchmark columns to detect tradeoffs (for example, coding optimized models that diverge from reasoning centric performance)

Export controls to support downstream analysis workflows

Model Leaderboard and Model Comparison: operational evaluation workflows

To reduce the friction of model selection, AI2025Dev includes:

A Model Leaderboard that aggregates scores and metadata for a broader 2025 model set

A Model Comparison view that enables side by side evaluation across benchmarks and attributes, with search and filtering to build shortlists by vendor, type, and openness

These workflows are designed for engineering teams that need a structured comparison surface before committing to integration, inference spend, or fine tuning pipelines.

Top 100 indexes: papers, researchers, startups, and investors

Beyond model tracking, the release extends to ecosystem mapping. The platform adds navigable “Top 100” modules for:

Research papers, providing an entry point into the core technical work shaping 2025 systems

AI researchers, presented as an unranked, evidence backed index with conference anchored context

AI startups and founders, enabling linkage between product direction and released systems

AI investors and funding, enabling analysis of capital flows around model and tool categories

Availability

The updated platform is available now at AI2025Dev and you don’t need any signup or login to access the platform. The release is designed to support both fast scanning and analyst grade workflows, with normalized schemas, typed categories, and exportable views intended for quantitative comparison rather than narrative browsing.
The post Marktechpost Releases ‘AI2025Dev’: A Structured Intelligence Layer for AI Models, Benchmarks, and Ecosystem Signals appeared first on MarkTechPost.

LLM-Pruning Collection: A JAX Based Repo For Structured And Unstructur …

Zlab Princeton researchers have released LLM-Pruning Collection, a JAX based repository that consolidates major pruning algorithms for large language models into a single, reproducible framework. It targets one concrete goal, make it easy to compare block level, layer level and weight level pruning methods under a consistent training and evaluation stack on both GPUs and TPUs.

What LLM-Pruning Collection Contains?

It is described as a JAX based repo for LLM pruning. It is organized into three main directories:

pruning holds implementations for several pruning methods: Minitron, ShortGPT, Wanda, SparseGPT, Magnitude, Sheared Llama and LLM-Pruner.

training provides integration with FMS-FSDP for GPU training and MaxText for TPU training.

eval exposes JAX compatible evaluation scripts built around lm-eval-harness, with accelerate based support for MaxText that gives about 2 to 4 times speedup.

Pruning Methods Covered

LLM-Pruning Collection spans several families of pruning algorithms with different granularity levels:

Minitron

Minitron is a practical pruning and distillation recipe developed by NVIDIA that compresses Llama 3.1 8B and Mistral NeMo 12B to 4B and 8B while preserving performance. It explores depth pruning and joint width pruning of hidden sizes, attention and MLP, followed by distillation.

In LLM-Pruning Collection, the pruning/minitron folder provides scripts such as prune_llama3.1-8b.sh which run Minitron style pruning on Llama 3.1 8B.

ShortGPT

ShortGPT is based on the observation that many Transformer layers are redundant. The method defines Block Influence, a metric that measures the contribution of each layer and then removes low influence layers by direct layer deletion. Experiments show that ShortGPT outperforms previous pruning methods for multiple choice and generative tasks.

In the collection, ShortGPT is implemented through the Minitron folder with a dedicated script prune_llama2-7b.sh.

Wanda, SparseGPT, Magnitude

Wanda is a post training pruning method that scores weights by the product of weight magnitude and corresponding input activation on a per output basis. It prunes the smallest scores, requires no retraining and induces sparsity that works well even at billion parameter scale.

SparseGPT is another post training method that uses a second order inspired reconstruction step to prune large GPT style models at high sparsity ratios. Magnitude pruning is the classical baseline that removes weights with small absolute value.

In LLM-Pruning Collection, all three live under pruning/wanda with a shared installation path. The README includes a dense table of Llama 2 7B results that compares Wanda, SparseGPT and Magnitude across BoolQ, RTE, HellaSwag, Winogrande, ARC E, ARC C and OBQA, under unstructured and structured sparsity patterns such as 4:8 and 2:4.

Sheared Llama

Sheared LLaMA is a structured pruning method that learns masks for layers, attention heads and hidden dimensions and then retrains the pruned architecture. The original release provides models at multiple scales including 2.7B and 1.3B.

The pruning/llmshearing directory in LLM-Pruning Collection integrates this recipe. It uses a RedPajama subset for calibration, accessed through Hugging Face, and helper scripts to convert between Hugging Face and MosaicML Composer formats.

LLM-Pruner

LLM-Pruner is a framework for structural pruning of large language models. It removes non critical coupled structures, such as attention heads or MLP channels, using gradient based importance scores and then recovers performance with a short LoRA tuning stage that uses about 50K samples. The collection includes LLM-Pruner under pruning/LLM-Pruner with scripts for LLaMA, LLaMA 2 and Llama 3.1 8B.

Key Takeaways

LLM-Pruning Collection is a JAX based, Apache-2.0 repo from zlab-princeton that unifies modern LLM pruning methods with shared pruning, training and evaluation pipelines for GPUs and TPUs.

The codebase implements block, layer and weight level pruning approaches, including Minitron, ShortGPT, Wanda, SparseGPT, Sheared LLaMA, Magnitude pruning and LLM-Pruner, with method specific scripts for Llama family models.

Training integrates FMS-FSDP on GPU and MaxText on TPU with JAX compatible evaluation scripts built on lm-eval-harness, giving roughly 2 to 4 times faster eval for MaxText checkpoints via accelerate.

The repository reproduces key results from prior pruning work, publishing side by side “paper vs reproduced” tables for methods like Wanda, SparseGPT, Sheared LLaMA and LLM-Pruner so engineers can verify their runs against known baselines.

Check out the GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post LLM-Pruning Collection: A JAX Based Repo For Structured And Unstructured LLM Compression appeared first on MarkTechPost.

A Coding Guide to Design and Orchestrate Advanced ReAct-Based Multi-Ag …

In this tutorial, we build an advanced multi-agent incident response system using AgentScope. We orchestrate multiple ReAct agents, each with a clearly defined role such as routing, triage, analysis, writing, and review, and connect them through structured routing and a shared message hub. By integrating OpenAI models, lightweight tool calling, and a simple internal runbook, we demonstrate how complex, real-world agentic workflows can be composed in pure Python without heavy infrastructure or brittle glue code. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install “agentscope>=0.1.5” pydantic nest_asyncio

import os, json, re
from getpass import getpass
from typing import Literal
from pydantic import BaseModel, Field
import nest_asyncio
nest_asyncio.apply()

from agentscope.agent import ReActAgent
from agentscope.message import Msg, TextBlock
from agentscope.model import OpenAIChatModel
from agentscope.formatter import OpenAIChatFormatter
from agentscope.memory import InMemoryMemory
from agentscope.tool import Toolkit, ToolResponse, execute_python_code
from agentscope.pipeline import MsgHub, sequential_pipeline

if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass(“Enter OPENAI_API_KEY (hidden): “)

OPENAI_MODEL = os.environ.get(“OPENAI_MODEL”, “gpt-4o-mini”)

We set up the execution environment and install all required dependencies so the tutorial runs reliably on Google Colab. We securely load the OpenAI API key and initialize the core AgentScope components that will be shared across all agents. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserRUNBOOK = [
{“id”: “P0”, “title”: “Severity Policy”, “text”: “P0 critical outage, P1 major degradation, P2 minor issue”},
{“id”: “IR1”, “title”: “Incident Triage Checklist”, “text”: “Assess blast radius, timeline, deployments, errors, mitigation”},
{“id”: “SEC7”, “title”: “Phishing Escalation”, “text”: “Disable account, reset sessions, block sender, preserve evidence”},
]

def _score(q, d):
q = set(re.findall(r”[a-z0-9]+”, q.lower()))
d = re.findall(r”[a-z0-9]+”, d.lower())
return sum(1 for w in d if w in q) / max(1, len(d))

async def search_runbook(query: str, top_k: int = 2) -> ToolResponse:
ranked = sorted(RUNBOOK, key=lambda r: _score(query, r[“title”] + r[“text”]), reverse=True)[: max(1, int(top_k))]
text = “nn”.join(f”[{r[‘id’]}] {r[‘title’]}n{r[‘text’]}” for r in ranked)
return ToolResponse(content=[TextBlock(type=”text”, text=text)])

toolkit = Toolkit()
toolkit.register_tool_function(search_runbook)
toolkit.register_tool_function(execute_python_code)

We define a lightweight internal runbook and implement a simple relevance-based search tool over it. We register this function along with a Python execution tool, enabling agents to retrieve policy knowledge or compute results dynamically. It demonstrates how we augment agents with external capabilities beyond pure language reasoning. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef make_model():
return OpenAIChatModel(
model_name=OPENAI_MODEL,
api_key=os.environ[“OPENAI_API_KEY”],
generate_kwargs={“temperature”: 0.2},
)

class Route(BaseModel):
lane: Literal[“triage”, “analysis”, “report”, “unknown”] = Field(…)
goal: str = Field(…)

router = ReActAgent(
name=”Router”,
sys_prompt=”Route the request to triage, analysis, or report and output structured JSON only.”,
model=make_model(),
formatter=OpenAIChatFormatter(),
memory=InMemoryMemory(),
)

triager = ReActAgent(
name=”Triager”,
sys_prompt=”Classify severity and immediate actions using runbook search when useful.”,
model=make_model(),
formatter=OpenAIChatFormatter(),
memory=InMemoryMemory(),
toolkit=toolkit,
)

analyst = ReActAgent(
name=”Analyst”,
sys_prompt=”Analyze logs and compute summaries using python tool when helpful.”,
model=make_model(),
formatter=OpenAIChatFormatter(),
memory=InMemoryMemory(),
toolkit=toolkit,
)

writer = ReActAgent(
name=”Writer”,
sys_prompt=”Write a concise incident report with clear structure.”,
model=make_model(),
formatter=OpenAIChatFormatter(),
memory=InMemoryMemory(),
)

reviewer = ReActAgent(
name=”Reviewer”,
sys_prompt=”Critique and improve the report with concrete fixes.”,
model=make_model(),
formatter=OpenAIChatFormatter(),
memory=InMemoryMemory(),
)

We construct multiple specialized ReAct agents and a structured router that decides how each user request should be handled. We assign clear responsibilities to the triage, analysis, writing, and review agents, ensuring separation of concerns. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserLOGS = “””timestamp,service,status,latency_ms,error
2025-12-18T12:00:00Z,checkout,200,180,false
2025-12-18T12:00:05Z,checkout,500,900,true
2025-12-18T12:00:10Z,auth,200,120,false
2025-12-18T12:00:12Z,checkout,502,1100,true
2025-12-18T12:00:20Z,search,200,140,false
2025-12-18T12:00:25Z,checkout,500,950,true
“””

def msg_text(m: Msg) -> str:
blocks = m.get_content_blocks(“text”)
if blocks is None:
return “”
if isinstance(blocks, str):
return blocks
if isinstance(blocks, list):
return “n”.join(str(x) for x in blocks)
return str(blocks)

We introduce sample log data and a utility function that normalizes agent outputs into clean text. We ensure that downstream agents can safely consume and refine earlier responses without format issues. It focuses on making inter-agent communication robust and predictable. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserasync def run_demo(user_request: str):
route_msg = await router(Msg(“user”, user_request, “user”), structured_model=Route)
lane = (route_msg.metadata or {}).get(“lane”, “unknown”)

if lane == “triage”:
first = await triager(Msg(“user”, user_request, “user”))
elif lane == “analysis”:
first = await analyst(Msg(“user”, user_request + “nnLogs:n” + LOGS, “user”))
elif lane == “report”:
draft = await writer(Msg(“user”, user_request, “user”))
first = await reviewer(Msg(“user”, “Review and improve:nn” + msg_text(draft), “user”))
else:
first = Msg(“system”, “Could not route request.”, “system”)

async with MsgHub(
participants=[triager, analyst, writer, reviewer],
announcement=Msg(“Host”, “Refine the final answer collaboratively.”, “assistant”),
):
await sequential_pipeline([triager, analyst, writer, reviewer])

return {“route”: route_msg.metadata, “initial_output”: msg_text(first)}

result = await run_demo(
“We see repeated 5xx errors in checkout. Classify severity, analyze logs, and produce an incident report.”
)
print(json.dumps(result, indent=2))

We orchestrate the full workflow by routing the request, executing the appropriate agent, and running a collaborative refinement loop using a message hub. We coordinate multiple agents in sequence to improve the final output before returning it to the user. It brings together all earlier components into a cohesive, end-to-end agentic pipeline.

In conclusion, we showed how AgentScope enables us to design robust, modular, and collaborative agent systems that go beyond single-prompt interactions. We routed tasks dynamically, invoked tools only when needed, and refined outputs through multi-agent coordination, all within a clean and reproducible Colab setup. This pattern illustrates how we can scale from simple agent experiments to production-style reasoning pipelines while maintaining clarity, control, and extensibility in our agentic AI applications.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Design and Orchestrate Advanced ReAct-Based Multi-Agent Workflows with AgentScope and OpenAI appeared first on MarkTechPost.

Tencent Researchers Release Tencent HY-MT1.5: A New Translation Models …

Tencent Hunyuan researchers have released HY-MT1.5, a multilingual machine translation family that targets both mobile devices and cloud systems with the same training recipe and metrics. HY-MT1.5 consists of 2 translation models, HY-MT1.5-1.8B and HY-MT1.5-7B, supports mutual translation across 33 languages with 5 ethnic and dialect variations, and is available on GitHub and Hugging Face under open weights.

Model family and deployment targets

HY-MT1.5-7B is an upgraded version of the WMT25 championship system Hunyuan-MT-7B. It is optimized for explanatory translation and mixed language scenarios, and adds native support for terminology intervention, contextual translation and formatted translation.

HY-MT1.5-1.8B is the compact variant. It has less than one third the parameters of HY-MT1.5-7B but delivers comparable translation performance in the reported benchmarks. After quantization, the 1.8B model can run on edge devices and support real time translation.

The quantized HY-MT1.5-1.8B operates on devices with about 1 GB of memory and reaches an average response time of about 0.18 seconds for Chinese inputs of around 50 tokens, while surpassing mainstream commercial translation APIs in quality. HY-MT1.5-7B targets server and high end edge deployment, where latency around 0.45 seconds is acceptable in exchange for higher quality.

Holistic training framework

The research team defines HY-MT1.5 as a translation specific language model trained with a multi stage pipeline.

The pipeline has 5 main components:

General pre training: The base model is first pre-trained on large scale multilingual text with a language modeling objective. This builds shared representations across languages.

MT oriented pre training: The model is then exposed to parallel corpora and translation oriented objectives. This step aligns the generation distribution with real translation tasks rather than open ended text generation.

Supervised fine tuning: High quality sentence and document level parallel data is used to fine tune the model with supervised loss. This stage sharpens literal correctness, domain coverage and direction specific behavior, such as ZH to EN versus EN to ZH.

On policy distillation from 7B to 1.8B: HY-MT1.5-7B is used as a teacher for HY-MT1.5-1.8B. The research team collects about 1 million monolingual prompts across the 33 languages, runs them through the teacher and uses reverse Kullback Leibler divergence on the student rollouts to match the teacher distribution. This yields a 1.8B student that inherits most of the 7B model’s translation behavior with much lower cost.

Reinforcement learning with rubrics based evaluation: In the final stage, both models are optimized with a group relative policy optimization style algorithm and a rubrics based reward model. Human reviewers score translations on multiple axes such as accuracy, fluency, idiomaticity and cultural appropriateness. The reward model distills those scores and guides the policy update.

This pipeline is specific to machine translation. It differs from chat oriented LLM training by combining translation centric supervised data, on policy distillation within the translation domain and RL tuned with fine grained translation rubrics.

Benchmark results against open and commercial systems

HY-MT1.5 is evaluated on Flores 200, WMT25 and a Mandarin to minority language benchmark using XCOMET-XXL and CometKiwi.

https://arxiv.org/pdf/2512.24092v1

Key results from the above Table in the report:

On Flores 200, HY-MT1.5-7B reaches XCOMET-XXL scores of 0.8690 for ZH to XX, 0.9093 for EN to XX and 0.8098 for XX to XX. It outperforms translation specialized models such as iFLYTEK Translator and Doubao Translator and matches or exceeds medium sized general models like Qwen3-235B-A22B.

On WMT25, HY-MT1.5-7B reaches XCOMET-XXL 0.6159. This is about 0.065 higher than Gemini 3.0 Pro and significantly above translation oriented models such as Seed-X-PPO-7B and Tower-Plus-72B. HY-MT1.5-1.8B scores 0.5308, which still exceeds many medium sized general models and translation systems.

On Mandarin to minority language pairs, HY-MT1.5-7B achieves 0.6174 in XCOMET-XXL, higher than all baselines including Gemini 3.0 Pro. The 1.8B variant reaches 0.5806 and still surpasses several very large models like DeepSeek-V3.2.

In human evaluation on a 0 to 4 scale for Chinese to English and English to Chinese, HY-MT1.5-1.8B achieves an average score of 2.74, which is higher than Baidu, iFLYTEK, Doubao, Microsoft and Google translator systems under the same protocol.

Practical features for product use

The models expose three prompt driven capabilities that matter in production systems:

Terminology intervention: A prompt template lets you inject term mappings such as “混元珠 → Chaos Pearl”. Without the mapping, the model outputs an ambiguous transliteration. With the mapping, it enforces a consistent domain specific term. This is critical for legal, medical or brand constrained content.

Context aware translation: A second template accepts a context block plus the sentence to translate. The report shows the word “pilot” misinterpreted as a person when context is absent. When a paragraph about TV series is added, the model correctly translates “pilot” as an episode.

Format preserving translation: A third template wraps the source in <source> tags and marks spans with <sn> tags. The instruction forces the model to keep tags and output inside <target> tags. This allows HTML or XML like text to survive translation with structure preserved.

These are implemented as prompt formats, so they are available even when you call the public weights through standard LLM stacks.

Quantization and edge deployment

HY-MT1.5-1.8B is evaluated with FP8 and Int4 post training quantization using GPTQ.

https://arxiv.org/pdf/2512.24092v1

The above Table 4 shows:

FP8 keeps XCOMET-XXL scores very close to the full precision model, for example 0.8379 versus 0.8361 for ZH to XX.

Int4 reduces size further but introduces clear quality drops on Flores 200.

On Hugging Face, Tencent publishes both FP8 and GPTQ Int4 variants for HY-MT1.5-1.8B and HY-MT1.5-7B, along with GGUF versions for local inference stacks. Quantization is the mechanism that enables the reported 1 GB memory deployment and low latency on consumer hardware.

Key Takeaways

HY-MT1.5 is a 2 model translation family, HY-MT1.5-1.8B and HY-MT1.5-7B, supporting mutual translation across 33 languages plus 5 dialect or variant forms, released with open weights on GitHub and Hugging Face.

HY-MT1.5-1.8B is a distillation based edge model that runs on about 1 GB memory with around 0.18 seconds latency for 50 token Chinese inputs, while achieving industry leading performance among models of similar size and surpassing most commercial translation APIs.

HY-MT1.5-7B is an upgraded WMT25 champion system that reaches roughly 95 percent of Gemini 3.0 Pro on Flores 200 and surpasses it on WMT25 and Mandarin minority benchmarks, competing with much larger open and closed models.

Both models are trained with a holistic translation specific pipeline that combines general and MT oriented pre training, supervised fine tuning, on policy distillation and reinforcement learning guided by rubric based human evaluation, which is critical to their quality and efficiency trade off.

HY-MT1.5 exposes production oriented features through prompts, including terminology intervention, context aware translation and format preserving translation, and ships FP8, Int4 and GGUF variants so teams can deploy on devices or servers with standard LLM stacks.

Check out the Paper, Model Weights on HF and GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Tencent Researchers Release Tencent HY-MT1.5: A New Translation Models Featuring 1.8B and 7B Models Designed for Seamless on-Device and Cloud Deployment appeared first on MarkTechPost.

DeepSeek Researchers Apply a 1967 Matrix Normalization Algorithm to Fi …

DeepSeek researchers are trying to solve a precise issue in large language model training. Residual connections made very deep networks trainable, hyper connections widened that residual stream, and training then became unstable at scale. The new method mHC, Manifold Constrained Hyper Connections, keeps the richer topology of hyper connections but locks the mixing behavior on a well defined manifold so that signals remain numerically stable in very deep stacks.

https://www.arxiv.org/pdf/2512.24880

From Residual Connections To Hyper Connections

Standard residual connections, as in ResNets and Transformers, propagate activations with xl+1​=xl​+F(xl​,Wl​)The identity path preserves magnitude and keeps gradients usable even when you stack many layers.

Hyper Connections generalize this structure. Instead of a single residual vector of size C, the model keeps an n stream buffer 𝑥𝑙∈𝑅𝑛×𝐶. Three learned mappings control how each layer reads and writes this buffer:

Hlpre selects a mixture of streams as the layer input

F is the usual attention or feed forward sublayer

Hlpost writes results back into the n stream buffer

Hlres​∈Rn×n mixes streams between layers

The update has the formxl+1​=Hlres​xl​+Hlpost​⊤F(Hlpre​xl​,Wl​)

With n set to 4, this design increases expressivity without a large increase in floating point cost, which is why hyper connections improve downstream performance in language models.

Why Hyper Connections Become Unstable

The problem appears when you look at the product of residual mixers across many layers. In a 27B mixture of experts model, DeepSeek studies the composite mapping

and defines an Amax Gain Magnitude based on maximum row and column sums. This metric measures worst case amplification in the forward and backward signal paths. In the hyper connection model, this gain reaches peaks around 3000, far from the ideal value 1 that you expect from a stable residual path.

This means small per layer deviations compound into very large amplification factors across depth. Training logs show loss spikes and unstable gradient norms relative to a baseline residual model. At the same time, keeping a multi stream buffer increases memory traffic for each token, which makes naive scaling of hyper connections unattractive for production large language models.

Manifold Constrained Hyper Connections

mHC keeps the multi stream residual idea but constrains the dangerous part. The residual mixing matrix Hlres no longer lives in the full n by n space. Instead, it is projected onto the manifold of doubly stochastic matrices, also called the Birkhoff polytope. In that set all entries are non negative and each row and each column sums to 1.

DeepSeek team enforces this constraint with the classical Sinkhorn Knopp algorithm from 1967, which alternates row and column normalizations to approximate a doubly stochastic matrix. The research team uses 20 iterations per layer during training, which is enough to keep the mapping close to the target manifold while keeping cost manageable.

Under these constraints, Hlres​xl behaves like a convex combination of residual streams. Total feature mass is preserved and the norm is tightly regularized, which eliminates the explosive growth seen in plain hyper connections. The research team also parameterize input and output mappings so that coefficients are non negative, which avoids cancellation between streams and keeps the interpretation as averaging clear.

With mHC the composite Amax Gain Magnitude stays bounded and peaks at about 1.6 in the 27B model, compared with peaks near 3000 for the unconstrained variant. That is a reduction of about 3 orders of magnitude in worst case amplification, and it comes from a direct mathematical constraint rather than tuned tricks.

Systems Work And Training Overhead

Constraining every residual mixer with Sinkhorn style iterations adds cost on paper. The research team addresses this with several systems choices:

Fused kernels combine RMSNorm, projections and gating for the mHC mappings so that memory traffic stays low

Recompute based activation checkpointing trades compute for memory by recomputing mHC activations during backprop for blocks of layers

Integration with a DualPipe like pipeline schedule overlaps communication and recomputation, so that additional work does not stall the training pipeline

In large scale in house training runs, mHC with expansion rate n equal to 4 adds about 6.7 percent training time overhead relative to the baseline architecture. That figure already includes both the extra compute from Sinkhorn Knopp and the infrastructure optimizations.

https://www.arxiv.org/pdf/2512.24880

Empirical Results

The research team trains 3B, 9B and 27B mixture of experts models and evaluates them on a standard language model benchmark suite, including tasks like BBH, DROP, GSM8K, HellaSwag, MMLU, PIQA and TriviaQA.

For the 27B model, the reported numbers on a subset of tasks show the pattern clearly:

Baseline: BBH 43.8, DROP F1 47.0

With hyper connections: BBH 48.9, DROP 51.6

With mHC: BBH 51.0, DROP 53.9

So hyper connections already provide a gain over the basic residual design, and manifold constrained hyper connections push performance further while restoring stability. Similar trends appear on other benchmarks and across model sizes, and scaling curves suggest that the advantage persists across compute budgets and through the full training trajectory rather than only at convergence.

Key Takeaways

mHC stabilizes widened residual streams: mHC, Manifold Constrained Hyper Connections, widens the residual pathway into 4 interacting streams like HC, but constrains the residual mixing matrices on a manifold of doubly stochastic matrices, so long range propagation remains norm controlled instead of exploding.

Exploding gain is reduced from ≈3000 to ≈1.6: For a 27B MoE model, the Amax Gain Magnitude of the composite residual mapping peaks near 3000 for unconstrained HC, while mHC keeps this metric bounded around 1.6, which removes the exploding residual stream behavior that previously broke training.

Sinkhorn Knopp enforces doubly stochastic residual mixing: Each residual mixing matrix is projected with about 20 Sinkhorn Knopp iterations so that rows and columns both sum to 1, making the mapping a convex combination of permutations, which restores an identity like behavior while still allowing rich cross stream communication.

Small training overhead, measurable downstream gains: Across 3B, 9B and 27B DeepSeek MoE models, mHC improves benchmark accuracy, for example about plus 2.1 percent on BBH for the 27B model, while adding only about 6.7 percent training time overhead through fused kernels, recompute and pipeline aware scheduling.

Introduces a new scaling axis for LLM design: Instead of only scaling parameters or context length, mHC shows that explicitly designing the topology and manifold constraints of the residual stream, for example residual width and structure, is a practical way to unlock better performance and stability in future large language models.

Check out the FULL PAPER here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post DeepSeek Researchers Apply a 1967 Matrix Normalization Algorithm to Fix Instability in Hyper Connections appeared first on MarkTechPost.

How to Build a Production-Ready Multi-Agent Incident Response System U …

In this tutorial, we build an advanced yet practical multi-agent system using OpenAI Swarm that runs in Colab. We demonstrate how we can orchestrate specialized agents, such as a triage agent, an SRE agent, a communications agent, and a critic, to collaboratively handle a real-world production incident scenario. By structuring agent handoffs, integrating lightweight tools for knowledge retrieval and decision ranking, and keeping the implementation clean and modular, we show how Swarm enables us to design controllable, agentic workflows without heavy frameworks or complex infrastructure. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browser!pip -q install -U openai
!pip -q install -U “git+https://github.com/openai/swarm.git”

import os

def load_openai_key():
try:
from google.colab import userdata
key = userdata.get(“OPENAI_API_KEY”)
except Exception:
key = None
if not key:
import getpass
key = getpass.getpass(“Enter OPENAI_API_KEY (hidden): “).strip()
if not key:
raise RuntimeError(“OPENAI_API_KEY not provided”)
return key

os.environ[“OPENAI_API_KEY”] = load_openai_key()

We set up the environment and securely load the OpenAI API key so the notebook can run safely in Google Colab. We ensure the key is fetched from Colab secrets when available and fall back to a hidden prompt otherwise. This keeps authentication simple and reusable across sessions. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browserimport json
import re
from typing import List, Dict
from swarm import Swarm, Agent

client = Swarm()

We import the core Python utilities and initialize the Swarm client that orchestrates all agent interactions. This snippet establishes the runtime backbone that allows agents to communicate, hand off tasks, and execute tool calls. It serves as the entry point for the multi-agent workflow. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different BrowserKB_DOCS = [
{
“id”: “kb-incident-001”,
“title”: “API Latency Incident Playbook”,
“text”: “If p95 latency spikes, validate deploys, dependencies, and error rates. Rollback, cache, rate-limit, scale. Compare p50 vs p99 and inspect upstream timeouts.”
},
{
“id”: “kb-risk-001”,
“title”: “Risk Communication Guidelines”,
“text”: “Updates must include impact, scope, mitigation, owner, and next update. Avoid blame and separate internal vs external messaging.”
},
{
“id”: “kb-ops-001”,
“title”: “On-call Handoff Template”,
“text”: “Include summary, timeline, current status, mitigations, open questions, next actions, and owners.”
},
]

def _normalize(s: str) -> List[str]:
return re.sub(r”[^a-z0-9s]”, ” “, s.lower()).split()

def search_kb(query: str, top_k: int = 3) -> str:
q = set(_normalize(query))
scored = []
for d in KB_DOCS:
score = len(q.intersection(set(_normalize(d[“title”] + ” ” + d[“text”]))))
scored.append((score, d))
scored.sort(key=lambda x: x[0], reverse=True)
docs = [d for s, d in scored[:top_k] if s > 0] or [scored[0][1]]
return json.dumps(docs, indent=2)

We define a lightweight internal knowledge base and implement a retrieval function to surface relevant context during agent reasoning. By using simple token-based matching, we allow agents to ground their responses in predefined operational documents. This demonstrates how Swarm can be augmented with domain-specific memory without external dependencies. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browserdef estimate_mitigation_impact(options_json: str) -> str:
try:
options = json.loads(options_json)
except Exception as e:
return json.dumps({“error”: str(e)})
ranking = []
for o in options:
conf = float(o.get(“confidence”, 0.5))
risk = o.get(“risk”, “medium”)
penalty = {“low”: 0.1, “medium”: 0.25, “high”: 0.45}.get(risk, 0.25)
ranking.append({
“option”: o.get(“option”),
“confidence”: conf,
“risk”: risk,
“score”: round(conf – penalty, 3)
})
ranking.sort(key=lambda x: x[“score”], reverse=True)
return json.dumps(ranking, indent=2)

We introduce a structured tool that evaluates and ranks mitigation strategies based on confidence and risk. This allows agents to move beyond free-form reasoning and produce semi-quantitative decisions. We show how tools can enforce consistency and decision discipline in agent outputs. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browserdef handoff_to_sre():
return sre_agent

def handoff_to_comms():
return comms_agent

def handoff_to_handoff_writer():
return handoff_writer_agent

def handoff_to_critic():
return critic_agent

We define explicit handoff functions that enable one agent to transfer control to another. This snippet illustrates how we model delegation and specialization within Swarm. It makes agent-to-agent routing transparent and easy to extend. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browsertriage_agent = Agent(
name=”Triage”,
model=”gpt-4o-mini”,
instructions=”””
Decide which agent should handle the request.
Use SRE for incident response.
Use Comms for customer or executive messaging.
Use HandoffWriter for on-call notes.
Use Critic for review or improvement.
“””,
functions=[search_kb, handoff_to_sre, handoff_to_comms, handoff_to_handoff_writer, handoff_to_critic]
)

sre_agent = Agent(
name=”SRE”,
model=”gpt-4o-mini”,
instructions=”””
Produce a structured incident response with triage steps,
ranked mitigations, ranked hypotheses, and a 30-minute plan.
“””,
functions=[search_kb, estimate_mitigation_impact]
)

comms_agent = Agent(
name=”Comms”,
model=”gpt-4o-mini”,
instructions=”””
Produce an external customer update and an internal technical update.
“””,
functions=[search_kb]
)

handoff_writer_agent = Agent(
name=”HandoffWriter”,
model=”gpt-4o-mini”,
instructions=”””
Produce a clean on-call handoff document with standard headings.
“””,
functions=[search_kb]
)

critic_agent = Agent(
name=”Critic”,
model=”gpt-4o-mini”,
instructions=”””
Critique the previous answer, then produce a refined final version and a checklist.
“””
)

We configure multiple specialized agents, each with a clearly scoped responsibility and instruction set. By separating triage, incident response, communications, handoff writing, and critique, we demonstrate a clean division of labor. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browserdef run_pipeline(user_request: str):
messages = [{“role”: “user”, “content”: user_request}]
r1 = client.run(agent=triage_agent, messages=messages, max_turns=8)
messages2 = r1.messages + [{“role”: “user”, “content”: “Review and improve the last answer”}]
r2 = client.run(agent=critic_agent, messages=messages2, max_turns=4)
return r2.messages[-1][“content”]

request = “””
Production p95 latency jumped from 250ms to 2.5s after a deploy.
Errors slightly increased, DB CPU stable, upstream timeouts rising.
Provide a 30-minute action plan and a customer update.
“””

print(run_pipeline(request))

We assemble the full orchestration pipeline that executes triage, specialist reasoning, and critical refinement in sequence. This snippet shows how we run the end-to-end workflow with a single function call. It ties together all agents and tools into a coherent, production-style agentic system.

In conclusion, we established a clear pattern for designing agent-oriented systems with OpenAI Swarm that emphasizes clarity, separation of responsibilities, and iterative refinement. We showed how to route tasks intelligently, enrich agent reasoning with local tools, and improve output quality via a critic loop, all while maintaining a simple, Colab-friendly setup. This approach allows us to scale from experimentation to real operational use cases, making Swarm a powerful foundation for building reliable, production-grade agentic AI workflows.

Check out the FULL CODES HERE. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Production-Ready Multi-Agent Incident Response System Using OpenAI Swarm and Tool-Augmented Agents appeared first on MarkTechPost.

Recursive Language Models (RLMs): From MIT’s Blueprint to Prime Inte …

Recursive Language Models aim to break the usual trade off between context length, accuracy and cost in large language models. Instead of forcing a model to read a giant prompt in one pass, RLMs treat the prompt as an external environment and let the model decide how to inspect it with code, then recursively call itself on smaller pieces.

https://arxiv.org/pdf/2512.24601

The Basics

The full input is loaded into a Python REPL as a single string variable. The root model, for example GPT-5, never sees that string directly in its context. Instead, it receives a system prompt that explains how to read slices of the variable, write helper functions, spawn sub LLM calls, and combine results. The model returns a final text answer, so the external interface stays identical to a standard chat completion endpoint.

The RLM design uses the REPL as a control plane for long context. The environment, usually written in Python, exposes tools such as string slicing, regex search and helper functions like llm_query that call a smaller model instance, for example GPT-5-mini. The root model writes code that calls these helpers to scan, partition and summarize the external context variable. The code can store intermediate results in variables and build up the final answer step by step. This structure makes the prompt size independent from the model context window and turns long context handling into a program synthesis problem.

https://arxiv.org/pdf/2512.24601

Where it stands in Evaluation?

The research paper evaluates this idea on four long context benchmarks with different computational structure. S-NIAH is a constant complexity needle in a haystack task. BrowseComp-Plus is a multi hop web style question answering benchmark over up to 1,000 documents. OOLONG is a linear complexity long context reasoning task where the model must transform many entries and then aggregate them. OOLONG Pairs increases the difficulty further with quadratic pairwise aggregation over the input. These tasks stress both context length and reasoning depth, not only retrieval.

On these benchmarks, RLMs give large accuracy gains over direct LLM calls and common long context agents. For GPT-5 on CodeQA, a long document question answering setup, the base model reaches 24.00 accuracy, a summarization agent reaches 41.33, while RLM reaches 62.00 and the RLM without recursion reaches 66.00. For Qwen3-Coder-480B-A35B, the base model scores 20.00, a CodeAct retrieval agent 52.00, and the RLM 56.00 with a REPL only variant at 44.66.

The gains are largest on the hardest setting, OOLONG Pairs. For GPT-5, the direct model is almost unusable with F1 equal to 0.04. Summarization and CodeAct agents sit near 0.01 and 24.67. The full RLM reaches 58.00 F1 and the non recursive REPL variant still achieves 43.93. For Qwen3-Coder, the base model stays below 0.10 F1, while the full RLM reaches 23.11 and the REPL only version 17.34. These numbers show that both the REPL and recursive sub calls are critical on dense quadratic tasks.

https://arxiv.org/pdf/2512.24601

BrowseComp-Plus highlights effective context extension. The corpus ranges from about 6M to 11M tokens, which is 2 orders of magnitude beyond the 272k token context window of GPT-5. RLM with GPT 5 maintains strong performance even when given 1,000 documents in the environment variable, while standard GPT-5 baselines degrade as document count grows. On this benchmark, RLM GPT 5 achieves around 91.33 accuracy with an average cost of 0.99 USD per query, while a hypothetical model that reads the full context directly would cost between $1.50 and $2.75 at current pricing.

The research paper also analyzes the trajectories of RLM runs. Several behavior patterns emerge. The model often starts with a peek step where it inspects the first few thousand characters of the context. It then uses grep style filtering with regex or keyword search to narrow down relevant lines. For more complex queries, it partitions the context into chunks and calls recursive LMs on each chunk to perform labeling or extraction, followed by programmatic aggregation. On long output tasks, the RLM stores partial outputs in variables and stitches them together, which bypasses output length limits of the base model.

The new take from Prime Intellect

Prime Intellect team has turned this concept into a concrete environment, RLMEnv, integrated in their verifiers stack and Environments Hub. In their design, the main RLM has only a Python REPL, while sub LLMs receive the heavy tools such as web search or file access. The REPL exposes an llm_batch function so the root model can fan out many sub queries in parallel, and an answer variable where the final solution must be written and flagged as ready. This isolates token heavy tool outputs from the main context and lets the RLM delegate expensive operations to sub models.

Prime Intellect evaluates this implementation on four environments. DeepDive tests web research with search and open tools and very verbose pages. Math python exposes a Python REPL for difficult competition style math problems. Oolong reuses the long context benchmark inside RLMEnv. Verbatim copy focuses on exact reproduction of complex strings across content types such as JSON, CSV and mixed codes. Across these environments, GPT-5-mini and the INTELLECT-3-MoE model both gain from the RLM scaffold in success rate and in robustness to very long contexts, especially when tool output would otherwise swamp the model context

The research paper’s author team and Prime Intellect team both stress that current implementations are not fully optimized. RLM calls are synchronous, recursion depth is limited and cost distributions have heavy tails due to very long trajectories. The real opportunity is to combine RLM scaffolding with dedicated reinforcement learning so that models learn better chunking, recursion and tool usage policies over time. If that happens, RLMs provide a framework where improvements in base models and in systems design convert directly into more capable long horizon agents that can consume 10M plus token environments without context rot.

Key Takeaways

Here are 5 concise, technical takeaways you can plug under the article.

RLMs reframe long context as an environment variable: Recursive Language Models treat the entire prompt as an external string in a Python style REPL, which the LLM inspects and transforms through code, instead of ingesting all tokens directly into the Transformer context.

Inference time recursion extends context to 10M plus tokens: RLMs let a root model recursively call sub LLMs on selected snippets of the context, which enables effective processing of prompts up to about 2 orders of magnitude longer than the base context window, reaching 10M plus tokens on BrowseComp-Plus style workloads.

RLMs outperform common long context scaffolds on hard benchmarks: Across S-NIAH, BrowseComp-Plus, OOLONG and OOLONG Pairs, RLM variants of GPT-5 and Qwen3-Coder improve accuracy and F1 over direct model calls, retrieval agents such as CodeAct, and summarization agents, while keeping per query cost comparable or lower.

REPL only variants already help, recursion is critical for quadratic tasks: An ablation that only exposes the REPL without recursive sub calls still boosts performance on some tasks, which shows the value of offloading context into the environment, but full RLMs are required to achieve large gains on information dense settings such as OOLONG Pairs.

Prime Intellect operationalizes RLMs through RLMEnv and INTELLECT 3: Prime Intellect team implements the RLM paradigm as RLMEnv, where the root LM controls a sandboxed Python REPL, calls tools via sub LMs and writes the final result to an answer variable, and reports consistent gains on DeepDive, math python, Oolong and verbatim copy environments with models such as INTELLECT-3.

Check out the Paper and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Recursive Language Models (RLMs): From MIT’s Blueprint to Prime Intellect’s RLMEnv for Long Horizon LLM Agents appeared first on MarkTechPost.