July 2025 - Page 9 of 10

Transforming network operations with AI: How Swisscom built a network …

Posted on July 4, 2025 by i-genie

In the telecommunications industry, managing complex network infrastructures requires processing vast amounts of data from multiple sources. Network engineers often spend considerable time manually gathering and analyzing this data, taking away valuable hours that could be spent on strategic initiatives. This challenge led Swisscom, Switzerland’s leading telecommunications provider, to explore how AI can transform their network operations.
Swisscom’s Network Assistant, built on Amazon Bedrock, represents a significant step forward in automating network operations. This solution combines generative AI capabilities with a sophisticated data processing pipeline to help engineers quickly access and analyze network data. Swisscom used AWS services to create a scalable solution that reduces manual effort and provides accurate and timely network insights.
In this post, we explore how Swisscom developed their Network Assistant. We discuss the initial challenges and how they implemented a solution that delivers measurable benefits. We examine the technical architecture, discuss key learnings, and look at future enhancements that can further transform network operations. We highlight best practices for handling sensitive data for Swisscom to comply with the strict regulations governing the telecommunications industry. This post provides telecommunications providers or other organizations managing complex infrastructure with valuable insights into how you can use AWS services to modernize operations through AI-powered automation.
The opportunity: Improve network operations
Network engineers at Swisscom faced the daily challenge to manage complex network operations and maintain optimal performance and compliance. These skilled professionals were tasked to monitor and analyze vast amounts of data from multiple and decoupled sources. The process was repetitive and demanded considerable time and attention to detail. In certain scenarios, fulfilling the assigned tasks consumed more than 10% of their availability. The manual nature of their work presented several critical pain points. The data consolidation process from multiple network entities into a coherent overview was particularly challenging, because engineers had to navigate through various tools and systems to retrieve telemetry information about data sources and network parameters from extensive documentation, verify KPIs through complex calculations, and identify potential issues of diverse nature. This fragmented approach consumed valuable time and introduced the risk of human error in data interpretation and analysis. The situation called for a solution to address three primary concerns:

Efficiency in data retrieval and analysis
Accuracy in calculations and reporting
Scalability to accommodate growing data sources and use cases

The team required a streamlined approach to access and analyze network data, maintain compliance with defined metrics and thresholds, and deliver fast and accurate responses to events while maintaining the highest standards of data security and sovereignty.
Solution overview
Swisscom’s approach to develop the Network Assistant was methodical and iterative. The team chose Amazon Bedrock as the foundation for their generative AI application and implemented a Retrieval Augmented Generation (RAG) architecture using Amazon Bedrock Knowledge Bases to enable precise and contextual responses to engineer queries. The RAG approach is implemented in three distinct phases:

Retrieval – User queries are matched with relevant knowledge base content through embedding models
Augmentation – The context is enriched with retrieved information
Generation – The large language model (LLM) produces informed responses

The following diagram illustrates the solution architecture.

The solution architecture evolved through several iterations. The initial implementation established basic RAG functionality by feeding the Amazon Bedrock knowledge base with tabular data and documentation. However, the Network Assistant struggled to manage large input files containing thousands of rows with numerical values across multiple parameter columns. This complexity highlighted the need for a more selective approach that could identify only the rows relevant for specific KPI calculations. At that point, the retrieval process wasn’t returning the precise number of vector embeddings required to calculate the formulas, prompting the team to refine the solution for greater accuracy.
Next iterations enhanced the assistant with agent-based processing and action groups. The team implemented AWS Lambda functions using Pandas or Spark for data processing, facilitating accurate numerical calculations retrieval using natural language from the user input prompt.
A significant advancement was introduced with the implementation of a multi-agent approach, using Amazon Bedrock Agents, where specialized agents handle different aspects of the system:

Supervisor agent – Orchestrates interactions between documentation management and calculator agents to provide comprehensive and accurate responses.
Documentation management agent – Helps the network engineers access information in large volumes of data efficiently and extract insights about data sources, network parameters, configuration, or tooling.
Calculator agent – Supports the network engineers to understand complex network parameters and perform precise data calculations out of telemetry data. This produces numerical insights that help perform network management tasks; optimize performance; maintain network reliability, uptime, and compliance; and assist in troubleshooting.

This following diagram illustrates the enhanced data extract, transform, and load (ETL) pipeline interaction with Amazon Bedrock.

To achieve the desired accuracy in KPI calculations, the data pipeline was refined to achieve consistent and precise performance, which leads to meaningful insights. The team implemented an ETL pipeline with Amazon Simple Storage Service (Amazon S3) as the data lake to store input files following a daily batch ingestion approach, AWS Glue for automated data crawling and cataloging, and Amazon Athena for SQL querying. At this point, it became possible for the calculator agent to forego the Pandas or Spark data processing implementation. Instead, by using Amazon Bedrock Agents, the agent translates natural language user prompts into SQL queries. In a subsequent step, the agent runs the relevant SQL queries selected dynamically through analysis of various input parameters, providing the calculator agent an accurate result. This serverless architecture supports scalability, cost-effectiveness, and maintains high accuracy in KPI calculations. The system integrates with Swisscom’s on-premises data lake through daily batch data ingestion, with careful consideration of data security and sovereignty requirements.
To enhance data security and appropriate ethics in the Network Assistant responses, a series of guardrails were defined in Amazon Bedrock. The application implements a comprehensive set of data security guardrails to protect against malicious inputs and safeguard sensitive information. These include content filters that block harmful categories such as hate, insults, violence, and prompt-based threats like SQL injection. Specific denied topics and sensitive identifiers (for example, IMSI, IMEI, MAC address, or GPS coordinates) are filtered through manual word filters and pattern-based detection, including regular expressions (regex). Sensitive data such as personally identifiable information (PII), AWS access keys, and serial numbers are blocked or masked. The system also uses contextual grounding and relevance checks to verify model responses are factually accurate and appropriate. In the event of restricted input or output, standardized messaging notifies the user that the request can’t be processed. These guardrails help prevent data leaks, reduce the risk of DDoS-driven cost spikes, and maintain the integrity of the application’s outputs.
Results and benefits
The implementation of the Network Assistant is set to deliver substantial and measurable benefits to Swisscom’s network operations. The most significant impact is time savings. Network engineers are estimated to experience 10% reduction in time spent on routine data retrieval and analysis tasks. This efficiency gain translates to nearly 200 hours per engineer saved annually, and represents a significant improvement in operational efficiency. The financial impact is equally impressive. The solution is projected to provide substantial cost savings per engineer annually, with minimal operational costs at less than 1% of the total value generated. The return on investment increases as additional teams and use cases are incorporated into the system, demonstrating strong scalability potential.
Beyond the quantifiable benefits, the Network Assistant is expected to transform how engineers interact with network data. The enhanced data pipeline supports accuracy in KPI calculations, critical for network health tracking, and the multi-agent approach provides orchestrated and comprehensive responses to complex queries out of user natural language.
As a result, engineers can have instant access to a wide range of network parameters, data source information, and troubleshooting guidance from an individual personalized endpoint with which they can quickly interact and obtain insights through natural language. This enables them to focus on strategic tasks rather than routine data gathering and analysis, leading to a significant work reduction that aligns with Swisscom SRE principles.
Lessons learned
Throughout the development and implementation of the Swisscom Network Assistant, several learnings emerged that shaped the solution. The team needed to address data sovereignty and security requirements for the solution, particularly when processing data on AWS. This led to careful consideration of data classification and compliance with applicable regulatory requirements in the telecommunications sector, to make sure that sensitive data is handled appropriately. In this regard, the application underwent a strict threat model evaluation, verifying the robustness of its interfaces against vulnerabilities and acting proactively towards securitization. The threat model was applied to assess doomsday scenarios, and data flow diagrams were created to depict major data flows inside and beyond the application boundaries. The AWS architecture was specified in detail, and trust boundaries were set to indicate which portions of the application trusted each other. Threats were identified following the STRIDE methodology (Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, Elevation of privilege), and countermeasures, including Amazon Bedrock Guardrails, were defined to avoid or mitigate threats in advance.
A critical technical insight was that complex calculations involving significant data volume management required a different approach than mere AI model interpretation. The team implemented an enhanced data processing pipeline that combines the contextual understanding of AI models with direct database queries for numerical calculations. This hybrid approach facilitates both accuracy in calculations and richness in contextual responses.
The choice of a serverless architecture proved to be particularly beneficial: it minimized the need to manage compute resources and provides automatic scaling capabilities. The pay-per-use model of AWS services helped keep operational costs low and maintain high performance. Additionally, the team’s decision to implement a multi-agent approach provided the flexibility needed to handle diverse types of queries and use cases effectively.
Next steps
Swisscom has ambitious plans to enhance the Network Assistant’s capabilities further. A key upcoming feature is the implementation of a network health tracker agent to provide proactive monitoring of network KPIs. This agent will automatically generate reports to categorize issues based on criticality, enable faster response time, and improve the quality of issue resolution to potential network issues. The team is also exploring the integration of Amazon Simple Notification Service (Amazon SNS) to enable proactive alerting for critical network status changes. This can include direct integration with operational tools that alert on-call engineers, to further streamline the incident response process. The enhanced notification system will help engineers address potential issues before they critically impact network performance and obtain a detailed action plan including the affected network entities, the severity of the event, and what went wrong precisely.
The roadmap also includes expanding the system’s data sources and use cases. Integration with additional internal network systems will provide more comprehensive network insights. The team is also working on developing more sophisticated troubleshooting features, using the growing knowledge base and agentic capabilities to provide increasingly detailed guidance to engineers.
Additionally, Swisscom is adopting infrastructure as code (IaC) principles by implementing the solution using AWS CloudFormation. This approach introduces automated and consistent deployments while providing version control of infrastructure components, facilitating simpler scaling and management of the Network Assistant solution as it grows.
Conclusion
The Network Assistant represents a significant advancement in how Swisscom can manage its network operations. By using AWS services and implementing a sophisticated AI-powered solution, they have successfully addressed the challenges of manual data retrieval and analysis. As a result, they have boosted both accuracy and efficiency so network engineers can respond quickly and decisively to network events. The solution’s success is aided not only by the quantifiable benefits in time and cost savings but also by its potential for future expansion. The serverless architecture and multi-agent approach provide a solid foundation for adding new capabilities and scaling across different teams and use cases.As organizations worldwide grapple with similar challenges in network operations, Swisscom’s implementation serves as a valuable blueprint for using cloud services and AI to transform traditional operations. The combination of Amazon Bedrock with careful attention to data security and accuracy demonstrates how modern AI solutions can help solve real-world engineering challenges.
As managing network operations complexity continues to grow, the lessons from Swisscom’s journey can be applied to many engineering disciplines. We encourage you to consider how Amazon Bedrock and similar AI solutions might help your organization overcome its own comprehension and process improvement barriers. To learn more about implementing generative AI in your workflows, explore Amazon Bedrock Resources or contact AWS.
Additional resources
For more information about Amazon Bedrock Agents and its use cases, refer to the following resources:

Generative AI for telecom
Best practices for building robust generative AI applications with Amazon Bedrock Agents – Part 1
Best practices for building robust generative AI applications with Amazon Bedrock Agents – Part 2

About the authors
Pablo García Benedicto is an experienced Data & AI Cloud Engineer with strong expertise in cloud hyperscalers and data engineering. With a background in telecommunications, he currently works at Swisscom, where he leads and contributes to projects involving Generative AI applications and agents using Amazon Bedrock. Aiming for AI and data specialization, his latest projects focus on building intelligent assistants and autonomous agents that streamline business information retrieval, leveraging cloud-native architectures and scalable data pipelines to reduce toil and drive operational efficiency.
Rajesh Sripathi is a Generative AI Specialist Solutions Architect at AWS, where he partners with global Telecommunication and Retail & CPG customers to develop and scale generative AI applications. With over 18 years of experience in the IT industry, Rajesh helps organizations use cutting-edge cloud and AI technologies for business transformation. Outside of work, he enjoys exploring new destinations through his passion for travel and driving.
Ruben Merz Ruben Merz is a Principal Solutions Architect at AWS. With a background in distributed systems and networking, his work with customers at AWS focuses on digital sovereignty, AI, and networking.
Jordi Montoliu Nerin is a Data & AI Leader currently serving as Senior AI/ML Specialist at AWS, where he helps worldwide telecommunications customers implement AI strategies after previously driving Data & Analytics business across EMEA regions. He has over 10 years of experience, where he has led multiple Data & AI implementations at scale, led executions of data strategy and data governance frameworks, and has driven strategic technical and business development programs across multiple industries and continents. Outside of work, he enjoys sports, cooking and traveling.

End-to-End model training and deployment with Amazon SageMaker Unified …

Posted on July 4, 2025 by i-genie

Although rapid generative AI advancements are revolutionizing organizational natural language processing tasks, developers and data scientists face significant challenges customizing these large models. These hurdles include managing complex workflows, efficiently preparing large datasets for fine-tuning, implementing sophisticated fine-tuning techniques while optimizing computational resources, consistently tracking model performance, and achieving reliable, scalable deployment.The fragmented nature of these tasks often leads to reduced productivity, increased development time, and potential inconsistencies in the model development pipeline. Organizations need a unified, streamlined approach that simplifies the entire process from data preparation to model deployment.
To address these challenges, AWS has expanded Amazon SageMaker with a comprehensive set of data, analytics, and generative AI capabilities. At the heart of this expansion is Amazon SageMaker Unified Studio, a centralized service that serves as a single integrated development environment (IDE). SageMaker Unified Studio streamlines access to familiar tools and functionality from purpose-built AWS analytics and artificial intelligence and machine learning (AI/ML) services, including Amazon EMR, AWS Glue, Amazon Athena, Amazon Redshift, Amazon Bedrock, and Amazon SageMaker AI. With SageMaker Unified Studio, you can discover data through Amazon SageMaker Catalog, access it from Amazon SageMaker Lakehouse, select foundation models (FMs) from Amazon SageMaker JumpStart or build them through JupyterLab, train and fine-tune them with SageMaker AI training infrastructure, and deploy and test models directly within the same environment. SageMaker AI is a fully managed service to build, train, and deploy ML models—including FMs—for different use cases by bringing together a broad set of tools to enable high-performance, low-cost ML. It’s available as a standalone service on the AWS Management Console, or through APIs. Model development capabilities from SageMaker AI are available within SageMaker Unified Studio.
In this post, we guide you through the stages of customizing large language models (LLMs) with SageMaker Unified Studio and SageMaker AI, covering the end-to-end process starting from data discovery to fine-tuning FMs with SageMaker AI distributed training, tracking metrics using MLflow, and then deploying models using SageMaker AI inference for real-time inference. We also discuss best practices to choose the right instance size and share some debugging best practices while working with JupyterLab notebooks in SageMaker Unified Studio.
Solution overview
The following diagram illustrates the solution architecture. There are three personas: admin, data engineer, and user, which can be a data scientist or an ML engineer.

AWS SageMaker Unified Studio ML workflow showing data processing, model training, and deployment stages

Setting up the solution consists of the following steps:

The admin sets up the SageMaker Unified Studio domain for the user and sets the access controls. The admin also publishes the data to SageMaker Catalog in SageMaker Lakehouse.
Data engineers can create and manage extract, transform, and load (ETL) pipelines directly within Unified Studio using Visual ETL. They can transform raw data sources into datasets ready for exploratory data analysis. The admin can then manage the publication of these assets to the SageMaker Catalog, making them discoverable and accessible to other team members or users such as data engineers in the organization.
Users or data engineers can log in to the Unified Studio web-based IDE using the login provided by the admin to create a project and create a managed MLflow server for tracking experiments. Users can discover available data assets in the SageMaker Catalog and request a subscription to an asset published by the data engineer. After the data engineer approves the subscription request, the user performs an exploratory data analysis of the content of the table with the query editor or with a JupyterLab notebook, then prepares the dataset by connecting with SageMaker Catalog through an AWS Glue or Athena connection.
You can explore models from SageMaker JumpStart, which hosts over 200 models for various tasks, and fine-tune directly with the UI, or develop a training script for fine-tuning the LLM in the JupyterLab IDE. SageMaker AI provides distributed training libraries and supports various distributed training options for deep learning tasks. For this post, we use the PyTorch framework and use Hugging Face open source FMs for fine-tuning. We will show you how you can use parameter efficient fine-tuning (PEFT) with Low-Rank Adaptation (LoRa), where you freeze the model weights, train the model with modifying weight metrics, and then merge these LoRa adapters back to the base model after distributed training.
You can track and monitor fine-tuning metrics directly in SageMaker Unified Studio using MLflow, by analyzing metrics such as loss to make sure the model is correctly fine-tuned.
You can deploy the model to a SageMaker AI endpoint after the fine-tuning job is complete and test it directly from SageMaker Unified Studio.

Prerequisites
Before starting this tutorial, make sure you have the following:

An AWS account with permissions to create SageMaker resources. For setup instructions, see Set up an AWS account and create an administrator user.
Familiarity with Python and PyTorch for distributed training and model customization.

Set up SageMaker Unified Studio and configure user access
SageMaker Unified Studio is built on top of Amazon DataZone capabilities such as domains to organize your assets and users, and projects to collaborate with others users, securely share artifacts, and seamlessly work across compute services.
To set up Unified Studio, complete the following steps:

As an admin, create a SageMaker Unified Studio domain, and note the URL.
On the domain’s details page, on the User management tab, choose Configure SSO user access. For this post, we recommend setting up using single sign-on (SSO) access using the URL.

For more information about setting up user access, see Managing users in Amazon SageMaker Unified Studio.
Log in to SageMaker Unified Studio
Now that you have created your new SageMaker Unified Studio domain, complete the following steps to access SageMaker Unified Studio:

On the SageMaker console, open the details page of your domain.
Choose the link for the SageMaker Unified Studio URL.
Log in with your SSO credentials.

Now you’re signed in to SageMaker Unified Studio.
Create a project
The next step is to create a project. Complete the following steps:

In SageMaker Unified Studio, choose Select a project on the top menu, and choose Create project.
For Project name, enter a name (for example, demo).
For Project profile, choose your profile capabilities. A project profile is a collection of blueprints, which are configurations used to create projects. For this post, we choose All capabilities, then choose Continue.

Creating a project in Amazon SageMaker Unified Studio

Create a compute space
SageMaker Unified Studio provides compute spaces for IDEs that you can use to code and develop your resources. By default, it creates a space for you to get started with you project. You can find the default space by choosing Compute in the navigation pane and choosing the Spaces tab. You can then choose Open to go to the JuypterLab environment and add members to this space. You can also create a new space by choosing Create space on the Spaces tab.

To use SageMaker Studio notebooks cost-effectively, use smaller, general-purpose instances (like the T or M families) for interactive data exploration and prototyping. For heavy lifting like training or large-scale processing or deployment, use SageMaker AI training jobs and SageMaker AI prediction to offload the work to separate and more powerful instances such as the P5 family. We will show you in the notebook how you can run training jobs and deploy LLMs in the notebook with APIs. It is not recommended to run distributed workloads in notebook instances. The chances of kernel failures is high because JupyterLab notebooks should not be used for large distributed workloads (both for data and ML training).
The following screenshot shows the configuration options for your space. You can change your instance size from default (ml.t3.medium) to (ml.m5.xlarge) for the JupyterLab IDE. You can also increase the Amazon Elastic Block Store (Amazon EBS) volume capacity from 16 GB to 50 GB for training LLMs.

Canfigure space in Amazon SageMaker Unified Studio

Set up MLflow to track ML experiments
You can use MLflow in SageMaker Unified Studio to create, manage, analyze, and compare ML experiments. Complete the following steps to set up MLflow:

In SageMaker Unified Studio, choose Compute in the navigation pane.
On the MLflow Tracking Servers tab, choose Create MLflow Tracking Server.
Provide a name and create your tracking server.
Choose Copy ARN to copy the Amazon Resource Name (ARN) of the tracking server.

You will need this MLflow ARN in your notebook to set up distributed training experiment tracking.
Set up the data catalog
For model fine-tuning, you need access to a dataset. After you set up the environment, the next step is to find the relevant data from the SageMaker Unified Studio data catalog and prepare the data for model tuning. For this post, we use the Stanford Question Answering Dataset (SQuAD) dataset. This dataset is a reading comprehension dataset, consisting of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Download the SQuaD dataset and upload it to SageMaker Lakehouse by following the steps in Uploading data.

Adding data to Catalog in Amazon SageMaker Unified Studio

To make this data discoverable by the users or ML engineers, the admin needs to publish this data to the Data Catalog. For this post, you can directly download the SQuaD dataset and upload it to the catalog. To learn how to publish the dataset to SageMaker Catalog, see Publish assets to the Amazon SageMaker Unified Studio catalog from the project inventory.
Query data with the query editor and JupyterLab
In many organizations, data preparation is a collaborative effort. A data engineer might prepare an initial raw dataset, which a data scientist then refines and augments with feature engineering before using it for model training. In the SageMaker Lakehouse data and model catalog, publishers set subscriptions for automatic or manual approval (wait for admin approval). Because you already set up the data in the previous section, you can skip this section showing how to subscribe to the dataset.
To subscribe to another dataset like SQuAD, open the data and model catalog in Amazon SageMaker Lakehouse, choose SQuAD, and subscribe.

Subscribing to any asset or dataset published by Admin

Next, let’s use the data explorer to explore the dataset you subscribed to. Complete the following steps:

On the project page, choose Data.
Under Lakehouse, expand AwsDataCatalog.
Expand your database starting from glue_db_.
Choose the dataset you created (starting with squad) and choose Query with Athena.

Querying the data using Query Editor in Amazon SageMaker Unfied Studio

Process your data through a multi-compute JupyterLab IDE notebook
SageMaker Unified Studio provides a unified JupyterLab experience across different languages, including SQL, PySpark, Python, and Scala Spark. It also supports unified access across different compute runtimes such as Amazon Redshift and Athena for SQL, Amazon EMR Serverless, Amazon EMR on EC2, and AWS Glue for Spark.
Complete the following steps to get started with the unified JupyterLab experience:

Open your SageMaker Unified Studio project page.
On the top menu, choose Build, and under IDE & APPLICATIONS, choose JupyterLab.
Wait for the space to be ready.
Choose the plus sign and for Notebook, choose Python 3.
Open a new terminal and enter git clonehttps://github.com/aws-samples/amazon-sagemaker-generativeai.
Go to the folder amazon-sagemaker-generativeai/3_distributed_training/distributed_training_sm_unified_studio/ and open the distributed training in unified studio.ipynb notebook to get started.
Enter the MLflow server ARN you created in the following code:

import os
os.environ[“mlflow_uri”] = “”
os.environ[“mlflow_experiment_name”] = “deepseek-r1-distill-llama-8b-sft”

Now you an visualize the data through the notebook.

On the project page, choose Data.
Under Lakehouse, expand AwsDataCatalog.
Expand your database starting from glue_db, copy the name of the database, and enter it in the following code:

db_name = “<enter your db name>”
table = “sqad”

You can now access the entire dataset directly by using the in-line SQL query capabilities of JupyterLab notebooks in SageMaker Unified Studio. You can follow the data preprocessing steps in the notebook.

%%sql project.athena
SELECT * FROM “<DATABASE_NAME>”.”sqad”;

The following screenshot shows the output.

We are going to split the dataset into a test set and training set for model training. When the data processing in done and we have split the data into test and training sets, the next step is to perform fine-tuning of the model using SageMaker Distributed Training.
Fine-tune the model with SageMaker Distributed training
You’re now ready to fine-tune your model by using SageMaker AI capabilities for training. Amazon SageMaker Training is a fully managed ML service offered by SageMaker that helps you efficiently train a wide range of ML models at scale. The core of SageMaker AI jobs is the containerization of ML workloads and the capability of managing AWS compute resources. SageMaker Training takes care of the heavy lifting associated with setting up and managing infrastructure for ML training workloads
We select one model directly from the Hugging Face Hub, DeepSeek-R1-Distill-Llama-8B, and develop our training script in the JupyterLab space. Because we want to distribute the training across all the available GPUs in our instance, by using PyTorch Fully Sharded Data Parallel (FSDP), we use the Hugging Face Accelerate library to run the same PyTorch code across distributed configurations. You can start the fine-tuning job directly in your JupyterLab notebook or use the SageMaker Python SDK to start the training job. We use the Trainer from transfomers to fine-tune our model. We prepared the script train.py, which loads the dataset from disk, prepares the model and tokenizer, and starts the training.
For configuration, we use TrlParser, and provide hyperparameters in a YAML file. You can upload this file and provide it to SageMaker similar to your datasets. The following is the config file for fine-tuning the model on ml.g5.12xlarge. Save the config file as args.yaml and upload it to Amazon Simple Storage Service (Amazon S3).

cat > ./args.yaml <<EOF
model_id: “deepseek-ai/DeepSeek-R1-Distill-Llama-8B” # Hugging Face model id
mlflow_uri: “${mlflow_uri}”
mlflow_experiment_name: “${mlflow_experiment_name}”
# sagemaker specific parameters
output_dir: “/opt/ml/model” # path to where SageMaker will upload the model
train_dataset_path: “/opt/ml/input/data/train/” # path to where FSx saves train dataset
test_dataset_path: “/opt/ml/input/data/test/” # path to where FSx saves test dataset
# training parameters
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1
learning_rate: 2e-4 # learning rate scheduler
num_train_epochs: 1 # number of training epochs
per_device_train_batch_size: 2 # batch size per device during training
per_device_eval_batch_size: 1 # batch size for evaluation
gradient_accumulation_steps: 2 # number of steps before performing a backward/update pass
gradient_checkpointing: true # use gradient checkpointing
bf16: true # use bfloat16 precision
tf32: false # use tf32 precision
fsdp: “full_shard auto_wrap offload”
fsdp_config:
   backward_prefetch: “backward_pre”
   cpu_ram_efficient_loading: true
   offload_params: true
   forward_prefetch: false
   use_orig_params: true
merge_weights: true # merge weights in the base model
EOF

Use the following code to use the native PyTorch container image, pre-built for SageMaker:

image_uri = sagemaker.image_uris.retrieve(
   framework=”pytorch”,
   region=sagemaker_session.boto_session.region_name,
   version=”2.6.0″,
   instance_type=instance_type,
   image_scope=”training”
)

image_uri

Define the trainer as follows:

Define the ModelTrainer
model_trainer = ModelTrainer(
   training_image=image_uri,
   source_code=source_code,
   base_job_name=job_name,
   compute=compute_configs,
   distributed=Torchrun(),
   stopping_condition=StoppingCondition(
   max_runtime_in_seconds=7200
   ),
   hyperparameters={
   “config”: “/opt/ml/input/data/config/args.yaml” # path to TRL config which was uploaded to s3
   },
   output_data_config=OutputDataConfig(
   s3_output_path=output_path
   ),
)

Run the trainer with the following:

# starting the train job with our uploaded datasets as input
model_trainer.train(input_data_config=data, wait=True)

You can follow the steps in the notebook.
You can explore the job execution in SageMaker Unified Studio. The training job runs on the SageMaker training cluster by distributing the computation across the four available GPUs on the selected instance type ml.g5.12xlarge. We choose to merge the LoRA adapter with the base model. This decision was made during the training process by setting the merge_weights parameter to True in our train_fn() function. Merging the weights provides a single, cohesive model that incorporates both the base knowledge and the domain-specific adaptations we’ve made through fine-tuning.
Track training metrics and model registration using MLflow
You created an MLflow server in an earlier step to track experiments and registered models, and provided the server ARN in the notebook.
You can log MLflow models and automatically register them with Amazon SageMaker Model Registry using either the Python SDK or directly through the MLflow UI. Use mlflow.register_model() to automatically register a model with SageMaker Model Registry during model training. You can explore the MLflow tracking code in train.py and the notebook. The training code tracks MLflow experiments and registers the model to the MLflow model registry. To learn more, see Automatically register SageMaker AI models with SageMaker Model Registry.
To see the logs, complete the following steps:

Choose Build, then choose Spaces.
Choose Compute in the navigation pane.
On the MLflow Tracking Servers tab, choose Open to open the tracking server.

You can see both the experiments and registered models.

Deploy and test the model using SageMaker AI Inference
When deploying a fine-tuned model on AWS, SageMaker AI Inference offers multiple deployment strategies. In this post, we use SageMaker real-time inference. The real-time inference endpoint is designed for having full control over the inference resources. You can use a set of available instances and deployment options for hosting your model. By using the SageMaker built-in container DJL Serving, you can take advantage of the inference script and optimization options available directly in the container. In this post, we deploy the fine-tuned model to a SageMaker endpoint for running inference, which will be used for testing the model.
In SageMaker Unified Studio, in JupyterLab, we create the Model object, which is a high-level SageMaker model class for working with multiple container options. The image_uri parameter specifies the container image URI for the model, and model_data points to the Amazon S3 location containing the model artifact (automatically uploaded by the SageMaker training job). We also specify a set of environment variables to configure the specific inference backend option (OPTION_ROLLING_BATCH), the degree of tensor parallelism based on the number of available GPUs (OPTION_TENSOR_PARALLEL_DEGREE), and the maximum allowable length of input sequences (in tokens) for models during inference (OPTION_MAX_MODEL_LEN).

model = Model(
   image_uri=image_uri,
   model_data=f”s3://{bucket_name}/{job_prefix}/{job_name}/output/model.tar.gz”,
   role=get_execution_role(),
   env={
   ‘HF_MODEL_ID’: “/opt/ml/model”,
   ‘OPTION_TRUST_REMOTE_CODE’: ‘true’,
   ‘OPTION_ROLLING_BATCH’: “vllm”,
   ‘OPTION_DTYPE’: ‘bf16’,
   ‘OPTION_TENSOR_PARALLEL_DEGREE’: ‘max’,
   ‘OPTION_MAX_ROLLING_BATCH_SIZE’: ‘1’,
   ‘OPTION_MODEL_LOADING_TIMEOUT’: ‘3600’,
   ‘OPTION_MAX_MODEL_LEN’: ‘4096’
   }
)

After you create the model object, you can deploy it to an endpoint using the deploy method. The initial_instance_count and instance_type parameters specify the number and type of instances to use for the endpoint. We selected the ml.g5.4xlarge instance for the endpoint. The container_startup_health_check_timeout and model_data_download_timeout parameters set the timeout values for the container startup health check and model data download, respectively.

model_id = “deepseek-ai/DeepSeek-R1-Distill-Llama-8B”
endpoint_name = f”{model_id.split(‘/’)[-1].replace(‘.’, ‘-‘)}-sft-djl”
predictor = model.deploy(
   initial_instance_count=instance_count,
   instance_type=instance_type,
   container_startup_health_check_timeout=1800,
   model_data_download_timeout=3600
)

It takes a few minutes to deploy the model before it becomes available for inference and evaluation. You can test the endpoint invocation in JupyterLab, by using the AWS SDK with the boto3 client for sagemaker-runtime, or by using the SageMaker Python SDK and the predictor previously created, by using the predict API.

base_prompt = f”””<s> [INST] {{question}} [/INST] “””

prompt = base_prompt.format(
question=”What statue is in front of the Notre Dame building?”
)

predictor.predict({
   “inputs”: prompt,
   “parameters”: {
   “max_new_tokens”: 300,
   “temperature”: 0.2,
   “top_p”: 0.9,
   “return_full_text”: False,
   “stop”: [‘</s>’]
   }
})

You can also test the model invocation in SageMaker Unified Studio, on the Inference endpoint page and Text inference tab.
Troubleshooting
You might encounter some of the following errors while running your model training and deployment:

Training job fails to start – If a training job fails to start, make sure your IAM role AmazonSageMakerDomainExecution has the necessary permissions, verify the instance type is available in your AWS Region, and check your S3 bucket permissions. This role is created when an admin creates the domain, and you can ask the admin to check your IAM access permissions associated with this role.
Out-of-memory errors during training – If you encounter out-of-memory errors during training, try reducing the batch size, use gradient accumulation to simulate larger batches, or consider using a larger instance.
Slow model deployment – For slow model deployment, make sure model artifacts aren’t excessively large, and use appropriate instance types for inference and capacity available for that instance in your Region.

For more troubleshooting tips, refer to Troubleshooting guide.
Clean up
SageMaker Unified Studio by default shuts down idle resources such as JupyterLab spaces after 1 hour. However, you must delete the S3 bucket and the hosted model endpoint to stop incurring costs. You can delete the real-time endpoints you created using the SageMaker console. For instructions, see Delete Endpoints and Resources.
Conclusion
This post demonstrated how SageMaker Unified Studio serves as a powerful centralized service for data and AI workflows, showcasing its seamless integration capabilities throughout the fine-tuning process. With SageMaker Unified Studio, data engineers and ML practitioners can efficiently discover and access data through SageMaker Catalog, prepare datasets, fine-tune models, and deploy them—all within a single, unified environment. The service’s direct integration with SageMaker AI and various AWS analytics services streamlines the development process, alleviating the need to switch between multiple tools and environments. The solution highlights the service’s versatility in handling complex ML workflows, from data discovery and preparation to model deployment, while maintaining a cohesive and intuitive user experience. Through features like integrated MLflow tracking, built-in model monitoring, and flexible deployment options, SageMaker Unified Studio demonstrates its capability to support sophisticated AI/ML projects at scale.
To learn more about SageMaker Unified Studio, see An integrated experience for all your data and AI with Amazon SageMaker Unified Studio.
If this post helps you or inspires you to solve a problem, we would love to hear about it! The code for this solution is available on the GitHub repo for you to use and extend. Contributions are always welcome!

About the authors
Mona Mona currently works as a Sr World Wide Gen AI Specialist Solutions Architect at Amazon focusing on Gen AI Solutions. She was a Lead Generative AI specialist in Google Public Sector at Google before joining Amazon. She is a published author of two books – Natural Language Processing with AWS AI Services and Google Cloud Certified Professional Machine Learning Study Guide. She has authored 19 blogs on AI/ML and cloud technology and a co-author on a research paper on CORD19 Neural Search which won an award for Best Research Paper at the prestigious AAAI (Association for the Advancement of Artificial Intelligence) conference.
Bruno Pistone is a Senior Generative AI and ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His expertise include: Machine Learning end to end, Machine Learning Industrialization, and Generative AI. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.
Lauren Mullennex is a Senior GenAI/ML Specialist Solutions Architect at AWS. She has a decade of experience in DevOps, infrastructure, and ML. Her areas of focus include MLOps/LLMOps, generative AI, and computer vision.

Together AI Releases DeepSWE: A Fully Open-Source RL-Trained Coding Ag …

Posted on July 3, 2025 by i-genie

Together AI has released DeepSWE, a state-of-the-art, fully open-sourced software engineering agent that is trained entirely through reinforcement learning (RL). Built on top of the Qwen3-32B language model, DeepSWE achieves 59% accuracy on the SWEBench-Verified benchmark and 42.2% Pass@1, topping the leaderboard among open-weight models. This launch represents a significant shift for Together AI, from traditional pretraining pipelines toward creating autonomous language agents that continuously learn and improve via real-world feedback.

Reinforcement Learning Meets Code Generation

DeepSWE is the result of post-training the Qwen3-32B foundation model using rLLM, Agentica’s modular reinforcement learning framework tailored for language agents. Unlike conventional supervised fine-tuning approaches, rLLM enables agents to adapt to real-world workflows through experience. DeepSWE has been specifically trained to solve complex software engineering tasks using a feedback-driven loop rather than static datasets.

The training pipeline incorporates Agentica’s R2EGym dataset—a software engineering benchmark designed for RL-style agent development. The framework focuses on training language models with action-oriented objectives, such as fixing bugs, completing functions, and editing code, rather than merely predicting next-token distributions. This aligns DeepSWE more closely with how human engineers iterate and learn from outcomes.

Performance Benchmarks and Capabilities

On SWEBench-Verified, the most rigorous benchmark for software engineering agents, DeepSWE scores 59% with test-time scaling. This significantly outperforms previous open-weight models. In Pass@1 evaluations—which measure the probability that the agent solves a problem correctly on the first attempt—DeepSWE reaches an impressive 42.2%.

These results underscore the power of RL-based training in enhancing agentic behavior, particularly in domains requiring iterative reasoning and precise outputs, such as code synthesis. The model’s architecture, inherited from Qwen3-32B, enables it to scale effectively while remaining suitable for real-world applications.

Open Source and Reproducibility at Its Core

One of the standout features of this release is its full transparency. Together AI and Agentica have open-sourced not only the DeepSWE model but also the entire training recipe, including the rLLM framework, the R2EGym dataset, and training configuration scripts. This promotes reproducibility and invites the broader research and developer communities to extend or build upon DeepSWE without restrictions.

Developers can access DeepSWE and rLLM via the following:

Model Weights: Hugging Face – DeepSWE

Training Framework: rLLM GitHub Repository

Training Documentation: DeepSWE Training Overview

From Language Reasoners to Language Agents

DeepSWE marks a philosophical and practical shift: from building models that reason about language to building agents that learn through interaction. Traditional LLMs have shown strong reasoning capabilities, but often lack the ability to adapt to feedback or improve with use. Reinforcement learning enables these models to not only perform well at launch but to get better over time, adapting to new problem distributions and domains.

This approach also opens the door for local deployment. Because DeepSWE is fully open-source and modular, it can be extended and retrained for organization-specific use cases. Developers and researchers can build their own agents on top of DeepSWE using rLLM to serve diverse domains such as web navigation, robotics, or autonomous research assistance.

Conclusion

DeepSWE is a milestone in the evolution of generative AI for software engineering. By applying reinforcement learning to large language models like Qwen3-32B and releasing the entire training infrastructure, Together AI is enabling a future where agents are not just pretrained and deployed, but continually trained and improved. This leap from language understanding to action-oriented agency has significant implications across programming, automation, and intelligent system design.

Model Weights: Hugging Face – DeepSWE

Training Framework: rLLM GitHub Repository

Training Documentation: DeepSWE Training Overview

All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Together AI Releases DeepSWE: A Fully Open-Source RL-Trained Coding Agent Based on Qwen3-32B and Achieves 59% on SWEBench appeared first on MarkTechPost.

Shanghai Jiao Tong Researchers Propose OctoThinker for Reinforcement L …

Posted on July 3, 2025 by i-genie

Introduction: Reinforcement Learning Progress through Chain-of-Thought Prompting

LLMs have shown excellent progress in complex reasoning tasks through CoT prompting combined with large-scale reinforcement learning (RL). Models like Deepseek-R1-Zero have shown strong reasoning capabilities by applying RL directly to base models. Similarly, methods such as SimpleRL and Open-ReasonerZero show improvements in smaller models like the Qwen series. However, achieving success across different base model families remains a challenge. Moreover, applying R1-Zero-style training to base models such as the Llama series faces difficulty, posing a fundamental question about the underlying factors that lead different base models to behave inconsistently during reinforcement learning.

Limitations of RL Scaling on Llama Models

Large-scale RL advances in models like OpenAI’s o1, o3, and DeepSeek’s R1 on competition-level mathematics problems, motivating the exploration of RL on smaller models with less than 100B parameters. However, they are limited to the Qwen model family, while replicating results on families such as Llama is difficult. The lack of transparency in pre-training pipelines has made it difficult to understand how pre-training influences RL scaling. This has prompted unconventional studies, which found that one-shot prompting improves reasoning in Qwen but offers little benefit in Llama. Efforts to curate high-quality mathematical pre-training corpora through projects like OpenWebMath, MathPile, InfiMM-Web-Math, and FineMath have made progress but remain limited in scale under 100B tokens.

Exploring Mid-Training with Stable-then-Decay Strategy

Researchers from Shanghai Jiao Tong University investigate how mid-training strategies shape RL dynamics, focusing on Qwen and Llama. The study presents several insights: First, high-quality mathematical corpora such as MegaMath-Web-Pro boost both base model and RL outcomes. Second, using QA-style data, especially those with long CoT reasoning, further enhances RL results. Third, long CoT introduces verbosity and instability in RL training. Lastly, applying scaling during mid-training results in stronger downstream RL performance. Researchers introduce a two-stage mid-training strategy called Stable-then-Decay, where base models are first trained on 200B tokens, followed by 20B tokens across three CoT-focused branches, resulting in OctoThinker models that show strong RL compatibility.

RL Configuration and Benchmark Evaluation

Researchers use the MATH8K dataset for RL training prompts. The configuration includes a global training batch size of 128, 16 rollout responses per query, and a PPO mini-batch size of 64, with experiments conducted on Llama-3.2-3B-Base and Qwen2.5-3B-Base models. For evaluation, few-shot prompting is used for base language models, and zero-shot for RL-tuned models across indicator tasks, including GSM8K, MATH500, OlympiadBench, and AMC23. During RL training, Qwen models exhibit increasing response lengths that remain reasonable throughout, whereas Llama displays abnormal behavior, with average response lengths escalating to 4,096 tokens. Evaluation further reveals that RL-tuned Qwen2.5-3B achieves improvements across benchmarks, while Llama-3.2-3B shows only marginal gains.

OctoThinker Outperforms Llama in RL Compatibility

Each OctoThinker branch demonstrates 10%-20% improvement over the original Llama base model and consistent gains over the stable-stage model across all sizes when evaluated on 13 mathematical benchmarks. The OctoThinker-Zero families reveal diverse thinking behaviors during RL scaling, with strong performance from the OctoThinker-Long variant. When comparing three 3B-scale base models during RL training, OctoThinker-Long-3B outperforms the original Llama-3.2-3B model and reaches performance parity with Qwen2.5-3B, a model known for strong reasoning capabilities and extensive pre-training. The hybrid and short branches show slightly lower performance, especially on challenging benchmarks

Conclusion and Future Work: Toward RL-Ready Foundation Models

This paper investigates why base models such as Llama and Qwen exhibit divergent behaviors during RL for reasoning, showing that mid-training plays a major role in RL scalability. The two-stage mid-training strategy transforms Llama into a foundation model better suited for RL, resulting in OctoThinker models. Future research directions include:

Curating higher-quality mathematical corpora to improve mid-training.

Creating RL-friendly base models using open recipes without distillation from long CoT reasoning models.

Separating the QA format and content to understand their contributions individually.

Expanding the OctoThinker family with new branches, such as tool-integrated reasoning.

Check out the Paper, Hugging Face Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Shanghai Jiao Tong Researchers Propose OctoThinker for Reinforcement Learning-Scalable LLM Development appeared first on MarkTechPost.

ReasonFlux-PRM: A Trajectory-Aware Reward Model Enhancing Chain-of-Tho …

Posted on July 3, 2025 by i-genie

Understanding the Role of Chain-of-Thought in LLMs

Large language models are increasingly being used to solve complex tasks such as mathematics and scientific reasoning through structured chain-of-thought approaches. These models do not just jump to answers—they reason through intermediate steps that simulate logical thought processes. This technique allows for improved reasoning accuracy and clearer error tracing. As models become more sophisticated, it has become essential to evaluate not just final responses but also the reasoning steps that lead to them.

Limitations of Traditional PRMs in Reasoning Evaluation

One pressing issue is that most current reward models only assess final answers, ignoring how those conclusions were reached. However, frontier models like Deepseek-R1 now output extensive reasoning paths before delivering final responses. These trajectory-response pairs are being reused to train smaller models. The problem is that current Process Reward Models (PRMs) are not built to evaluate these full trajectories. This mismatch leads to unreliable supervision, which can degrade the performance of smaller models trained on trajectory-response data.

Challenges in Handling Disorganized Reasoning Chains

Traditional PRMs are primarily calibrated for structured, clean outputs rather than the lengthy and sometimes disorganized reasoning chains generated by advanced LLMs. Even advanced PRMs, such as Qwen2.5-Math-PRM-72B, show a limited ability to distinguish between high- and low-quality intermediate reasoning. When applied to trajectory-response outputs from Gemini or Deepseek-R1, these models often produce overlapping reward scores, indicating weak discrimination. Their limited sensitivity leads to poor data selection for downstream fine-tuning, and experiments confirm that models trained on PRM-selected data perform worse than those trained on human-curated datasets.

Introducing ReasonFlux-PRM for Trajectory-Level Supervision

Researchers from the University of Illinois Urbana-Champaign (UIUC), Princeton University, Cornell University, and ByteDance Seed introduced ReasonFlux-PRM. The research introduced ReasonFlux-PRM as a trajectory-aware model that evaluates both intermediate reasoning steps and final answers. It integrates step-level and trajectory-level scoring, enabling a more nuanced understanding of reasoning quality. ReasonFlux-PRM is trained on a 10,000-sample dataset of carefully curated math and science problems explicitly designed to mirror real-world trajectory-response formats.

Technical Framework of ReasonFlux-PRM

Technically, ReasonFlux-PRM operates by scoring each intermediate step in a trajectory concerning its contribution to the final answer. It uses a reference reward function that considers the prompt, prior reasoning steps, and final output to assign step-level scores. These are then aggregated to produce a total trajectory reward. The model supports multiple applications, including offline filtering of high-quality training data, dense reward provision during reinforcement learning using GRPO-based policy optimization, and Best-of-N test-time response selection to enhance inference quality. These capabilities make ReasonFlux-PRM more flexible and comprehensive than prior PRMs.

Empirical Results on Reasoning Benchmarks

In performance evaluations across tasks like AIME, MATH500, and GPQA-Diamond, ReasonFlux-PRM-7B outperformed Qwen2.5-Math-PRM-72B and human-curated data in several key metrics. Specifically, it achieved a 12.1% accuracy gain in supervised fine-tuning, a 4.5% improvement during reinforcement learning, and a 6.3% increase during test-time scaling. These gains are particularly considerable given that ReasonFlux-PRM is smaller in model size. Table 1 shows that the Qwen2.5-14B-Instruct model, when trained on data selected by ReasonFlux-PRM, achieved performance levels close to or exceeding human-curated baselines. In contrast, other PRMs resulted in significant drops of up to 26.6% in certain benchmarks.

Impact and Future Direction of ReasonFlux-PRM

This research addresses a crucial limitation in the training and evaluation of modern reasoning models. By enabling supervision over both thinking trajectories and final answers, ReasonFlux-PRM enhances the quality of training data and the reliability of model responses. It sets a new direction for systematically evaluating and improving reasoning processes in large models.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post ReasonFlux-PRM: A Trajectory-Aware Reward Model Enhancing Chain-of-Thought Reasoning in LLMs appeared first on MarkTechPost.

Optimize RAG in production environments using Amazon SageMaker JumpSta …

Posted on July 3, 2025 by i-genie

Generative AI has revolutionized customer interactions across industries by offering personalized, intuitive experiences powered by unprecedented access to information. This transformation is further enhanced by Retrieval Augmented Generation (RAG), a technique that allows large language models (LLMs) to reference external knowledge sources beyond their training data. RAG has gained popularity for its ability to improve generative AI applications by incorporating additional information, often preferred by customers over techniques like fine-tuning due to its cost-effectiveness and faster iteration cycles.
The RAG approach excels in grounding language generation with external knowledge, producing more factual, coherent, and relevant responses. This capability proves invaluable in applications such as question answering, dialogue systems, and content generation, where accuracy and informative outputs are crucial. For businesses, RAG offers a powerful way to use internal knowledge by connecting company documentation to a generative AI model. When an employee asks a question, the RAG system retrieves relevant information from the company’s internal documents and uses this context to generate an accurate, company-specific response. This approach enhances the understanding and usage of internal company documents and reports. By extracting relevant context from corporate knowledge bases, RAG models facilitate tasks like summarization, information extraction, and complex question answering on domain-specific materials, enabling employees to quickly access vital insights from vast internal resources. This integration of AI with proprietary information can significantly improve efficiency, decision-making, and knowledge sharing across the organization.
A typical RAG workflow consists of four key components: input prompt, document retrieval, contextual generation, and output. The process begins with a user query, which is used to search a comprehensive knowledge corpus. Relevant documents are then retrieved and combined with the original query to provide additional context for the LLM. This enriched input allows the model to generate more accurate and contextually appropriate responses. RAG’s popularity stems from its ability to use frequently updated external data, providing dynamic outputs without the need for costly and compute-intensive model retraining.
To implement RAG effectively, many organizations turn to platforms like Amazon SageMaker JumpStart. This service offers numerous advantages for building and deploying generative AI applications, including access to a wide range of pre-trained models with ready-to-use artifacts, a user-friendly interface, and seamless scalability within the AWS ecosystem. By using pre-trained models and optimized hardware, SageMaker JumpStart enables rapid deployment of both LLMs and embedding models, minimizing the time spent on complex scalability configurations.
In the previous post, we showed how to build a RAG application on SageMaker JumpStart using Facebook AI Similarity Search (Faiss). In this post, we show how to use Amazon OpenSearch Service as a vector store to build an efficient RAG application.
Solution overview
To implement our RAG workflow on SageMaker, we use a popular open source Python library known as LangChain. With LangChain, the RAG components are simplified into independent blocks that you can bring together using a chain object that will encapsulate the entire workflow. The solution consists of the following key components:

LLM (inference) – We need an LLM that will do the actual inference and answer the end-user’s initial prompt. For our use case, we use Meta Llama3 for this component. LangChain comes with a default wrapper class for SageMaker endpoints with which we can simply pass in the endpoint name to define an LLM object in the library.
Embeddings model – We need an embeddings model to convert our document corpus into textual embeddings. This is necessary for when we’re doing a similarity search on the input text to see what documents share similarities or contain the information to help augment our response. For this post, we use the BGE Hugging Face Embeddings model available in SageMaker JumpStart.
Vector store and retriever – To house the different embeddings we have generated, we use a vector store. In this case, we use OpenSearch Service, which allows for similarity search using k-nearest neighbors (k-NN) as well as traditional lexical search. Within our chain object, we define the vector store as the retriever. You can tune this depending on how many documents you want to retrieve.

The following diagram illustrates the solution architecture.

In the following sections, we walk through setting up OpenSearch, followed by exploring the notebook that implements a RAG solution with LangChain, Amazon SageMaker AI, and OpenSearch Service.
Benefits of using OpenSearch Service as a vector store for RAG
In this post, we showcase how you can use a vector store such as OpenSearch Service as a knowledge base and embedding store. OpenSearch Service offers several advantages when used for RAG in conjunction with SageMaker AI:

Performance – Efficiently handles large-scale data and search operations
Advanced search – Offers full-text search, relevance scoring, and semantic capabilities
AWS integration – Seamlessly integrates with SageMaker AI and other AWS services
Real-time updates – Supports continuous knowledge base updates with minimal delay
Customization – Allows fine-tuning of search relevance for optimal context retrieval
Reliability – Provides high availability and fault tolerance through a distributed architecture
Analytics – Provides analytical features for data understanding and performance improvement
Security – Offers robust features such as encryption, access control, and audit logging
Cost-effectiveness – Serves as an economical solution compared to proprietary vector databases
Flexibility – Supports various data types and search algorithms, offering versatile storage and retrieval options for RAG applications

You can use SageMaker AI with OpenSearch Service to create powerful and efficient RAG systems. SageMaker AI provides the machine learning (ML) infrastructure for training and deploying your language models, and OpenSearch Service serves as an efficient and scalable knowledge base for retrieval.
OpenSearch Service optimization strategies for RAG
Based on our learnings from the hundreds of RAG applications deployed using OpenSearch Service as a vector store, we’ve developed several best practices:

If you are starting from a clean slate and want to move quickly with something simple, scalable, and high-performing, we recommend using an Amazon OpenSearch Serverless vector store collection. With OpenSearch Serverless, you benefit from automatic scaling of resources, decoupling of storage, indexing compute, and search compute, with no node or shard management, and you only pay for what you use.
If you have a large-scale production workload and want to take the time to tune for the best price-performance and the most flexibility, you can use an OpenSearch Service managed cluster. In a managed cluster, you pick the node type, node size, number of nodes, and number of shards and replicas, and you have more control over when to scale your resources. For more details on best practices for operating an OpenSearch Service managed cluster, see Operational best practices for Amazon OpenSearch Service.
OpenSearch supports both exact k-NN and approximate k-NN. Use exact k-NN if the number of documents or vectors in your corpus is less than 50,000 for the best recall. For use cases where the number of vectors is greater than 50,000, exact k-NN will still provide the best recall but might not provide sub-100 millisecond query performance. Use approximate k-NN in use cases above 50,000 vectors for the best performance.
OpenSearch uses algorithms from the NMSLIB, Faiss, and Lucene libraries to power approximate k-NN search. There are pros and cons to each k-NN engine, but we find that most customers choose Faiss due to its overall performance in both indexing and search as well as the variety of different quantization and algorithm options that are supported and the broad community support.
Within the Faiss engine, OpenSearch supports both Hierarchical Navigable Small World (HNSW) and Inverted File System (IVF) algorithms. Most customers find HNSW to have better recall than IVF and choose it for their RAG use cases. To learn more about the differences between these engine algorithms, see Vector search.
To reduce the memory footprint to lower the cost of the vector store while keeping the recall high, you can start with Faiss HNSW 16-bit scalar quantization. This can also reduce search latencies and improve indexing throughput when used with SIMD optimization.
If using an OpenSearch Service managed cluster, refer to Performance tuning for additional recommendations.

Prerequisites
Make sure you have access to one ml.g5.4xlarge and ml.g5.2xlarge instance each in your account. A secret should be created in the same region as the stack is deployed.Then complete the following prerequisite steps to create a secret using AWS Secrets Manager:

On the Secrets Manager console, choose Secrets in the navigation pane.
Choose Store a new secret.

For Secret type, select Other type of secret.
For Key/value pairs, on the Plaintext tab, enter a complete password.
Choose Next.

For Secret name, enter a name for your secret.
Choose Next.

Under Configure rotation, keep the settings as default and choose Next.

Choose Store to save your secret.

On the secret details page, note the secret Amazon Resource Name (ARN) to use in the next step.

Create an OpenSearch Service cluster and SageMaker notebook
We use AWS CloudFormation to deploy our OpenSearch Service cluster, SageMaker notebook, and other resources. Complete the following steps:

Launch the following CloudFormation template.
Provide the ARN of the secret you created as a prerequisite and keep the other parameters as default.

Choose Create to create your stack, and wait for the stack to complete (about 20 minutes).
When the status of the stack is CREATE_COMPLETE, note the value of OpenSearchDomainEndpoint on the stack Outputs tab.
Locate SageMakerNotebookURL in the outputs and choose the link to open the SageMaker notebook.

Run the SageMaker notebook
After you have launched the notebook in JupyterLab, complete the following steps:

Go to genai-recipes/RAG-recipes/llama3-RAG-Opensearch-langchain-SMJS.ipynb.

You can also clone the notebook from the GitHub repo.

Update the value of OPENSEARCH_URL in the notebook with the value copied from OpenSearchDomainEndpoint in the previous step (look for os.environ[‘OPENSEARCH_URL’] = “”). The port needs to be 443.
Run the cells in the notebook.

The notebook provides a detailed explanation of all the steps. We explain some of the key cells in the notebook in this section.
For the RAG workflow, we deploy the huggingface-sentencesimilarity-bge-large-en-v1-5 embedding model and meta-textgeneration-llama-3-8b-instruct LLM from Hugging Face. SageMaker JumpStart simplifies this process because the model artifacts, data, and container specifications are all prepackaged for optimal inference. These are then exposed using the SageMaker Python SDK high-level API calls, which let you specify the model ID for deployment to a SageMaker real-time endpoint:

sagemaker.jumpstart.model JumpStartModel

model_id  “meta-textgeneration-llama-3-8b-instruct”
accept_eula
model  JumpStartModel(model_idmodel_id)
llm_predictor  modeldeploy(accept_eulaaccept_eula)

model_id  “huggingface-sentencesimilarity-bge-large-en-v1-5”
text_embedding_model  JumpStartModel(model_idmodel_id)
embedding_predictor  text_embedding_modeldeploy()

Content handlers are crucial for formatting data for SageMaker endpoints. They transform inputs into the format expected by the model and handle model-specific parameters like temperature and token limits. These parameters can be tuned to control the creativity and consistency of the model’s responses.

class Llama38BContentHandler(LLMContentHandler):
content_type = “application/json”
accepts = “application/json”

   def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
   payload = {
   “inputs”: prompt,
   “parameters”: {
   “max_new_tokens”: 1000,
   “top_p”: 0.9,
   “temperature”: 0.6,
   “stop”: [“<|eot_id|>”],
   },
   }
   input_str = json.dumps(
   payload,
   )
   #print(input_str)
   return input_str.encode(“utf-8”)

We use PyPDFLoader from LangChain to load PDF files, attach metadata to each document fragment, and then use RecursiveCharacterTextSplitter to break the documents into smaller, manageable chunks. The text splitter is configured with a chunk size of 1,000 characters and an overlap of 100 characters, which helps maintain context between chunks. This preprocessing step is crucial for effective document retrieval and embedding generation, because it makes sure the text segments are appropriately sized for the embedding model and the language model used in the RAG system.

import numpy as np
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
documents = []
for idx, file in enumerate(filenames):
   loader = PyPDFLoader(data_root + file)
   document = loader.load()
   for document_fragment in document:
   document_fragment.metadata = metadata[idx]
   documents += document
# – in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
   # Set a really small chunk size, just to show.
   chunk_size=1000,
   chunk_overlap=100,
)
docs = text_splitter.split_documents(documents)
print(docs[100])

The following block initializes a vector store using OpenSearch Service for the RAG system. It converts preprocessed document chunks into vector embeddings using a SageMaker model and stores them in OpenSearch Service. The process is configured with security measures like SSL and authentication to provide secure data handling. The bulk insertion is optimized for performance with a sizeable batch size. Finally, the vector store is wrapped with VectorStoreIndexWrapper, providing a simplified interface for operations like querying and retrieval. This setup creates a searchable database of document embeddings, enabling quick and relevant context retrieval for user queries in the RAG pipeline.

from langchain.indexes.vectorstore import VectorStoreIndexWrapper
# Initialize OpenSearchVectorSearch
vectorstore_opensearch = OpenSearchVectorSearch.from_documents(
   docs,
   sagemaker_embeddings,
   http_auth=awsauth, # Auth will use the IAM role
   use_ssl=True,
   verify_certs=True,
   connection_class=RequestsHttpConnection,
   bulk_size=2000  # Increase this to accommodate the number of documents you have
)
# Wrap the OpenSearch vector store with the VectorStoreIndexWrapper
wrapper_store_opensearch = VectorStoreIndexWrapper(vectorstore=vectorstore_opensearch)

Next, we use the wrapper from the previous step along with the prompt template. We define the prompt template for interacting with the Meta Llama 3 8B Instruct model in the RAG system. The template uses specific tokens to structure the input in a way that the model expects. It sets up a conversation format with system instructions, user query, and a placeholder for the assistant’s response. The PromptTemplate class from LangChain is used to create a reusable prompt with a variable for the user’s query. This structured approach to prompt engineering helps maintain consistency in the model’s responses and guides it to act as a helpful assistant.

prompt_template = “””<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>
{query}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
“””
PROMPT = PromptTemplate(
template=prompt_template, input_variables=[“query”]
)
query = “How did AWS perform in 2021?”

answer = wrapper_store_opensearch.query(question=PROMPT.format(query=query), llm=llm)
print(answer)

Similarly, the notebook also shows how to use Retrieval QA, where you can customize how the documents fetched should be added to prompt using the chain_type parameter.
Clean up
Delete your SageMaker endpoints from the notebook to avoid incurring costs:

# Delete resources
llm_predictor.delete_model()
llm_predictor.delete_endpoint()
embedding_predictor.delete_model()
embedding_predictor.delete_endpoint()

Next, delete your OpenSearch cluster to stop incurring additional charges:aws cloudformation delete-stack –stack-name rag-opensearch
Conclusion
RAG has revolutionized how businesses use AI by enabling general-purpose language models to work seamlessly with company-specific data. The key benefit is the ability to create AI systems that combine broad knowledge with up-to-date, proprietary information without expensive model retraining. This approach transforms customer engagement and internal operations by delivering personalized, accurate, and timely responses based on the latest company data. The RAG workflow—comprising input prompt, document retrieval, contextual generation, and output—allows businesses to tap into their vast repositories of internal documents, policies, and data, making this information readily accessible and actionable. For businesses, this means enhanced decision-making, improved customer service, and increased operational efficiency. Employees can quickly access relevant information, while customers receive more accurate and personalized responses. Moreover, RAG’s cost-efficiency and ability to rapidly iterate make it an attractive solution for businesses looking to stay competitive in the AI era without constant, expensive updates to their AI systems. By making general-purpose LLMs work effectively on proprietary data, RAG empowers businesses to create dynamic, knowledge-rich AI applications that evolve with their data, potentially transforming how companies operate, innovate, and engage with both employees and customers.
SageMaker JumpStart has streamlined the process of developing and deploying generative AI applications. It offers pre-trained models, user-friendly interfaces, and seamless scalability within the AWS ecosystem, making it straightforward for businesses to harness the power of RAG.
Furthermore, using OpenSearch Service as a vector store facilitates swift retrieval from vast information repositories. This approach not only enhances the speed and relevance of responses, but also helps manage costs and operational complexity effectively.
By combining these technologies, you can create robust, scalable, and efficient RAG systems that provide up-to-date, context-aware responses to customer queries, ultimately enhancing user experience and satisfaction.
To get started with implementing this Retrieval Augmented Generation (RAG) solution using Amazon SageMaker JumpStart and Amazon OpenSearch Service, check out the example notebook on GitHub. You can also learn more about Amazon OpenSearch Service in the developer guide.

About the authors
Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Harish Rao is a Senior Solutions Architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.
Raghu Ramesha is an ML Solutions Architect. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.
Sohaib Katariwala is a Sr. Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. His interests are in all things data and analytics. More specifically he loves to help customers use AI in their data strategy to solve modern day challenges.
Karan Jain is a Senior Machine Learning Specialist at AWS, where he leads the worldwide Go-To-Market strategy for Amazon SageMaker Inference. He helps customers accelerate their generative AI and ML journey on AWS by providing guidance on deployment, cost-optimization, and GTM strategy. He has led product, marketing, and business development efforts across industries for over 10 years, and is passionate about mapping complex service features to customer solutions.

Advancing AI agent governance with Boomi and AWS: A unified approach t …

Posted on July 3, 2025 by i-genie

Just as APIs became the standard for integration, AI agents are transforming workflow automation through intelligent task coordination. AI agents are already enhancing decision-making and streamlining operations across enterprises. But as adoption accelerates, organizations face growing complexity in managing them at scale. Organizations struggle with observability and lifecycle management, finding it difficult to monitor performance and manage versions effectively. Governance and security concerns arise as these agents process sensitive data, which requires strict compliance and access controls. Perhaps most concerningly, without proper management, organizations face the risk of agent sprawl—the unchecked proliferation of AI agents leading to inefficiency and security vulnerabilities.
Boomi and AWS have collaborated to address the complexity surrounding AI agents with Agent Control Tower, an AI agent management solution developed by Boomi and tightly integrated with Amazon Bedrock. Agent Control Tower, part of the Boomi Agentstudio solution, provides the governance framework to manage this transformation, with capabilities that address both current and emerging compliance needs.
As a leader in enterprise iPaaS per Gartner’s Magic Quadrant, based on Completeness of Vision and Ability to Execute, Boomi serves over 20,000 enterprise customers, with three-quarters of these customers operating on AWS. This includes a significant presence among Fortune 500 and Global 2000 organizations across critical sectors such as healthcare, finance, technology, and manufacturing. Boomi is innovating with generative AI, with more than 2,000 customers using its AI agents. The convergence of capabilities that Boomi provides—spanning AI, integration, automation, API management, and data management—with AWS and its proven track record in reliability, security, and AI innovation creates a compelling foundation for standardized AI agent governance at scale. In this post, we share how Boomi partnered with AWS to help enterprises accelerate and scale AI adoption with confidence using Agent Control Tower.
A unified AI management solution
Built on AWS, Agent Control Tower uniquely delivers a single control plane for managing AI agents across multiple systems, including other cloud providers and on-premises environments. At its core, it offers comprehensive observability and monitoring, providing real-time performance tracking and deep visibility into agent decision-making and behavior.
The following screenshot showcases how users can view summary data across agent providers and add or manage providers.

The following screenshot shows an example of the Monitoring and Compliance dashboard.

Agent Control Tower also provides a single pane of glass for visibility into the tools used by each agent, as illustrated in the following screenshot.

Agent Control Tower provides key governance and security controls such as centralized policy enforcement and role-based access control, and enables meeting regulatory compliance with frameworks like GDPR and HIPAA. Furthermore, its lifecycle management capabilities enable automated agent discovery, version tracking, and operational control through features such as pause and resume functionality. Agent Control Tower is positioned as one of the first, if not the first, unified solutions that provides full lifecycle AI agent management with integrated governance and orchestration features. Although many vendors focus on releasing AI agents, there are few that focus on solutions for managing, deploying, and governing AI agents at scale.
The following screenshot shows an example of how users can review agent details and disable or enable an agent.

As shown in the following screenshot, users can drill down into details for each part of the agent.

Amazon Bedrock: Enabling and enhancing AI governance
Using Amazon Bedrock, organizations can implement security guardrails and content moderation while maintaining the flexibility to select and switch between AI models for optimized performance and accuracy. Organizations can create and enable access to curated knowledge bases and predefined action groups, enabling sophisticated multi-agent collaboration. Amazon Bedrock also provides comprehensive metrics and trace logs for agents to help facilitate complete transparency and accountability in agent operations. Through deep integration with Amazon Bedrock, Boomi’s Agent Control Tower enhances agent transparency and governance, offering a unified, actionable view of agent configurations and activities across environments.
The following diagram illustrates the Agent Control Tower architecture on AWS.

Business impact: Transforming enterprise AI operations
Consider a global manufacturer using AI agents for supply chain optimization. With Agent Control Tower, they can monitor agent performance across regions in real time, enforce consistent security policies, and enable regulatory compliance. When issues arise, they can quickly identify and resolve them while maintaining the ability to scale AI operations confidently. With this level of control and visibility, organizations can deploy AI agents more effectively while maintaining robust security and compliance standards.
Conclusion
Boomi customers have already deployed more than 33,000 agents and are seeing up to 80% less time spent on documentation and 50% faster issue resolution. With Boomi and AWS, enterprises can accelerate and scale AI adoption with confidence, backed by a product that puts visibility, governance, and security first. Discover how Agent Control Tower can help your organization manage AI agent sprawl and take advantage of scalable, compliance-aligned innovation. Take a guided tour and learn more about Boomi Agent Control Tower and Amazon Bedrock integration. Or, you can get started today with AI FastTrack.

About the authors
Deepak Chandrasekar is the VP of Software Engineering & User Experience and leads multidisciplinary teams at Boomi. He oversees flagship initiatives like Boomi’s Agent Control Tower, Task Automation, and Market Reach, while driving a cohesive and intelligent experience layer across products. Previously, Deepak held a key leadership role at Unifi Software, which was acquired by Boomi. With a passion for building scalable, and intuitive AI-powered solutions, he brings a commitment to engineering excellence and responsible innovation.
Sandeep Singh is Director of Engineering at Boomi, where he leads global teams building solutions that enable enterprise integration and automation at scale. He drives initiatives like Boomi Agent Control Tower, Marketplace, and Labs, empowering partners and customers with intelligent, trusted solutions. With leadership experience at GE and Fujitsu, Sandeep brings expertise in API strategy, product engineering, and AI/ML solutions. A former solution architect, he is passionate about designing mission-critical systems and driving innovation through scalable, intelligent solutions.
Santosh Ameti is a seasoned Engineering leader in the Amazon Bedrock team and has built Agents, Evaluation, Guardrails, and Prompt Management solutions. His team continuously innovates in the agentic space, delivering one of the most secure and managed agentic solutions for enterprises.
Greg Sligh is a Senior Solutions Architect at AWS with more than 25 years of experience in software engineering, software architecture, consulting, and IT and Engineering leadership roles across multiple industries. For the majority of his career, he has focused on creating and delivering distributed, data-driven applications with particular focus on scale, performance, and resiliency. Now he helps ISVs meet their objectives across technologies, with particular focus on AI/ML.
Padma Iyer is a Senior Customer Solutions Manager at Amazon Web Services, where she specializes in supporting ISVs. With a passion for cloud transformation and financial technology, Padma works closely with ISVs to guide them through successful cloud transformations, using best practices to optimize their operations and drive business growth. Padma has over 20 years of industry experience spanning banking, tech, and consulting.

Baidu Researchers Propose AI Search Paradigm: A Multi-Agent Framework …

Posted on July 2, 2025 by i-genie

The Need for Cognitive and Adaptive Search Engines

Modern search systems are evolving rapidly as the demand for context-aware, adaptive information retrieval grows. With the increasing volume and complexity of user queries, particularly those requiring layered reasoning, systems are no longer limited to simple keyword matching or document ranking. Instead, they aim to mimic the cognitive behaviors humans exhibit when gathering and processing information. This transition towards a more sophisticated, collaborative approach marks a fundamental shift in how intelligent systems are designed to respond to users.

Limitations of Traditional and RAG Systems

Despite these advances, current methods still face critical limitations. Retrieval-augmented generation (RAG) systems, while useful for direct question answering, often operate in rigid pipelines. They struggle with tasks that involve conflicting information sources, contextual ambiguity, or multi-step reasoning. For example, a query that compares the ages of historical figures requires understanding, calculating, and comparing information from separate documents—tasks that demand more than simple retrieval and generation. The absence of adaptive planning and robust reasoning mechanisms often leads to shallow or incomplete answers in such cases.

The Emergence of Multi-Agent Architectures in Search

Several tools have been introduced to enhance search performance, including Learning-to-Rank systems and advanced retrieval mechanisms utilizing Large Language Models (LLMs). These frameworks incorporate features like user behavior data, semantic understanding, and heuristic models. However, even advanced RAG methods, including ReAct and RQ-RAG, primarily follow static logic, which limits their ability to effectively reconfigure plans or recover from execution failures. Their dependence on one-shot document retrieval and single-agent execution further restricts their ability to handle complex, context-dependent tasks.

Introduction of the AI Search Paradigm by Baidu

Researchers from Baidu introduced a new approach called the “AI Search Paradigm,” designed to overcome the limitations of static, single-agent models. It comprises a multi-agent framework with four key agents: Master, Planner, Executor, and Writer. Each agent is assigned a specific role within the search process. The Master coordinates the entire workflow based on the complexity of the query. The Planner structures complex tasks into sub-queries. The Executor manages tool usage and task completion. Finally, the Writer synthesizes the outputs into a coherent response. This modular architecture enables flexibility and precise task execution that traditional systems lack.

Use of Directed Acyclic Graphs for Task Planning

The framework introduces a Directed Acyclic Graph (DAG) to organize complex queries into dependent sub-tasks. The Planner chooses relevant tools from the MCP servers to address each sub-task. The Executor then invokes these tools iteratively, adjusting queries and fallback strategies when tools fail or data is insufficient. This dynamic reassignment ensures continuity and completeness. The Writer evaluates the results, filters inconsistencies, and compiles a structured response. For example, in a query asking who is older than Emperor Wu of Han and Julius Caesar, the system retrieves birthdates from different tools, performs the age calculation, and delivers the result—all in a coordinated, multi-agent process.

Qualitative Evaluations and Workflow Configurations

The performance of this new system was evaluated using several case studies and comparative workflows. Unlike traditional RAG systems, which operate in a one-shot retrieval mode, the AI Search Paradigm dynamically replans and reflects on each sub-task. The system supports three team configurations based on complexity: Writer-Only, Executor-Inclusive, and Planner-Enhanced. For the Emperor age comparison query, the Planner decomposed the task into three sub-steps and assigned tools accordingly. The final output stated that Emperor Wu of Han lived for 69 years and Julius Caesar for 56 years, indicating a 13-year difference—an output accurately synthesized across multiple sub-tasks. While the paper focused more on qualitative insights than numeric performance metrics, it demonstrated strong improvements in user satisfaction and robustness across tasks.

Conclusion: Toward Scalable, Multi-Agent Search Intelligence

In conclusion, this research presents a modular, agent-based framework that enables search systems to surpass document retrieval and emulate human-style reasoning. The AI Search Paradigm represents a significant advancement by incorporating real-time planning, dynamic execution, and coherent synthesis. It not only solves current limitations but also offers a foundation for scalable, trustworthy search solutions driven by structured collaboration between intelligent agents.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Baidu Researchers Propose AI Search Paradigm: A Multi-Agent Framework for Smarter Information Retrieval appeared first on MarkTechPost.

Baidu Open Sources ERNIE 4.5: LLM Series Scaling from 0.3B to 424B Par …

Posted on July 2, 2025 by i-genie

Baidu has officially open-sourced its latest ERNIE 4.5 series, a powerful family of foundation models designed for enhanced language understanding, reasoning, and generation. The release includes ten model variants ranging from compact 0.3B dense models to massive Mixture-of-Experts (MoE) architectures, with the largest variant totaling 424B parameters. These models are now freely available to the global research and developer community through Hugging Face, enabling open experimentation and broader access to cutting-edge Chinese and multilingual language technology.

Technical Overview of ERNIE 4.5 Architecture

The ERNIE 4.5 series builds on Baidu’s previous iterations of ERNIE models by introducing advanced model architectures, including both dense and sparsely activated MoE designs. The MoE variants are particularly notable for scaling parameter counts efficiently: the ERNIE 4.5-MoE-3B and ERNIE 4.5-MoE-47B variants activate only a subset of experts per input token (typically 2 of 64 experts), keeping the number of active parameters manageable while retaining model expressivity and generalization capabilities.

ERNIE 4.5 models are trained using a mixture of supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and contrastive alignment techniques. The training corpus spans 5.6 trillion tokens across diverse domains in both Chinese and English, using Baidu’s proprietary multi-stage pretraining pipeline. The resulting models demonstrate high fidelity in instruction-following, multi-turn conversation, long-form generation, and reasoning benchmarks.

Model Variants and Open-Source Release

The ERNIE 4.5 release includes the following ten variants:

Dense Models: ERNIE 4.5-0.3B, 0.5B, 1.8B, and 4B

MoE Models: ERNIE 4.5-MoE-3B, 4B, 6B, 15B, 47B, and 424B total parameters (with varying active parameters)

The MoE-47B variant, for instance, activates only 3B parameters during inference while having a total of 47B. Similarly, the 424B model—the largest ever released by Baidu—employs sparse activation strategies to make inference feasible and scalable. These models support both FP16 and INT8 quantization for efficient deployment.

Performance Benchmarks

ERNIE 4.5 models show significant improvements on several key Chinese and multilingual NLP tasks. According to the official technical report:

On CMMLU, ERNIE 4.5 surpasses previous ERNIE versions and achieves state-of-the-art accuracy in Chinese language understanding.

On MMLU, the multilingual benchmark, ERNIE 4.5-47B demonstrates competitive performance with other leading LLMs like GPT-4 and Claude.

For long-form generation, ERNIE 4.5 achieves higher coherence and factuality scores when evaluated using Baidu’s internal metrics.

In instruction-following tasks, the models benefit from contrastive fine-tuning, showing improved alignment with user intent and reduced hallucination rates compared to earlier ERNIE versions.

Applications and Deployment

ERNIE 4.5 models are optimized for a broad range of applications:

Chatbots and Assistants: Multilingual support and instruction-following alignment make it suitable for AI assistants.

Search and Question Answering: High retrieval and generation fidelity allow for integration with RAG pipelines.

Content Generation: Long-form text and knowledge-rich content generation are improved with better factual grounding.

Code and Multimodal Extension: Although the current release focuses on text, Baidu indicates that ERNIE 4.5 is compatible with multimodal extensions.

With support for up to 128K context length in some variants, the ERNIE 4.5 family can be used in tasks requiring memory and reasoning across long documents or sessions.

Conclusion

The ERNIE 4.5 series represents a significant step in open-source AI development, offering a versatile set of models tailored for scalable, multilingual, and instruction-aligned tasks. Baidu’s decision to release models ranging from lightweight 0.3B variants to a 424B-parameter MoE model underscores its commitment to inclusive and transparent AI research. With comprehensive documentation, open availability on Hugging Face, and support for efficient deployment, ERNIE 4.5 is positioned to accelerate global advancements in natural language understanding and generation.

Check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Baidu Open Sources ERNIE 4.5: LLM Series Scaling from 0.3B to 424B Parameters appeared first on MarkTechPost.

OMEGA: A Structured Math Benchmark to Probe the Reasoning Limits of LL …

Posted on July 2, 2025 by i-genie

Introduction to Generalization in Mathematical Reasoning

Large-scale language models with long CoT reasoning, such as DeepSeek-R1, have shown good results on Olympiad-level mathematics. However, models trained through Supervised Fine-Tuning or Reinforcement Learning depend on limited techniques, such as repeating known algebra rules or defaulting to coordinate geometry in diagram problems. Since these models follow learned reasoning patterns rather than showing true mathematical creativity, they face challenges with complex tasks that demand original insights. Current math datasets are poorly suited for analyzing math skills that RL models can learn. Large-scale corpora integrate a range of math questions varying in topic and difficulty, making it challenging to isolate specific reasoning skills.

Limitations of Current Mathematical Benchmarks

Current methods, such as out-of-distribution generalization, focus on handling test distributions that differ from training data, which is crucial for mathematical reasoning, physical modeling, and financial forecasting. Compositional generalization techniques aim to help models systematically combine learned skills. Researchers have created datasets through various methods to benchmark mathematical abilities, which include hiring humans to write problems like GSM8K and MinervaMath, collecting exam questions such as AIME and OlympiadBench, and scraping and filtering exam corpora like NuminaMath and BigMath. However, these approaches either lack sufficient challenge for modern LLMs or fail to provide analysis granularity.

Introducing OMEGA: A Controlled Benchmark for Reasoning Skills

Researchers from the University of California, Ai2, the University of Washington, and dmodel.ai have proposed OMEGA, a benchmark designed to evaluate three dimensions of Out-of-Distribution generalization, inspired by Boden’s typology of creativity. It creates matched training and test pairs designed to isolate specific reasoning skills across three dimensions: Exploratory, Compositional, and Transformative. OMEGA’s test and train problems are constructed using carefully engineered templates, allowing precise control over diversity, complexity, and the specific reasoning strategies required for solutions. Moreover, it employs 40 templated problem generators across six mathematical domains: arithmetic, algebra, combinatorics, number theory, geometry, and logic & puzzles.

Evaluation on Frontier LLMs and Reinforcement Learning Setup

Researchers evaluate four frontier models, including DeepSeek-R1, Claude-3.7-Sonnet, OpenAI-o3-mini, and OpenAI-o4-mini, across different complexity levels. For RL generalization experiments, the framework applies the GRPO algorithm on 1,000 training problems using Qwen2.5-7B-Instruct and Qwen2.5-Math-7B models. Exploratory generalization trains on restricted complexity levels and evaluates on higher complexity problems. Compositional generalization involves training models on individual skills in isolation and testing their ability to combine and apply those skills effectively. Transformational generalization trains on conventional solution approaches and evaluates performance on problems that need unconventional strategies.

Performance Observations and Model Behavior Patterns

Reasoning LLMs tend to perform worse as problem complexity increases, often finding correct solutions early but spending too many tokens on unnecessary verification. RL applied only on low-complexity problems enhances generalization to medium-complexity problems, with larger gains on in-domain examples than out-of-distribution ones, indicating RL’s effectiveness at reinforcing familiar patterns. For instance, in the Zebra Logic domain, the base model achieves only 30% accuracy. However, RL training increased performance by 61 points on in-domain examples and 53 points on out-of-distribution examples without SFT.

Conclusion: Toward Advancing Transformational Reasoning

In conclusion, researchers introduced OMEGA, a benchmark that isolates and evaluates three axes of out-of-distribution generalization in mathematical reasoning: explorative, compositional, and transformative. The empirical study reveals three insights: (a) RL fine-tuning significantly improves performance on in-distribution and exploratory generalization tasks, (b) RL’s benefits for compositional tasks are limited, and (c) RL fails to induce genuinely new reasoning patterns. These findings highlight a fundamental limitation: RL can amplify problem-solving breadth and depth, but it falls short in enabling the creative leaps essential for transformational reasoning. Future work should explore curriculum scaffolding and meta-reasoning controllers.

Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post OMEGA: A Structured Math Benchmark to Probe the Reasoning Limits of LLMs appeared first on MarkTechPost.

Use Amazon SageMaker Unified Studio to build complex AI workflows usin …

Posted on July 2, 2025 by i-genie

Organizations face the challenge to manage data, multiple artificial intelligence and machine learning (AI/ML) tools, and workflows across different environments, impacting productivity and governance. A unified development environment consolidates data processing, model development, and AI application deployment into a single system. This integration streamlines workflows, enhances collaboration, and accelerates AI solution development from concept to production.
The next generation of Amazon SageMaker is the center for your data, analytics, and AI. SageMaker brings together AWS AI/ML and analytics capabilities and delivers an integrated experience for analytics and AI with unified access to data. Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access your data and act on it using AWS analytics and AI/ML services, for SQL analytics, data processing, model development, and generative AI application development.
With SageMaker Unified Studio, you can efficiently build generative AI applications in a trusted and secure environment using Amazon Bedrock. You can choose from a selection of high-performing foundation models (FMs) and advanced customization and tooling such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, Amazon Bedrock Agents, and Amazon Bedrock Flows. You can rapidly tailor and deploy generative AI applications, and share with the built-in catalog for discovery.
In this post, we demonstrate how you can use SageMaker Unified Studio to create complex AI workflows using Amazon Bedrock Flows.
Solution overview
Consider FinAssist Corp, a leading financial institution developing a generative AI-powered agent support application. The solution offers the following key features:

Complaint reference system – An AI-powered system providing quick access to historical complaint data, enabling customer service representatives to efficiently handle customer follow-ups, support internal audits, and aid in training new staff.
Intelligent knowledge base – A comprehensive data source of resolved complaints that quickly retrieves relevant complaint details, resolution actions, and outcome summaries.
Streamlined workflow management – Enhanced consistency in customer communications through standardized access to past case information, supporting compliance checks and process improvement initiatives.
Flexible query capability – A straightforward interface supporting various query scenarios, from customer inquiries about past resolutions to internal reviews of complaint handling procedures.

Let’s explore how SageMaker Unified Studio and Amazon Bedrock Flows, integrated with Amazon Bedrock Knowledge Bases and Amazon Bedrock Agents, address these challenges by creating an AI-powered complaint reference system. The following diagram illustrates the solution architecture.

The solution uses the following key components:

SageMaker Unified Studio – Provides the development environment
Flow app – Orchestrates the workflow, including:

Knowledge base queries
Prompt-based classification
Conditional routing
Agent-based response generation

The workflow processes user queries through the following steps:

A user submits a complaint-related question.
The knowledge base provides relevant complaint information.
The prompt classifies if the query is about resolution timing.
Based on the classification using the condition, the application takes the following action:

Routes the query to an AI agent for specific resolution responses.
Returns general complaint information.

The application generates an appropriate response for the user.

Prerequisites
For this example, you need the following:

Access to SageMaker Unified Studio. (You will need the SageMaker Unified Studio portal URL from your administrator). You can authenticate using either:

AWS Identity and Access Management (IAM) user credentials.
Single sign-on (SSO) credentials with AWS IAM Identity Center.

The IAM user or IAM Identity Center user must have appropriate permissions for:

SageMaker Unified Studio.

Amazon Bedrock (including Amazon Bedrock Flows, Amazon Bedrock Agents, Amazon Bedrock Prompt Management, and Amazon Bedrock Knowledge Bases).
For more information, refer to Identity-based policy examples.

Access to Amazon Bedrock FMs (make sure these are enabled for your account), for example:Anthropic’s Claude 3 Haiku (for the agent).
Configure access to your Amazon Bedrock serverless models for Amazon Bedrock in SageMaker Unified Studio projects.
Amazon Titan Embedding (for the knowledge base).
Sample complaint data prepared in CSV format for creating the knowledge base.

Prepare your data
We have created a sample dataset to use for Amazon Bedrock Knowledge Bases. This dataset has information of complaints received by customer service representatives and resolution information.The following is an example from the sample dataset:

complaint_id,product,sub_product,issue,sub_issue,complaint_summary,action_taken,next_steps,financial_institution,state,submitted_via,resolution_type,timely_response
FIN-2024-001,04/26/24,”Mortgage”,”Conventional mortgage”,”Payment issue”,”Escrow dispute”,”Customer disputes mortgage payment increase after recent escrow analysis”,”Reviewed escrow analysis, explained property tax increase impact, provided detailed payment breakdown”,”1. Send written explanation of escrow analysis 2. Schedule annual escrow review 3. Provide payment assistance options”,”Financial Institution-1″,”TX”,”Web”,”Closed with explanation”,”Yes”
FIN-2024-002,04/26/24,”Money transfer”,”Wire transfer”,”Processing delay”,”International transfer”,”Wire transfer of $10,000 delayed, customer concerned about international payment deadline”,”Located wire transfer in system, expedited processing, waived wire fee”,”1. Confirm receipt with receiving bank 2. Update customer on delivery 3. Document process improvement needs”,”Financial Institution-2″,”FL”,”Phone”,”Closed with monetary relief”,”No”

Create a project
In SageMaker Unified Studio, users can use projects to collaborate on various business use cases. Within projects, you can manage data assets in the SageMaker Unified Studio catalog, perform data analysis, organize workflows, develop ML models, build generative AI applications, and more.
To create a project, complete the following steps:

Open the SageMaker Unified Studio landing page using the URL from your admin.
Choose Create project.
Enter a project name and optional description.
For Project profile, choose Generative AI application development.
Choose Continue.

Complete your project configuration, then choose Create project.

Create a prompt
Let’s create a reusable prompt to capture the instructions for FMs, which we will use later while creating the flow application. For more information, see Reuse and share Amazon Bedrock prompts.

In SageMaker Unified Studio, on the Build menu, choose Prompt under Machine Learning & Generative AI.

Provide a name for the prompt.
Choose the appropriate FM (for this example, we choose Claude 3 Haiku).
For Prompt message, we enter the following:

You are a complaint analysis classifier. You will receive complaint data from a knowledge base. Analyze the {{input}} and respond with a single letter:
T: If the input contains information about complaint resolution timing, response time, or processing timeline (whether timely or delayed)
F: For all other types of complaint information
Return only ‘T’ or ‘F’ based on whether the knowledge base response is about resolution timing. Do not add any additional text or explanation – respond with just the single letter ‘T’ or ‘F’.

Choose Save.

Choose Create version.

Create a chat agent
Let’s create a chat agent to handle specific resolution responses. Complete the following steps:

In SageMaker Unified Studio, on the Build menu, choose Chat agent under Machine Learning & Generative AI.
Provide a name for the prompt.
Choose the appropriate FM (for this example, we choose Claude 3 Haiku).
For Enter a system prompt, we enter the following:

You are a Financial Complaints Assistant AI. You will receive complaint information from a knowledge base and questions about resolution timing.
When responding to resolution timing queries:
1. Use the provided complaint information to confirm if it was resolved within timeline
2. For timely resolutions, provide:
– Confirmation of timely completion
– Specific actions taken (from the provided complaint data)
– Next steps that were completed
2. For delayed resolutions, provide:
– Acknowledgment of delay
– Standard compensation package:
• $75 service credit
• Priority Status upgrade for 6 months
• Service fees waived for current billing cycle
– Actions taken (from the provided complaint data)
– Contact information for follow-up: Priority Line: **************
Always reference the specific complaint details provided in your input when discussing actions taken and resolution process.

Choose Save.

After the agent is saved, choose Deploy.
For Alias name, enter demoAlias.
Choose Deploy.

Create a flow
Now that we have our prompt and agent ready, let’s create a flow that will orchestrate the complaint handling process:

In SageMaker Unified Studio, on the Build menu, choose Flow under Machine Learning & Generative AI.

Create a new flow called demo-flow.

Add a knowledge base to your flow application
Complete the following steps to add a knowledge base node to the flow:

In the navigation pane, on the Nodes tab, choose Knowledge Base.
On the Configure tab, provide the following information:

For Node name, enter a name (for example, complaints_kb).
Choose Create new Knowledge Base.

In the Create Knowledge Base pane, enter the following information:

For Name, enter a name (for example, complaints).
For Description, enter a description (for example, user complaints information).
For Add data sources, select Local file and upload the complaints.txt file.
For Embeddings model, choose Titan Text Embeddings V2.
For Vector store, choose OpenSearch Serverless.
Choose Create.

After you create the knowledge base, choose it in the flow.
In the details name, provide the following information:
For Response generation model, choose Claude 3 Haiku.
Connect the output of the flow input node with the input of the knowledge base node.
Connect the output of the knowledge base node with the input of the flow output node.

Choose Save.

Add a prompt to your flow application
Now let’s add the prompt you created earlier to the flow:

On the Nodes tab in the Flow app builder pane, add a prompt node.
On the Configure tab for the prompt node, provide the following information:
For Node name, enter a name (for example, demo_prompt).
For Prompt, choose financeAssistantPrompt.
For Version, choose 1.
Connect the output of the knowledge base node with the input of the prompt node.
Choose Save.

Add a condition to your flow application
The condition node determines how the flow handles different types of queries. It evaluates whether a query is about resolution timing or general complaint information, enabling the flow to route the query appropriately. When a query is about resolution timing, it will be directed to the chat agent for specialized handling; otherwise, it will receive a direct response from the knowledge base. Complete the following steps to add a condition:

On the Nodes tab in the Flow app builder pane, add a condition node.
On the Configure tab for the condition node, provide the following information:

For Node name, enter a name (for example, demo_condition).
Under Conditions, for Condition, enter conditionInput == “T”.
Connect the output of the prompt node with the input of the condition node.

Choose Save.

Add a chat agent to your flow application
Now let’s add the chat agent you created earlier to the flow:

On the Nodes tab in the Flow app builder pane, add the agent node.
On the Configure tab for the agent node, provide the following information:

For Node name, enter a name (for example, demo_agent).
For Chat agent, choose DemoAgent.
For Alias, choose demoAlias.

Create the following node connections:

Connect the input of the condition node (demo_condition) to the output of the prompt node (demo_prompt).
Connect the output of the condition node:

Set If condition is true to the agent node (demo_agent).
Set If condition is false to the existing flow output node (FlowOutputNode).

Connect the output of the knowledge base node (complaints_kb) to the input of the following:

The agent node (demo_agent).
The flow output node (FlowOutputNode).

Connect the output of the agent node (demo_agent) to a new flow output node named FlowOutputNode_2.

Choose Save.

Test the flow application
Now that the flow application is ready, let’s test it. On the right side of the page, choose the expand icon to open the Test pane.

In the Enter prompt text box, we can ask a few questions related to the dataset created earlier. The following screenshots show some examples.

Clean up
To clean up your resources, delete the flow, agent, prompt, knowledge base, and associated OpenSearch Serverless resources.
Conclusion
In this post, we demonstrated how to build an AI-powered complaint reference system using a flow application in SageMaker Unified Studio. By using the integrated capabilities of SageMaker Unified Studio with Amazon Bedrock features like Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and Amazon Bedrock Flows, you can rapidly develop and deploy sophisticated AI applications without extensive coding.
As you build AI workflows using SageMaker Unified Studio, remember to adhere to the AWS Shared Responsibility Model for security. Implement SageMaker Unified Studio security best practices, including proper IAM configurations and data encryption. You can also refer to Secure a generative AI assistant with OWASP Top 10 mitigation for details on how to assess the security posture of a generative AI assistant using OWASP TOP 10 mitigations for common threats. Following these guidelines helps establish robust AI applications that maintain data integrity and system protection.
To learn more, refer to Amazon Bedrock in SageMaker Unified Studio and join discussions and share your experiences in AWS Generative AI Community.
We look forward to seeing the innovative solutions you will create with these powerful new features.

About the authors
Sumeet Tripathi is an Enterprise Support Lead (TAM) at AWS in North Carolina. He has over 17 years of experience in technology across various roles. He is passionate about helping customers to reduce operational challenges and friction. His focus area is AI/ML and Energy & Utilities Segment. Outside work, He enjoys traveling with family, watching cricket and movies.
Vishal Naik is a Sr. Solutions Architect at Amazon Web Services (AWS). He is a builder who enjoys helping customers accomplish their business needs and solve complex challenges with AWS solutions and best practices. His core area of focus includes Generative AI and Machine Learning. In his spare time, Vishal loves making short films on time travel and alternate universe themes.

Accelerating AI innovation: Scale MCP servers for enterprise workloads …

Posted on July 2, 2025 by i-genie

Generative AI has been moving at a rapid pace, with new tools, offerings, and models released frequently. According to Gartner, agentic AI is one of the top technology trends of 2025, and organizations are performing prototypes on how to use agents in their enterprise environment. Agents depend on tools, and each tool might have its own mechanism to send and receive information. Model Context Protocol (MCP) by Anthropic is an open source protocol that attempts to solve this challenge. It provides a protocol and communication standard that is cross-compatible with different tools, and can be used by an agentic application’s large language model (LLM) to connect to enterprise APIs or external tools using a standard mechanism. However, large enterprise organizations like financial services tend to have complex data governance and operating models, which makes it challenging to implement agents working with MCP.
One major challenge is the siloed approach in which individual teams build their own tools, leading to duplication of efforts and wasted resources. This approach slows down innovation and creates inconsistencies in integrations and enterprise design. Furthermore, managing multiple disconnected MCP tools across teams makes it difficult to scale AI initiatives effectively. These inefficiencies hinder enterprises from fully taking advantage of generative AI for tasks like post-trade processing, customer service automation, and regulatory compliance.
In this post, we present a centralized MCP server implementation using Amazon Bedrock that offers an innovative approach by providing shared access to tools and resources. With this approach, teams can focus on building AI capabilities rather than spending time developing or maintaining tools. By standardizing access to resources and tools through MCP, organizations can accelerate the development of AI agents, so teams can reach production faster. Additionally, a centralized approach provides consistency and standardization and reduces operational overhead, because the tools are managed by a dedicated team rather than across individual teams. It also enables centralized governance that enforces controlled access to MCP servers, which reduces the risk of data exfiltration and prevents unauthorized or insecure tool use across the organization.
Solution overview
The following figure illustrates a proposed solution based on a financial services use case that uses MCP servers across multiple lines of business (LoBs), such as compliance, trading, operations, and risk management. Each LoB performs distinct functions tailored to their specific business. For instance, the trading LoB focuses on trade execution, whereas the risk LoB performs risk limit checks. For performing these functions, each division provides a set of MCP servers that facilitate actions and access to relevant data within their LoBs. These servers are accessible to agents developed within the respective LoBs and can also be exposed to agents outside LoBs.

The development of MCP servers is decentralized. Each LoB is responsible for developing the servers that support their specific functions. When the development of a server is complete, it’s hosted centrally and accessible across LoBs. It takes the form of a registry or marketplace that facilitates integration of AI-driven solutions across divisions while maintaining control and governance over shared resources.
In the following sections, we explore what the solution looks like on a conceptual level.
Agentic application interaction with a central MCP server hub
The following flow diagram showcases how an agentic application built using Amazon Bedrock interacts with one of the MCP servers located in the MCP server hub.
The flow consists of the following steps:

The application connects to the central MCP hub through the load balancer and requests a list of available tools from the specific MCP server. This can be fine-grained based on what servers the agentic application has access to.
The trade server responds with list of tools available, including details such as tool name, description, and required input parameters.
The agentic application invokes an Amazon Bedrock agent and provides the list of tools available.
Using this information, the agent determines what to do next based on the given task and the list of tools available to it.
The agent chooses the most suitable tool and responds with the tool name and input parameters. The control comes back to the agentic application.
The agentic application calls for the execution of the tool through the MCP server using the tool name and input parameters.
The trade MCP server executes the tool and returns the results of the execution back to the application.
The application returns the results of the tool execution back to the Amazon Bedrock agent.
The agent observes the tool execution results and determines the next step.

Let’s dive into the technical architecture of the solution.
Architecture overview
The following diagram illustrates the architecture to host the centralized cluster of MCP servers for an LoB.

The architecture can be split in five sections:

MCP server discovery API
Agentic applications
Central MCP server hub
Tools and resources

Let’s explore each section in detail:

MCP server discovery API – This API is a dedicated endpoint for discovering various MCP servers. Different teams can call this API to find what MCP servers are available in the registry; read their description, tool, and resource details; and decide which MCP server would be the right one for their agentic application. When a new MCP server is published, it’s added to an Amazon DynamoDB database. MCP server owners are responsible for keeping the registry information up-to-date.
Agentic application – The agentic applications are hosted on AWS Fargate for Amazon Elastic Container Service (Amazon ECS) and built using Amazon Bedrock Agents. Teams can also use the newly released open source AWS Strands Agents SDK, or other agentic frameworks of choice, to build the agentic application and their own containerized solution to host the agentic application. The agentic applications access Amazon Bedrock through a secure private virtual private cloud (VPC) endpoint. It uses private VPC endpoints to access MCP servers.
Central MCP server hub – This is where the MCP servers are hosted. Access to servers is enabled through an AWS Network Load Balancer. Technically, each server is a Docker container that can is hosted on Amazon ECS, but you can choose your own container deployment solution. These servers can scale individually without impacting the other server. These servers in turn connect to one or more tools using private VPC endpoints.
Tools and resources – This component holds the tools, such as databases, another application, Amazon Simple Storage Service (Amazon S3), or other tools. For enterprises, access to the tools and resources is provided only through private VPC endpoints.

Benefits of the solution
The solution offers the following key benefits:

Scalability and resilience – Because you’re using Amazon ECS on Fargate, you get scalability out of the box without managing infrastructure and handling scaling concerns. Amazon ECS automatically detects and recovers from failures by restarting failed MCP server tasks locally or reprovisioning containers, minimizing downtime. It can also redirect traffic away from unhealthy Availability Zones and rebalance tasks across healthy Availability Zones to provide uninterrupted access to the server.
Security – Access to MCP servers is secured at the network level through network controls such as PrivateLink. This makes sure the agentic application only connects to trusted MCP servers hosted by the organization, and vice versa. Each Fargate workload runs in an isolated environment. This prevents resource sharing between tasks. For application authentication and authorization, we propose using an MCP Auth Server (refer to the following GitHub repo) to hand off those tasks to a dedicated component that can scale independently.

At the time of writing, the MCP protocol doesn’t provide built-in mechanisms for user-level access control or authorization. Organizations requiring user-specific access restrictions must implement additional security layers on top of the MCP protocol. For a reference implementation, refer to the following GitHub repo.
Let’s dive deeper in the implementation of this solution.
Use case
The implementation is based on a financial services use case featuring post-trade execution. Post-trade execution refers to the processes and steps that take place after an equity buy/sell order has been placed by a customer. It involves many steps, including verifying trade details, actual transfer of assets, providing a detailed report of the execution, running fraudulent checks, and more. For simplification of the demo, we focus on the order execution step.
Although this use case is tailored to the financial industry, you can apply the architecture and the approach to other enterprise workloads as well. The entire code of this implementation is available on GitHub. We use the AWS Cloud Development Kit (AWS CDK) for Python to deploy this solution, which creates an agentic application connected to tools through the MCP server. It also creates a Streamlit UI to interact with the agentic application.
The following code snippet provides access to the MCP discovery API:

def get_server_registry():
# Initialize DynamoDB client
dynamodb = boto3.resource(‘dynamodb’)
table = dynamodb.Table(DDBTBL_MCP_SERVER_REGISTRY)

try:
# Scan the table to get all items
response = table.scan()
items = response.get(‘Items’, [])

# Format the items to include only id, description, server
formatted_items = []
for item in items:
formatted_item = {
‘id’: item.get(‘id’, ”),
‘description’: item.get(‘description’, ”),
‘server’: item.get(‘server’, ”),
}
formatted_items.append(formatted_item)

# Return the formatted items as JSON
return {
‘statusCode’: 200,
‘headers’: cors_headers,
‘body’: json.dumps(formatted_items)
}
except Exception as e:
# Handle any errors
return {
‘statusCode’: 500,
‘headers’: cors_headers,
‘body’: json.dumps({‘error’: str(e)})
}

The preceding code is invoked through an AWS Lambda function. The complete code is available in the GitHub repository. The following graphic shows the response of the discovery API.

Let’s explore a scenario where the user submits a question: “Buy 100 shares of AMZN at USD 186, to be distributed equally between accounts A31 and B12.”To execute this task, the agentic application invokes the trade-execution MCP server. The following code is the sample implementation of the MCP server for trade execution:

from fastmcp import FastMCP
from starlette.requests import Request
from starlette.responses import PlainTextResponse
mcp = FastMCP(“server”)

@mcp.custom_route(“/”, methods=[“GET”])
async def health_check(request: Request) -> PlainTextResponse:
return PlainTextResponse(“OK”)

@mcp.tool()
async def executeTrade(ticker, quantity, price):
“””
Execute a trade for the given ticker, quantity, and price.

Sample input:
{
“ticker”: “AMZN”,
“quantity”: 1000,
“price”: 150.25
}
“””
# Simulate trade execution
return {
“tradeId”: “T12345”,
“status”: “Executed”,
“timestamp”: “2025-04-09T22:58:00”
}

@mcp.tool()
async def sendTradeDetails(tradeId):
“””
Send trade details for the given tradeId.
Sample input:
{
“tradeId”: “T12345”
}
“””
return {
“status”: “Details Sent”,
“recipientSystem”: “MiddleOffice”,
“timestamp”: “2025-04-09T22:59:00”
}
if __name__ == “__main__”:
mcp.run(host=”0.0.0.0″, transport=”streamable-http”)

The complete code is available in the following GitHub repo.
The following graphic shows the MCP server execution in action.

This is a sample implementation of the use case focusing on the deployment step. For a production scenario, we strongly recommend adding a human oversight workflow to monitor the execution and provide input at various steps of the trade execution.
Now you’re ready to deploy this solution.
Prerequisites
Prerequisites for the solution are available in the README.md of the GitHub repository.
Deploy the application
Complete the following steps to run this solution:

Navigate to the README.md file of the GitHub repository to find the instructions to deploy the solution. Follow these steps to complete deployment.

The successful deployment will exit with a message similar to the one shown in the following screenshot.

When the deployment is complete, access the Streamlit application.

You can find the Streamlit URL in the terminal output, similar to the following screenshot.

Enter the URL of the Streamlit application in a browser to open the application console.

On the application console, different sets of MCP servers are listed in the left pane under MCP Server Registry. Each set corresponds to an MCP server and includes the definition of the tools, such as the name, description, and input parameters.

In the right pane, Agentic App, a request is pre-populated: “Buy 100 shares of AMZN at USD 186, to be distributed equally between accounts A31 and B12.” This request is ready to be submitted to the agent for execution.

Choose Submit to invoke an Amazon Bedrock agent to process the request.

The agentic application will evaluate the request together with the list of tools it has access to, and iterate through a series of tools execution and evaluation to fulfil the request.You can view the trace output to see the tools that the agent used. For each tool used, you can see the values of the input parameters, followed by the corresponding results. In this case, the agent operated as follows:

The agent first used the function executeTrade with input parameters of ticker=AMZN, quantity=100, and price=186
After the trade was executed, used the allocateTrade tool to allocate the trade position between two portfolio accounts

Clean up
You will incur charges when you consume the services used in this solution. Instructions to clean up the resources are available in the README.md of the GitHub repository.
Summary
This solution offers a straightforward and enterprise-ready approach to implement MCP servers on AWS. With this centralized operating model, teams can focus on building their applications rather than maintaining the MCP servers. As enterprises continue to embrace agentic workflows, centralized MCP servers offer a practical solution for overcoming operational silos and inefficiencies. With the AWS scalable infrastructure and advanced tools like Amazon Bedrock Agents and Amazon ECS, enterprises can accelerate their journey toward smarter workflows and better customer outcomes.
Check out the GitHub repository to replicate the solution in your own AWS environment.
To learn more about how to run MCP servers on AWS, refer to the following resources:

Harness the power of MCP servers with Amazon Bedrock Agents
Unlocking the power of Model Context Protocol (MCP) on AWS
Amazon Bedrock Agents Samples GitHub repository

About the authors
Xan Huang is a Senior Solutions Architect with AWS and is based in Singapore. He works with major financial institutions to design and build secure, scalable, and highly available solutions in the cloud. Outside of work, Xan dedicates most of his free time to his family, where he lovingly takes direction from his two young daughters, aged one and four. You can find Xan on LinkedIn: https://www.linkedin.com/in/xanhuang/
Vikesh Pandey is a Principal GenAI/ML Specialist Solutions Architect at AWS helping large financial institutions adopt and scale generative AI and ML workloads. He is the author of book “Generative AI for financial services.” He carries more than decade of experience building enterprise-grade applications on generative AI/ML and related technologies. In his spare time, he plays an unnamed sport with his son that lies somewhere between football and rugby.

Choosing the right approach for generative AI-powered structured data …

Posted on July 2, 2025 by i-genie

Organizations want direct answers to their business questions without the complexity of writing SQL queries or navigating through business intelligence (BI) dashboards to extract data from structured data stores. Examples of structured data include tables, databases, and data warehouses that conform to a predefined schema. Large language model (LLM)-powered natural language query systems transform how we interact with data, so you can ask questions like “Which region has the highest revenue?” and receive immediate, insightful responses. Implementing these capabilities requires careful consideration of your specific needs—whether you need to integrate knowledge from other systems (for example, unstructured sources like documents), serve internal or external users, handle the analytical complexity of questions, or customize responses for business appropriateness, among other factors.
In this post, we discuss LLM-powered structured data query patterns in AWS. We provide a decision framework to help you select the best pattern for your specific use case.
Business challenge: Making structured data accessible
Organizations have vast amounts of structured data but struggle to make it effectively accessible to non-technical users for several reasons:

Business users lack the technical knowledge (like SQL) needed to query data
Employees rely on BI teams or data scientists for analysis, limiting self-service capabilities
Gaining insights often involves time delays that impact decision-making
Predefined dashboards constrain spontaneous exploration of data
Users might not know what questions are possible or where relevant data resides

Solution overview
An effective solution should provide the following:

A conversational interface that allows employees to query structured data sources without technical expertise
The ability to ask questions in everyday language and receive accurate, trustworthy answers
Automatic generation of visualizations and explanations to clearly communicate insights.
Integration of information from different data sources (both structured and unstructured) presented in a unified manner
Ease of integration with existing investments and rapid deployment capabilities
Access restriction based on identities, roles, and permissions

In the following sections, we explore five patterns that can address these needs, highlighting the architecture, ideal use cases, benefits, considerations, and implementation resources for each approach.
Pattern 1: Direct conversational interface using an enterprise assistant
This pattern uses Amazon Q Business, a generative AI-powered assistant, to provide a chat interface on data sources with native connectors. When users ask questions in natural language, Amazon Q Business connects to the data source, interprets the question, and retrieves relevant information without requiring intermediate services. The following diagram illustrates this workflow.

This approach is ideal for internal enterprise assistants that need to answer business user-facing questions from both structured and unstructured data sources in a unified experience. For example, HR personnel can ask “What’s our parental leave policy and how many employees used it last quarter?” and receive answers drawn from both leave policy documentation and employee databases together in one interaction. With this pattern, you can benefit from the following:

Simplified connectivity through the extensive Amazon Q Business library of built-in connectors
Streamlined implementation with a single service to configure and manage
Unified search experience for accessing both structured and unstructured information
Built-in understanding and respect existing identities, roles, and permissions

You can define the scope of data to be pulled in the form of a SQL query. Amazon Q Business pre-indexes database content based on defined SQL queries and uses this index when responding to user questions. Similarly, you can define the sync mode and schedule to determine how often you want to update your index. Amazon Q Business does the heavy lifting of indexing the data using a Retrieval Augmented Generation (RAG) approach and using an LLM to generate well-written answers. For more details on how to set up Amazon Q Business with an Amazon Aurora PostgreSQL-Compatible Edition connector, see Discover insights from your Amazon Aurora PostgreSQL database using the Amazon Q Business connector. You can also refer to the complete list of supported data source connectors.
Pattern 2: Enhancing BI tool with natural language querying capabilities
This pattern uses Amazon Q in QuickSight to process natural language queries against datasets that have been previously configured in Amazon QuickSight. Users can ask questions in everyday language within the QuickSight interface and get visualized answers without writing SQL. This approach works with QuickSight (Enterprise or Q edition) and supports various data sources, including Amazon Relational Database Service (Amazon RDS), Amazon Redshift, Amazon Athena, and others. The architecture is depicted in the following diagram.

This pattern is well-suited for internal BI and analytics use cases. Business analysts, executives, and other employees can ask ad-hoc questions to get immediate visualized insights in the form of dashboards. For example, executives can ask questions like “What were our top 5 regions by revenue last quarter?” and immediately see responsive charts, reducing dependency on analytics teams. The benefits of this pattern are as follows:

It enables natural language queries that produce rich visualizations and charts
No coding or machine learning (ML) experience is needed—the heavy lifting like natural language interpretation and SQL generation is managed by Amazon Q in QuickSight
It integrates seamlessly within the familiar QuickSight dashboard environment

Existing QuickSight users might find this the most straightforward way to take advantage of generative AI benefits. You can optimize this pattern for higher-quality results by configuring topics like curated fields, synonyms, and expected question phrasing. This pattern will pull data only from a specific configured data source in QuickSight to produce a dashboard as an output. For more details, check out QuickSight DemoCentral to view a demo in QuickSight, see the generative BI learning dashboard, and view guided instructions to create dashboards with Amazon Q. Also refer to the list of supported data sources.
Pattern 3: Combining BI visualization with conversational AI for a seamless experience
This pattern merges BI visualization capabilities with conversational AI to create a seamless knowledge experience. By integrating Amazon Q in QuickSight with Amazon Q Business (with the QuickSight plugin enabled), organizations can provide users with a unified conversational interface that draws on both unstructured and structured data. The following diagram illustrates the architecture.

This is ideal for enterprises that want an internal AI assistant to answer a variety of questions—whether it’s a metric from a database or knowledge from a document. For example, executives can ask “What was our Q4 revenue growth?” and see visualized results from data warehouses through Amazon Redshift through QuickSight, then immediately follow up with “What is our company vacation policy?” to access HR documentation—all within the same conversation flow. This pattern offers the following benefits:

It unifies answers from structured data (databases and warehouses) and unstructured data (documents, wikis, emails) in a single application
It delivers rich visualizations alongside conversational responses in a seamless experience with real-time analysis in chat
There is no duplication of work—if your BI team has already built datasets and topics in QuickSight for analytics, you use that in Amazon Q Business
It maintains conversational context when switching between data and document-based inquiries

For more details, see Query structured data from Amazon Q Business using Amazon QuickSight integration and Amazon Q Business now provides insights from your databases and data warehouses (preview).
Another variation of this pattern is recommended for BI users who want to expose unified data through rich visuals in QuickSight, as illustrated in the following diagram.

For more details, see Integrate unstructured data into Amazon QuickSight using Amazon Q Business.
Pattern 4: Building knowledge bases from structured data using managed text-to-SQL
This pattern uses Amazon Bedrock Knowledge Bases to enable structured data retrieval. The service provides a fully managed text-to-SQL module that alleviates common challenges in developing natural language query applications for structured data. This implementation uses Amazon Bedrock (Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases) along with your choice of data warehouse such as Amazon Redshift or Amazon SageMaker Lakehouse. The following diagram illustrates the workflow.

For example, a seller can use this capability embedded into an ecommerce application to ask a complex query like “Give me top 5 products whose sales increased by 50% last year as compared to previous year? Also group the results by product category.” The system automatically generates the appropriate SQL, executes it against the data sources, and delivers results or a summarized narrative. This pattern features the following benefits:

It provides fully managed text-to-SQL capabilities without requiring model training
It enables direct querying of data from the source without data movement
It supports complex analytical queries on warehouse data
It offers flexibility in foundation model (FM) selection through Amazon Bedrock
API connectivity, personalization options, and context-aware chat features make it better suited for customer facing applications

Choose this pattern when you need a flexible, developer-oriented solution. This approach works well for applications (internal or external) where you control the UI design. Default outputs are primarily text or structured data. However, executing arbitrary SQL queries can be a security risk for text-to-SQL applications. It is recommended that you take precautions as needed, such as using restricted roles, read-only databases, and sandboxing. For more information on how to build this pattern, see Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift. For a list of supported structured data stores, refer to Create a knowledge base by connecting to a structured data store.
Pattern 5: Custom text-to-SQL implementation with flexible model selection
This pattern represents a build-your-own solution using FMs to convert natural language to SQL, execute queries on data warehouses, and return results. Choose Amazon Bedrock when you want to quickly integrate this capability without deep ML expertise—it offers a fully managed service with ready-to-use FMs through a unified API, handling infrastructure needs with pay-as-you-go pricing. Alternatively, select Amazon SageMaker AI when you require extensive model customization to build specialized needs—it provides complete ML lifecycle tools for data scientists and ML engineers to build, train, and deploy custom models with greater control. For more information, refer to our Amazon Bedrock or Amazon SageMaker AI decision guide. The following diagram illustrates the architecture.

Use this pattern if your use case requires specific open-weight models, or you want to fine-tune models on your domain-specific data. For example, if you need highly accurate results for your query, then you can use this pattern to fine-tune models on specific schema structures, while maintaining the flexibility to integrate with existing workflows and multi-cloud environments. This pattern offers the following benefits:

It provides maximum customization in model selection, fine-tuning, and system design
It supports complex logic across multiple data sources
It offers complete control over security and deployment in your virtual private cloud (VPC)
It enables flexible interface implementation (Slack bots, custom web UIs, notebook plugins)
You can implement it for external user-facing solutions

For more information on steps to build this pattern, see Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources.
Pattern comparison: Making the right choice
To make effective decisions, let’s compare these patterns across key criteria.
Data workload suitability
Different out-of-the-box patterns handle transactional (operational) and analytical (historical or aggregated) data with varying degrees of effectiveness. Patterns 1 and 3, which use Amazon Q Business, work with indexed data and are optimized for lookup-style queries against previously indexed content rather than real-time transactional database queries. Pattern 2, which uses Amazon Q in QuickSight, gets visual output for transactional information for ad-hoc analysis. Pattern 4, which uses Amazon Bedrock structured data retrieval, is specifically designed for analytical systems and data warehouses, excelling at complex queries on large datasets. Pattern 5 is a self-managed text-to-SQL option that can be built to support both transactional or analytical needs of users.
Target audience
Architectures highlighted in Patterns 1, 2, and 3 (using Amazon Q Business, Amazon Q in QuickSight, or a combination) are best suited for internal enterprise use. However, you can use Amazon QuickSight Embedded to embed data visuals, dashboards, and natural language queries into both internal or customer-facing applications. Amazon Q Business serves as an enterprise AI assistant for organizational knowledge that uses subscription-based pricing tiers that is designed for internal employees. Pattern 4 (using Amazon Bedrock) can be used to build both internal as well as customer-facing applications. This is because, unlike the subscription-based model of Amazon Q Business, Amazon Bedrock provides API-driven services that alleviate per-user costs and identity management overhead for external customer scenarios. This makes it well-suited for customer-facing experiences where you need to serve potentially thousands of external users. The custom LLM solutions in Pattern 5 can similarly be tailored to external application requirements.
Interface and output format
Different patterns deliver answers through different interaction models:

Conversational experiences – Patterns 1 and 3 (using Amazon Q Business) provide chat-based interfaces. Pattern 4 (using Amazon Bedrock Knowledge Bases for structured data retrieval) naturally supports AI assistant integration, and Pattern 5 (a custom text-to-SQL solution) can be designed for a variety of interaction models.
Visualization-focused output – Pattern 2 (using Amazon Q in QuickSight) specializes in generating on-the-fly visualizations such as charts and tables in response to user questions.
API integration – For embedding capabilities into existing applications, Patterns 4 and 5 offer the most flexible API-based integration options.

The following figure is a comparison matrix of AWS structured data query patterns.

Conclusion
Between these patterns, your optimal choice depends on the following key factors:

Data location and characteristics – Is your data in operational databases, already in a data warehouse, or distributed across various sources?
User profile and interaction model – Are you supporting internal or external users? Do they prefer conversational or visualization-focused interfaces?
Available resources and expertise – Do you have ML specialists available, or do you need a fully managed solution?
Accuracy and governance requirements – Do you need strictly controlled semantics and curation, or is broader query flexibility acceptable with monitoring?

By understanding these patterns and their trade-offs, you can architect solutions that align with your business objectives.

About the authors
Akshara Shah is a Senior Solutions Architect at Amazon Web Services. She helps commercial customers build cloud-based generative AI services to meet their business needs. She has been designing, developing, and implementing solutions that leverage AI and ML technologies for more than 10 years. Outside of work, she loves painting, exercising and spending time with family.
Sanghwa Na is a Generative AI Specialist Solutions Architect at Amazon Web Services. Based in San Francisco, he works with customers to design and build generative AI solutions using large language models and foundation models on AWS. He focuses on helping organizations adopt AI technologies that drive real business value

Building Advanced Multi-Agent AI Workflows by Leveraging AutoGen and S …

Posted on July 1, 2025 by i-genie

In this tutorial, we walk you through the seamless integration of AutoGen and Semantic Kernel with Google’s Gemini Flash model. We begin by setting up our GeminiWrapper and SemanticKernelGeminiPlugin classes to bridge the generative power of Gemini with AutoGen’s multi-agent orchestration. From there, we configure specialist agents, ranging from code reviewers to creative analysts, demonstrating how we can leverage AutoGen’s ConversableAgent API alongside Semantic Kernel’s decorated functions for text analysis, summarization, code review, and creative problem-solving. By combining AutoGen’s robust agent framework with Semantic Kernel’s function-driven approach, we create an advanced AI assistant that adapts to a variety of tasks with structured, actionable insights.

Copy CodeCopiedUse a different Browser!pip install pyautogen semantic-kernel google-generativeai python-dotenv

import os
import asyncio
from typing import Dict, Any, List
import autogen
import google.generativeai as genai
from semantic_kernel import Kernel
from semantic_kernel.functions import KernelArguments
from semantic_kernel.functions.kernel_function_decorator import kernel_function

We start by installing the core dependencies: pyautogen, semantic-kernel, google-generativeai, and python-dotenv, ensuring we have all the necessary libraries for our multi-agent and semantic function setup. Then we import essential Python modules (os, asyncio, typing) along with autogen for agent orchestration, genai for Gemini API access, and the Semantic Kernel classes and decorators to define our AI functions.

Copy CodeCopiedUse a different BrowserGEMINI_API_KEY = “Use Your API Key Here”
genai.configure(api_key=GEMINI_API_KEY)

config_list = [
{
“model”: “gemini-1.5-flash”,
“api_key”: GEMINI_API_KEY,
“api_type”: “google”,
“api_base”: “https://generativelanguage.googleapis.com/v1beta”,
}
]

We define our GEMINI_API_KEY placeholder and immediately configure the genai client so all subsequent Gemini calls are authenticated. Then we build a config_list containing the Gemini Flash model settings, model name, API key, endpoint type, and base URL, which we’ll hand off to our agents for LLM interactions.

Copy CodeCopiedUse a different Browserclass GeminiWrapper:
“””Wrapper for Gemini API to work with AutoGen”””

def __init__(self, model_name=”gemini-1.5-flash”):
self.model = genai.GenerativeModel(model_name)

def generate_response(self, prompt: str, temperature: float = 0.7) -> str:
“””Generate response using Gemini”””
try:
response = self.model.generate_content(
prompt,
generation_config=genai.types.GenerationConfig(
temperature=temperature,
max_output_tokens=2048,
)
)
return response.text
except Exception as e:
return f”Gemini API Error: {str(e)}”

We encapsulate all Gemini Flash interactions in a GeminiWrapper class, where we initialize a GenerativeModel for our chosen model and expose a simple generate_response method. In this method, we pass the prompt and temperature into Gemini’s generate_content API (capped at 2048 tokens) and return the raw text or a formatted error.

Copy CodeCopiedUse a different Browserclass SemanticKernelGeminiPlugin:
“””Semantic Kernel plugin using Gemini Flash for advanced AI operations”””

def __init__(self):
self.kernel = Kernel()
self.gemini = GeminiWrapper()

@kernel_function(name=”analyze_text”, description=”Analyze text for sentiment and key insights”)
def analyze_text(self, text: str) -> str:
“””Analyze text using Gemini Flash”””
prompt = f”””
Analyze the following text comprehensively:

Text: {text}

Provide analysis in this format:
– Sentiment: [positive/negative/neutral with confidence]
– Key Themes: [main topics and concepts]
– Insights: [important observations and patterns]
– Recommendations: [actionable next steps]
– Tone: [formal/informal/technical/emotional]
“””

return self.gemini.generate_response(prompt, temperature=0.3)

@kernel_function(name=”generate_summary”, description=”Generate comprehensive summary”)
def generate_summary(self, content: str) -> str:
“””Generate summary using Gemini’s advanced capabilities”””
prompt = f”””
Create a comprehensive summary of the following content:

Content: {content}

Provide:
1. Executive Summary (2-3 sentences)
2. Key Points (bullet format)
3. Important Details
4. Conclusion/Implications
“””

return self.gemini.generate_response(prompt, temperature=0.4)

@kernel_function(name=”code_analysis”, description=”Analyze code for quality and suggestions”)
def code_analysis(self, code: str) -> str:
“””Analyze code using Gemini’s code understanding”””
prompt = f”””
Analyze this code comprehensively:

“`
{code}
“`

Provide analysis covering:
– Code Quality: [readability, structure, best practices]
– Performance: [efficiency, optimization opportunities]
– Security: [potential vulnerabilities, security best practices]
– Maintainability: [documentation, modularity, extensibility]
– Suggestions: [specific improvements with examples]
“””

return self.gemini.generate_response(prompt, temperature=0.2)

@kernel_function(name=”creative_solution”, description=”Generate creative solutions to problems”)
def creative_solution(self, problem: str) -> str:
“””Generate creative solutions using Gemini’s creative capabilities”””
prompt = f”””
Problem: {problem}

Generate creative solutions:
1. Conventional Approaches (2-3 standard solutions)
2. Innovative Ideas (3-4 creative alternatives)
3. Hybrid Solutions (combining different approaches)
4. Implementation Strategy (practical steps)
5. Potential Challenges and Mitigation
“””

return self.gemini.generate_response(prompt, temperature=0.8)

We encapsulate our Semantic Kernel logic in the SemanticKernelGeminiPlugin, where we initialize both the Kernel and our GeminiWrapper to power custom AI functions. Using the @kernel_function decorator, we declare methods like analyze_text, generate_summary, code_analysis, and creative_solution, each of which constructs a structured prompt and delegates the heavy lifting to Gemini Flash. This plugin lets us seamlessly register and invoke advanced AI operations within our Semantic Kernel environment.

Copy CodeCopiedUse a different Browserclass AdvancedGeminiAgent:
“””Advanced AI Agent using Gemini Flash with AutoGen and Semantic Kernel”””

def __init__(self):
self.sk_plugin = SemanticKernelGeminiPlugin()
self.gemini = GeminiWrapper()
self.setup_agents()

def setup_agents(self):
“””Initialize AutoGen agents with Gemini Flash”””

gemini_config = {
“config_list”: [{“model”: “gemini-1.5-flash”, “api_key”: GEMINI_API_KEY}],
“temperature”: 0.7,
}

self.assistant = autogen.ConversableAgent(
name=”GeminiAssistant”,
llm_config=gemini_config,
system_message=”””You are an advanced AI assistant powered by Gemini Flash with Semantic Kernel capabilities.
You excel at analysis, problem-solving, and creative thinking. Always provide comprehensive, actionable insights.
Use structured responses and consider multiple perspectives.”””,
human_input_mode=”NEVER”,
)

self.code_reviewer = autogen.ConversableAgent(
name=”GeminiCodeReviewer”,
llm_config={**gemini_config, “temperature”: 0.3},
system_message=”””You are a senior code reviewer powered by Gemini Flash.
Analyze code for best practices, security, performance, and maintainability.
Provide specific, actionable feedback with examples.”””,
human_input_mode=”NEVER”,
)

self.creative_analyst = autogen.ConversableAgent(
name=”GeminiCreativeAnalyst”,
llm_config={**gemini_config, “temperature”: 0.8},
system_message=”””You are a creative problem solver and innovation expert powered by Gemini Flash.
Generate innovative solutions, and provide fresh perspectives.
Balance creativity with practicality.”””,
human_input_mode=”NEVER”,
)

self.data_specialist = autogen.ConversableAgent(
name=”GeminiDataSpecialist”,
llm_config={**gemini_config, “temperature”: 0.4},
system_message=”””You are a data analysis expert powered by Gemini Flash.
Provide evidence-based recommendations and statistical perspectives.”””,
human_input_mode=”NEVER”,
)

self.user_proxy = autogen.ConversableAgent(
name=”UserProxy”,
human_input_mode=”NEVER”,
max_consecutive_auto_reply=2,
is_termination_msg=lambda x: x.get(“content”, “”).rstrip().endswith(“TERMINATE”),
llm_config=False,
)

def analyze_with_semantic_kernel(self, content: str, analysis_type: str) -> str:
“””Bridge function between AutoGen and Semantic Kernel with Gemini”””
try:
if analysis_type == “text”:
return self.sk_plugin.analyze_text(content)
elif analysis_type == “code”:
return self.sk_plugin.code_analysis(content)
elif analysis_type == “summary”:
return self.sk_plugin.generate_summary(content)
elif analysis_type == “creative”:
return self.sk_plugin.creative_solution(content)
else:
return “Invalid analysis type. Use ‘text’, ‘code’, ‘summary’, or ‘creative’.”
except Exception as e:
return f”Semantic Kernel Analysis Error: {str(e)}”

def multi_agent_collaboration(self, task: str) -> Dict[str, str]:
“””Orchestrate multi-agent collaboration using Gemini”””
results = {}

agents = {
“assistant”: (self.assistant, “comprehensive analysis”),
“code_reviewer”: (self.code_reviewer, “code review perspective”),
“creative_analyst”: (self.creative_analyst, “creative solutions”),
“data_specialist”: (self.data_specialist, “data-driven insights”)
}

for agent_name, (agent, perspective) in agents.items():
try:
prompt = f”Task: {task}nnProvide your {perspective} on this task.”
response = agent.generate_reply([{“role”: “user”, “content”: prompt}])
results[agent_name] = response if isinstance(response, str) else str(response)
except Exception as e:
results[agent_name] = f”Agent {agent_name} error: {str(e)}”

return results

def run_comprehensive_analysis(self, query: str) -> Dict[str, Any]:
“””Run comprehensive analysis using all Gemini-powered capabilities”””
results = {}

analyses = [“text”, “summary”, “creative”]
for analysis_type in analyses:
try:
results[f”sk_{analysis_type}”] = self.analyze_with_semantic_kernel(query, analysis_type)
except Exception as e:
results[f”sk_{analysis_type}”] = f”Error: {str(e)}”

try:
results[“multi_agent”] = self.multi_agent_collaboration(query)
except Exception as e:
results[“multi_agent”] = f”Multi-agent error: {str(e)}”

try:
results[“direct_gemini”] = self.gemini.generate_response(
f”Provide a comprehensive analysis of: {query}”, temperature=0.6
)
except Exception as e:
results[“direct_gemini”] = f”Direct Gemini error: {str(e)}”

return results

We add our end-to-end AI orchestration in the AdvancedGeminiAgent class, where we initialize our Semantic Kernel plugin, Gemini wrapper, and configure a suite of specialist AutoGen agents (assistant, code reviewer, creative analyst, data specialist, and user proxy). With simple methods for semantic-kernel bridging, multi-agent collaboration, and direct Gemini calls, we enable a seamless, comprehensive analysis pipeline for any user query.

Copy CodeCopiedUse a different Browserdef main():
“””Main execution function for Google Colab with Gemini Flash”””
print(” Initializing Advanced Gemini Flash AI Agent…”)
print(” Using Gemini 1.5 Flash for high-speed, cost-effective AI processing”)

try:
agent = AdvancedGeminiAgent()
print(” Agent initialized successfully!”)
except Exception as e:
print(f” Initialization error: {str(e)}”)
print(” Make sure to set your Gemini API key!”)
return

demo_queries = [
“How can AI transform education in developing countries?”,
“def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)”,
“What are the most promising renewable energy technologies for 2025?”
]

print(“n Running Gemini Flash Powered Analysis…”)

for i, query in enumerate(demo_queries, 1):
print(f”n{‘=’*60}”)
print(f” Demo {i}: {query}”)
print(‘=’*60)

try:
results = agent.run_comprehensive_analysis(query)

for key, value in results.items():
if key == “multi_agent” and isinstance(value, dict):
print(f”n {key.upper().replace(‘_’, ‘ ‘)}:”)
for agent_name, response in value.items():
print(f” {agent_name}: {str(response)[:200]}…”)
else:
print(f”n {key.upper().replace(‘_’, ‘ ‘)}:”)
print(f” {str(value)[:300]}…”)

except Exception as e:
print(f” Error in demo {i}: {str(e)}”)

print(f”n{‘=’*60}”)
print(” Gemini Flash AI Agent Demo Completed!”)
print(” To use with your API key, replace ‘your-gemini-api-key-here'”)
print(” Get your free Gemini API key at: https://makersuite.google.com/app/apikey”)

if __name__ == “__main__”:
main()

Finally, we run the main function that initializes the AdvancedGeminiAgent, prints out status messages, and iterates through a set of demo queries. As we run each query, we collect and display results from semantic-kernel analyses, multi-agent collaboration, and direct Gemini responses, ensuring a clear, step-by-step showcase of our multi-agent AI workflow.

In conclusion, we showcased how AutoGen and Semantic Kernel complement each other to produce a versatile, multi-agent AI system powered by Gemini Flash. We highlighted how AutoGen simplifies the orchestration of diverse expert agents, while Semantic Kernel provides a clean, declarative layer for defining and invoking advanced AI functions. By uniting these tools in a Colab notebook, we’ve enabled rapid experimentation and prototyping of complex AI workflows without sacrificing clarity or control.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Building Advanced Multi-Agent AI Workflows by Leveraging AutoGen and Semantic Kernel appeared first on MarkTechPost.

TabArena: Benchmarking Tabular Machine Learning with Reproducibility a …

Posted on July 1, 2025 by i-genie

Understanding the Importance of Benchmarking in Tabular ML

Machine learning on tabular data focuses on building models that learn patterns from structured datasets, typically composed of rows and columns similar to those found in spreadsheets. These datasets are used in industries ranging from healthcare to finance, where accuracy and interpretability are essential. Techniques such as gradient-boosted trees and neural networks are commonly used, and recent advances have introduced foundation models designed to handle tabular data structures. Ensuring fair and effective comparisons between these methods has become increasingly important as new models continue to emerge.

Challenges with Existing Benchmarks

One challenge in this domain is that benchmarks for evaluating models on tabular data are often outdated or flawed. Many benchmarks continue to utilize obsolete datasets with licensing issues or those that do not accurately reflect real-world tabular use cases. Furthermore, some benchmarks include data leaks or synthetic tasks, which distort model evaluation. Without active maintenance or updates, these benchmarks fail to keep pace with advances in modeling, leaving researchers and practitioners with tools that cannot reliably measure current model performance.

Limitations of Current Benchmarking Tools

Several tools have attempted to benchmark models, but they typically rely on automatic dataset selection and minimal human oversight. This introduces inconsistencies in performance evaluation due to unverified data quality, duplication, or preprocessing errors. Furthermore, many of these benchmarks utilize only default model settings and avoid extensive hyperparameter tuning or ensemble techniques. The result is a lack of reproducibility and a limited understanding of how models perform under real-world conditions. Even widely cited benchmarks often fail to specify essential implementation details or restrict their evaluations to narrow validation protocols.

Introducing TabArena: A Living Benchmarking Platform

Researchers from Amazon Web Services, University of Freiburg, INRIA Paris, Ecole Normale Supérieure, PSL Research University, PriorLabs, and the ELLIS Institute Tübingen have introduced TabArena—a continuously maintained benchmark system designed for tabular machine learning. The research introduced TabArena to function as a dynamic and evolving platform. Unlike previous benchmarks that are static and outdated soon after release, TabArena is maintained like software: versioned, community-driven, and updated based on new findings and user contributions. The system was launched with 51 carefully curated datasets and 16 well-implemented machine-learning models.

Three Pillars of TabArena’s Design

The research team constructed TabArena on three main pillars: robust model implementation, detailed hyperparameter optimization, and rigorous evaluation. All models are built using AutoGluon and adhere to a unified framework that supports preprocessing, cross-validation, metric tracking, and ensembling. Hyperparameter tuning involves evaluating up to 200 different configurations for most models, except TabICL and TabDPT, which were tested for in-context learning only. For validation, the team uses 8-fold cross-validation and applies ensembling across different runs of the same model. Foundation models, due to their complexity, are trained on merged training-validation splits as recommended by their original developers. Each benchmarking configuration is evaluated with a one-hour time limit on standard computing resources.

Performance Insights from 25 Million Model Evaluations

Performance results from TabArena are based on an extensive evaluation involving approximately 25 million model instances. The analysis showed that ensemble strategies significantly improve performance across all model types. Gradient-boosted decision trees still perform strongly, but deep-learning models with tuning and ensembling are on par with, or even better than, them. For instance, AutoGluon 1.3 achieved marked results under a 4-hour training budget. Foundation models, particularly TabPFNv2 and TabICL, demonstrated strong performance on smaller datasets thanks to their effective in-context learning capabilities, even without tuning. Ensembles combining different types of models achieved state-of-the-art performance, although not all individual models contributed equally to the final results. These findings highlight the importance of both model diversity and the effectiveness of ensemble methods.

Significance of TabArena for the ML Community

The article identifies a clear gap in reliable, current benchmarking for tabular machine learning and offers a well-structured solution. By creating TabArena, the researchers have introduced a platform that addresses critical issues of reproducibility, data curation, and performance evaluation. The method relies on detailed curation and practical validation strategies, making it a significant contribution for anyone developing or evaluating models on tabular data.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post TabArena: Benchmarking Tabular Machine Learning with Reproducibility and Ensembling at Scale appeared first on MarkTechPost.