July 2025 - Page 6 of 10

This AI Paper Introduces TableRAG: A Hybrid SQL and Text Retrieval Fra …

Posted on July 16, 2025 by i-genie

Handling questions that involve both natural language and structured tables has become an essential task in building more intelligent and useful AI systems. These systems are often expected to process content that includes diverse data types, such as text mixed with numerical tables, which are commonly found in business documents, research papers, and public reports. Understanding such documents requires the AI to perform reasoning that spans both textual explanations and table-based details—a process that is inherently more complicated than traditional text-based question answering.

One of the major problems in this area is that current language models often fail to interpret documents accurately when tables are involved. Models tend to lose the relationships between rows and columns when the tables are flattened into plain text. This distorts the underlying structure of the data and reduces the accuracy of answers, especially when the task involves computations, aggregations, or reasoning that connects multiple facts across the document. Such limitations make it challenging to utilize standard systems for practical multi-hop question-answering tasks that require insights from both text and tables.

To solve these problems, previous methods have attempted to apply Retrieval-Augmented Generation (RAG) techniques. These involve retrieving text segments and feeding them into a language model for answer generation. However, these techniques are insufficient for tasks that require compositional or global reasoning across large tabular datasets. Tools like NaiveRAG and TableGPT2 try to simulate this process by converting tables into Markdown format or generating code-based execution in Python. Yet, these methods still struggle with tasks where maintaining the table’s original structure is necessary for correct interpretation.

Researchers from Huawei Cloud BU proposed a method named TableRAG that directly addresses these limitations. Research introduced TableRAG as a hybrid system that alternates between textual data retrieval and structured SQL-based execution. This approach preserves the tabular layout and treats table-based queries as a unified reasoning unit. This new system not only preserves the table structure but also executes queries in a manner that respects the relational nature of data, organized in rows and columns. The researchers also created a dataset called HeteQA to benchmark the performance of their method across different domains and multi-step reasoning tasks.

TableRAG functions in two main stages. The offline stage involves parsing heterogeneous documents into structured databases by extracting tables and textual content separately. These are stored in parallel corpora—a relational database for tables and a chunked knowledge base for text. The online phase handles user questions through an iterative four-step process: query decomposition, text retrieval, SQL programming and execution, and intermediate answer generation. When a question is received, the system identifies whether it requires tabular or textual reasoning, dynamically chooses the appropriate strategy, and combines the outputs. SQL is used for precise symbolic execution, enabling better performance in numerical and logical computations.

During experiments, TableRAG was tested on several benchmarks, including HybridQA, WikiTableQuestions, and the newly constructed HeteQA. HeteQA consists of 304 complex questions across nine diverse domains and includes 136 unique tables, as well as over 5,300 Wikipedia-derived entities. The dataset challenges models with tasks like filtering, aggregation, grouping, calculation, and sorting. TableRAG outperformed all baseline methods, including NaiveRAG, React, and TableGPT2. It achieved consistently higher accuracy, with document-level reasoning powered by up to 5 iterative steps, and utilized models such as Claude-3.5-Sonnet and Qwen-2.5-72B to verify the results.

The work presented a strong and well-structured solution to the challenge of reasoning over mixed-format documents. By maintaining structural integrity and adopting SQL for structured data operations, the researchers demonstrated an effective alternative to existing retrieval-based systems. TableRAG represents a significant step forward in question-answering systems that handle documents containing both tables and text, offering a viable method for more accurate, scalable, and interpretable document understanding.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More]
The post This AI Paper Introduces TableRAG: A Hybrid SQL and Text Retrieval Framework for Multi-Hop Question Answering over Heterogeneous Documents appeared first on MarkTechPost.

Efficient and Adaptable Speech Enhancement via Pre-trained Generative …

Posted on July 16, 2025 by i-genie

Recent advances in speech enhancement (SE) have moved beyond traditional mask or signal prediction methods, turning instead to pre-trained audio models for richer, more transferable features. These models, such as WavLM, extract meaningful audio embeddings that enhance the performance of SE. Some approaches use these embeddings to predict masks or combine them with spectral data for better accuracy. Others explore generative techniques, using neural vocoders to reconstruct clean speech directly from noisy embeddings. While effective, these methods often involve freezing pre-trained models or require extensive fine-tuning, which limits adaptability and increases computational costs, making transfer to other tasks more difficult.

Researchers at MiLM Plus, Xiaomi Inc., present a lightweight and flexible SE method that uses pre-trained models. First, audio embeddings are extracted from noisy speech using a frozen audioencoder. These are then cleaned by a small denoise encoder and passed to a vocoder to generate clean speech. Unlike task-specific models, both the audioencoder and vocoder are pre-trained separately, making the system adaptable to tasks like dereverberation or separation. Experiments have shown that generative models outperform discriminative ones in terms of speech quality and speaker fidelity. Despite its simplicity, the system is highly efficient and even surpasses a leading SE model in listening tests.

The proposed speech enhancement system is divided into three main components. First, noisy speech is passed through a pre-trained audioencoder, which generates noisy audio embeddings. A denoise encoder then refines these embeddings to produce cleaner versions, which are finally converted back into speech by a vocoder. While the denoise encoder and vocoder are trained separately, they both rely on the same frozen, pre-trained audioencoder. During training, the denoise encoder minimizes the difference between noisy and clean embeddings, both of which are generated in parallel from paired speech samples, using a Mean Squared Error loss. This encoder is built using a ViT architecture with standard activation and normalization layers.

For the vocoder, training is done in a self-supervised way using clean speech data alone. The vocoder learns to reconstruct speech waveforms from audio embeddings by predicting Fourier spectral coefficients, which are converted back to audio through the inverse short-time Fourier transform. It adopts a slightly modified version of the Vocos framework, tailored to accommodate various audioencoders. A Generative Adversarial Network (GAN) setup is employed, where the generator is based on ConvNeXt, and the discriminators include both multi-period and multi-resolution types. The training also incorporates adversarial, reconstruction, and feature matching losses. Importantly, throughout the process, the audioencoder remains unchanged, using weights from publicly available models.

The evaluation demonstrated that generative audioencoders, such as Dasheng, consistently outperformed discriminative ones. On the DNS1 dataset, Dasheng achieved a speaker similarity score of 0.881, whereas WavLM and Whisper scored 0.486 and 0.489, respectively. In terms of speech quality, non-intrusive metrics like DNSMOS and NISQAv2 indicated notable improvements, even with smaller denoise encoders. For instance, ViT3 reached a DNSMOS of 4.03 and a NISQAv2 score of 4.41. Subjective listening tests involving 17 participants showed that Dasheng produced a Mean Opinion Score (MOS) of 3.87, surpassing Demucs at 3.11 and LMS at 2.98, highlighting its strong perceptual performance.

In conclusion, the study presents a practical and adaptable speech enhancement system that relies on pre-trained generative audioencoders and vocoders, avoiding the need for full model fine-tuning. By denoising audio embeddings using a lightweight encoder and reconstructing speech with a pre-trained vocoder, the system achieves both computational efficiency and strong performance. Evaluations show that generative audioencoders significantly outperform discriminative ones in terms of speech quality and speaker fidelity. The compact denoise encoder maintains high perceptual quality even with fewer parameters. Subjective listening tests further confirm that this method delivers better perceptual clarity than an existing state-of-the-art model, highlighting its effectiveness and versatility.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More]
The post Efficient and Adaptable Speech Enhancement via Pre-trained Generative Audioencoders and Vocoders appeared first on MarkTechPost.

Amazon Bedrock Knowledge Bases now supports Amazon OpenSearch Service …

Posted on July 16, 2025 by i-genie

Amazon Bedrock Knowledge Bases has extended its vector store options by enabling support for Amazon OpenSearch Service managed clusters, further strengthening its capabilities as a fully managed Retrieval Augmented Generation (RAG) solution. This enhancement builds on the core functionality of Amazon Bedrock Knowledge Bases , which is designed to seamlessly connect foundation models (FMs) with internal data sources. Amazon Bedrock Knowledge Bases automates critical processes such as data ingestion, chunking, embedding generation, and vector storage, and the application of advanced indexing algorithms and retrieval techniques, empowering users to develop intelligent applications with minimal effort.
The latest update broadens the vector database options available to users. In addition to the previously supported vector stores such as Amazon OpenSearch Serverless, Amazon Aurora PostgreSQL-Compatible Edition, Amazon Neptune Analytics, Pinecone, MongoDB, and Redis Enterprise Cloud, users can now use OpenSearch Service managed clusters. This integration enables the use of an OpenSearch Service domain as a robust backend for storing and retrieving vector embeddings, offering greater flexibility and choice in vector storage solutions.
To help users take full advantage of this new integration, this post provides a comprehensive, step-by-step guide on integrating an Amazon Bedrock knowledge base with an OpenSearch Service managed cluster as its vector store.
Why use OpenSearch Service Managed Cluster as a vector store?
OpenSearch Service provides two complementary deployment options for vector workloads: managed clusters and serverless collections. Both harness the powerful vector search and retrieval capabilities of OpenSearch Service, though each excels in different scenarios. Managed clusters offer extensive configuration flexibility, performance tuning options, and scalability that make them particularly well-suited for enterprise-grade AI applications.Organizations seeking greater control over cluster configurations, compute instances, the ability to fine-tune performance and cost, and support for a wider range of OpenSearch features and API operations will find managed clusters a natural fit for their use cases. Alternatively, OpenSearch Serverless excels in use cases that require automatic scaling and capacity management, simplified operations without the need to manage clusters or nodes, automatic software updates, and built-in high availability and redundancy. The optimal choice depends entirely on specific use case, operational model, and technical requirements. Here are some key reasons why OpenSearch Service managed clusters offer a compelling choice for organizations:

Flexible configuration – Managed clusters provide flexible and extensive configuration options that enable fine-tuning for specific workloads. This includes the ability to select instance types, adjust resource allocations, configure cluster topology, and implement specialized performance optimizations. For organizations with specific performance requirements or unique workload characteristics, this level of customization can be invaluable.
Performance and cost optimizations to meet your design criteria – Vector database performance is a trade-off between three key dimensions: accuracy, latency, and cost. Managed Cluster provides the granular control to optimize along one or a combination of these dimensions and meet the specific design criteria.
Early access to advanced ML features – OpenSearch Service follows a structured release cycle, with new capabilities typically introduced first in the open source project, then in managed clusters, and later in serverless offerings. Organizations that prioritize early adoption of advanced vector search capabilities might benefit from choosing managed clusters, which often provide earlier exposure to new innovation. However, for customers using Amazon Bedrock Knowledge Bases, these features become beneficial only after they have been fully integrated into the knowledge bases. This means that even if a feature is available in a managed OpenSearch Service cluster, it might not be immediately accessible within Amazon Bedrock Knowledge Bases. Nonetheless, opting for managed clusters positions organizations to take advantage of the latest OpenSearch advancements more promptly after they’re supported within Bedrock Knowledge Bases.

Prerequisites
Before we dive into the setup, make sure you have the following prerequisites in place:

Data source – An Amazon S3 bucket (or custom source) with documents for knowledge base ingestion. We will assume your bucket contains supported documents types (PDFs, TXTs, etc.) for retrieval.
OpenSearch Service domain (optional) – For existing domains, make sure it’s in the same Region and account where you’ll create your Amazon Bedrock knowledge base. As of this writing, Bedrock Knowledge Bases requires OpenSearch Service domains with public access; virtual private cloud (VPC)-only domains aren’t supported yet. Make sure you have the necessary permissions to create or configure domains. This guide covers setup for both new and existing domains.

Solution overview
This section covers the following high-level steps to integrate an OpenSearch Service managed cluster with Amazon Bedrock Knowledge Bases:

Create an OpenSearch Service domain – Set up a new OpenSearch Service managed cluster with public access, appropriate engine version, and security settings, including AWS Identity and Access Management (IAM) master user role and fine-grained access control. This step includes establishing administrative access by creating dedicated IAM resources and configuring Amazon Cognito authentication for secure dashboard access.
Configure a vector index in OpenSearch Service – Create a k-nearest neighbors (k-NN) enabled index on the domain with the appropriate mappings for vector, text chunk, and metadata fields to be compatible with Amazon Bedrock Knowledge Bases.
Configure the Amazon Bedrock knowledge base – Initiate the creation of an Amazon Bedrock knowledge base, enable your Amazon Simple Storage Service (Amazon S3) data source, and configure it to use your OpenSearch Service domain as the vector store with all relevant domain details.
Configure fine-grained access control permissions in OpenSearch Service – Configure fine-grained access control in OpenSearch Service by creating a role with specific permissions and mapping it to the Amazon Bedrock IAM service role, facilitating secure and controlled access for the knowledge base.
Complete knowledge base creation and ingest data – Initiate a sync operation in the Amazon Bedrock console to process S3 documents, generate embeddings, and store them in your OpenSearch Service index.

The following diagram illustrates these steps:

Solution walkthrough
Here are the steps to follow in the AWS console to integrate Amazon Bedrock Knowledge Bases with OpenSearch Service Managed Cluster.
Establish administrative access with IAM master user and role
Before creating an OpenSearch Service domain, you need to create two key IAM resources: a dedicated IAM admin user and a master role. This approach facilitates proper access management for your OpenSearch Service domain, particularly when implementing fine-grained access control, which is strongly recommended for production environments. This user and role will have the necessary permissions to create, configure, and manage the OpenSearch Service domain and its integration with Amazon Bedrock Knowledge Bases.
Create an IAM admin user
The administrative user serves as the principal account for managing the OpenSearch Service configuration. To create an IAM admin user, follow these steps:

Open the IAM console in your AWS account
In the left navigation pane, choose Users and then choose Create user
Enter a descriptive username like <opensearch-admin>
On the permissions configuration page, choose Attach policies directly
Search for and attach the AmazonOpenSearchServiceFullAccess managed policy, which grants comprehensive permissions for OpenSearch Service operations
Review your settings and choose Create user

After creating the user, copy and save the user’s Amazon Resource name (ARN) for later use in domain configuration, replacing <ACCOUNT_ID> with your AWS account ID.
The ARN will look like this:
arn:aws:iam::<ACCOUNT_ID>:user/opensearch-admin
Create an IAM role to act as the OpenSearch Service master user
With OpenSearch Service, you can assign a master user for domains with fine-grained access control. By configuring an IAM role as the master user, you can manage access using trusted principles and avoid static usernames and passwords. To create the IAM role, follow these steps:

On the IAM console, in the left-hand navigation pane, choose Roles and then choose Create role
Choose Custom trust policy as the trusted entity type to precisely control which principals can assume this role
In the JSON editor, paste the following trust policy that allows entities, such as your opensearch-admin user, to assume this role

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“AWS”: “arn:aws:iam::<ACCOUNT_ID>:user/opensearch-admin”
},
“Action”: “sts:AssumeRole”
}
]
}

Proceed to the Add permissions page and attach the same AmazonOpenSearchServiceFullAccess managed policy you used for your admin user
Provide a descriptive name such as OpenSearchMasterRole and choose Create role

After the role is created, navigate to its summary page and copy the role’s ARN. You’ll need this ARN when configuring your OpenSearch Service domain’s master user.
arn:aws:iam:: <ACCOUNT_ID>:role/OpenSearchMasterRole
Create an OpenSearch Service domain for vector search
With the administrative IAM role established, the next step is to create the OpenSearch Service domain that will serve as the vector store for your Amazon Bedrock knowledge base. This involves configuring the domain’s engine, network access, and, most importantly, its security settings using fine-grained access control.

In the OpenSearch Service console, select Managed clusters as your deployment type. Then choose Create domain.
Configure your domain details:

Provide a domain name such as bedrock-kb-domain.
For a quick and straightforward setup, choose Easy create, as shown in the following screenshot. This option automatically selects suitable instance types and default configurations optimized for development or small-scale workloads. This way, you can quickly deploy a functional OpenSearch Service domain without manual configuration. Many of these settings can be modified later as your needs evolve, making this approach ideal for experimentation or nonproduction use cases while still providing a solid foundation.

If your workload demands higher input/output operations per second (IOPS) or throughput or involves managing substantial volumes of data, selecting Standard create is recommended. With this option enabled, you can customize instance types, storage configurations, and advanced security settings to optimize the speed and efficiency of data storage and retrieval operations, making it well-suited for production environments. For example, you can scale the baseline GP3 volume performance from 3,000 IOPS and 125 MiB/s throughput up to 16,000 IOPS and 1,000 MiB/s throughput for every 3 TiB of storage provisioned per data node. This flexibility means that you can align your OpenSearch Service domain performance with specific workload demands, facilitating efficient indexing and retrieval operations for high-throughput or large-scale applications. These settings should be fine-tuned based on the size and complexity of your OpenSearch Service workload to optimize both performance and cost.
However, although increasing your domain’s throughput and storage settings can help improve domain performance—and might help mitigate ingestion errors caused by storage or node-level bottlenecks—it doesn’t increase the ingestion speed into Amazon Bedrock Knowledge Bases as of this writing. Knowledge base ingestion operates at a fixed throughput rate for customers and vector databases, regardless of underlying domain configuration. AWS continues to invest in scaling and evolving the ingestion capabilities of Bedrock Knowledge Bases, and future improvements might offer greater flexibility.

For engine version, choose OpenSearch version 2.13 or higher. If you plan to store binary embeddings, select version 2.16 or above because it’s required for binary vector indexing. It’s recommended to use the latest available version to benefit from performance improvements and feature updates.
For network configuration, under Network, choose Public access, as shown in the following screenshot. This is crucial because, as of this writing, Amazon Bedrock Knowledge Bases doesn’t support connecting to OpenSearch Service domains that are behind a VPC. To maintain security, we implement IAM policies and fine-grained access controls to manage access at a granular level. Using these controls, you can define who can access your resources and what actions they can perform, adhering to the principle of least privilege. Select Dual-stack mode for network settings if prompted. This enables support for both IPv4 and IPv6, offering greater compatibility and accessibility.

For security, enable Fine-grained access control to secure your domain by defining detailed, role-based permissions at the index, document, and field levels. This feature offers more precise control compared to resource-based policies, which operate only at the domain level.

In the fine-grained access control implementation section, we guide you through creating a custom OpenSearch Service role with specific index and cluster permissions, then authorizing Amazon Bedrock Knowledge Bases by associating its service role with this custom role. This mapping establishes a trust relationship that restricts Bedrock Knowledge Bases to only the operations you’ve explicitly permitted when accessing your OpenSearch Service domain with its service credentials, facilitating secure and controlled integration.
When enabling fine-grained access control, you must select a master user to manage the domain. You have two options:

Create master user (Username and Password) – This option establishes credentials in OpenSearch Service internal user database, providing quick setup and direct access to OpenSearch Dashboards using basic authentication. Although convenient for initial configuration or development environments, it requires careful management of these credentials as a separate identity from your AWS infrastructure.
Set IAM ARN as master user – This option integrates with the AWS identity landscape, allowing IAM based authentication. This is strongly recommended for production environments where applications and services already rely on IAM for secure access and where you need auditability and integration with your existing AWS security posture.

For this walkthrough, we choose Set IAM ARN as master user. This is the recommended approach for production environments because it integrates with your existing AWS identity framework, providing better auditability and security management.
In the text box, paste the ARN of the OpenSearchMasterRole that you created in the first step, as shown in the following screenshot. This designates the IAM role as the superuser for your OpenSearch Service domain, granting it full permissions to manage users, roles, and permissions within OpenSearch Dashboards.

Although setting an IAM master user is ideal for programmatic access, it’s not convenient for allowing users to log in to the OpenSearch Dashboards. In a subsequent step, after the domain is created and we’ve configured Cognito resources, we’ll revisit this security configuration to enable Amazon Cognito authentication. Then you’ll be able to create a user-friendly login experience for the OpenSearch Dashboards, where users can sign in through a hosted UI and be automatically mapped to IAM roles (such as the MasterUserRole or more limited roles), combining ease of use with robust, role-based security. For now, proceed with the IAM ARN as the master user to complete the initial domain setup.

Review your settings and choose Create to launch the domain. The initialization process typically takes around 10–15 minutes. During this time, OpenSearch Service will set up the domain and apply your configurations.

After your domain becomes active, navigate to its detail page to retrieve the following information:

Domain endpoint – This is the HTTPS URL where your OpenSearch Service is accessible, typically following the format: https://search-<domain-name>-<unique-identifier>.<region>.es.amazonaws.com
Domain ARN – This uniquely identifies your domain and follows the structure: arn:aws:es:<region>:<account-id>:domain/<domain-name>

Make sure to copy and securely store both these details because you’ll need them when configuring your Amazon Bedrock knowledge base in subsequent steps. With the OpenSearch Service domain up and running, you now have an empty cluster ready to store your vector embeddings. Next, we move on to configuring a vector index within this domain.
Create an Amazon Cognito user pool
Following the creation of your OpenSearch Service domain, the next step is to configure an Amazon Cognito user pool. This user pool will provide a secure and user-friendly authentication layer for accessing the OpenSearch Dashboards. Follow these steps:

Navigate to the Amazon Cognito console and choose User pools from the main dashboard. Choose Create user pool to begin the configuration process. The latest developer-focused console experience presents a unified application setup interface rather than the traditional step-by-step wizard.
For OpenSearch Dashboards integration, choose Traditional web application. This application type supports the authentication flow required for dashboard access and can securely handle the OAuth flows needed for the integration.
Enter a descriptive name in the Name your application field, such as opensearch-kb-app. This name will automatically become your app client name.
Configure how users will authenticate with your system. For OpenSearch integration, select Email as the primary sign-in option. This allows users to sign up and sign in using their email addresses, providing a familiar authentication method. Additional options include Phone number and Username if your use case requires alternative sign-in methods.
Specify the user information that must be collected during registration. At minimum, make sure Email is selected as a required attribute. This is essential for account verification and recovery processes.
This step is a critical security configuration that specifies where Cognito can redirect users after successful authentication. In the Add a return URL field, enter your OpenSearch Dashboards URL in the following format: https://search-<domain-name>-<unique-identifier>.aos.<region>.on.aws/_dashboards.
Choose Create user directory to provision your user pool and its associated app client.

The simplified interface automatically configures optimal settings for your selected application type, including appropriate security policies, OAuth flows, and hosted UI domain generation. Copy and save the User pool ID and App client ID values. You’ll need them to configure the Cognito identity pool and update the OpenSearch Service domain’s security settings.
Add an admin user to the user pool
After creating your Amazon Cognito user pool, you need to add an administrator user who will have access to OpenSearch Dashboards. Follow these steps:

In the Amazon Cognito console, select your newly created user pool
In the left navigation pane, choose Users
Choose Create user
Select Send an email invitation
Enter an Email address for the administrator, for example, admin@example.com
Choose whether to set a Temporary password or have Cognito generate one
Choose Create user

Upon the administrator’s first login, they’ll be prompted to create a permanent password. When all the subsequent setup steps are complete, this admin user will be able to authenticate to OpenSearch Dashboards.
Configure app client settings
With your Amazon Cognito user pool created, the next step is to configure app client parameters that will enable seamless integration with your OpenSearch dashboard. The app client configuration defines how OpenSearch Dashboards will interact with the Cognito authentication system, including callback URLs, OAuth flows, and scope permissions. Follow these steps:

Navigate to your created user pool on the Amazon Cognito console and locate your app client in the applications list. Select your app client to access its configuration dashboard.
Choose the Login tab from the app client interface. This section displays your current managed login pages configuration, including callback URLs, identity providers, and OAuth settings.
To open the OAuth configuration interface, in the Managed login pages configuration section, choose Edit.
Add your OpenSearch Dashboards URL in the Allowed callback URLs section from the Create an Amazon Cognito user pool section.
To allow authentication using your user pool credentials, in the Identity providers dropdown list, select Cognito user pool.
Select Authorization code grant from the OAuth 2.0 grant types dropdown list. This provides the most secure OAuth flow for web applications by exchanging authorization codes for access tokens server-side.
Configure OpenID Connect scopes by selecting the appropriate scopes from the available options:

Email: Enables access to user email addresses for identification.
OpenID: Provides basic OpenID Connect (OIDC) functionality.
Profile: Allows access to user profile information.

Save the configuration by choosing Save changes at the bottom of the page to apply the OAuth settings to your app client. The system will validate your configuration and confirm the updates have been successfully applied.
Update master role trust policy for Cognito integration
Before creating the Cognito identity pool, you must first update your existing OpenSearchMasterRoleto trust the Cognito identity service. This is required because only IAM roles with the proper trust policy for cognito-identity.amazonaws.com will appear in the Identity pool role selection dropdown list. Follow these steps:

Navigate to IAM on the console.
In the left navigation menu, choose Roles.
Find and select OpenSearchMasterRole from the list of roles.
Choose the Trust relationships tab.
Choose Edit trust policy.
Replace the existing trust policy with the following configuration that includes both your IAM user access and Cognito federated access. Replace YOUR_ACCOUNT_ID with your AWS account number. Leave PLACEHOLDER_IDENTITY_POOL_ID as is for now. You’ll update this in Step 6 after creating the identity pool:

“`
{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“AWS”: “arn:aws:iam::YOUR_ACCOUNT_ID:user/opensearch-admin”
},
“Action”: “sts:AssumeRole”
},
{
“Effect”: “Allow”,
“Principal”: {
“Federated”: “cognito-identity.amazonaws.com”
},
“Action”: “sts:AssumeRoleWithWebIdentity”,
“Condition”: {
“StringEquals”: {
“cognito-identity.amazonaws.com:aud”: ” IDENTITY_POOL_ID”
},
“ForAnyValue:StringLike”: {
“cognito-identity.amazonaws.com:amr”: “authenticated”
}
}
}
]
}
“`

Choose Update policy to save the trust relationship configuration.

Create and configure Amazon Cognito identity pool
The identity pool serves as a bridge between your Cognito user pool authentication and AWS IAM roles so that authenticated users can assume specific IAM permissions when accessing your OpenSearch Service domain. This configuration is essential for mapping Cognito authenticated users to the appropriate OpenSearch Service access permissions. This step primarily configures administrative access to the OpenSearch Dashboards, allowing domain administrators to manage users, roles, and domain settings through a secure web interface. Follow these steps:

Navigate to Identity pools on the Amazon Cognito console and choose Create identity pool to begin the configuration process.
In the Authentication section, configure the types of access your identity pool will support:

Select Authenticated access to enable your identity pool to issue credentials to users who have successfully authenticated through your configured identity providers. This is essential for Cognito authenticated users to be able to access AWS resources.
In the Authenticated identity sources section, choose Amazon Cognito user pool as the authentication source for your identity pool.

Choose Next to proceed to the permissions configuration.
For the Authenticated role, select Use an existing role and choose the OpenSearchMasterRolethat you created in Establish administrative access with IAM master user and role. This assignment grants authenticated users the comprehensive permissions defined in your master role so that they can:

Access and manage your OpenSearch Service domain through the dashboards interface.
Configure security settings and user permissions.
Manage indices and perform administrative operations.
Create and modify OpenSearch Service roles and role mappings.

This configuration provides full administrative access to your OpenSearch Service domain. Users who authenticate through this Cognito setup will have master-level permissions, making this suitable for domain administrators who need to configure security settings, manage users, and perform maintenance tasks.

Choose Next to continue with identity provider configuration.
From the dropdown list, choose the User pool you created in Create an Amazon Cognito user pool.
Choose the app client you configured in the previous step from the available options in the App client dropdown list.
Keep the default role setting, which will assign the OpenSearchMasterRole to authenticated users from this user pool.
Choose Next.
Provide a descriptive name such as OpenSearchIdentityPool.
Review all configuration settings and choose Create identity pool. Amazon Cognito will provision the identity pool and establish the necessary trust relationships. After creation, copy the identity pool ID.

To update your master role’s trust policy with the identity pool ID, follow these steps:

On the IAM console in the left navigation menu, choose Roles
From the list of roles, find and select OpenSearchMasterRole
Choose the Trust relationships tab and choose Edit trust policy
Replace PLACEHOLDER_IDENTITY_POOL_ID with your identity pool ID from the previous step
To finalize the configuration, choose Update policy

Your authentication infrastructure is now configured to provide secure, administrative access to OpenSearch Dashboards through Amazon Cognito authentication. Users who authenticate through the Cognito user pool will assume the master role and gain full administrative capabilities for your OpenSearch Service domain.
Enable Amazon Cognito authentication for OpenSearch Dashboards
After setting up your Cognito user pool, app client, and identity pool, the next step is to configure your OpenSearch Service domain to use Cognito authentication for OpenSearch Dashboards. Follow these steps:

Navigate to the Amazon OpenSearch Service console
Select the name of the domain that you previously created
Choose the Security configuration tab and choose Edit
Scroll to the Amazon Cognito authentication section and select Enable Amazon Cognito authentication, as shown in the following screenshot
You’ll be prompted to provide the following:

Cognito user pool ID: Enter the user pool ID you created in a previous step
Cognito identity pool ID: Enter the identity pool ID you created

Review your settings and choose Save changes

The domain will update its configuration, which might take several minutes. You’ll receive a progress pop-up, as shown in the following screenshot.

Create a k-NN vector index in OpenSearch Service
This step involves creating a vector search–enabled index in your OpenSearch Service domain for Amazon Bedrock to store document embedding vectors, text chunks, and metadata. The index must contain three essential fields: an embedding vector field that stores numerical representations of your content (in floating-point or binary format), a text field that holds the raw text chunks, and a field for Amazon Bedrock managed metadata where Amazon Bedrock tracks critical information such as document IDs and source attributions. With proper index mapping, Amazon Bedrock Knowledge Bases can efficiently store and retrieve the components of your document data.
You create this index using the Dev Tools feature in OpenSearch Dashboards. To access Dev Tools in OpenSearch Dashboards, follow these steps:

Sign in to your OpenSearch Dashboards account
Navigate to your OpenSearch Dashboards URL
You’ll be redirected to the Cognito sign-in page
Sign in using the admin user credentials you created in the Add an admin user to the user pool section
Enter the email address you provided (admin@example.com)
Enter your password (if this is your first sign-in, you’ll be prompted to create a permanent password)
After successful authentication, you’ll be directed to the OpenSearch Dashboards home page
In the left navigation pane under the Management group, choose Dev Tools
Confirm you’re on the Console page, as shown in the following screenshot, where you’ll enter API commands

To define and create the index copy the following command into the Dev Tools console and replace bedrock-kb-index with your preferred index name if needed. If you’re setting up a binary vector index (for example, to use binary embeddings with Amazon Titan Text Embeddings V2), include the additional required fields in your index mapping:

Set “data_type“: “binary” for the vector field
Set “space_type“: “hamming” (instead of “l2”, which is used for float embeddings)

For more details, refer to the Amazon Bedrock Knowledge Bases setup documentation.

PUT /bedrock-kb-index
{
  “settings”: {
   “index”: {
   “knn”: true
   }
  },
  “mappings”: {
   “properties”: {
   “embeddings”: {
   “type”: “knn_vector”,
   “dimension”: <<embeddings size depending on embedding model used>>,
   “space_type”: “l2”,
   “method”: {
   “name”: “hnsw”,
   “engine”: “faiss”,
   “parameters”: {
   “ef_construction”: 128,
   “m”: 24
   }
   }
   },
   “AMAZON_BEDROCK_TEXT_CHUNK”: {
   “type”: “text”,
   “index”: true
   },
   “AMAZON_BEDROCK_METADATA”: {
   “type”: “text”,
   “index”: false
   }
   }
  }
}

The key components of this index mapping are:

k-NN enablement – Activates k-NN functionality in the index settings, allowing the use of knn_vector field type.
Vector field configuration – Defines the embeddings field for storing vector data, specifying dimension, space type, and data type based on the chosen embedding model. It’s critical to match the dimension with the embedding model’s output. Amazon Bedrock Knowledge Bases offers models such as Amazon Titan Embeddings V2 (with 256, 512, or 1,024 dimensions) and Cohere Embed (1,024 dimensions). For example, using Amazon Titan Embeddings V2 with 1,024 dimensions requires setting dimension: 1024 in the mapping. A mismatch between the model’s vector size and index mapping will cause ingestion failures, so it’s crucial to verify this value.
Vector method setup – Configures the hierarchical navigable small world (HNSW) algorithm with the Faiss engine, setting parameters for balancing index build speed and accuracy. Amazon Bedrock Knowledge Bases integration specifically requires the Faiss engine for OpenSearch Service k-NN index.
Text chunk storage – Establishes a field for storing raw text chunks from documents, enabling potential full-text queries.
Metadata field – Creates a field for Amazon Bedrock managed metadata, storing essential information without indexing for direct searches.

After pasting the command into the Dev Tools console, choose Run. If successful, you’ll receive a response similar to the one shown in the following screenshot.

Now, you should have a new index (for example, named bedrock-kb-index) on your domain with the preceding mapping. Make a note of the index name you created, the vector field name (embeddings), the text field name (AMAZON_BEDROCK_TEXT_CHUNK), and the metadata field name (AMAZON_BEDROCK_METADATA). In the next steps, you’ll grant Amazon Bedrock permission to use this index and then plug these details into the Amazon Bedrock Knowledge Bases setup.
With the vector index successfully created, your OpenSearch Service domain is now ready to store and retrieve embedding vectors. Next, you’ll configure IAM roles and access policies to facilitate secure interaction between Amazon Bedrock and your OpenSearch Service domain.
Initiate Amazon Bedrock knowledge base creation
Now that your OpenSearch Service domain and vector index are ready, it’s time to configure an Amazon Bedrock knowledge base to use this vector store. In this step, you will:

Begin creating a new knowledge base in the Amazon Bedrock console
Configure it to use your existing OpenSearch Service domain as a vector store

We will pause the knowledge base creation midway to update OpenSearch Service access policies before finalizing the setup.
To create the Amazon Bedrock knowledge base in the console, follow these steps. For detailed instructions, refer to Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases in the AWS documentation. The following steps provide a streamlined overview of the general process:

On the Amazon Bedrock Console, go to Knowledge Bases and choose Create with vector store.
Enter a name and description and choose Create and use a new service role for the runtime role. Choose Amazon S3 as the data source for the knowledge base.
Provide the details for the data source, including data source name, location, Amazon S3 URI, and keep the parsing and chunking strategies as default.
Choose Amazon Titan Embeddings v2 as your embeddings model to convert your data. Make sure the embeddings dimensions match what you configured in your index mappings in the Create an OpenSearch Service domain for vector search section because mismatches will cause the integration to fail.

To configure OpenSearch Service Managed Cluster as the vector store, follow these steps:

Under Vector database, select Use an existing vector store and for Vector store, select OpenSearch Service Managed Cluster, as shown in the following screenshot

Enter the details from your OpenSearch Service domain setup in the following fields, as shown in the following screenshot:

Domain ARN: Provide the ARN of your OpenSearch Service domain.
Domain endpoint: Enter the endpoint URL of your OpenSearch Service domain.
Vector index name: Specify the name of the vector index created in your OpenSearch Service domain.
Vector field name
Text field name
Bedrock-managed metadata field name

You must not choose Create yet. Amazon Bedrock will be ready to create the knowledge base, but you need to configure OpenSearch Service access permissions first. Copy the ARN of the new IAM service role that Amazon Bedrock will use for this knowledge base (the console will display the role ARN you selected or just created). Keep this ARN handy and leave the Amazon Bedrock console open (pause the creation process here).
Configure fine-grained access control permissions in OpenSearch Service
With the IAM service role ARN copied, configure fine-grained permissions in the OpenSearch dashboard. Fine-grained access control provides role-based permission management at a granular level (indices, documents, and fields), so that your Amazon Bedrock knowledge base has precisely controlled access. Follow these steps:

On the OpenSearch Service console, navigate to your OpenSearch Service domain.
Choose the URL for OpenSearch Dashboards. It typically looks like: https://<your-domain-endpoint>/_dashboards/
From the OpenSearch Dashboards interface, in the left navigation pane, choose Security, then choose Roles.
Choose Create role and provide a meaningful name, such as bedrock-knowledgebase-role.
Under Cluster Permissions, enter the following permissions necessary for Amazon Bedrock operations, as shown in the following screenshot:

indices:data/read/msearch
indices:data/write/bulk*
indices:data/read/mget*

Under Index permissions:

Specify the exact vector index name you created previously (for example, bedrock-kb-index).
Choose Create new permission group, then choose Create new action group.
Add the following specific permissions, essential for Amazon Bedrock Knowledge Bases:

indices:admin/get indices:data/read/msearch
indices:data/read/search indices:data/write/index
indices:data/write/update indices:data/write/delete
indices:data/write/delete/byquery indices:data/write/bulk*
indices:admin/mapping/put indices:data/read/mget*

Confirm by choosing Create.

To map the Amazon Bedrock IAM service role (copied earlier) to the newly created OpenSearch Service role, follow these steps:

In OpenSearch Dashboards, navigate to Security and then Roles.
Locate and open the role you created in the previous step (bedrock-knowledgebase-role).
Choose the Mapped users tab and choose Manage mapping, as shown in the following screenshot.
In the Backend roles section, paste the knowledge base’s service role ARN you copied from Amazon Bedrock (for example, arn:aws:iam::<accountId>:role/service-role/BedrockKnowledgeBaseRole). When mapping this IAM role to an OpenSearch Service role, the IAM role doesn’t need to exist in your AWS account at the time of mapping. You’re referencing its ARN to establish the association within the OpenSearch backend. This allows OpenSearch Service to recognize and authorize the role when it’s eventually created and used. Make sure that the ARN is correctly specified to facilitate proper permission mapping.
Choose Map to finalize the connection between the IAM role and OpenSearch Service permissions.

Complete knowledge base creation and verify resource-based policy
With fine-grained permissions in place, return to the paused Amazon Bedrock console to finalize your knowledge base setup. Confirm that all OpenSearch Service domain details are correctly entered, including the domain endpoint, domain ARN, index name, vector field name, text field name, and metadata field name. Choose Create knowledge base.
Amazon Bedrock will use the configured IAM service role to securely connect to your OpenSearch Service domain. After the setup is complete, the knowledge base status should change to Available, confirming successful integration.
Understanding access policies
When integrating OpenSearch Service Managed Cluster with Amazon Bedrock Knowledge Bases, it’s important to understand how access control works across different layers.
For same-account configurations (where both the knowledge base and OpenSearch Service domain are in the same AWS account), no updates to the OpenSearch Service domain’s resource-based policy are required as long as fine-grained access control is enabled and your IAM role is correctly mapped. In this case, IAM permissions and fine-grained access control mappings are sufficient to authorize access. However, if the domain’s resource-based policy includes deny statements targeting your knowledge base service role or principals, access will be blocked—regardless of IAM or fine-grained access control settings. To avoid unintended failures, make sure the policy doesn’t explicitly restrict access to the Amazon Bedrock Knowledge Bases service role.
For cross-account access (when the IAM role used by Amazon Bedrock Knowledge Bases belongs to a different AWS account than the OpenSearch Service domain), you must include an explicit allow statement in the domain’s resource-based policy for the external role. Without this, access will be denied even if all other permissions are correctly configured.

To begin using your knowledge base, select your configured data source and initiate the sync process. This action starts the ingestion of your Amazon S3 data. After synchronization is complete, your knowledge base is ready for information retrieval.
Conclusion
Integrating Amazon Bedrock Knowledge Bases with OpenSearch Service Managed Cluster offers a powerful solution for vector storage and retrieval in AI applications. In this post, we walked you through the process of setting up an OpenSearch Service domain, configuring a vector index, and connecting it to an Amazon Bedrock knowledge base. With this setup, you’re now equipped to use the full potential of vector search capabilities in your AI-driven applications, enhancing your ability to process and retrieve information from large datasets efficiently.
Get started with Amazon Bedrock Knowledge Bases and let us know your thoughts in the comments section.

About the authors
Manoj Selvakumar is a Generative AI Specialist Solutions Architect at AWS, where he helps startups design, prototype, and scale intelligent, agent-driven applications using Amazon Bedrock. He works closely with founders to turn ambitious ideas into production-ready solutions—bridging startup agility with the advanced capabilities of AWS’s generative AI ecosystem. Before joining AWS, Manoj led the development of data science solutions across healthcare, telecom, and enterprise domains. He has delivered end-to-end machine learning systems backed by solid MLOps practices—enabling scalable model training, real-time inference, continuous evaluation, and robust monitoring in production environments.
Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High-Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.
Dani Mitchell is a Generative AI Specialist Solutions Architect at AWS. He is focused on helping accelerate enterprises across the world on their generative AI journeys with Amazon Bedrock.
Juan Camilo Del Rio Cuervo is a Software Developer Engineer at Amazon Bedrock Knowledge Bases team. He is focused on building and improving RAG experiences for AWS customers.

Monitor agents built on Amazon Bedrock with Datadog LLM Observability

Posted on July 16, 2025 by i-genie

This post was co-written with Mohammad Jama, Yun Kim, and Barry Eom from Datadog.
The emergence of generative AI agents in recent years has transformed the AI landscape, driven by advances in large language models (LLMs) and natural language processing (NLP). The focus is shifting from simple AI assistants to Agentic AI systems that can think, iterate, and take actions to solve complex tasks. These Agentic AI systems may use multiple agents, interact with tools both within and outside organizational boundaries to make decisions, and connect with knowledge sources to learn about processes. While these autonomous systems help organizations improve workplace productivity, streamline business workflows, and transform research and more, they introduce additional operational requirements. To ensure reliability, performance, and responsible AI use, teams need observability solutions purpose-built for tracking agent behavior, coordination, and execution flow.
The multi-agentic system collaboration capabilities of Amazon Bedrock Agents make it straightforward and fast to build these systems. Developers can configure a set of coordinated agents by breaking down complex user requests into multiple steps, calling internal APIs, accessing knowledge bases, and maintaining contextual conversations—all without managing the logic themselves.
In order for organizations to scale Agentic AI systems they need robust observability solutions to ensure reliability, performance, and responsible use of AI technology.
Datadog LLM Observability helps teams operate production-grade LLM applications with confidence by monitoring performance, quality, and security issues—such as latency spikes, hallucinations, tool selection, or prompt injection attempts. With full visibility into model behavior and application context, developers can identify, troubleshoot, and resolve issues faster.
We’re excited to announce a new integration between Datadog LLM Observability and Amazon Bedrock Agents that helps monitor agentic applications built on Amazon Bedrock. Beyond tracking the overall health of agentic applications, developers can track step-by-step agent executions across complex workflows and monitor foundational model calls, tool invocations, and knowledge base interactions.
In this post, we’ll explore how Datadog’s LLM Observability provides the visibility and control needed to successfully monitor, operate, and debug production-grade agentic applications built on Amazon Bedrock Agents.
Solution Overview
Datadog’s integration with Amazon Bedrock Agents offers comprehensive observability tailored for agentic Generative AI applications that programmatically invoke agents by using the InvokeAgent API. This integration captures detailed telemetry from each agent execution, enabling teams to monitor, troubleshoot, and optimize their LLM applications effectively.
Optimize Performance and Control Costs
As teams scale their agentic applications, each agent interaction—whether it’s retrieving knowledge, invoking tools, or calling models—can impact latency and cost. Without visibility into how these resources are used, it’s difficult to pinpoint inefficiencies or control spend as workflows grow more complex. For applications built on Bedrock Agents, Datadog automatically captures and provides:

Latency monitoring: Track the time taken for each step and overall execution to identify bottlenecks
Error rate tracking: Observe the frequency and types of errors encountered to improve reliability and debug issues
Token usage analysis: Monitor the number of tokens consumed during processing to manage costs
Tool invocation details: Gain insights into external API calls made by agents, such as Lambda functions or knowledge base queries

This LLM Observability dashboard presents a detailed overview of an AI-powered support chatbot’s performance and usage patterns.

Monitor Complex Agentic Workflows
Agents can perform specific tasks, invoke tools, access knowledge bases, and maintain contextual conversations. Datadog provides comprehensive visibility into agent workflows by capturing detailed telemetry from Amazon Bedrock Agents, enabling teams to monitor, troubleshoot, and optimize their LLM applications effectively, providing:

End-to-end execution visibility: Visualize each operation of agent’s workflow, from pre-processing through post-processing, including orchestration and guardrail evaluations
Efficient troubleshooting: Debug with detailed execution insights to quickly pinpoint failure points and understand error contexts

This LLM Observability trace details the execution of a travel agent bot using Amazon Bedrock.

Evaluate output, tool selection, and overall quality
In agentic applications, it’s not enough to know that a task completed, you also need to know how well it was completed. For example, are generated summaries accurate and on-topic? Are user-facing answers clear, helpful, and free of harmful content? Did an agent select the right tool? Without visibility into these questions, silent failures can slip through and undercut intended outcomes—like reducing handoffs to human agents or automating repetitive decisions.
Datadog LLM Observability helps teams assess the quality and safety of their LLM applications by evaluating the inputs and outputs of model calls—both at the root level and within nested steps of a workflow. With this integration, you can:

Run built-in evaluations: Detect quality, safety, and security, issues like prompt injection, off-topic completions, or toxic content, with Datadog LLM Observability Evaluations
Submit custom evaluations: Visualize domain-specific quality metrics, such as whether an output matched expected formats or adhered to policy guidelines
Monitor guardrails: Inspect when and why content filters are triggered during execution.

These insights appear directly alongside latency, cost, and trace data—helping teams identify not just how an agent behaved, but whether it produced the right result.
How to get started
Datadog Bedrock Agent Observability is initially available for Python applications, with additional language support on the roadmap. Tracing Bedrock Agent invocations is handled by integrating Datadog’s ddtrace library into your application.
Prerequisites

An AWS account with Bedrock access enabled.
A python-base application using Amazon Bedrock. If needed, please see the examples in amazon-bedrock-samples.
A Datadog account and api key.

Instrumentation is accomplished with just a few steps, please consult the latest LLM Observability Python SDK Reference for full details. In most cases only 2 lines are required to add ddtrace to your application:

from ddtrace.llmobs import LLMObs
LLMObs.enable()

The ddtrace library can be configured using environment variables or at runtime passing values to the enable function. Please consult the SDK reference above and also the setup documentation for more details and customization options.
Finally, be sure to stop or remove any applications when you are finished to manage costs.
Conclusion
Datadog is an AWS Specialization Partner and AWS Marketplace Seller that has been building integrations with AWS services for over a decade, amassing a growing catalog of 100+ integrations. This new Amazon Bedrock Agents integration builds upon Datadog’s strong track record of AWS partnership success. For organizations looking to implement generative AI solutions, this capability provides essential observability tools to ensure their agentic AI applications built on AWS Bedrock Agents perform optimally and deliver business value.
To get started, see Datadog LLM Observability.
To learn more about how Datadog integrates with Amazon AI/ML services, see Monitor Amazon Bedrock with Datadog and Monitoring Amazon SageMaker with Datadog.
If you don’t already have a Datadog account, you can sign up for a free 14-day trial today.

About the authors
Nina Chen is a Customer Solutions Manager at AWS specializing in leading software companies to leverage the power of the AWS cloud to accelerate their product innovation and growth. With over 4 years of experience working in the strategic Independent Software Vendor (ISV) vertical, Nina enjoys guiding ISV partners through their cloud transformation journeys, helping them optimize their cloud infrastructure, driving product innovation, and delivering exceptional customer experiences.
Sujatha Kuppuraju is a Principal Solutions Architect at AWS, specializing in Cloud and, Generative AI Security. She collaborates with software companies’ leadership teams to architect secure, scalable solutions on AWS and guide strategic product development. Leveraging her expertise in cloud architecture and emerging technologies, Sujatha helps organizations optimize offerings, maintain robust security, and bring innovative products to market in an evolving tech landscape.
Jason Mimick is a Partner Solutions Architect at AWS supporting top customers and working closely with product, engineering, marketing, and sales teams daily. Jason focuses on enabling product development and sales success for partners and customers across all industries.
Mohammad Jama is a Product Marketing Manager at Datadog. He leads go-to-market for Datadog’s AWS integrations, working closely with product, marketing, and sales to help companies observe and secure their hybrid and AWS environments.
Yun Kim is a software engineer on Datadog’s LLM Observability team, where he specializes on developing client-side SDKs and integrations. He is excited about the development of trustworthy, transparent Generative AI models and frameworks.
Barry Eom is a Product Manager at Datadog, where he has launched and leads the development of AI/ML and LLM Observability solutions. He is passionate about enabling teams to create and productionize ethical and humane technologies.

How PayU built a secure enterprise AI assistant using Amazon Bedrock

Posted on July 16, 2025 by i-genie

This is a guest post co-written with Rahul Ghosh, Sandeep Kumar Veerlapati, Rahmat Khan, and Mudit Chopra from PayU.
PayU offers a full-stack digital financial services system that serves the financial needs of merchants, banks, and consumers through technology.
As a Central Bank-regulated financial institution in India, we recently observed a surge in our employees’ interest in using public generative AI assistants. Our teams found these AI assistants helpful for a variety of tasks, including troubleshooting technical issues by sharing error or exception details, generating email responses, and rephrasing English content for internal and external communications. However, this growing reliance on public generative AI tools quickly raised red flags for our Information Security (Infosec) team. We became increasingly concerned about the risks of sensitive data—such as proprietary system information, confidential customer details, and regulated documentation—being transmitted to and processed by external, third-party AI providers. Given our strict compliance requirements and the critical importance of data privacy in the financial sector, we made the decision to restrict access to these public generative AI systems. This move was necessary to safeguard our organization against potential data leaks and regulatory breaches, but it also highlighted the need for a secure, compliance-aligned alternative that would allow us to harness the benefits of generative AI without compromising on security policies.
In this post, we explain how we equipped the PayU team with an enterprise AI solution and democratized AI access using Amazon Bedrock, without compromising on data residency requirements.
Solution overview
As a regulated entity, we were required to keep all our data within India and securely contained within our PayU virtual private cloud (VPC). Therefore, we sought a solution that could use the power of generative AI to foster innovation and enhance operational efficiency, while simultaneously enabling robust data security measures and geo-fencing of the utilized data. Beyond foundational use cases like technical troubleshooting, email drafting, and content refinement, we aimed to equip teams with a natural language interface to query enterprise data across domains. This included enabling self-service access to business-critical insights—such as loan disbursement trends, repayment patterns, customer demographics, and transaction analytics—as well as HR policy clarifications, through intuitive, conversational interactions. Our vision was to empower employees with instant, AI-driven answers derived from internal systems without exposing sensitive data to external systems, thereby aligning with India’s financial regulations and our internal governance frameworks.
We chose Amazon Bedrock because it is a fully managed service that provides access to a wide selection of high-performing foundation models (FMs) from industry leaders such as AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, TwelveLabs (coming soon), Writer, and Amazon. The models are accessible through a single, unified API. Amazon Bedrock also offers a comprehensive suite of features that align with our requirements, including Amazon Bedrock Agents for workflow automation and Amazon Bedrock Knowledge Bases for enterprise data integration. In addition, Amazon Bedrock Guardrails provides essential safeguards across model, prompt, and application levels for blocking undesirable and harmful multimodal content and helped filter hallucinated responses in our Retrieval Augmented Generation (RAG) and agentic workflows.
For the frontend, we selected Open WebUI, an open-source solution known for its extensibility, rich feature set, and intuitive, user-friendly interface, so our teams can interact seamlessly with the AI capabilities we’ve integrated.
The following diagram illustrates the solution architecture.

In the following sections, we discuss the key components to the solution in more detail.
Open WebUI
We use Open WebUI as our browser-based frontend application. Open WebUI is an open source, self-hosted application designed to provide a user-friendly and feature-rich interface for interacting with large language models (LLMs). It supports integration with a wide range of models and can be deployed in private environments to help protect data privacy and security. Open WebUI supports enterprise features like single sign-on (SSO), so users can authenticate seamlessly using their organization’s identity provider, streamlining access and reducing password-related risks. The service also offers role-based access control (RBAC), so administrators can define granular user roles—such as admin and user—so that permissions, model access, and data visibility can be tailored to organizational needs. This supports the protection of sensitive information.
We connected Open WebUI with our identity provider for enabling SSO. RBAC was implemented by defining functional roles—such as loan operations or HR support—directly tied to user job functions. These roles govern permissions to specific agents, knowledge bases, and FMs so that teams only access tools relevant to their responsibilities. Configurations, user conversation histories, and usage metrics are securely stored in a persistent Amazon Relational Database Service (Amazon RDS) for PostgreSQL database, enabling audit readiness and supporting compliance. For deployment, we containerized Open WebUI and orchestrated it on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, using automatic scaling to dynamically adjust resources based on demand while maintaining high availability.
Access Gateway
Access Gateway serves as an intermediary between Open WebUI and Amazon Bedrock, translating Amazon Bedrock APIs to a compatible schema for Open WebUI. This component enables the frontend to access FMs, Amazon Bedrock Agents, and Amazon Bedrock Knowledge Bases.
Amazon Bedrock
Amazon Bedrock offers a diverse selection of FMs, which we have integrated into the web UI to enable the PayU workforce to efficiently perform tasks such as text summarization, email drafting, and technical troubleshooting. In addition, we developed custom AI agents using Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases, using our organizational data. These tailored agents are also accessible through the frontend application.
To enable secure, role-based access to organizational insights, we deployed specialized agents tailored to specific business functions—including hr-policy-agent, credit-disbursal-agent, collections-agent, and payments-demographics-agent. Access to these agents is governed by user roles and job functions. These agents follow a combination of RAG and text-to-SQL approaches. For example, hr-policy-agent uses RAG, querying a vectorized knowledge base in Amazon OpenSearch Service, whereas credit-disbursal-agent uses a text-to-SQL pipeline, translating natural language queries into structured SQL commands to extract insights from an Amazon Simple Storage Service (Amazon S3) based data lake. These approaches provide precise, context-aware responses while maintaining data governance. Implementation details of the text-to-SQL workflow is described in the following diagram.

The workflow consists of the following steps:

We maintain our business-specific datamarts in the data lakehouse in Amazon S3, enriched with rich metadata and presented in a highly denormalized form. This data lakehouse, internally referred to as Luna, is built using Apache Spark and Apache Hudi. The datamarts are crucial for achieving higher accuracy and improved performance in our systems. The data is exposed as AWS Glue tables, which function as the Hive Metastore, and can be queried using Amazon Athena, enabling efficient access and analytical capabilities for our business needs.
HR policy documents are stored in another S3 bucket. Using Amazon Bedrock Knowledge Bases, these are vectorized and stored in OpenSearch Service.
Depending on their role, employees can access FMs and agents through the Open WebUI interface. They have the option to choose either an FM or an agent from a dropdown menu. When a user selects an FM, their question is answered directly using the model’s pre-trained knowledge, without involving an agent. If an agent is selected, the corresponding agent is invoked to handle the request.
To facilitate orchestration, an instruction prompt is given to the Amazon Bedrock agent. The agent interprets this prompt and manages the workflow by delegating specific actions to the underlying LLM. Through this process, the Amazon Bedrock agent coordinates task execution, so that each step is handled appropriately based on the input received and the orchestration logic defined for the workflow. The orchestration step can extract context from the knowledge base or invoke an action group. An instruction prompt is supplied to the Amazon Bedrock agent to guide the orchestration process. The agent interprets this prompt and coordinates the workflow by assigning specific tasks to the LLM. For example, while invoking actions for the text-to-SQL agent, it has been instructed to check syntaxes first and fix the query by reading the error then only execute the final query.
An instruction prompt is supplied to the Amazon Bedrock agent to guide the orchestration process. The agent interprets this prompt and coordinates the workflow by assigning specific tasks to the LLM.
The primary function of an action group in an Amazon Bedrock agent is to organize and execute multiple actions in response to a user’s input or request. This enables the agent to carry out a sequence of coordinated steps to effectively meet the user’s needs, rather than being limited to a single action. Each action group includes a schema, which defines the required format and parameters. This schema allows the agent to interact accurately with the compute layer, such as an AWS Lambda function, by supplying the required structure for communication.
The Lambda function serves as the execution engine, running SQL queries and connecting with Athena to process data. To enable secure and efficient operation, it is essential to configure resource policies and permissions correctly, which helps maintain the integrity of the serverless compute environment.
Athena is a serverless query service that analyzes Amazon S3 data using standard SQL, with AWS Glue managing the data catalog. AWS Glue reads data from Amazon S3, creates queryable tables for Athena, and stores query results back in Amazon S3. This integration, supported by crawlers and the AWS Glue Data Catalog, streamlines data management and analysis.
For questions related to HR policies and other enterprise documents, the system uses Amazon Bedrock Knowledge Bases. These knowledge bases are built from the HR policy documents stored in Amazon S3, with semantic search capabilities powered by vector embeddings in OpenSearch Service.

Private access to foundation models
Given that our organizational data was included as context in prompts sent to Amazon Bedrock and the generated responses could contain sensitive information, we needed a robust solution to help prevent exposure of this data to the public internet. Our goal was to establish a secure data perimeter that would help mitigate potential risks associated with internet-facing communication. To achieve this, we implemented AWS PrivateLink, creating a private and dedicated connection between our VPC and Amazon Bedrock. With this configuration, Amazon Bedrock is accessible as though it resides within our own VPC, removing the need for an internet gateway or NAT gateway. By setting up an interface endpoint with PrivateLink, we provisioned a network interface directly in our VPC subnet, so that data remains securely within the AWS network. This architecture not only strengthens our security posture by minimizing external exposure but also streamlines connectivity for our internal applications.
The following diagram illustrates this architecture.

Conclusion
The introduction of this application has generated significant interest in generative AI within PayU. Employees are now more aware of AI’s potential to address complex business challenges. This enthusiasm has led to the addition of multiple business workflows to the application. Collaboration between business units and the technical team has accelerated digital transformation efforts. After the rollout, internal estimates revealed a 30% improvement in the productivity of the business analyst team. This boost in efficiency has made it possible for analysts to focus on more strategic tasks and reduced turnaround times. Overall, the application has inspired a culture of innovation and continuous learning across the organization.
Ready to take your organization’s AI capabilities to the next level? Dive into the technical details of Amazon Bedrock, Amazon Bedrock Agents, and Amazon Bedrock Guardrails in the Amazon Bedrock User Guide, and explore hands-on examples in the Amazon Bedrock Agent GitHub repo to kickstart your implementation.

About the authors
Deepesh Dhapola is a Senior Solutions Architect at AWS India, where he architects high-performance, resilient cloud solutions for financial services and fintech organizations. He specializes in using advanced AI technologies—including generative AI, intelligent agents, and the Model Context Protocol (MCP)—to design secure, scalable, and context-aware applications. With deep expertise in machine learning and a keen focus on emerging trends, Deepesh drives digital transformation by integrating cutting-edge AI capabilities to enhance operational efficiency and foster innovation for AWS customers. Beyond his technical pursuits, he enjoys quality time with his family and explores creative culinary techniques.
Rahul Ghosh is a seasoned Data & AI Engineer with deep expertise in cloud-based data architectures, large-scale data processing, and modern AI technologies, including generative AI, LLMs, Retrieval Augmented Generation (RAG), and agent-based systems. His technical toolkit spans across Python, SQL, Spark, Hudi, Airflow, Kubeflow, and other modern orchestration frameworks, with hands-on experience in AWS, Azure, and open source systems. Rahul is passionate about building reliable, scalable, and ethically grounded solutions at the intersection of data and intelligence. Outside of work, he enjoys mentoring budding technologists and doing social work rooted in his native rural Bengal.

Sandeep Kumar Veerlapati is an Associate Director – Data Engineering at PayU Finance, where he focuses on building strong, high-performing teams and defining effective data strategies. With expertise in cloud data systems, data architecture, and generative AI, Sandeep brings a wealth of experience in creating scalable and impactful solutions. He has a deep technical background with tools like Spark, Airflow, Hudi, and the AWS Cloud. Passionate about delivering value through data, he thrives on leading teams to solve real-world challenges. Outside of work, Sandeep enjoys mentoring, collaborating, and finding new ways to innovate with technology.
Mudit Chopra is a skilled DevOps Engineer and generative AI enthusiast, with expertise in automating workflows, building robust CI/CD pipelines, and managing cloud-based infrastructures across systems. With a passion for streamlining delivery pipelines and enabling cross-team collaboration, they facilitate seamless product deployments. Dedicated to continuous learning and innovation, Mudit thrives on using AI-driven tools to enhance operational efficiency and create smarter, more agile systems. Always staying ahead of tech trends, he is dedicated to driving digital transformation and delivering impactful solutions.
Rahmat Khan is a driven AI & Machine Learning Engineer and entrepreneur, with a deep focus on building intelligent, real-world systems. His work spans the full ML lifecycle—data engineering, model development, and deployment at scale—with a strong grounding in practical AI applications. Over the years, he has explored everything from generative models to multimodal systems, with an eye toward creating seamless user experiences. Driven by curiosity and a love for experimentation, he enjoys solving open-ended problems, shipping fast, and learning from the edge of what’s possible. Outside of tech, he’s equally passionate about nurturing ideas, mentoring peers, and staying grounded in the bigger picture of why we build.
Saikat Dey is a Technical Account Manager (TAM) at AWS India, supporting strategic fintech customers in harnessing the power of the cloud to drive innovation and business transformation. As a trusted advisor, he bridges technical and business teams, delivering architectural best practices, proactive guidance, and strategic insights that enable long-term success on AWS. With a strong passion for generative AI, Saikat partners with customers to identify high-impact use cases and accelerate their adoption of generative AI solutions using services like Amazon Bedrock and Amazon Q. Outside of work, he actively explores emerging technologies, follows innovation trends, and enjoys traveling to experience diverse cultures and perspectives.

Tracing OpenAI Agent Responses using MLFlow

Posted on July 15, 2025 by i-genie

MLflow is an open-source platform for managing and tracking machine learning experiments. When used with the OpenAI Agents SDK, MLflow automatically:

Logs all agent interactions and API calls

Captures tool usage, input/output messages, and intermediate decisions

Tracks runs for debugging, performance analysis, and reproducibility

This is especially useful when you’re building multi-agent systems where different agents collaborate or call functions dynamically

In this tutorial, we’ll walk through two key examples: a simple handoff between agents, and the use of agent guardrails — all while tracing their behavior using MLflow.

Setting up the dependencies

Installing the libraries

Copy CodeCopiedUse a different Browserpip install openai-agents mlflow pydantic pydotenv

OpenAI API Key

To get an OpenAI API key, visit https://platform.openai.com/settings/organization/api-keys and generate a new key. If you’re a new user, you may need to add billing details and make a minimum payment of $5 to activate API access.

Once the key is generated, create a .env file and enter the following:

Copy CodeCopiedUse a different BrowserOPENAI_API_KEY = <YOUR_API_KEY>

Replace <YOUR_API_KEY> with the key you generated.

Multi-Agent System (multi_agent_demo.py)

In this script (multi_agent_demo.py), we build a simple multi-agent assistant using the OpenAI Agents SDK, designed to route user queries to either a coding expert or a cooking expert. We enable mlflow.openai.autolog(), which automatically traces and logs all agent interactions with the OpenAI API — including inputs, outputs, and agent handoffs — making it easy to monitor and debug the system. MLflow is configured to use a local file-based tracking URI (./mlruns) and logs all activity under the experiment name “Agent‑Coding‑Cooking“.

Copy CodeCopiedUse a different Browserimport mlflow, asyncio
from agents import Agent, Runner
import os
from dotenv import load_dotenv
load_dotenv()

mlflow.openai.autolog() # Auto‑trace every OpenAI call
mlflow.set_tracking_uri(“./mlruns”)
mlflow.set_experiment(“Agent‑Coding‑Cooking”)

coding_agent = Agent(name=”Coding agent”,
instructions=”You only answer coding questions.”)

cooking_agent = Agent(name=”Cooking agent”,
instructions=”You only answer cooking questions.”)

triage_agent = Agent(
name=”Triage agent”,
instructions=”If the request is about code, handoff to coding_agent; ”
“if about cooking, handoff to cooking_agent.”,
handoffs=[coding_agent, cooking_agent],
)

async def main():
res = await Runner.run(triage_agent,
input=”How do I boil pasta al dente?”)
print(res.final_output)

if __name__ == “__main__”:
asyncio.run(main())

MLFlow UI

To open the MLflow UI and view all the logged agent interactions, run the following command in a new terminal:

Copy CodeCopiedUse a different Browsermlflow ui

This will start the MLflow tracking server and display a prompt indicating the URL and port where the UI is accessible — usually http://localhost:5000 by default.

We can view the entire interaction flow in the Tracing section — from the user’s initial input to how the assistant routed the request to the appropriate agent, and finally, the response generated by that agent. This end-to-end trace provides valuable insight into decision-making, handoffs, and outputs, helping you debug and optimize your agent workflows.

Tracing Guardrails (guardrails.py)

In this example, we implement a guardrail-protected customer support agent using the OpenAI Agents SDK with MLflow tracing. The agent is designed to help users with general queries but is restricted from answering medical-related questions. A dedicated guardrail agent checks for such inputs, and if detected, blocks the request. MLflow captures the entire flow — including guardrail activation, reasoning, and agent response — providing full traceability and insight into safety mechanisms.

Copy CodeCopiedUse a different Browserimport mlflow, asyncio
from pydantic import BaseModel
from agents import (
Agent, Runner,
GuardrailFunctionOutput, InputGuardrailTripwireTriggered,
input_guardrail, RunContextWrapper)

from dotenv import load_dotenv
load_dotenv()

mlflow.openai.autolog()
mlflow.set_tracking_uri(“./mlruns”)
mlflow.set_experiment(“Agent‑Guardrails”)

class MedicalSymptons(BaseModel):
medical_symptoms: bool
reasoning: str

guardrail_agent = Agent(
name=”Guardrail check”,
instructions=”Check if the user is asking you for medical symptons.”,
output_type=MedicalSymptons,
)

@input_guardrail
async def medical_guardrail(
ctx: RunContextWrapper[None], agent: Agent, input
) -> GuardrailFunctionOutput:
result = await Runner.run(guardrail_agent, input, context=ctx.context)

return GuardrailFunctionOutput(
output_info=result.final_output,
tripwire_triggered=result.final_output.medical_symptoms,
)

agent = Agent(
name=”Customer support agent”,
instructions=”You are a customer support agent. You help customers with their questions.”,
input_guardrails=[medical_guardrail],
)

async def main():
try:
await Runner.run(agent, “Should I take aspirin if I’m having a headache?”)
print(“Guardrail didn’t trip – this is unexpected”)

except InputGuardrailTripwireTriggered:
print(“Medical guardrail tripped”)

if __name__ == “__main__”:
asyncio.run(main())

This script defines a customer support agent with an input guardrail that detects medical-related questions. It uses a separate guardrail_agent to evaluate whether the user’s input contains a request for medical advice. If such input is detected, the guardrail triggers and prevents the main agent from responding. The entire process, including guardrail checks and outcomes, is automatically logged and traced using MLflow.

MLFlow UI

To open the MLflow UI and view all the logged agent interactions, run the following command in a new terminal:

Copy CodeCopiedUse a different Browsermlflow ui

In this example, we asked the agent, “Should I take aspirin if I’m having a headache?”, which triggered the guardrail. In the MLflow UI, we can clearly see that the input was flagged, along with the reasoning provided by the guardrail agent for why the request was blocked.

Check out the Codes. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More]
The post Tracing OpenAI Agent Responses using MLFlow appeared first on MarkTechPost.

Fractional Reasoning in LLMs: A New Way to Control Inference Depth

Posted on July 15, 2025 by i-genie

What is included in this article:The limitations of current test-time compute strategies in LLMs.Introduction of Fractional Reasoning (FR) as a training-free, model-agnostic framework.Techniques for latent state manipulation using reasoning prompts and adjustable scaling.Breadth- and depth-based scaling benefits demonstrated across GSM8K, MATH500, and GPQA.Evaluation results showing FR’s superiority over Best-of-N and Majority Vote.Analysis of FR’s behavior across different models, including DeepSeek-R1.

Introduction: Challenges in Uniform Reasoning During Inference

LLMs have shown improvements in various domains, with test-time compute playing a crucial role in their performance. This approach enhances reasoning during inference by allocating extra computational resources, such as generating multiple candidate responses and selecting the most suitable one, or refining answers iteratively through self-reflection. However, current test-time compute strategies treat all problems uniformly, applying the same depth of reasoning regardless of query difficulty or structure. In reality, reasoning needs are highly variable, and reasoning with under-, overthinking, or reflection can lead to degraded answers or unnecessary computational costs. Therefore, LLMs must be capable of adjusting their reasoning depth or level of reflection dynamically.

Prior Work: Latent Steering and Representation Control

Existing research has explored various methods to enhance LLM reasoning through inference-time scaling and latent state control. The Chain-of-Thought (CoT) prompting technique guides models to decompose complex problems into intermediate steps to improve reasoning performance. Outcome reward models (ORMs) and process reward models (PRMs) evaluate generated responses based on correctness or quality of internal reasoning. Moreover, representation engineering methods use steering vectors in LLM latent spaces for controlled generation, while methods like In-Context Vectors (ICV) extract latent vectors from demonstrations to steer internal states at inference time, and Representation Finetuning (ReFT) learns task-specific low-rank interventions over latent representations.

The Proposed Framework: Fractional Reasoning for Adaptive Inference

Researchers from Stanford University have proposed Fractional Reasoning (FR), a training-free and model-agnostic framework for improving test-time compute through adaptive reasoning control. FR adjusts reasoning behavior by directly modifying the model’s internal representations, extracting the latent shift induced by reasoning-promoting inputs such as CoT or reflection prompts, and again applying this shift with a tunable scaling factor. This enables models to adjust the depth of reasoning during inference without modifying the input text or requiring fine-tuning. FR supports and enhances two key forms of test-time scaling: (a) Breadth-based scaling, like Best-of-N and Majority vote, and (b) Depth-based scaling, like self-reflection.

Benchmarking: Performance Gains on Reasoning Tasks

FR is evaluated on three benchmarks that require multi-step reasoning: GSM8K, MATH500, and GPQA. The evaluation utilizes test sets for GSM8K and MATH500 while using the diamond split for GPQA. Main experiments use two competitive open-source instruction-tuned models: Qwen2.5-7B-Instruct and LLaMA-3.1-8B-Instruct, both of which demonstrate strong reasoning capabilities and provide access to the latent state representations required by the proposed method. FR outperforms standard test-time compute methods on all benchmarks and models, showing that it can strongly enhance performance. Adjusting the influence of prompts enables broader exploration of the solution space, increasing the efficiency of traditional test-time compute methods.

Behavior and Model-Agnostic Generality of Fractional Reasoning

Researchers further analyzed FR to understand its behavioral dynamics, generality across models, and other metrics. Analysis reveals that increasing the scaling parameter leads to longer outputs with more detailed multi-step reasoning, confirming the framework steers model behavior predictably and continuously. FR remains effective even when applied to reasoning-specialized models such as DeepSeek-R1-Distill-Qwen-7B, improving accuracy over standard prompting baselines and showing its generality across both general-purpose and specialized LLMs. Performance scaling analysis shows consistent improvements with an increasing number of generations, and FR shows higher accuracy across most sampling budgets compared to the majority vote baseline.

Conclusion: Towards More Dynamic and Efficient LLM Inference

In conclusion, researchers from Stanford University introduced Fractional Reasoning (FR), a training-free and model-agnostic framework that improves test-time compute through adaptive control of reasoning behavior in LLMs. It offers a general and interpretable approach for more precise and efficient allocation of computational effort during inference, overcoming the limitation of uniform reasoning application in current test-time compute strategies. However, the framework currently depends on predefined reasoning directions and lacks automatic selection of scaling factors, indicating future research directions toward adaptive policies for fully dynamic inference.

Check out the Paper. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More]

The post Fractional Reasoning in LLMs: A New Way to Control Inference Depth appeared first on MarkTechPost.

Liquid AI Open-Sources LFM2: A New Generation of Edge LLMs

Posted on July 15, 2025 by i-genie

What is included in this article:Performance breakthroughs – 2x faster inference and 3x faster trainingTechnical architecture – Hybrid design with convolution and attention blocksModel specifications – Three size variants (350M, 700M, 1.2B parameters)Benchmark results – Superior performance compared to similar-sized modelsDeployment optimization – Edge-focused design for various hardwareOpen-source accessibility – Apache 2.0-based licensingMarket implications – Impact on edge AI adoption

The landscape of on-device artificial intelligence has taken a significant leap forward with Liquid AI’s release of LFM2, their second-generation Liquid Foundation Models. This new series of generative AI models represents a paradigm shift in edge computing, delivering unprecedented performance optimizations specifically designed for on-device deployment while maintaining competitive quality standards.

Revolutionary Performance Gains

LFM2 establishes new benchmarks in the edge AI space by achieving remarkable efficiency improvements across multiple dimensions. The models deliver 2x faster decode and prefill performance compared to Qwen3 on CPU architectures, a critical advancement for real-time applications. Perhaps more impressively, the training process itself has been optimized to achieve 3x faster training compared to the previous LFM generation, making LFM2 the most cost-effective path to building capable, general-purpose AI systems.

These performance improvements are not merely incremental but represent a fundamental breakthrough in making powerful AI accessible on resource-constrained devices. The models are specifically engineered to unlock millisecond latency, offline resilience, and data-sovereign privacy – capabilities essential for phones, laptops, cars, robots, wearables, satellites, and other endpoints that must reason in real time.

Hybrid Architecture Innovation

The technical foundation of LFM2 lies in its novel hybrid architecture that combines the best aspects of convolution and attention mechanisms. The model employs a sophisticated 16-block structure consisting of 10 double-gated short-range convolution blocks and 6 blocks of grouped query attention (GQA). This hybrid approach draws from Liquid AI’s pioneering work on Liquid Time-constant Networks (LTCs), which introduced continuous-time recurrent neural networks with linear dynamical systems modulated by nonlinear input interlinked gates.

At the core of this architecture is the Linear Input-Varying (LIV) operator framework, which enables weights to be generated on-the-fly from the input they are acting on. This allows convolutions, recurrences, attention, and other structured layers to fall under one unified, input-aware framework. The LFM2 convolution blocks implement multiplicative gates and short convolutions, creating linear first-order systems that converge to zero after a finite time.

The architecture selection process utilized STAR, Liquid AI’s neural architecture search engine, which was modified to evaluate language modeling capabilities beyond traditional validation loss and perplexity metrics. Instead, it employs a comprehensive suite of over 50 internal evaluations that assess diverse capabilities including knowledge recall, multi-hop reasoning, understanding of low-resource languages, instruction following, and tool use.

Comprehensive Model Lineup

LFM2 is available in three strategically sized configurations: 350M, 700M, and 1.2B parameters, each optimized for different deployment scenarios while maintaining the core efficiency benefits. All models were trained on 10 trillion tokens drawn from a carefully curated pre-training corpus comprising approximately 75% English, 20% multilingual content, and 5% code data sourced from web and licensed materials.

The training methodology incorporates knowledge distillation using the existing LFM1-7B as a teacher model, with cross-entropy between LFM2’s student outputs and the teacher outputs serving as the primary training signal throughout the entire 10T token training process. The context length was extended to 32k during pretraining, enabling the models to handle longer sequences effectively.

Superior Benchmark Performance

Evaluation results demonstrate that LFM2 significantly outperforms similarly-sized models across multiple benchmark categories. The LFM2-1.2B model performs competitively with Qwen3-1.7B despite having 47% fewer parameters. Similarly, LFM2-700M outperforms Gemma 3 1B IT, while the smallest LFM2-350M checkpoint remains competitive with Qwen3-0.6B and Llama 3.2 1B Instruct.

Beyond automated benchmarks, LFM2 demonstrates superior conversational capabilities in multi-turn dialogues. Using the WildChat dataset and LLM-as-a-Judge evaluation framework, LFM2-1.2B showed significant preference advantages over Llama 3.2 1B Instruct and Gemma 3 1B IT while matching Qwen3-1.7B performance despite being substantially smaller and faster.

Edge-Optimized Deployment

The models excel in real-world deployment scenarios, having been exported to multiple inference frameworks including PyTorch’s ExecuTorch and the open-source llama.cpp library. Testing on target hardware including Samsung Galaxy S24 Ultra and AMD Ryzen platforms demonstrates that LFM2 dominates the Pareto frontier for both prefill and decode inference speed relative to model size.

The strong CPU performance translates effectively to accelerators such as GPU and NPU after kernel optimization, making LFM2 suitable for a wide range of hardware configurations. This flexibility is crucial for the diverse ecosystem of edge devices that require on-device AI capabilities.

Conclusion

The release of LFM2 addresses a critical gap in the AI deployment landscape where the shift from cloud-based to edge-based inference is accelerating. By enabling millisecond latency, offline operation, and data-sovereign privacy, LFM2 unlocks new possibilities for AI integration across consumer electronics, robotics, smart appliances, finance, e-commerce, and education sectors.

The technical achievements represented in LFM2 signal a maturation of edge AI technology, where the trade-offs between model capability and deployment efficiency are being successfully optimized. As enterprises pivot from cloud LLMs to cost-efficient, fast, private, and on-premises intelligence, LFM2 positions itself as a foundational technology for the next generation of AI-powered devices and applications.

Check out the Technical Details and Model on Hugging Face. All credit for this research goes to the researchers of this project. ‘Your AI deserves a smarter stage. Ours reaches 1M minds a month.‘ Put it on Marktechpost
The post Liquid AI Open-Sources LFM2: A New Generation of Edge LLMs appeared first on MarkTechPost.

Build AI-driven policy creation for vehicle data collection and automa …

Posted on July 15, 2025 by i-genie

Vehicle data is critical for original equipment manufacturers (OEMs) to drive continuous product innovation and performance improvements and to support new value-added services. Similarly, the increasing digitalization of vehicle architectures and adoption of software-configurable functions allow OEMs to add new features and capabilities efficiently. Sonatus’s Collector AI and Automator AI products address these two aspects of the move towards Software-Defined Vehicles (SDVs) in the automotive industry.
Collector AI lowers the barrier to using data across the entire vehicle lifecycle using data collection policies that can be created without changes to vehicle electronics or requiring modifications to embedded code. However, OEM engineers and other consumers of vehicle data struggle with the thousands of vehicle signals to choose to drive their specific use cases and outcomes. Likewise, Automator AI’s no-code methodology for automating vehicle functions using intuitive if-then-style scripted workflows can also be challenging, especially for OEM users who aren’t well-versed in the events and signals available on vehicles to incorporate in a desired automated action.
To address these challenges, Sonatus partnered with the AWS Generative AI Innovation Center to develop a natural language interface to generate data collection and automation policies using generative AI. This innovation aims to reduce the policy generation process from days to minutes while making it accessible to both engineers and non-experts alike.
In this post, we explore how we built this system using Sonatus’s Collector AI and Amazon Bedrock. We discuss the background, challenges, and high-level solution architecture.
Collector AI and Automator AI
Sonatus has developed a sophisticated vehicle data collection and automation workflow tool, which comprises two main products:

Collector AI – Gathers and transmits precise vehicle data based on configurable trigger events
Automator AI – Executes automated actions within the vehicle based on analyzed data and trigger conditions

The current process requires engineers to create data collection or automation policies manually. Depending on the range of an OEM’s use cases, there could be hundreds of policies for a given vehicle model. Also, identifying the correct data to collect for the given intent required sifting through multiple layers of information and organizational challenges. Our goal was to develop a more intelligent and intuitive way to accomplish the following:

Generate policies from the user’s natural language input
Significantly reduce policy creation time from days to minutes
Provide complete control over the intermediate steps in the generation process
Expand policy creation capabilities to non-engineers such as vehicle product owners, product planners, and even procurement
Implement a human-in-the-loop review process for both existing and newly created policies

Key challenges
During implementation, we encountered several challenges:

Complex event structures – Vehicle models and different policy entities use diverse representations and formats, requiring flexible policy generation
Labeled data limitations – Labeled data mapping natural language inputs to desired policies is limited
Format translation – The solution must handle different data formats and schemas across customers and vehicle models
Quality assurance – Generated policies must be accurate and consistent
Explainability – Clear explanations for how policies are generated can help build trust

Success metrics
We defined the following key metrics to measure the success of our solution:

Business metrics:

Reduced policy generation time
Increased number of policies per customer
Expanded user base for policy creation

Technical metrics:

Accuracy of generated policies
Quality of results for modified prompts

Operational metrics:

Reduced policy generation effort and turnaround time compared to manual process
Successful integration with existing systems

Solution overview
The Sonatus Advanced Technology team and Generative AI Innovation Center team built an automated policy generation system, as shown in the following diagram.

This is a chain of large language models (LLMs) that perform individual tasks, including entity extraction, signal translation, and signal parametrization.
Entity extraction
A fully generated vehicle policy consists of multiple parts, which could be captured within one single user statement. These are triggers and target data for collector policies, and triggers, actions, and associated tasks for automator policies. The user’s statement is first broken down into its entities using the following steps and rules:

Few-shot examples are provided for each entity
Trigger outputs must be self-contained with the appropriate signal value and comparison operator information:

Query example: “Generate an automation policy that locks the doors automatically when the car is moving”
Trigger output: <response>vehicle speed above 0, vehicle signal</response>

Triggers and actions are secondarily verified using a classification prompt
For Automator AI, triggers and actions must be associated with their corresponding tasks
The final output of this process is the intermediate structured XML representation of the user query in natural language:

Query example: “Generate an automation policy that locks the doors automatically when the car is moving”
Generated XML:

<response>
<task> Lock doors when moving </task>
<triggers> vehicle speed above 0, vehicle signal </triggers>
<actions> lock doors, vehicle signal </actions>
</response>

The following is a diagram of our improved solution, which converts a user query into XML output.

Signal translation and parametrization
To get to the final JSON policy structure from the intermediate structured XML output, the correct signals must be identified, the signal parameters need to be generated, and this information must be combined to follow the application’s expected JSON schema.
The output signal format of choice at this stage is Vehicle Signal Specification (VSS), an industry-standard specification driven by COVESA. VSS is a standard specifying vehicle signal naming conventions and strategies that make vehicle signals descriptive and understandable when compared to their physical Control Area Network (CAN) signal counterparts. This makes it not only suitable but also essential in the generative AI generation process because descriptive signal names and availability of their meanings are necessary.
The VSS signals, along with their descriptions and other necessary metadata, are embedded into a vector index. For every XML structure requiring a lookup of a vehicle signal, the process of signal translation includes the following steps:

Available signal data is preprocessed and stored into a vector database.
Each XML representation—triggers, actions, and data—is converted into their corresponding embeddings. In some cases, the XML phrases can also be enhanced for better embedding representation.
For each of the preceding entities:

Top-k similar vector embeddings are identified (assume k as 20).
Candidate signals are reranked based on name and descriptions.
The final signal is selected using a LLM selection prompt.

In the case of triggers, after the selection of the correct signal, the trigger value and condition comparator operator are also generated using few-shot examples.
This retrieved and generated information is combined into a predefined trigger, action, data, and task JSON object structure.
Individual JSON objects are assembled to construct the final JSON policy.
This is run through a policy schema validator before it is saved.

The following diagram illustrates the step-by-step process of signal translation. To generate the JSON output from the intermediate XML structure, correct signals are identified using vector-based lookups and reranking techniques.

Solution highlights
In this section, we discuss key components and features of the solution.
Improvement of task adjacency
In automator policies, a task is a discrete unit of work within a larger process. It has a specific purpose and performs a defined set of actions—both within and outside a vehicle. It also optionally defines a set of trigger conditions that, when evaluated to be true, the defined actions start executing. The larger process—the workflow—defines a dependency graph of tasks and the order in which they are executed. The workflow follows the following rules:

Every automator policy starts with exactly one task
A task can point to one or more next tasks
One task can only initiate one other task
Multiple possible next tasks can exist, but only one can be triggered at a time
Each policy workflow runs one task at a given time
Tasks can be arranged in linear or branching patterns
If none of the conditions satisfy, the default is monitoring the trigger conditions for the next available tasks

For example:

# Linear Task Adjacency
t1 → t2 → t3 → t4 → t1*
# Branching Task Adjacency
t1 → t2, t3, t4
t3 → t5
t5 → t4

*Loops back to start.
In some of the generated outputs, we identified that there can be two adjacent tasks in which one doesn’t have an action, and another doesn’t have a trigger. Task merging aims to resolve this issue by merging those into a single task. To address this, we implemented task merging using Anthropic’s Claude on Amazon Bedrock. Our outcomes were as follows:

Solve the task merging issue, where multiple tasks with incomplete information are merged into one task
Properly generate tasks that point to multiple next tasks
Change the prompt style to decision tree-based planning to make it more flexible

Multi-agent approach for parameter generation
During the signal translation process, an exhaustive list of signals is fed into a vector store, and when corresponding triggers or actions are generated, they are used to search the vector store and select the signal with the highest relevancy. However, this sometimes generates less accurate or ambiguous results.
For example, the following policy asks to cool down the car:
Action: <response> cool down the car </response>
The corresponding signal should try to cool the car cabin, as shown in the following signal:
Vehicle.Cabin.HVAC.Station.Row1.Driver.Temperature
It should not cool the car engine, as shown in the following incorrect signal:
Vehicle.Powertrain.CombustionEngine.EngineCoolant.Temperature
We mitigated this issue by introducing a multi-agent approach. Our approach has two agents:

ReasoningAgent – Proposes initial signal names based on the query and knowledge base
JudgeAgent – Evaluates and refines the proposed signals

The agents interact iteratively up to a set cycle threshold before claiming success for signal identification.
Reduce redundant LLM calls
To reduce latency, parts of the pipeline were identified that could be merged into a single LLM call. For example, trigger condition value generation and trigger condition operator generation were individual LLM calls.We addressed this by introducing a faster Anthropic’s Claude 3 Haiku model and merging prompts where it is possible to do so. The following is an example of a set of prompts before and after merging.The first example is before merging, with the trigger set to when the temperature is above 20 degrees Celsius:

Operator response: <operator> > </operator>
Parameter response: <value> 20 </value>

The following is the combined response for the same trigger:

Context-driven policy generation
The goal here is to disambiguate the signal translation, similar to the multi-agent approach for parameter generation. To make policy generation more context-aware, we proposed a customer intent clarifier that carries out the following tasks:

Retrieves relevant subsystems using knowledge base lookups
Identifies the intended target subsystem
Allows user verification and override

This approach works by using external and preprocessed information like available vehicle subsystems, knowledge bases, and signals to guide the signal selection. Users can also clarify or override intent in cases of ambiguity early on to reduce wasted iterations and achieve the desired result more quickly. For example, in the case of the previously stated example on an ambiguous generation of “cool the car,” users are asked to clarify which subsystem they meant—to choose from “Engine” or “Cabin.”
Conclusion
Combining early feedback loops and a multi-agent approach has transformed Sonatus’s policy creation system into a more automated and efficient solution. By using Amazon Bedrock, we created a system that not only automates policy creation, reducing time taken by 70%, but also provides accuracy through context-aware generation and validation. So, organizations can achieve similar efficiency gains by implementing this multi-agent approach with Amazon Bedrock for their own complex policy creation workflows. Developers can leverage these techniques to build natural language interfaces that dramatically reduce technical complexity while maintaining precision in business-critical systems.

About the authors
Giridhar Akila Dhakshinamoorthy is the Senior Staff Engineer and AI/ML Tech Lead in the CTO Office at Sonatus.
Tanay Chowdhury is a Data Scientist at Generative AI Innovation Center at Amazon Web Services who helps customers solve their business problems using generative AI and machine learning. He has done MS with Thesis in Machine Learning from University of Illinois and has extensive experience in solving customer problem in the field of data science.
Parth Patwa is a Data Scientist in the Generative AI Innovation Center at Amazon Web Services. He has co-authored research papers at top AI/ML venues and has 1000+ citations.
Yingwei Yu is an Applied Science Manager at Generative AI Innovation Center, AWS, where he leverages machine learning and generative AI to drive innovation across industries. With a PhD in Computer Science from Texas A&M University and years of working experience, Yingwei brings extensive expertise in applying cutting-edge technologies to real-world applications.
Hamed Yazdanpanah was a Data Scientist in the Generative AI Innovation Center at Amazon Web Services. He helps customers solve their business problems using generative AI and machine learning.

How Rapid7 automates vulnerability risk scores with ML pipelines using …

Posted on July 15, 2025 by i-genie

This post is cowritten with Jimmy Cancilla from Rapid7.
Organizations are managing increasingly distributed systems, which span on-premises infrastructure, cloud services, and edge devices. As systems become interconnected and exchange data, the potential pathways for exploitation multiply, and vulnerability management becomes critical to managing risk. Vulnerability management (VM) is the process of identifying, classifying, prioritizing, and remediating security weaknesses in software, hardware, virtual machines, Internet of Things (IoT) devices, and similar assets. When new vulnerabilities are discovered, organizations are under pressure to remediate them. Delayed responses can open the door to exploits, data breaches, and reputational harm. For organizations with thousands or millions of software assets, effective triage and prioritization for the remediation of vulnerabilities are critical.
To support this process, the Common Vulnerability Scoring System (CVSS) has become the industry standard for evaluating the severity of software vulnerabilities. CVSS v3.1, published by the Forum of Incident Response and Security Teams (FIRST), provides a structured and repeatable framework for scoring vulnerabilities across multiple dimensions: exploitability, impact, attack vector, and others. With new threats emerging constantly, security teams need standardized, near real-time data to respond effectively. CVSS v3.1 is used by organizations such as NIST and major software vendors to prioritize remediation efforts, support risk assessments, and comply with standards.
There is, however, a critical gap that emerges before a vulnerability is formally standardized. When a new vulnerability is disclosed, vendors aren’t required to include a CVSS score alongside the disclosure. Additionally, third-party organizations such as NIST aren’t obligated or bound by specific timelines to analyze vulnerabilities and assign CVSS scores. As a result, many vulnerabilities are made public without a corresponding CVSS score. This situation can leave customers uncertain about how to respond: should they patch the newly discovered vulnerability immediately, monitor it for a few days, or deprioritize it? Our goal with machine learning (ML) is to provide Rapid7 customers with a timely answer to this critical question.
Rapid7 helps organizations protect what matters most so innovation can thrive in an increasingly connected world. Rapid7’s comprehensive technology, services, and community-focused research remove complexity, reduce vulnerabilities, monitor for malicious behavior, and shut down attacks. In this post, we share how Rapid7 implemented end-to-end automation for the training, validation, and deployment of ML models that predict CVSS vectors. Rapid7 customers have the information they need to accurately understand their risk and prioritize remediation measures.
Rapid7’s solution architecture
Rapid7 built their end-to-end solution using Amazon SageMaker AI, the Amazon Web Services (AWS) fully managed ML service to build, train, and deploy ML models into production environments. SageMaker AI provides powerful compute for ephemeral tasks, orchestration tools for building automated pipelines, a model registry for tracking model artifacts and versions, and scalable deployment to configurable endpoints.
Rapid7 integrated SageMaker AI with their DevOps tools (GitHub for version control and Jenkins for build automation) to implement continuous integration and continuous deployment (CI/CD) for the ML models used for CVSS scoring. By automating model training and deployment, Rapid7’s CVSS scoring solutions stay up to date with the latest data without additional operational overhead.
The following diagram illustrates the solution architecture.

Orchestrating with SageMaker AI Pipelines
The first step in the journey toward end-to-end automation was removing manual activities previously performed by data scientists. This meant migrating experimental code from Jupyter notebooks to production-ready Python scripts. Rapid7 established a project structure to support both development and production. Each step in the ML pipeline—data download, preprocessing, training, evaluation, and deployment—was defined as a standalone Python module in a common directory.
Designing the pipeline
After refactoring, pipeline steps were moved to SageMaker Training and Processing jobs for remote execution. Steps in the pipeline were defined using Docker images with the required libraries, and orchestrated using SageMaker Pipelines in the SageMaker Python SDK.
CVSS v3.1 vectors consist of eight independent metrics combined into a single vector. To produce an accurate CVSS vector, eight separate models were trained in parallel. However, the data used to train these models was identical. This meant that the training process could share common download and preprocessing steps, followed by separate training, validation, and deployment steps for each metric. The following diagram illustrates the high-level architecture of the implemented pipeline.

Data loading and preprocessing
The data used to train the model comprised existing vulnerabilities and their associated CVSS vectors. This data source is updated constantly, which is why Rapid7 decided to download the most recent data available at training time and uploaded it to Amazon Simple Storage Service (Amazon S3) to be used by subsequent steps. After being updated, Rapid7 implemented a preprocessing step to:

Structure the data to facilitate ingestion and use in training.
Split the data into three sets: training, validation, and testing (80%, 10%, and 10%).

The preprocessing step was defined with a dependency on the data download step so that the new dataset was available before a new preprocessing job was started. The outputs of the preprocessing job—the resulting training, validation, and test sets—are also uploaded to Amazon S3 to be consumed by the training steps that follow.
Model training, evaluation, and deployment
For the remaining pipeline steps, Rapid7 executed each step eight times—one time for each metric in the CVSS vector. Rapid7 iterated through each of the eight metrics to define the corresponding training, evaluation, and deployment steps using the SageMaker Pipelines SDK.
The loop follows a similar pattern for each metric. The process starts with a training job using PyTorch framework images provided by Amazon SageMaker AI. The following is a sample script for defining a training job.

estimator = PyTorch(
entry_point=”train.py”,
source_dir=”src”,
role=role,
instance_count=1,
instance_type=TRAINING_INSTANCE_TYPE
output_path=f”s3://{s3_bucket}/cvss/trained-model”,
framework_version=”2.2″,
py_version=”py310″,
disable_profiler=True,
environment={“METRIC”: cvss_metric}
)
step_train = TrainingStep(
name=f”TrainModel_{cvss_metric}”,
estimator=estimator,
inputs={
“train”: TrainingInput(
s3_data=<<INPUT_DATA_S3_URI>>,
content_type=”text/plain”
),
“validation”: TrainingInput(
s3_data=<<VALIDATION_DATA_S3_URI>>,
content_type=”text/plain”
)
}
)
training_steps.append(step_train)

The PyTorch Estimator creates model artifacts that are automatically uploaded to the Amazon S3 location defined in the output path parameter. The same script is used for each one of the CVSS v3.1 metrics while focusing on a different metric by passing a different cvss_metric to the training script as an environment variable.
The SageMaker Pipeline is configured to trigger the execution of a model evaluation step when the model training job for that CVSS v3.1 metric is finished. The model evaluation job takes the newly trained model and test data as inputs, as shown in the following step definition.

script_eval = Processor(…)
eval_args = script_eval.run(
inputs=[
ProcessingInput(
source=<<MODEL_ARTIFACTS_IN_AMAZON_S3>>,
destination=”/opt/ml/processing/model”
),
ProcessingInput(
source=<<TEST_DATA_IN_AMAZON_S3>>,
destination=”/opt/ml/processing/test”
)
],
outputs=[
ProcessingOutput(
output_name=”evaluation”,
source=”/opt/ml/processing/evaluation/”,
destination=f”s3://{s3_bucket}/cvss/evaluation/{cvss_metric}/”
)
],
source_dir=”src”,
code=”evaluate.py”
)
evaluation_report = PropertyFile(
name=”EvaluationReport”,
output_name=”evaluation”,
path=”evaluation.json”
)
step_eval = ProcessingStep(
name=f”Evaluate_{cvss_metric}”,
step_args=eval_args,
property_files=[evaluation_report],
)
evaluation_steps.append(step_eval)

The processing job is configured to create a PropertyFile object to store the results from the evaluation step. Here is a sample of what might be found in this file:

{
“ac”: {
“metrics”: {
“accuracy”: 99
}
}
}

This information is critical in the last step of the sequence followed for each metric in the CVSS vector. Rapid7 wants to ensure that models deployed in production meet quality standards, and they do that by using a ConditionStep that allows only models whose accuracy is above a critical value to be registered in the SageMaker Model Registry. This process is repeated for all eight models.

cond_gte = ConditionGreaterThanOrEqualTo(
left=JsonGet(
step_name=step_eval.name,
property_file=evaluation_report,
json_path=f”{cvss_metric}.metrics.accuracy”
),
right=accuracy_threshold_param
)
step_cond = ConditionStep(
name=f”CVSS_{cvss_metric}_Accuracy_Condition”,
conditions=[cond_gte],
if_steps=[step_model_create],
else_steps=[]
)
conditional_steps.append(step_cond)

Defining the pipeline
With all the steps defined, a pipeline object is created with all the steps for all eight models. The graph for the pipeline definition is shown in the following image.

Managing models with SageMaker Model Registry
SageMaker Model Registry is a repository for storing, versioning, and managing ML models throughout the machine learning operations (MLOps) lifecycle. The model registry enables the Rapid7 team to track model artifacts and their metadata (such as performance metrics), and streamline model version management as their CVSS models evolve. Each time a new model is added, a new version is created under the same model group, which helps track model iterations over time. Because new versions are evaluated for accuracy before registration, they’re registered with an Approved status. If a model’s accuracy falls below this threshold, the automated deployment pipeline will detect this and send an alert to notify the team about the failed deployment. This enables Rapid7 to maintain an automated pipeline that serves the most accurate model available to date without requiring manual review of new model artifacts.
Deploying models with inference components
When a set of CVSS scoring models has been selected, they can be deployed in a SageMaker AI endpoint for real-time inference, allowing them to be invoked to calculate a CVSS vector as soon as new vulnerability data is available. SageMaker AI endpoints are accessible URLs where applications can send data and receive predictions. Internally, the CVSS v3.1 vector is prepared using predictions from the eight scoring models, followed by postprocessing logic. Because each invocation runs each of the eight CVSS scoring models one time, their deployment can be optimized for efficient use of compute resources.
When the deployment script runs, it checks the model registry for new versions. If it detects an update, it immediately deploys the new version to a SageMaker endpoint.
Ensuring Cost Efficiency
Cost efficiency was a key consideration in designing this workflow. Usage patterns for vulnerability scoring are bursty, with periods of high activity followed by long idle intervals. Maintaining dedicated compute resources for each model would be unnecessarily expensive given these idle times. To address this issue, Rapid7 implemented Inference Components in their SageMaker endpoint. Inference components allow multiple models to share the same underlying compute resources, significantly improving cost efficiency—particularly for bursty inference patterns. This approach enabled Rapid7 to deploy all eight models on a single instance. Performance tests showed that inference requests could be processed in parallel across all eight models, consistently achieving sub-second response times (100-200ms).
Monitoring models in production
Rapid7 continually monitors the models in production to ensure high availability and efficient use of compute resources. The SageMaker AI endpoint automatically uploads logs and metrics into Amazon CloudWatch, which are then forwarded and visualized in Grafana. As part of regular operations, Rapid7 monitors these dashboards to visualize metrics such as model latency, the number of instances behind the endpoint, and invocations and errors over time. Additionally, alerts are configured on response time metrics to maintain system responsiveness and prevent delays in the enrichment pipeline. For more information on the various metrics and their usage, refer to the AWS blog post, Best practices for load testing Amazon SageMaker real-time inference endpoints.
Conclusion
End-to-end automation of vulnerability scoring model development and deployment has given Rapid7 a consistent, fully automated process. The previous manual process for retraining and redeploying these models was fragile, error-prone, and time-intensive. By implementing an automated pipeline with SageMaker, the engineering team now saves at least 2–3 days of maintenance work each month. By eliminating 20 manual operations, Rapid7 software engineers can focus on delivering higher-impact work for their customers. Furthermore, by using inference components, all models can be consolidated onto a single ml.m5.2xlarge instance, rather than deploying a separate endpoint (and instance) for each model. This approach nearly halves the hourly compute cost, resulting in approximately 50% cloud compute savings for this workload. In building this pipeline, Rapid7 benefited from features that reduced time and cost across multiple steps. For example, using custom containers with the necessary libraries improved startup times, while inference components enabled efficient resource utilization—both were instrumental in building an effective solution.
Most importantly, this automation means that Rapid7 customers always receive the most recently published CVEs with a CVSSv3.1 score assigned. This is especially important for InsightVM because Active Risk Scores, Rapid7’s latest risk strategy for understanding vulnerability impact, rely on the CVSSv3.1 score as a key component in their calculation. Providing accurate and meaningful risk scores is critical for the success of security teams, empowering them to prioritize and address vulnerabilities more effectively.
In summary, automating model training and deployment with Amazon SageMaker Pipelines has enabled Rapid7 to deliver scalable, reliable, and efficient ML solutions. By embracing these best practices and lessons learned, teams can streamline their workflows, reduce operational overhead, and remain focused on driving innovation and value for their customers.

About the authors
Jimmy Cancilla is a Principal Software Engineer at Rapid7, focused on applying machine learning and AI to solve complex cybersecurity challenges. He leads the development of secure, cloud-based solutions that use automation and data-driven insights to improve threat detection and vulnerability management. He is driven by a vision of AI as a tool to augment human work, accelerating innovation, enhancing productivity, and enabling teams to achieve more with greater speed and impact.
Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.
Steven Warwick is a Senior Solutions Architect at AWS, where he leads customer engagements to drive successful cloud adoption and specializes in SaaS architectures and Generative AI solutions. He produces educational content including blog posts and sample code to help customers implement best practices, and has led programs on GenAI topics for solution architects. Steven brings decades of technology experience to his role, helping customers with architectural reviews, cost optimization, and proof-of-concept development.

Build secure RAG applications with AWS serverless data lakes

Posted on July 15, 2025 by i-genie

Data is your generative AI differentiator, and successful generative AI implementation depends on a robust data strategy incorporating a comprehensive data governance approach. Traditional data architectures often struggle to meet the unique demands of generative such as applications. An effective generative AI data strategy requires several key components like seamless integration of diverse data sources, real-time processing capabilities, comprehensive data governance frameworks that maintain data quality and compliance, and secure access patterns that respect organizational boundaries. In particular, Retrieval Augmented Generation (RAG) applications have emerged as one of the most promising developments in this space. RAG is the process of optimizing the output of a foundation model (FM), so it references a knowledge base outside of its training data sources before generating a response. Such systems require secure, scalable, and flexible data ingestion and access patterns to enterprise data.
In this post, we explore how to build a secure RAG application using serverless data lake architecture, an important data strategy to support generative AI development. We use Amazon Web Services (AWS) services including Amazon S3, Amazon DynamoDB, AWS Lambda, and Amazon Bedrock Knowledge Bases to create a comprehensive solution supporting unstructured data assets which can be extended to structured data. The post covers how to implement fine-grained access controls for your enterprise data and design metadata-driven retrieval systems that respect security boundaries. These approaches will help you maximize the value of your organization’s data while maintaining robust security and compliance.
Use case overview
As an example, consider a RAG-based generative AI application. The following diagram shows the typical conversational workflow that is initiated with a user prompt, for example, operation specialists in a retail company querying internal knowledge to get procurement and supplier details. Each user prompt is augmented with relevant contexts from data residing in an enterprise data lake.

In the solution, the user interacts with the Streamlit frontend, which serves as the application interface. Amazon Cognito that enables IdP integration through IAM Identity Center, so that only authorized users can access the application. For production use, we recommend that you use a more robust frontend framework such as AWS Amplify, which provides a comprehensive set of tools and services for building scalable and secure web applications. After the user has successfully signed in, the application retrieves the list of datasets associated with the user’s ID from the DynamoDB table. The list of datasets is used to filter while querying the knowledge base to get answers from datasets the user is authorized to access. This is made possible because when the datasets are ingested, the knowledge base is prepopulated with metadata files containing user principal-dataset mapping stored in Amazon S3. The knowledge base returns the relevant results, which are then sent back to the application and displayed to the user.
The datasets reside in a serverless data lake on Amazon S3 and are governed using Amazon S3 Access Grants with IAM Identity Center trusted identity propagation enabling automated data permissions at scale. When an access grant is created or deleted for a user or group, the information is added to the DynamoDB table through event-driven architecture using AWS CloudTrail and Amazon EventBridge.
The workflow includes the following key data foundation steps:

Access policies to extract permissions based on relevant data and filter out results based on the prompt user role and permissions.
Enforce data privacy policies such as personally identifiable information (PII) redactions.
Enforce fine-grained access control.
Grant the user role permissions for sensitive information and compliance policies based on dataset classification in the data lake.
Extract, transform, and load multimodal data assets into a vector store.

In the following sections, we explain why a modern data strategy is important for generative AI and what challenges it solves.
Serverless data lakes powering RAG applications
Organizations implementing RAG applications face several critical challenges that impact both functionality and cost-effectiveness. At the forefront is security and access control. Applications must carefully balance broad data access with strict security boundaries. These systems need to allow access to data sources by only authorized users, apply dynamic filtering based on permissions and classifications, and maintain security context throughout the entire retrieval and generation process. This comprehensive security approach helps prevent unauthorized information exposure while still enabling powerful AI capabilities.
Data discovery and relevance present another significant hurdle. When dealing with petabytes of enterprise data, organizations must implement sophisticated systems for metadata management and advanced indexing. These systems need to understand query context and intent while efficiently ranking retrieval results to make sure users receive the most relevant information. Without proper attention to these aspects, RAG applications risk returning irrelevant or outdated information that diminishes their utility.
Performance considerations become increasingly critical as these systems scale. RAG applications must maintain consistent low latency while processing large document collections, handling multiple concurrent users, integrating data from distributed sources and retrieving relevant data. The challenge of balancing real-time and historical data access adds another layer of complexity to maintaining responsive performance at scale.Cost management represents a final key challenge that organizations must address. Without careful architectural planning, RAG implementations can lead to unnecessary expenses through duplicate data storage, excessive vector database operations, and inefficient data transfer patterns. Organizations need to optimize their resource utilization carefully to help prevent these costs from escalating while maintaining system performance and functionality.
A modern data strategy addresses the complex challenges of RAG applications through comprehensive governance frameworks and robust architectural components. At its core, the strategy implements sophisticated governance mechanisms that go beyond traditional data management approaches. These frameworks enable AI systems to dynamically access enterprise information while maintaining strict control over data lineage, access patterns, and regulatory compliance. By implementing comprehensive provenance tracking, usage auditing, and compliance frameworks, organizations can operate their RAG applications within established ethical and regulatory boundaries.
Serverless data lakes serve as the foundational component of this strategy, offering an elegant solution to both performance and cost challenges. Their inherent scalability automatically handles varying workloads without requiring complex capacity planning, and pay-per-use pricing models facilitate cost efficiency. The ability to support multiple data formats—from structured to unstructured—makes them particularly well-suited for RAG applications that need to process and index diverse document types.
To address security and access control challenges, the strategy implements enterprise-level data sharing mechanisms. These include sophisticated cross-functional access controls and federated access management systems that enable secure data exchange across organizational boundaries. Fine-grained permissions at the row, column, and object levels enforce security boundaries while maintaining necessary data access for AI systems.Data discoverability challenges are met through centralized cataloging systems that help prevent duplicate efforts and enable efficient resource utilization. This comprehensive approach includes business glossaries, technical catalogs, and data lineage tracking, so that teams can quickly locate and understand available data assets. The catalog system is enriched with quality metrics that help maintain data accuracy and consistency across the organization.
Finally, the strategy implements a structured data classification framework that addresses security and compliance concerns. By categorizing information into clear sensitivity levels from public to restricted, organizations can create RAG applications that only retrieve and process information appropriate to user access levels. This systematic approach to data classification helps prevent unauthorized information disclosure while maintaining the utility of AI systems across different business contexts.Our solution uses AWS services to create a secure, scalable foundation for enterprise RAG applications. The components are explained in the following sections.
Data lake structure using Amazon S3
Our data lake will use Amazon S3 as the primary storage layer, organized with the following structure:

s3://amzn-s3-demo-enterprise-datalake/
├── retail/
│ ├── product-catalog/
│ ├── customer-data/
│ └── sales-history/
├── finance/
│ ├── financial-statements/
│ ├── tax-documents/
│ └── budget-forecasts/
├── supply-chain/
│ ├── inventory-reports/
│ ├── supplier-contracts/
│ └── logistics-data/
└── shared/
   ├── company-policies/
   ├── knowledge-base/
   └── public-data/

Each business domain has dedicated folders containing domain-specific data, with common data stored in a shared folder.
Data sharing options
There are two options for data sharing. The first option is Amazon S3 Access Points, which provide a dedicated access endpoint policy for different applications or user groups. This approach enables fine-grained control without modifying the base bucket policy.
The following code is an example access point configuration. This policy grants the RetailAnalyticsRole read-only access (GetObject and ListBucket permissions) to data in both the retail-specific directory and the shared directory, but it restricts access to other business domain directories. The policy is attached to a dedicated S3 access point, allowing users with this role to retrieve only data relevant to retail operations and commonly shared resources:

{
  “Version”: “2012-10-17”,
  “Statement”: [
   {
   “Effect”: “Allow”,
   “Principal”: {
   “AWS”: “arn:aws:iam::123456789012:role/RetailAnalyticsRole”
   },
   “Action”: [
   “s3:GetObject”,
   “s3:ListBucket”
   ],
   “Resource”: [
   “arn:aws:s3:us-east-1:123456789012:accesspoint/retail-access-point/object/retail/*”,
   “arn:aws:s3:us-east-1:123456789012:accesspoint/retail-access-point/object/shared/*”
   ]
   }
  ]
}

The second option for data sharing is using bucket policies with path-based access control. Bucket policies can implement path-based restrictions to control which user roles can access specific data directories.The following code is an example bucket policy. This bucket policy implements domain-based access control by granting different permissions based on user roles and data paths. The FinanceUserRole can only access data within the finance and shared directories, and the RetailUserRole can only access data within the retail and shared directories. This pattern enforces data isolation between business domains while facilitating access to common resources. Each role is limited to read-only operations (GetObject and ListBucket) on their authorized directories, which means users can only retrieve data relevant to their business functions.

{
  “Version”: “2012-10-17”,
  “Statement”: [
   {
   “Effect”: “Allow”,
   “Principal”: {
   “AWS”: “arn:aws:iam::123456789012:role/FinanceUserRole”
   },
   “Action”: [
   “s3:GetObject”,
   “s3:ListBucket”
   ],
   “Resource”: [
   “arn:aws:s3:::amzn-s3-demo-enterprise-datalake/finance/*”,
   “arn:aws:s3:::amzn-s3-demo-enterprise-datalake/shared/*”
   ]
   },
   {
   “Effect”: “Allow”,
   “Principal”: {
   “AWS”: “arn:aws:iam::123456789012:role/RetailUserRole”
   },
   “Action”: [
   “s3:GetObject”,
   “s3:ListBucket”
   ],
   “Resource”: [
   “arn:aws:s3:::amzn-s3-demo-enterprise-datalake/retail/*”,
   “arn:aws:s3:::amzn-s3-demo-enterprise-datalake/shared/*”
   ]
   }
  ]
}

As your number of datasets and use cases scale, you might require more policy space. Bucket policies work as long as the necessary policies fit within the policy size limits of S3 bucket policies (20 KB), AWS Identity and Access Management (IAM) policies (5 KB), and within the number of IAM principals allowed per account. With an increasing number of datasets, access points offer a better alternative of having a dedicated policy for each access point in such cases. You can define quite granular access control patterns because you can have thousands of access points per AWS Region per account, with a policy up to 20 KB in size for each access point. Although S3 Access Points increases the amount of policy space available, it requires a mechanism for clients to discover the right access point for the right dataset. To manage scale, S3 Access Points provides a simplified model to map identities in directories such as Active Directory or IAM principals to datasets in Amazon S3 by prefix, bucket, or object. With the simplified access scheme in S3 Access Grants, you can grant read-only, write-only, or read-write access according to Amazon S3 prefix to both IAM principals and directly to users or groups from a corporate directory. As a result, you can manage automated data permissions at scale.
Amazon Comprehend PII redaction job identifies and redacts (or masks) sensitive data in documents residing in Amazon S3. After redaction, documents are verified for redaction effectiveness using Amazon Macie. Documents flagged by Macie are sent to another bucket for manual review, and cleared documents are moved to a redacted bucket ready for ingestion. For more details, refer to Protect sensitive data in RAG applications with Amazon Comprehend.
User-dataset mapping with DynamoDB
To dynamically manage access permissions, you can use DynamoDB to store mapping information between users or roles and datasets. You can automate the mapping from AWS Lake Formation to DynamoDB using CloudTrail and event-driven Lambda invocation. The DynamoDB structure consists of a table named UserDatasetAccess. Its primary key structure is:

Partition key – UserIdentifier (string) – IAM role Amazon Resource Name (ARN) or user ID
Sort key – DatasetID (string) – Unique identifier for each dataset

Additional attributes consist of:

DatasetPath (string) – S3 path to the dataset
AccessLevel (string) – READ, WRITE, or ADMIN
Classification (string) – PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED
Domain (string) – Business domain (such as retail or finance)
ExpirationTime (number) – Optional Time To Live (TTL) for temporary access

The following DynamoDB item represents an access mapping between a user role (RetailAnalyst) and a specific dataset (retail-products). It defines that this role has READ access to product catalog data in the retail domain with an INTERNAL security classification. When the RAG application processes a query, it references this mapping to determine which datasets the querying user can access, and the application only retrieves and uses data appropriate for the user’s permissions level.

{
  “UserIdentifier”: “arn:aws:iam::123456789012:role/RetailAnalyst”,
  “DatasetID”: “retail-products”,
  “DatasetPath”: “s3://amzn-s3-demo-enterprise-datalake/retail/product-catalog/”,
  “AccessLevel”: “READ”,
  “Classification”: “INTERNAL”,
  “Domain”: “retail”
}

This approach provides a flexible, programmatic way to control which users can access specific datasets, enabling fine-grained permission management for RAG applications.
Amazon Bedrock Knowledge Bases for unstructured data
Amazon Bedrock Knowledge Bases provides a managed solution for organizing, indexing, and retrieving unstructured data to support RAG applications. For our solution, we use this service to create domain-specific knowledge bases. With the metadata filtering feature provided by Amazon Bedrock Knowledge Bases, you can retrieve not only semantically relevant chunks but a well-defined subset of those relevant chunks based on applied metadata filters and associated values. In the next sections, we show how you can set this up.
Configuring knowledge bases with metadata filtering
We organize our knowledge bases to support filtering based on:

Business domain (such as finance, retail, or supply-chain)
Security classification (such as public, internal, confidential, or restricted)
Document type (such as policy, report, or guide)

Each document ingested into our knowledge base includes a standardized metadata structure:

{

  “source_uri”: “s3://amzn-s3-demo-enterprise-datalake/retail/product-catalog/shoes-inventory-2023.pdf”,
  “title”: “Shoes Inventory Report 2023”,
  “language”: “en”,
  “last_updated”: “2023-12-15T14:30:00Z”,
  “author”: “Inventory Management Team”,
  “business_domain”: “retail”,
  “security_classification”: “internal”,
  “document_type”: “inventory_report”,
  “departments”: [“retail”, “supply-chain”],
  “tags”: [“footwear”, “inventory”, “2023”],
  “version”: “1.2”
}

Code examples shown throughout this post are for reference only and highlight key API calls and logic. Additional implementation code is required for production deployments.
Amazon Bedrock Knowledge Bases API integration
To demonstrate how our RAG application will interact with the knowledge base, here’s a Python sample using the AWS SDK:

# High-level logic for querying knowledge base with security filters
def query_knowledge_base(query_text, user_role, business_domain=None):
   # Get permitted classifications based on user role
   permitted_classifications = get_permitted_classifications(user_role)

   # Build security filter expression
   filter_expression = build_security_filters(permitted_classifications, business_domain)

   # Key API call for retrieval with security filtering
   response = bedrock_agent_runtime.retrieve(
   knowledgeBaseId=’your-kb-id’,
   retrievalQuery={‘text’: query_text},
   retrievalConfiguration={
   ‘vectorSearchConfiguration’: {
   ‘numberOfResults’: 5,
   ‘filter’: filter_expression # Apply security filters here
   }
   }
   )

   return response[‘retrievalResults’]

Conclusion
In this post, we’ve explored how to build a secure RAG application using a serverless data lake architecture. The approach we’ve outlined provides several key advantages:

Security-first design – Fine-grained access controls at scale mean that users only access data they’re authorized for
Scalability – Serverless components automatically handle varying workloads
Cost-efficiency – Pay-as-you-go pricing models optimize expenses
Flexibility – Seamless adaptation to different business domains and use cases

By implementing a modern data strategy with proper governance, security controls, and serverless architecture, organizations can make the most of their data assets for generative AI applications while maintaining security and compliance.The RAG architecture we’ve described enables contextualized, accurate responses that respect security boundaries, providing a powerful foundation for enterprise AI applications across diverse business domains.For next steps, consider implementing monitoring and observability to track performance and usage patterns.
For performance and usage monitoring:

Deploy Amazon CloudWatch metrics and dashboards to track key performance indicators such as query latency, throughput, and error rates
Set up CloudWatch Logs Insights to analyze usage patterns and identify optimization opportunities
Implement AWS X-Ray tracing to visualize request flows across your serverless components

For security monitoring and defense:

Enable Amazon GuardDuty to detect potential threats targeting your S3 data lake, Lambda functions, and other application resources
Implement Amazon Inspector for automated vulnerability assessments of your Lambda functions and container images
Configure AWS Security Hub to consolidate security findings and measure cloud security posture across your RAG application resources
Use Amazon Macie for continuous monitoring of S3 data lake contents to detect sensitive data exposures

For authentication and activity auditing:

Analyze AWS CloudTrail logs to audit API calls across your application stack
Implement CloudTrail Lake to create SQL-queryable datasets for security investigations
Enable Amazon Cognito advanced security features to detect suspicious sign-in activities

For data access controls:

Set up CloudWatch alarms to send alerts about unusual data access patterns
Configure AWS Config rules to monitor for compliance with access control best practices
Implement AWS IAM Access Analyzer to identify unintended resource access

Other important considerations include:

Adding feedback loops to continuously improve retrieval quality
Exploring multi-Region deployment for improved resilience
Implementing caching layers to optimize frequently accessed content
Extending the solution to support structured data assets using AWS Glue and AWS Lake Formation for data transformation and data access

With these foundations in place, your organization will be well-positioned to use generative AI technologies securely and effectively across the enterprise.

About the authors
Venkata Sistla is a Senior Specialist Solutions Architect in the Worldwide team at Amazon Web Services (AWS), with over 12 years of experience in cloud architecture. He specializes in designing and implementing enterprise-scale AI/ML systems across financial services, healthcare, mining and energy, independent software vendors (ISVs), sports, and retail sectors. His expertise lies in helping organizations transform their data challenges into competitive advantages through innovative cloud solutions while mentoring teams and driving technological excellence. He focuses on architecting highly scalable infrastructures that accelerate machine learning initiatives and deliver measurable business outcomes.
Aamna Najmi is a Senior GenAI and Data Specialist in the Worldwide team at Amazon Web Services (AWS). She assists customers across industries and Regions in operationalizing and governing their generative AI systems at scale, ensuring they meet the highest standards of performance, safety, and ethical considerations, bringing a unique perspective of modern data strategies to complement the field of AI. In her spare time, she pursues her passion of experimenting with food and discovering new places.

Google DeepMind Releases GenAI Processors: A Lightweight Python Librar …

Posted on July 14, 2025 by i-genie

Google DeepMind recently released GenAI Processors, a lightweight, open-source Python library built to simplify the orchestration of generative AI workflows—especially those involving real-time multimodal content. Launched last week, and available under an Apache‑2.0 license, this library provides a high-throughput, asynchronous stream framework for building advanced AI pipelines.

Stream‑Oriented Architecture

At the heart of GenAI Processors is the concept of processing asynchronous streams of ProcessorPart objects. These parts represent discrete chunks of data—text, audio, images, or JSON—each carrying metadata. By standardizing inputs and outputs into a consistent stream of parts, the library enables seamless chaining, combining, or branching of processing components while maintaining bidirectional flow. Internally, the use of Python’s asyncio enables each pipeline element to operate concurrently, dramatically reducing latency and improving overall throughput.

Efficient Concurrency

GenAI Processors is engineered to optimize latency by minimizing “Time To First Token” (TTFT). As soon as upstream components produce pieces of the stream, downstream processors begin work. This pipelined execution ensures that operations—including model inference—overlap and proceed in parallel, achieving efficient utilization of system and network resources.

Plug‑and‑Play Gemini Integration

The library comes with ready-made connectors for Google’s Gemini APIs, including both synchronous text-based calls and the Gemini Live API for streaming applications. These “model processors” abstract away the complexity of batching, context management, and streaming I/O, enabling rapid prototyping of interactive systems—such as live commentary agents, multimodal assistants, or tool-augmented research explorers.

Modular Components & Extensions

GenAI Processors prioritizes modularity. Developers build reusable units—processors—each encapsulating a defined operation, from MIME-type conversion to conditional routing. A contrib/ directory encourages community extensions for custom features, further enriching the ecosystem. Common utilities support tasks such as splitting/merging streams, filtering, and metadata handling, enabling complex pipelines with minimal custom code.

Notebooks and Real‑World Use Cases

Included with the repository are hands-on examples demonstrating key use cases:

Real‑Time Live agent: Connects audio input to Gemini and optionally a tool like web search, streaming audio output—all in real time.

Research agent: Orchestrates data collection, LLM querying, and dynamic summarization in sequence.

Live commentary agent: Combines event detection with narrative generation, showcasing how different processors sync to produce streamed commentary.

These examples, provided as Jupyter notebooks, serve as blueprints for engineers building responsive AI systems.

Comparison and Ecosystem Role

GenAI Processors complements tools like the google-genai SDK (the GenAI Python client) and Vertex AI, but elevates development by offering a structured orchestration layer focused on streaming capabilities. Unlike LangChain—which is focused primarily on LLM chaining—or NeMo—which constructs neural components—GenAI Processors excels in managing streaming data and coordinating asynchronous model interactions efficiently.

Broader Context: Gemini’s Capabilities

GenAI Processors leverages Gemini’s strengths. Gemini, DeepMind’s multimodal large language model, supports processing of text, images, audio, and video—most recently seen in the Gemini 2.5 rollout in. GenAI Processors enables developers to create pipelines that match Gemini’s multimodal skillset, delivering low-latency, interactive AI experiences.

Conclusion

With GenAI Processors, Google DeepMind provides a stream-first, asynchronous abstraction layer tailored for generative AI pipelines. By enabling:

Bidirectional, metadata-rich streaming of structured data parts

Concurrent execution of chained or parallel processors

Integration with Gemini model APIs (including Live streaming)

Modular, composable architecture with an open extension model

…this library bridges the gap between raw AI models and deployable, responsive pipelines. Whether you’re developing conversational agents, real-time document extractors, or multimodal research tools, GenAI Processors offers a lightweight yet powerful foundation.

Check out the Technical Details and GitHub Page. All credit for this research goes to the researchers of this project. If you’re planning a product launch/release, fundraising, or simply aiming for developer traction—let us help you hit that goal efficiently.
The post Google DeepMind Releases GenAI Processors: A Lightweight Python Library that Enables Efficient and Parallel Content Processing appeared first on MarkTechPost.

Meta AI Introduces UMA (Universal Models for Atoms): A Family of Unive …

Posted on July 14, 2025 by i-genie

Density Functional Theory (DFT) serves as the foundation of modern computational chemistry and materials science. However, its high computational cost severely limits its usage. Machine Learning Interatomic Potentials (MLIPs) have the potential to closely approximate DFT accuracy while significantly improving performance, reducing computation time from hours to less than a second with O(n) versus O(n³) scaling. However, training MLIPs that generalize across different chemical tasks remains an open challenge, as traditional methods rely on smaller problem-specific datasets instead of using the scaling advantages that have driven significant advances in language and vision models.

Existing attempts to address these challenges have focused on developing Universal MLIPs trained on larger datasets, with datasets like Alexandria and OMat24 leading to improved performance on the Matbench-Discovery leaderboard. Moreover, researchers have explored scaling relations to understand relationships between compute, data, and model size, taking inspiration from empirical scaling laws in LLMs that motivated training on more tokens with larger models for predictable performance improvements. These scaling relations help in determining optimal resource allocation between the dataset and model size. However, their application to MLIPs remains limited compared to the transformative impact seen in language modeling.

Researchers from FAIR at Meta and Carnegie Mellon University have proposed a family of Universal Models for Atoms (UMA) designed to test the limits of accuracy, speed, and generalization for a single model across chemistry and materials science. To address these challenges, Moreover, they developed empirical scaling laws relating compute, data, and model size to determine optimal model sizing and training strategies. This helped in overcoming the challenge of balancing accuracy and efficiency, which was due to the unprecedented dataset of ~500 million atomic systems. Moreover, UMA performs similarly or better than specialized models in both accuracy and inference speed on a wide range of material, molecular, and catalysis benchmarks, without fine-tuning to specific tasks.

The UMA architecture builds upon eSEN, an equivariant graph neural network, with crucial modifications to enable efficient scaling and handle additional inputs, including total charge, spin, and DFT settings for emulation. It also incorporates a new embedding that allows UMA models to integrate charge, spin, and DFT-related tasks. Each of these inputs generates an embedding of the same dimension as the spherical channels used. The training follows a two-stage approach: first stage directly predicts forces for faster training, and the second stage removes the force head and fine-tunes the model to predict conserving forces and stresses using auto-grad, ensuring energy conservation and smooth potential energy landscapes.

The results show that UMA models exhibit log-linear scaling behavior across the tested FLOP ranges. This indicates that greater model capacity is required to fit the UMA dataset, with these scaling relationships used to select accurate model sizes and show MoLE’s advantages over dense architectures. In multi-task training, a significant improvement is observed in loss when moving from 1 expert to 8 experts, smaller gains with 32 experts, and negligible improvements at 128 experts. Moreover, UMA models demonstrate exceptional inference efficiency despite having large parameter counts, with UMA-S capable of simulating 1000 atoms at 16 steps per second and fitting system sizes up to 100,000 atoms in memory on a single 80GB GPU.

In conclusion, researchers introduced a family of Universal Models for Atoms (UMA) that shows strong performance across a wide range of benchmarks, including materials, molecules, catalysts, molecular crystals, and metal-organic frameworks. It achieves new state-of-the-art results on established benchmarks such as AdsorbML and Matbench Discovery. However, it fails to handle long-range interactions due to the standard 6Å cutoff distance. Moreover, it uses separate embeddings for discrete charge or spin values, which limits generalization to unseen charges or spins. Future research aims to advance toward universal MLIPs and unlock new possibilities in atomic simulations, while highlighting the need for more challenging benchmarks to drive future progress.

Check out the Paper, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. If you’re planning a product launch/release, fundraising, or simply aiming for developer traction—let us help you hit that goal efficiently.
The post Meta AI Introduces UMA (Universal Models for Atoms): A Family of Universal Models for Atoms appeared first on MarkTechPost.

Moonshot AI Releases Kimi K2: A Trillion-Parameter MoE Model Focused …

Posted on July 13, 2025 by i-genie

Kimi K2, launched by Moonshot AI in July 2025, is a purpose-built, open-source Mixture-of-Experts (MoE) model—1 trillion total parameters, with 32 billion active parameters per token. It’s trained using the custom MuonClip optimizer on 15.5 trillion tokens, achieving stable training at this unprecedented scale without the typical instabilities seen in ultra-large models.

Unlike traditional chatbots, K2 is architected specifically for agentic workflows. It features native Model Context Protocol (MCP) support and was trained on simulated multi-step tool interactions, enabling it to autonomously decompose tasks, execute tool sequences, write and debug code, analyze data, and orchestrate workflows—all with minimal human oversight.

Why Agentic over Conversational?

While advanced models like GPT-4 and Claude 4 Sonnet excel at language reasoning, Kimi K2 moves from reasoning to action. It doesn’t just respond—it executes. The core shift lies in enabling real-world workflows:

Autonomous code execution

Data analysis with charts and interfaces

End-to-end web application development

Orchestration of 17+ tools per session without human input

K2’s training incorporated millions of synthetic dialogues, each rated by an LLM-based evaluator. These dialogues simulate realistic tool-use scenarios, giving K2 a practical edge in tool selection and multi-step execution.

Architecture and Training Innovations

K2’s technical design demonstrates several novel elements:

MoE Transformer Design: 384 experts with routing to 8 active experts per token, plus 1 shared expert for global context. The model uses 64 attention heads and supports a 128K-token context window.

MuonClip Optimizer: A modified version of Muon that stabilizes training at scale. It uses qk-clipping to constrain attention scores by rescaling Q/K matrices, effectively preventing instability in deep layers.

Training Dataset: Over 15.5 trillion tokens from multilingual and multimodal sources, giving K2 robust generalization and tool-use reasoning across diverse domains.

The model comes in two variants: Kimi-K2-Base, the foundational model ideal for fine-tuning and building customized solutions; and Kimi-K2-Instruct, the post-trained version optimized for immediate use in general-purpose chat and tool-using agentic tasks. Instruct is reflex-grade—optimized for fast, low-latency interaction rather than long-form deliberation. On benchmarks, Kimi K2 outperforms Claude Sonnet 4 and GPT-4.1 in coding and agentic reasoning, with 71.6% on SWE-bench, 65.8% on agentic tasks, and 53.7% on LiveCodeBench.

Performance Benchmarks

Kimi K2 not only matches but often surpasses closed-source models on key benchmarks:

BenchmarkKimi K2GPT‑4.1Claude Sonnet 4SWE-bench Verified71.6 %54.6 %~72.7 %Agentic Coding (Tau2)65.8 %45.2 %~61 %LiveCodeBench v6 (Pass@1)53.7 %44.7 %47.4 %MATH-50097.4 %92.4 %–MMLU89.5 %~90.4 %~92.9 %

Its performance in agentic benchmarks like Tau2 and LiveCodeBench demonstrates its superior capacity to handle multi-step, real-world coding tasks—outperforming many proprietary models.

Cost Efficiency

Perhaps the most disruptive element is pricing:

Claude 4 Sonnet: $3 input / $15 output per million tokens

Gemini 2.5 Pro: $2.5 input / $15 output

Kimi K2: $0.60 input / $2.50 output

Kimi K2 is roughly 5x cheaper than Claude or Gemini while offering equal or better performance on several metrics. The cost advantage, combined with open access and support for local deployment, positions K2 as an economically viable alternative for developers, enterprises, and research teams.

Strategic Shift: From Thinking to Acting

Kimi K2 marks a pivotal moment in AI’s evolution—from thinking agents to acting systems. With native tool-use capabilities and built-in support for multi-agent protocols, it goes far beyond static chat interfaces. It is capable of triggering workflows, making decisions, executing API calls, and delivering tangible outputs autonomously.

Moreover, its release comes at a time when most such capabilities are either locked behind expensive APIs or limited to research labs. K2 is:

Open-source, requiring no subscription

Globally accessible, not limited to US-based deployment

Designed for developers, not just end-users

Broader Implications

Will agentic architecture become the norm? K2’s strong performance on tool use tasks could push proprietary players to rethink their architectures.

Can open-source efforts from Asia compete at global scale? With K2, Moonshot AI joins others like DeepSeek in showing that top-tier performance doesn’t have to originate from Silicon Valley.

What’s next in the agentic evolution? Future models may combine video, robotics, and embodied reasoning to further expand the scope of what agentic AI can accomplish.

Conclusion

Kimi K2 isn’t just a bigger model—it’s a blueprint for what comes after the reasoning race: execution-first AI. By combining trillion-parameter scale, low inference costs, and deeply integrated agentic capabilities, Kimi K2 opens the door for AI systems that do more than generate—they build, act, and solve autonomously.

Check out the Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, and Youtube and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Moonshot AI Releases Kimi K2: A Trillion-Parameter MoE Model Focused on Long Context, Code, Reasoning, and Agentic Behavior appeared first on MarkTechPost.

From Perception to Action: The Role of World Models in Embodied AI Sys …

Posted on July 12, 2025 by i-genie

Introduction to Embodied AI Agents

Embodied AI agents are systems that exist in physical or virtual forms, such as robots, wearables, or avatars, and can interact with their surroundings. Unlike static web-based bots, these agents perceive the world and act meaningfully within it. Their embodiment enhances physical interaction, human trust, and human-like learning. Recent advances in large language and vision-language models have powered more capable, autonomous agents that can plan, reason, and adapt to users’ needs. These agents understand context, retain memory, and can collaborate or request clarification when needed. Despite progress, challenges remain, especially with generative models that often prioritize detail over efficient reasoning and decision-making.

World Modeling and Applications

Researchers at Meta AI are exploring how embodied AI agents, such as avatars, wearables, and robots, can interact more naturally with users and their surroundings by sensing, learning, and acting within real or virtual environments. Central to this is “world modeling,” which combines perception, reasoning, memory, and planning to help agents understand both physical spaces and human intentions. These agents are reshaping industries such as healthcare, entertainment, and labor. The study highlights future goals, such as enhancing collaboration, social intelligence, and ethical safeguards, particularly around privacy and anthropomorphism, as these agents become increasingly integrated into our lives.

Types of Embodied Agents

Embodied AI agents come in three forms: virtual, wearable, and robotic, and are designed to interact with the world in much the same way as humans. Virtual agents, such as therapy bots or avatars in the metaverse, simulate emotions to foster empathetic interactions. Wearable agents, such as those in smart glasses, share the user’s view and assist with real-time tasks or provide cognitive support. Robotic agents operate in physical spaces, assisting with complex or high-risk tasks such as caregiving or disaster response. These agents not only enhance daily life but also push us closer to general AI by learning through real-world experience, perception, and physical interaction.

Importance of World Models

World models are crucial for embodied AI agents, enabling them to perceive, understand, and interact with their environment like humans. These models integrate various sensory inputs, such as vision, sound, and touch, with memory and reasoning capabilities to form a cohesive understanding of the world. This enables agents to anticipate outcomes, plan effective actions, and adapt to new situations. By incorporating both physical surroundings and user intentions, world models facilitate more natural and intuitive interactions between humans and AI agents, enhancing their ability to perform complex tasks autonomously.

To enable truly autonomous learning in Embodied AI, future research must integrate passive observation (such as vision-language learning) with active interaction (like reinforcement learning). Passive systems excel at understanding structure from data but lack grounding in real-world actions. Active systems learn through doing, but are often inefficient. By combining both, AI can gain abstract knowledge and apply it through goal-driven behavior. Looking ahead, collaboration among multiple agents adds complexity, requiring effective communication, coordination, and conflict resolution. Strategies like emergent communication, negotiation, and multi-agent reinforcement learning will be key. Ultimately, the aim is to build adaptable, interactive AI that learns like humans through experience.

Conclusion

In conclusion, the study examines how embodied AI agents, such as virtual avatars, wearable devices, and robots, can interact with the world more like humans by perceiving, learning, and acting within their environments. Central to their success is building “world models” that help them understand context, predict outcomes, and plan effectively. These agents are already reshaping areas like therapy, entertainment, and real-time assistance. As they become more integrated into daily life, ethical issues such as privacy and human-like behavior require careful attention. Future work will focus on improving learning, collaboration, and social intelligence, aiming for more natural, intuitive, and responsible human-AI interaction.

Check out the Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, and Youtube and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post From Perception to Action: The Role of World Models in Embodied AI Systems appeared first on MarkTechPost.