In-Context Learning Capabilities of Multi-Layer Perceptrons MLPs: A Co …

Recent years have seen significant advances in neural language models, particularly Large Language Models (LLMs) enabled by the Transformer architecture and increased scale. LLMs exhibit exceptional skills in generating grammatical text, answering questions, summarising content, creating imaginative outputs, and solving complex puzzles. A key capability is in-context learning (ICL), where the model uses novel task exemplars presented during inference to respond accurately without weight updates. ICL is typically attributed to Transformers and their attention-based mechanisms.

ICL has been shown for linear regression tasks with Transformers, which can generalize to new input/label pairs in-context. Transformers achieve this by potentially implementing gradient descent or replicating least-squares regression. Transformers interpolate between in-weight learning (IWL) and ICL, with diverse datasets enhancing ICL capabilities. While most studies focus on Transformers, some research explores recurrent neural networks (RNNs) and LSTMs, with mixed results. Recent findings highlight various causal sequence models and state space models also achieving ICL. However, MLPs’ potential for ICL remains underexplored despite their resurgence in complex tasks, prompted by the introduction of the MLP-Mixer model.

In this study researchers from Harvard demonstrate that multi-layer perceptrons (MLPs) can effectively learn in-context. MLPs and MLPMixer models perform competitively with Transformers on ICL tasks within the same compute budget. Particularly, MLPs outperform Transformers in relational reasoning ICL tasks, challenging the belief that ICL is unique to Transformers. This success suggests exploring beyond attention-based architectures and indicates that Transformers, constrained by self-attention and positional encodings, may be biased away from certain task structures compared to MLPs.

The study investigates MLPs’ behavior in ICL through two tasks: in-context regression and in-context classification. For ICL regression, the input is a sequence of linearly related value pairs (xi, yi), with varying weights β and added noise, plus a query xq. The model predicts the corresponding yq by inferring β from the context exemplars. For ICL classification, the input is a sequence of exemplars (xi, yi) followed by a query xq, sampled from a Gaussian mixture model. The model predicts the correct label for xq by referencing the context exemplars, considering data diversity and burstiness (Number of repeats per cluster in the context).

MLPs and Transformers were compared on in-context regression and classification tasks. Both architectures, including MLP-Mixers, achieved near-optimal mean squared error (MSE) with sufficient computing, although Transformers slightly outperformed MLPs for smaller computing budgets. For longer context lengths, vanilla MLPs performed worse, while MLP-Mixers maintained optimal MSE. As data diversity increased, all models transitioned from IWL to ICL, with Transformers making the transition more quickly. In in-context classification, MLPs performed comparably to Transformers, maintaining relatively flat loss across context lengths and transitioning from IWL to ICL with increased data diversity.

In this work, Harvard researchers compare MLPs and Transformers on in-context regression and classification tasks. All architectures, including MLP-Mixers, achieved near-optimal MSE with sufficient compute, although Transformers slightly outperformed MLPs with smaller compute budgets. Vanilla MLPs performed worse with longer context lengths, while MLP-Mixers maintained optimal MSE. As data diversity increased, all models transitioned from IWL to ICL, with Transformers making the transition more quickly. In in-context classification, MLPs performed comparably to Transformers, maintaining flat loss across context lengths and transitioning from IWL to ICL as data diversity increased.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform
The post In-Context Learning Capabilities of Multi-Layer Perceptrons MLPs: A Comparative Study with Transformers appeared first on MarkTechPost.

Microsoft AI for Good Introduces Pytorch-Wildlife: An Open-Source Deep …

Human activities increasingly threaten wildlife’s role in maintaining ecosystem balance, highlighting the critical need for large-scale biodiversity monitoring. Addressing the logistical challenges of fieldwork and data collection, especially in remote and biodiverse regions, has led to the deployment of automated data collection devices. These include camera traps, autonomous recording units, and overhead cameras on drones and satellites. While these tools have proven effective, they generate vast datasets that necessitate manual processing and annotation, creating a significant bottleneck in data management.

Deep learning technologies, particularly Convolutional Neural Networks (CNNs), have revolutionized the processing of large, complex datasets, such as those comprising wildlife images. These technologies have shown exceptional performance in animal detection and classification.

However, practical implementation in conservation efforts presents challenges. Effective integration of deep learning in conservation requires addressing accessibility, scalability, and transparency. Accessibility ensures models are easy to install and use, even for non-technical users. Scalability allows the framework to adapt to various needs and scenarios, and transparency involves providing open-source solutions that users can understand and build upon.

To tackle these challenges, Microsoft researchers developed Pytorch-wildlife. Pytorch-wildlife is an open-source deep learning framework tailored specifically for conservation efforts, emphasizing ease of use, adaptability, and openness. Thanks to its availability via pip, the framework can be easily installed on any system that supports Python. Its modular architecture enables the seamless addition of new features, models, and datasets, ensuring that it remains versatile and applicable across different conservation tasks.

One of Pytorch-wildlife’s significant features is its comprehensive model zoo, which includes various models for animal detection and classification. This allows users to choose the best-suited models for their specific needs. Additionally, Pytorch-wildlife features a user-friendly interface designed to cater to non-technical users, making advanced deep-learning tools accessible to a broader audience within the conservation community. This interface simplifies interaction with the framework’s capabilities, fostering wider adoption and more effective use of AI in wildlife monitoring.

Pytorch-wildlife also demonstrates its practical utility through real-world applications. For instance, it has been used to detect and recognize animals in specific conservation projects, such as monitoring opossums in the Galapagos Islands and identifying 36 animal genera in the Amazon Rainforest. These applications showcase the framework’s robustness and effectiveness in diverse environments, underscoring its potential to transform biodiversity monitoring and wildlife conservation efforts.

In conclusion, Pytorch-wildlife represents a significant advancement in using deep learning for conservation. Focusing on accessibility, scalability, and transparency addresses the primary challenges of integrating AI into wildlife monitoring. As an open-source framework, it encourages collaboration and continuous improvement, enabling the conservation community to leverage cutting-edge technology in preserving biodiversity. Pytorch-wildlife is a unified and versatile platform poised to enhance the efficiency and impact of conservation projects worldwide.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform
The post Microsoft AI for Good Introduces Pytorch-Wildlife: An Open-Source Deep Learning Platform Built on PyTorch appeared first on MarkTechPost.

Advancing Ethical AI: Preference Matching Reinforcement Learning from …

Large language models (LLMs) like ChatGPT-4 and Claude-3 Opus excel in tasks such as code generation, data analysis, and reasoning. Their growing influence in decision-making across various domains makes it crucial to align them with human preferences to ensure fairness and sound economic decisions. Human preferences vary widely due to cultural backgrounds and personal experiences, and LLMs often exhibit biases, favoring dominant viewpoints and frequent items. If LLMs do not accurately reflect these diverse preferences, biased outputs can lead to unfair and economically detrimental outcomes.

Existing methods, particularly reinforcement learning from human feedback (RLHF), suffer from algorithmic bias, leading to preference collapse where minority preferences are disregarded. This bias persists even with an oracle reward model, highlighting the limitations of current approaches in capturing diverse human preferences accurately.

Researchers have introduced a groundbreaking approach, Preference Matching RLHF, aimed at mitigating algorithmic bias and aligning LLMs with human preferences effectively. At the core of this innovative method lies the preference-matching regularizer, derived through solving an ordinary differential equation. This regularizer ensures the LLM strikes a balance between response diversification and reward maximization, enhancing the model’s ability to capture and reflect human preferences accurately. Preference Matching RLHF provides robust statistical guarantees and effectively eliminates the bias inherent in conventional RLHF approaches. The paper also details a conditional variant tailored for natural language generation tasks, improving the model’s capacity to generate responses that align closely with human preferences.

The experimental validation of Preference Matching RLHF on the OPT-1.3B and Llama-2-7B models yielded compelling results, demonstrating significant improvements in aligning LLMs with human preferences. Performance metrics show a 29% to 41% improvement compared to standard RLHF methods, underscoring the approach’s capability to capture diverse human preferences and mitigate algorithmic bias. These results highlight the promising potential of Preference Matching RLHF in advancing AI research toward more ethical and effective decision-making processes.

In conclusion, Preference Matching RLHF offers a significant contribution by addressing algorithmic bias and enhancing the alignment of LLMs with human preferences. This advancement can improve decision-making processes, promote fairness, and mitigate biased outputs from LLMs, advancing the field of AI research.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform
The post Advancing Ethical AI: Preference Matching Reinforcement Learning from Human Feedback RLHF for Aligning LLMs with Human Preferences appeared first on MarkTechPost.

CBRE and AWS perform natural language queries of structured data using …

This is a guest post co-written with CBRE.
CBRE is the world’s largest commercial real estate services and investment firm, with 130,000 professionals serving clients in more than 100 countries. Services range from financing and investment to property management.
CBRE is unlocking the potential of artificial intelligence (AI) to realize value across the entire commercial real estate lifecycle—from guiding investment decisions to managing buildings. The opportunities to unlock value using AI in the commercial real estate lifecycle starts with data at scale. CBRE’s data environment, with 39 billion data points from over 300 sources, combined with a suite of enterprise-grade technology can deploy a range of AI solutions to enable individual productivity all the way to broadscale transformation. Although CBRE provides customers their curated best-in-class dashboards, CBRE wanted to provide a solution for their customers to quickly make custom queries of their data using only natural language prompts.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API, along with a broad set of capabilities to build generative AI applications, simplifying development while maintaining privacy and security. With the comprehensive capabilities of Amazon Bedrock, you can experiment with a variety of FMs, privately customize them with your own data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and create managed agents that run complex business tasks—from booking travel and processing insurance claims to creating ad campaigns and managing inventory—all without the need to write code. Because Amazon Bedrock is serverless, you don’t have to manage infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.
In this post, we describe how CBRE partnered with AWS Prototyping to develop a custom query environment allowing natural language query (NLQ) prompts by using Amazon Bedrock, AWS Lambda, Amazon Relational Database Service (Amazon RDS), and Amazon OpenSearch Service. AWS Prototyping successfully delivered a scalable prototype, which solved CBRE’s business problem with a high accuracy rate (over 95%) and supported reuse of embeddings for similar NLQs, and an API gateway for integration into CBRE’s dashboards.
Customer use case
Today, CBRE manages a standardized set of best-in-class client dashboards and reports, powered by various business intelligence (BI) tools, such as Tableau and Microsoft Power BI, and their proprietary UI, enabling CBRE clients to review core metrics and reports on occupancy, rent, energy usage, and more for various properties managed by CBRE.
The company’s Data & Analytics team regularly receives client requests for unique reports, metrics, or insights, which require custom development. CBRE wanted to enable clients to quickly query existing data using natural language prompts, all in a user-friendly environment. The prompts are managed through Lambda functions to use OpenSearch Service and Anthropic Claude 2 on Amazon Bedrock to search the client’s database and generate an appropriate response to the client’s business analysis, including the response in plain English, the reasoning, and the SQL code. A simple UI was developed that encapsulates the complexity and allows users to input questions and retrieve the results directly. This solution can be applied to other dashboards at a later stage.
Key use case and environment requirements
Generative AI is a powerful tool for analyzing and transforming vast datasets into usable summaries and text for end-users. Key requirements from CBRE included:

Natural language queries (common questions submitted in English) to be used as primary input
A scalable solution using a large language model (LLM) to generate and run SQL queries for business dashboards
Queries submitted to the environment that return the following:

Result in plain English
Reasoning in plain English
SQL code generated

The ability to reuseexisting embeddings of tables, columns, and SQL code if input NLQ is similar to a previous query
Query response time of 3–5 seconds
Target 90% “good” responses to queries (based on customer User Acceptance Testing)
An API management layer for integration into CBRE’s dashboard
A straightforward UI and frontend for User Acceptance Testing (UAT)

Solution overview
CBRE and AWS Prototyping built an environment that allows a user to submit a query to structured data tables using natural language (in English), based on Anthropic Claude 2 on Amazon Bedrock with support for 100,000 maximum tokens. Embeddings were generated using Amazon Titan. The framework for connecting Anthropic Claude 2 and CBRE’s sample database was implemented using LangChain. AWS Prototyping developed an AWS Cloud Development Kit (AWS CDK) stack for deployment following AWS best practices.
The environment was developed over a period of multiple development sprints. CBRE, in parallel, completed UAT testing to confirm it performed as expected.
The following figure illustrates the core architecture for the NLQ capability.

The workflow for NLQ consists of the following steps:

A Lambda function writes schema JSON and table metadata CSV to an S3 bucket.
A user sends a question (NLQ) as a JSON event.
The Lambda wrapper function searches for similar questions in OpenSearch Service. If it finds any, it skips to Step 6. If not, it continues to Step 3.
The wrapper function reads the table metadata from the S3 bucket.
The wrapper function creates a dynamic prompt template and gets relevant tables using Amazon Bedrock and LangChain.
The wrapper function selects only relevant tables schema from the schema JSON in the S3 bucket.
The wrapper function creates a dynamic prompt template and generates a SQL query using Anthropic Claude 2.
The wrapper function runs the SQL query using psycopg2.
The wrapper function creates a dynamic prompt template to generate an English answer using Anthropic Claude 2.
The wrapper function uses Anthropic Claude 2 and OpenSearch Service to do the following:

It generates embeddings using Amazon Titan.
It stores the question and SQL query as a vector for reuse in the OpenSearch Service index.

The wrapper function consolidates the output and returns the JSON output.

Web UI and API management layer
AWS Prototyping built a web interface and API management layer to enable user testing during development and accelerate integration into CBRE’s existing BI capabilities. The following diagram illustrates the web interface and API management layer.

The workflow includes the following steps:

The user accesses the web portal hosted from their laptop through a web browser.
A low-latency Amazon CloudFront distribution is used to serve the static site protected by a HTTPS certificate issued by Amazon Certificate Manager (ACM).
An S3 bucket stores the website-related HTML, CSS, and JavaScript necessary to render the static site. The CloudFront distribution has its origin configured to this S3 bucket and remains in sync to serve the latest version of the site to users.
Amazon Cognito is used as a primary authentication and authorization provider with its user pools to allow user login, access to the API gateway, and access to the website bucket and response bucket.
An Amazon API Gateway endpoint with a REST API stage is secured by Amazon Cognito to only allow authenticated entities access to the Lambda function.
A Lambda function with business logic invokes the primary Lambda function.
An S3 bucket to store the generated response from the primary Lambda function is queried from the frontend periodically to show on the web application.
A VPC endpoint is established to isolate the primary Lambda function.
VPC endpoints for both Lambda and Amazon S3 are imported and configured using the AWS CDK so the frontend stack can have adequate access permissions to reach resources within a VPC.
AWS Identity and Access Management (IAM) enforces the necessary permissions for the frontend application.
Amazon CloudWatch captures run logs across various resources, especially Lambda and API Gateway.

Technical approach
Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With the Amazon Bedrock serverless experience, you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using AWS tools without having to manage any infrastructure.
Anthropic Claude 2 on Amazon Bedrock, a general-purpose LLM with 100,000 maximum token support, was selected to support the solution. LLMs demonstrate impressive abilities in automatically generating code. Relevant metadata can help guide the model’s output and in customizing SQL code generation for specific use cases. AWS offers tools like AWS Glue crawlers to automatically extract technical metadata from data sources. Business metadata can be constructed using services like Amazon DataZone. A lightweight approach was taken to quickly build the required technical and business catalogs using custom scripts. The metadata primed the model to generate tailored SQL code aligned with our database schema and business needs.
Input context files are needed for the Anthropic Claude 2 model to generate a SQL query according to the NLQ:

meta.csv – This is human-written metadata in a CSV file stored in an S3 bucket, which includes the names of the tables in the schema and a description for each table. The meta.csv file is sent as an input context to the model (refer to steps 3 and 4 in the end-to-end solution architecture diagram) to find the relevant tables according to the input NLQ. The S3 location of meta.csv is as follows:

s3://<dbSchemaGeneratorBucket>/<DB_Name>/table/meta.csv

schema.json – This JSON schema is generated by a Lambda function and stored in Amazon S3. Following steps 5 and 6 in the architecture, the relevant tables schema is sent as input context to the model to generate a SQL query according to the input NLQ. The S3 location of schem.json is as follows:

s3://<dbSchemaGeneratorBucket>/<DB_Name>/schema/schema.json

DB schema generator Lambda function
This function needs to be invoked manually. The following configurable environmental variables are managed by the AWS CDK during the deployment of this Lambda function:

dbSchemaGeneratorBucket – S3 bucket for schema.json
secretManagerKey – AWS Secrets Manager key for DB credentials
secretManagerRegion – AWS Region in which the Secrets Manager key exists

After a successful run, schema.json is written in an S3 bucket.
Lambda wrapper function
This is the core component of the solution, which performs steps 2 through 10 as described in the end-to-end solution architecture. The following figure illustrates its code structure and workflow.

It runs the following scripts:

index.py – The Lambda handler (main) handles input/output and runs functions based on keys in the input context
langchain_bedrock.py – Get relevant tables, generate SQL queries, and convert SQL to English using Anthropic Claude 2
opensearch.py – Retrieve similar embeddings with existing index or generate new embeddings in OpenSearch Service
sql.py – Run SQL queries using pyscopg2 and the opensearch.py module
boto3_bedrock.py – The Boto3 client for Amazon Bedrock
utils.py – The utilities function includes the OpenSearch Service client, Secrets Manager client, and formatting the final output response

The Lambda wrapper function has two layers for the dependencies:

LangChain layer – pip modules and dependencies of LangChain, boto3, and psycopg2
OpenSearch Service layer – OpenSearch Service Python client dependencies

AWS CDK manages the following configurable environmental variables during wrapper function deployment:

dbSchemaGeneratorBucket – S3 bucket for schema.json
opensearchDomainEndpoint – OpenSearch Service endpoint
opensearchMasterUserSecretKey – Secret key name for OpenSearch Service credentials
secretManagerKey – Secret key name for Amazon RDS credentials
secretManagerRegion – Region in which Secrets Manager key exists

The following code illustrates the JSON format for an input event:

{
“useVectorDB”: <0 or 1>,
“input_queries”: [
<Question 1>,
<Question 2>,
<Question 3>
],
“S3OutBucket”: <Output response bucket>,
“S3OutPrefix”: <Output S3 Prefix>
}

It contains the following parameters:

input_queries is a list of NLQ questions with a range of 1 to X integer. If there is more than one NLQ, those are added as follow-up questions to the first NLQ.
The useVectorDB key defines if OpenSearch Service is to be used as the vector database. If 0, it will run the end-to-end workflow without searching for similar embeddings in OpenSearch Service. If 1, it searches for similar embeddings. If similar embeddings are available, it directly runs the SQL code, otherwise it performs inference with the model. By default, useVectorDB is set to 1, and therefore this key is optional.
The S3OutBucket and S3OutPrefix keys are optional. These keys represent the S3 output location of the JSON response. These are primarily used by the frontend in asynchronous mode.

The following code illustrates the JSON format for an output response:

[
statusCode: <200 or 400>,
{
“Question”: <Input NLQ>,
“sql_code”: <SQL Query generated by Amazon Bedrock>,
“SQL_Answer”: <SQL Response>,
“English_Answer”: <English Answer>
}
]

statusCode 200 indicates a successful run of the Lambda function; statusCode 400 indicates a failure with error.
Performance tuning approach
Performance tuning is an iterative approach across multiple layers. In this section, we discuss a performance tuning approach for this solution.
Input context for RAG
LLMs are mostly trained on general domain corpora, making them less effective on domain-specific tasks. In this scenario, when the expectation is to generate SQL queries based on a PostgreSQL DB schema, the schema becomes our input context to an LLM to generate a context-specific SQL query. In our solution, two input context files are critical for the best output, performance, and cost:

Get relevant tables – Because the entire PostgreSQL DB schema’s context length is high (over 16,000 tokens for our demo database), it’s necessary to include only the relevant tables in the schema rather than the entire DB schema with all tables to reduce the input context length of the model, which impacts not only the quality of the generated content, but also performance and cost. Because choosing the right tables according to the NLQ is a crucial step, it’s highly recommended to describe the tables in detail in meta.csv.
DB schema – schema.json is generated by the schema generator Lambda function, saved in Amazon S3, and passed as input context. It includes column names, data type, distinct values, relationships, and more. The output quality of the LLM-generated SQL query is highly dependent on the detailed schema. Input context length for each table’s schema for demo is between 2,000–4,000 tokens. A more detailed schema may provide fine results, but it’s also necessary to optimize the context length for performance and cost. As part of our solution, we already optimized the DB schema generator Lambda function to balance detailed schema and input context length. If required, you can further optimize the function depending on the complexity of the SQL query to be generated to include more details (for example, column metadata).

Prompt engineering and instruction tuning
Prompt engineering allows you to design the input to an LLM in order to generate an optimized output. A dynamic prompt template is created according to the input NLQ using LangChain (refer to steps 4, 6, and 8 in the end-to-end solution architecture). We combine the input NLQ (prompt) along with a set of instructions for the model to generate the content. It is necessary to optimize both the input NLQ and the instructions within the dynamic prompt template:

With prompt tuning, it’s vital to be descriptive of newer NLQs for the model to understand and generate a relevant SQL query.
For instruction tuning, the functions dyn_prompt_get_table, gen_sql_query, and sql_to_english in langchain_bedrock.py of the Lambda wrapper function have a set of purpose-specific instructions. These instructions are optimized for best performance and can be further optimized depending on the complexity of the SQL query to be generated.

Inference parameters
Refer to Inference parameters for foundation models for more information on model inference parameters to influence the response generated by the model. We’ve used the following parameters specific to different inference steps to control maximum tokens to sample, randomness, probability distribution, and cutoff based on the sum of probabilities of the potential choices.
The following parameters specify to get relevant tables and output a SQL-to-English response:

inf_var_table = {
“max_tokens_to_sample”: 4096,
“temperature”: 1,
“top_k”: 250,
“top_p”: 0.999,
“stop_sequences”: [“nnHuman”],
}

The following parameters generate the SQL query:

inf_var_sql = {
“max_tokens_to_sample”: 4096,
“temperature”: 0.3,
“top_k”: 250,
“top_p”: 0.3,
“stop_sequences”: [“nnHuman”],
}

Monitoring
You can monitor the solution components through Amazon CloudWatch logs and metrics. For example, the Lambda wrapper’s logs are available on the Log groups page of the CloudWatch console (cbre-wrapper-lambda-<account ID>-us-east-1), and provide step-by-step logs throughout the workflow. Similarly, Amazon Bedrock metrics are available by navigating to Metrics, Bedrock on the CloudWatch console. These metrics include input/output tokens count, invocation metrics, and errors.
AWS CDK stacks
We used the AWS CDK to provision all the resources mentioned. The AWS CDK defines the AWS Cloud infrastructure in a general-purpose programming language. Currently, the AWS CDK supports TypeScript, JavaScript, Python, Java, C#, and Go. We used TypeScript for the AWS CDK stacks and constructs.
AWS CodeCommit
The first AWS Cloud resource is an AWS CodeCommit repository. CodeCommit is a secure, highly scalable, fully managed source control service that hosts private Git repositories. The entire code base of this prototyping engagement resides in the CodeCommit repo provisioned by the AWS CDK in the us-east-1 Region.
Amazon Bedrock roles
A dedicated IAM policy is created to allow other AWS Cloud services to access Amazon Bedrock within the target AWS account. We used IAM to create a policy document and add the necessary roles. The roles and policy define the access constraints to Amazon Bedrock from other AWS services in the customer account.
It’s recommended to follow the Well Architected Framework’s principle of least privilege for a production-ready security posture.
Amazon VPC
The prototype infrastructure was built within an virtual private cloud (VPC), which enables you to launch AWS resources in a logically isolated virtual network that you’ve defined.
Amazon Virtual Private Cloud (Amazon VPC) also isolates other resources, including publicly accessible AWS services like Secrets Manager, Amazon S3, and Lambda. A VPC endpoint enables you to privately connect to supported AWS services and VPC endpoint services powered by AWS PrivateLink. VPC endpoints create dynamic, scalable, and privately routable network connections between the VPC and supported AWS services. There are two types of VPC endpoints: interface endpoints and gateway endpoints. The following endpoints were created using the AWS CDK:

An Amazon S3 gateway endpoint to access several S3 buckets needed for this prototype
An Amazon VPC endpoint to allow private communication between AWS Cloud resources within the VPC and Amazon Bedrock with a policy to allow listing of FMs and to invoke an FM
An Amazon VPC endpoint to allow private communication between AWS Cloud resources within the VPC and the secrets stored in Secrets Manager only within the AWS account and the specific target Region of us-east-1

Provision OpenSearch Service clusters
OpenSearch Service makes it straightforward to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open source, distributed search and analytics suite derived from Elasticsearch. OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), as well as visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions). OpenSearch Service currently has tens of thousands of active customers with hundreds of thousands of clusters under management, processing hundreds of trillions of requests per month.
The first step was setting up an OpenSearch Service security group that is restricted to only allow HTTPS connectivity to the index. Then we added this security group to the newly created VPC endpoints for Secrets Manager to allow OpenSearch Service to store and retrieve the credentials necessary to access the clusters. As a best practice, we don’t reuse or import a primary user; instead, we create a primary user with a unique user name and password automatically using the AWS CDK upon deployment. Because the OpenSearch Service security group to the VPC is allowed, the primary user credentials are now stored directly in Secrets Manager while the AWS CDK stack is deployed.
The number of data nodes must be a multiple of the number of Availability Zones configured for the domain, so a list of three subnets from all the available VPC subnets is maintained.
Lambda wrapper function design and deployment
The Lambda wrapper function is the central Lambda function, which connects to every other AWS resource such as Amazon Bedrock, OpenSearch Service, Secrets Manager, and Amazon S3.
The first step is setting up two Lambda layers, one for LangChain and the other for OpenSearch Service dependencies. A Lambda layer is a .zip file archive that contains supplementary code or data. Layers usually contain library dependencies, a custom runtime, or configuration files.
Using the provided RDS database, the security groups were imported and linked to the Lambda wrapper function for Lambda to then reach out to the RDS instance. We used Amazon RDS Proxy to create a proxy to obscure the original domain details of the RDS instance. This RDS proxy interface was manually created from the AWS Management Console and not from the AWS CDK.
DB schema generator Lambda function
An S3 bucket is then created to store the RDS DB schema file with configurations to block public access with Amazon S3 managed encryptions, although customer managed key (CMK) backed encryption is recommended for enhanced security for production workloads.
The Lambda function was created with access to Amazon RDS using an RDS proxy endpoint. The credentials of the RDS instance are manually stored in Secrets Manager and access to the DB schema S3 bucket can be gained by adding an IAM policy to the Amazon S3 VPC endpoint (created earlier in the stack).
Website dashboard
The frontend provides an interface where users can log in and enter natural language prompts to get AI-generated responses. The various resources deployed through the website stack are as follows.
Imports
The website stack communicates with the infrastructure stack to deploy the resources within a VPC and trigger the Lambda wrapper function. The VPC and Lambda function objects were imported into this stack. This is the only link between the two stacks so they remain loosely coupled.
Auth stack
The auth stack is responsible for setting up Amazon Cognito user pools, identity pools, and the authenticated and un-authenticated IAM roles. User sign-in settings and password policies were set up with an email as our primary authentication mechanism to help prevent new users from signing up from the web application itself. New users must be manually created from the console.
Bucket stack
The bucket stack is responsible for setting up the S3 bucket to store the response from the Lambda wrapper function. The Lambda wrapper function is smart enough to understand if it was invoked directly from the console or the website. The frontend code will reach out to this response bucket to pull the response for the respective natural language prompt. The S3 bucket endpoint is configured with an allow list to limit the I/O traffic of this bucket within the VPC only.
API stack
The API stack is responsible for setting up an API Gateway endpoint that is protected by Amazon Cognito to allow authenticated and authorized user entities. Also, a REST API stage was added, which then invokes the website Lambda function.
The website Lambda function is allowed to invoke the Lambda wrapper function. Invoking a Lambda function within a VPC by a non-VPC Lambda function is allowed but is not recommended for a production system.
The API Gateway endpoint is protected by an AWS WAF configuration. AWS WAF helps you protect against common web exploits and bots that can affect availability, compromise security, or consume excessive resources.
Hosting stack
The hosting stack uses CloudFront to serve the frontend website code (HTML, CSS, and JavaScript) stored in a dedicated S3 bucket. CloudFront is a content delivery network (CDN) service built for high performance, security, and developer convenience. When you serve static content that is hosted on AWS, the recommended approach is to use an S3 bucket as the origin and use CloudFront to distribute the content. There are two primary benefits of this solution. The first is the convenience of caching static content at edge locations. The second is that you can define web access control lists (ACLs) for the CloudFront distribution, which helps you secure requests to the content with minimal configuration and administrative overhead.
Users can visit the CloudFront distribution endpoint from their preferred web browser to access the login screen.
Home page
The home page has three sections to it. The first section is the NLQ prompt section, where you can add up to three user prompts and delete prompts as needed.

The prompts are then translated into a prompt input that will be sent to the Lambda wrapper function. This section is non-editable and only for reference. You can opt to use the OpenSearch Service vector DB store to get preprocessed queries for faster responses. Only prompts that were processed earlier and stored in the vector DB will return a valid response. For newer queries, we recommend leaving the switch in its default off position.

If you choose Get Response, you may see a progress bar, which waits for approximately 100 seconds for the Lambda wrapper function to finish. If the response is timed out for reasons such as unexcepted service delays with Amazon Bedrock or Lambda, you will see a timeout message and the prompts are reset.

When the Lambda wrapper function is complete, it outputs the AI generated response.

Conclusion
CBRE has taken pragmatic steps to adopt transformative AI technologies that enhance their business offerings and extend their leadership in the market. CBRE and the AWS Prototyping team developed an NLQ environment using Amazon Bedrock, Lambda, Amazon RDS, and OpenSearch Service, demonstrating outputs with a high accuracy rate (more than 95%), supported reuse of embeddings, and an API gateway.
This project is a great starting point for organizations looking to break ground with generative AI in data analytics. CBRE stands poised and ready to continue using their intimate knowledge of their customers and the real estate industry to build the real estate solutions of tomorrow.
For more resources, refer to the following:

AWS Generative AI Innovation Center
Inside the AWS Prototyping and Innovation Lab
Guidance for Natural Language Queries of Relational Databases on AWS

About the Authors

Surya Rebbapragada is the VP of Digital & Technology at CBRE
Edy Setiawan is the Director of Digital & Technology at CBRE
Naveena Allampalli is a Sr. Principal Enterprise Architect at CBRE
Chakra Nagarajan is a Sr. Principal ML Prototyping Solutions Architect at AWS
Tamil Jayakumar is a Sr. Prototyping Engineer at AWS
Shane Madigan is a Sr. Engagement Manager at AWS
Maran Chandrasekaran is a Sr. Solutions Architect at AWS
VB Bakre is an Account Manager at AWS

Dynamic video content moderation and policy evaluation using AWS gener …

Organizations across media and entertainment, advertising, social media, education, and other sectors require efficient solutions to extract information from videos and apply flexible evaluations based on their policies. Generative artificial intelligence (AI) has unlocked fresh opportunities for these use cases. In this post, we introduce the Media Analysis and Policy Evaluation solution, which uses AWS AI and generative AI services to provide a framework to streamline video extraction and evaluation processes.
Popular use cases
Advertising tech companies own video content like ad creatives. When it comes to video analysis, priorities include brand safety, regulatory compliance, and engaging content. This solution, powered by AWS AI and generative AI services, meets these needs. Advanced content moderation makes sure ads appear alongside safe, compliant content, building trust with consumers. You can use the solution to evaluate videos against content compliance policies. You can also use it to create compelling headlines and summaries, boosting user engagement and ad performance.
Educational tech companies manage large inventories of training videos. An efficient way to analyze videos will help them evaluate content against industry policies, index videos for efficient search, and perform dynamic detection and redaction tasks, such as blurring student faces in a Zoom recording.
The solution is available on the GitHub repository and can be deployed to your AWS account using an AWS Cloud Development Kit (AWS CDK) package.
Solution overview

Media extraction – After a video uploaded, the app starts preprocessing by extracting image frames from a video. Each frame will be analyzed using Amazon Rekognition and Amazon Bedrock for metadata extraction. In parallel, the system extracts audio transcription from the uploaded content using Amazon Transcribe.
Policy evaluation – Using the extracted metadata from the video, the system conducts LLM evaluation. This allows you to take advantage of the flexibility of LLMs to evaluate video against dynamic policies.

The following diagram illustrates the solution workflow and architecture.

The solution adopts microservice design principles, with loosely coupled components that can be deployed together to serve the video analysis and policy evaluation workflow, or independently to integrate into existing pipelines. The following diagram illustrates the microservice architecture.

The microservice workflow consists of the following steps:

Users access the frontend static website via Amazon CloudFront distribution. The static content is hosted on Amazon Simple Storage Service (Amazon S3).
Users log in to the frontend web application and are authenticated by an Amazon Cognito user pool.
Users upload videos to Amazon S3 directly from their browser using multi-part pre-signed Amazon S3 URLs.
The frontend UI interacts with the extract microservice through a RESTful interface provided by Amazon API Gateway. This interface offers CRUD (create, read, update, delete) features for video task extraction management.
An AWS Step Functions state machine oversees the analysis process. It transcribes audio using Amazon Transcribe, samples image frames from video using moviepy, and analyzes each image using Anthropic Claude Sonnet image summarization. It also generates text embedding and multimodal embedding on the frame level using Amazon Titan models.
An Amazon OpenSearch Service cluster stores the extracted video metadata and facilitates users’ search and discovery needs. The UI constructs evaluation prompts and sends them to Amazon Bedrock LLMs, retrieving evaluation results synchronously.
Using the solution UI, user selects existing template prompts, customize them and start the policy evaluation utilizing Amazon Bedrock. The solution runs the evaluation workflow and display the results back to the user.

In the following sections, we discuss the key components and microservices of the solution in more detail.
Website UI
The solution features a website that lets users browse videos and manage the uploading process through a user-friendly interface. It offers details of the extracted video information and includes a lightweight analytics UI for dynamic LLM analysis. The following screenshots show some examples.

Extract information from videos
The solution includes a backend extraction service to manage video metadata extraction asynchronously. This involves extracting information from both the visual and audio components, including identifying objects, scenes, text, and human faces. The audio component is particularly important for videos with active narratives and conversations, because it often contains valuable information.
Building a robust solution to extract information from videos poses challenges from both machine learning (ML) and engineering perspectives. From the ML standpoint, our goal is to achieve generic extraction of information to serve as factual data for downstream analysis. On the engineering side, managing video sampling with concurrency, providing high availability, and flexible configuration options, as well as having an extendable architecture to support additional ML model plugins requires intensive effort.
The extraction service uses Amazon Transcribe to convert the audio portion of the video into text in subtitle formats. For visual extraction, there are a few major techniques involved:

Frame sampling – The classic method for analyzing the visual aspect of a video uses a sampling technique. This involves capturing screenshots at specific intervals and then applying ML models to extract information from each image frame. Our solution uses sampling with the following considerations:

The solution supports a configurable interval for the fixed sampling rate.
It also offers an advanced smart sampling option, which uses the Amazon Titan Multimodal Embeddings model to conduct similarity search against frames sampled from the same video. This process identifies similar images and discards redundant ones to optimize performance and cost.

Extract information from image frames – The solution will iterate through images sampled from a video and process them concurrently. For each image, it will apply the following ML features to extract information:

Recognize celebrity faces using the Amazon Rekognition celebrity API.
Detect generic objects and labels using the Amazon Rekognition label detection API.
Detect text using the Amazon Rekognition text detection API.
Flag inappropriate content using the Amazon Rekognition moderation API.
Use the Anthropic Claude V3 Haiku model to generate summarization of the image frame.

The following diagram illustrates how the extraction service is implemented.

The extraction service uses Amazon Simple Queue Service (Amazon SQS) and Step Functions to manage concurrent video processing, allowing configurable settings. You can specify how many videos can be processed in parallel and how many frames for each video can be processed concurrently, based on your account’s service quota limits and performance requirements.
Search the videos
Efficiently identifying videos within your inventory is a priority, and an effective search capability is critical for video analysis tasks. Traditional video search methods rely on full-text keyword searches. With the introduction of text embedding and multimodal embedding, new search methods based on semantics and images have emerged.
The solution offers search functionality via the extraction service, available as a UI feature. It generates vector embeddings at the image frame level as part of the extraction process to serve video search. You can search videos and their underlying frames either through the built-in web UI or via the RESTful API interface directly. There are three search options you can choose from:

Full text search – Powered by OpenSearch Service, it uses a search index generated by text analyzers that is ideal for keyword search.
Semantic search – Powered by the Amazon Titan Text Embeddings model, generated based on transcription and image metadata extracted at the frame level.
Image search – Powered by the Amazon Titan Multimodal Embeddings model, generated using the same text message used for text embedding along with the image frame. This feature is suitable for image search, allowing you to provide an image and find similar frames in videos.

The following screenshot of the UI showcases the use of multimodal embedding to search for videos containing the AWS logo. The web UI displays three videos with frames that have a high similarity score when compared with the provided AWS logo image. You can also find the other two text search options on the dropdown menu, giving you the flexibility to switch among search options.

Analyze the videos
After gathering rich insights from the videos, you can analyze the data. The solution features a lightweight UI, implemented as a static React web application, powered by a backend microservice called the evaluation service. This service acts as a proxy atop the Amazon Bedrock LLMs to provide real-time evaluation. You can use this as a sandbox feature to test out LLMs prompts for dynamic video analysis. The web UI contains a few sample prompt templates to show how you can analyze video for different use cases, including the following:

Content moderation – Flag unsafe scenes, text, or speech that violate your trust and safety policy
Video summarization – Summarize the video into a concise description based on its audio or visual content cues
IAB classification – Classify the video content into advertising IAB categories for better organization and understanding

You can also choose from a collection of LLMs models offered by Amazon Bedrock to test the evaluation results and find the most suitable one for your workload. LLMs can use the extraction data and perform analysis based on your instructions, making them flexible and extendable analytics tools that can support various use cases. The following are some examples of the prompt templates for video analysis. The placeholders within #### will be replaced by the corresponding video-extracted data at runtime.
The first example shows how to moderate a video based on audio transcription and object and moderation labels detected by Amazon Rekognition. This sample includes a basic inline policy. You can extend this section to add more rules. You can integrate longer trust and safety policy documents and runbooks in an Retrieval Augmented Generation (RAG) pattern using Knowledge Bases for Amazon Bedrock.

You are a specialist responsible for reviewing content to ensure compliance with company policies.
Your task involves evaluating videos.
The transcription of the video is within the <transcription> tag.
The detected label from the video is located in the <label> tag, and the moderation detection label is within the <moderation> tag.
You can find the company policy in the <policy> tag.

<transcription>##TRANSCRIPTION##</transcription>
<label>##LABEL##</label>
<moderation>##MODERATION##</moderation>
<policy>The content could not contain anything against nudity, violence, suggestive, hate symbols, hate speech and more. Anything consider alcohol or smoking violate the policy</policy>

Does the video violate the trust and safety policy?
Please consider and provide your analysis in the <analysis> tag, keeping the analysis within 100 words.Respond in the <answer> tag with either ‘Y’ or ‘N’.
‘Y’ indicates that the message sounds like a political Ads, while ‘N’ means the content sounds normal.

Summarizing videos into shorter descriptions is another popular use case. With the flexibility of the solution, you can instruct the LLMs to summarize the video based on selected extracted metadata. The following sample demonstrates a prompt that summarizes the video based on audio transcription and image frame captions:

Summarize the video using image frame descriptions and transcription subtitles.

The image descriptions and timestamps (in seconds) are provided here: ##IMAGECAPTION##.
The transcription subtitles are provided here: ##SUBTITLE##.

Classifying videos into IAB categories used to be challenging before generative AI became popular. It typically involved custom-trained text and image classification ML models, which often faced accuracy challenges. The following sample prompt uses the Amazon Bedrock Anthropic Claude V3 Sonnet model, which has built-in knowledge of the IAB taxonomy. Therefore, you don’t even need to include the taxonomy definitions as part of the LLM prompt.

Classify the video into IAB categories.

Transcription: ##TRANSCRIPTION##
Label: ##LABEL##
Text extracted from image frames:##TEXT##
Moderation categories: ##MODERATION##
Celebrities: ##CELEBRITY##

Summary
Video analysis presents challenges that span technical difficulties in both ML and engineering. This solution provides a user-friendly UI to streamline the video analysis and policy evaluation processes. The backend components can serve as building blocks for integration into your existing analysis workflow, allowing you to focus on analytics tasks with greater business impact.
You can deploy the solution into your AWS account using the AWS CDK package available on the GitHub repo. For deployment details, refer to the step-by-step instructions.

About the Authors

Lana Zhang is a Senior Solutions Architect at AWS World Wide Specialist Organization AI Services team, specializing in AI and generative AI with a focus on use cases including content moderation and media analysis. With her expertise, she is dedicated to promoting AWS AI and generative AI solutions, demonstrating how generative AI can transform classic use cases with advanced business value. She assists customers in transforming their business solutions across diverse industries, including social media, gaming, e-commerce, media, advertising, and marketing.
Negin Rouhanizadeh is a Solutions Architect at AWS focusing on AI/ML in Advertising and Marketing. Beyond crafting solutions for her customers, Negin enjoys painting, coding, spending time with family and her furry boys, Simba and Huchi.

Vitech uses Amazon Bedrock to revolutionize information access with AI …

This post is co-written with Murthy Palla and Madesh Subbanna from Vitech.
Vitech is a global provider of cloud-centered benefit and investment administration software. Vitech helps group insurance, pension fund administration, and investment clients expand their offerings and capabilities, streamline their operations, and gain analytical insights. To serve their customers, Vitech maintains a repository of information that includes product documentation (user guides, standard operating procedures, runbooks), which is currently scattered across multiple internal platforms (for example, Confluence sites and SharePoint folders). The lack of a centralized and easily navigable knowledge system led to several issues, including:

Low productivity due to lack of an efficient retrieval system and often leads to information overload
Inconsistent information access because there was no singular, unified source of truth

To address these challenges, Vitech used generative artificial intelligence (AI) with Amazon Bedrock to build VitechIQ, an AI-powered chatbot for Vitech employees to access an internal repository of documentation.
For customers that are looking to build an AI-driven chatbot that interacts with internal repository of documents, AWS offers a fully managed capability Knowledge Bases for Amazon Bedrock, that can implement the entire Retrieval Augment Generation (RAG) workflow from ingestion to retrieval, and prompt augmentation without having to build any custom integrations to data sources or manage data flows. Alternatively, open-source technologies like Langchain can be used to orchestrate the end-to-end flow.
In this blog, we walkthrough the architectural components, evaluation criteria for the components selected by Vitech and the process flow of user interaction within VitechIQ.
Technical components and evaluation criteria
In this section, we discuss the key technical components and evaluation criteria for the components involved in building the solution.
Hosting large language models
Vitech explored the option of hosting Large Language Models (LLMs) models using Amazon Sagemaker. Vitech needed a fully managed and secure experience to host LLMs and eliminate the undifferentiated heavy lifting associated with hosting 3P models. Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available via an API, so one can choose from a wide range of FMs to find the model that is best suited for their use case. With Bedrock’s serverless experience, one can get started quickly, privately customize FMs with their own data, and easily integrate and deploy them into applications using the AWS tools without having to manage any infrastructure. Vitech thereby selected Amazon Bedrock to host LLMs and integrate seamlessly with their existing infrastructure.
Retrieval Augmented Generation vs. fine tuning
Traditional LLMs don’t have an understanding of Vitech’s processes and flow, making it imperative to augment the power of LLMs with Vitech’s knowledge base. Fine-tuning would allow Vitech to train the model on a small sample set, thereby allowing the model to provide response using Vitech’s vocabulary. However, for this use case, the complexity associated with fine-tuning and the costs were not warranted. Instead, Vitech opted for Retrieval Augmented Generation (RAG), in which the LLM can use vector embeddings to perform a semantic search and provide a more relevant answer to users when interacting with the chatbot.
Data store
Vitech’s product documentation is largely available in .pdf format, making it the standard format used by VitechIQ. In cases where document is in available in other formats, users preprocess this data and convert it into .pdf format. These documents are uploaded and stored in Amazon Simple Storage Service (Amazon S3), making it the centralized data store.
Data chunking
Chunking is the process of breaking down large text documents into smaller, more manageable segments (such as paragraphs or sections). Vitech chose a recursive chunking method that involves dynamically dividing text based on its inherent structure like chapters and sections, offering a more natural division of text. A chunk size of 1,000 tokens with a 200-token overlap provided the most optimal results.
Large language models
VitechIQ uses two key LLM models to address the business challenge of providing efficient and accurate information retrieval:

Vector embedding – This process converts the documents into a numerical representation, making sure semantic relationships are captured (similar documents are represented numerically closer to each other), allowing for an efficient search. Vitech explored multiple vector embeddings models and selected the Amazon Titan Embeddings text model offered by Amazon Bedrock.
Question answering – The core functionality of VitechIQ is to provide concise and trustworthy answers to user queries based on the retrieved context. Vitech chose the Anthropic Claude model, available from Amazon Bedrock, for this purpose. The high token limit of 200,000 (approximately 150,000 words) allows the model to process extensive context and maintain awareness of the ongoing conversation, enabling it to provide more accurate and relevant responses. Additionally, VitechIQ includes metadata from the vector database (for example, document URLs) in the model’s output, providing users with source attribution and enhancing trust in the generated answers.

Prompt engineering
Prompt engineering is crucial for the knowledge retrieval system. The prompt guides the LLM on how to respond and interact based on the user question. Prompts also help ground the model. As part of prompt engineering, VitechIQ configured the prompt with a set of instructions for the LLM to keep the conversations relevant and eliminate discriminatory remarks, and guided it on how to respond to open-ended conversations. The following is an example of a prompt used in VitechIQ:

“””You are Jarvis, a chatbot designed to assist and engage in conversations with humans.
Your primary functions are:
1. Friendly Greeting: Respond with a warm greeting when users initiate a conversation by
greeting you.
2. Open-Ended Conversations: Acknowledge and inquire when users provide random context or
open-ended statements to better understand their intent.
3. Honesty: If you don’t know the answer to a user’s question, simply state that you don’t know,
and avoid making up answers.
Your name is Jarvis, and you should maintain a friendly and helpful tone throughout the
conversation.
Use the following pieces of context to answer the question at the end.
If you don’t know the answer, just say that you don’t know, don’t try to make up an answer
{context}
{chat_history}
Human: {human_input}
Chatbot:”””

Vector store
Vitech explored vector stores like OpenSearch and Redis. However, Vitech has expertise in handling and managing Amazon Aurora PostgreSQL-Compatible Edition databases for their enterprise applications. Amazon Aurora PostgreSQL provides support for the open source pgvector extension to process vector embeddings, and Amazon Aurora Optimized Reads offers a cost-effective and performant option. These factors led to the selection of Amazon Aurora PostgreSQL as the store for vector embeddings.
Processing framework
LangChain offered seamless machine learning (ML) model integration, allowing Vitech to build custom automated AI components and be model agnostic. LangChain’s out-of-the-box chain and agents libraries have empowered Vitech to adopt features like prompt templates and memory management, accelerating the overall development process. Vitech used Python virtual environments to freeze a stable version of the LangChain dependencies and seamlessly move it from development to production environments. With support of Langchain ConversationBufferMemory library, VitechIQ stores conversation information using a stateful session to maintain the relevance in conversation. The state is deleted after a configurable idle timeout elapses.
Multiple LangChain libraries were used across VitechIQ; the following are a few notable libraries and their usage:

langchain.llms (Bedrock) – Interact with LLMs provided by Amazon Bedrock
langchain.embeddings (BedrockEmbeddings) – Create embeddings
langchain.chains.question_answering (load_qa_chain) – Perform Q&A
langchain.prompts (PromptTemplate) – Create prompt templates
langchain.vectorstores.pgvector (PGVector) – Create vector embeddings and perform semantic search
langchain.text_splitter (RecursiveCharacterTextSplitter) – Split documents into chunks
langchain.memory (ConversationBufferMemory) – Manage conversational memory

They used the following versions:

langchain==0.0.306
langchain-experimental==0.0.24
langsmith==0.0.43
pgvector==0.2.3
streamlit==1.28.0
streamlit-extras==0.3.4

User interface
The VitechIQ user interface is built using Streamlit. Streamlit offers a user-friendly experience to quickly build interactive and easily deployable solutions using the Python library (used widely at Vitech). The Streamlit app is hosted on an Amazon Elastic Cloud Compute (Amazon EC2) fronted with Elastic Load Balancing (ELB), allowing Vitech to scale as traffic increases.
Optimizing search results
To reduce hallucination and optimize the token size and search results, VitechIQ performs semantic search using the value k in the search function (similarity_search_with_score). VitechIQ filters embedding responses to the top 10 results and then limits the dataset to records that have a score less than 0.48 (indicating close co-relation), thereby identifying the most relevant response and eliminating noise.
Amazon Bedrock VPC interface endpoints
Vitech wanted to make sure all communication is kept private and doesn’t traverse the public internet. VitechIQ uses an Amazon Bedrock VPC interface endpoint to make sure the connectivity is secured end to end.
Monitoring
VitechIQ application logs are sent to Amazon CloudWatch. This helps Vitech management get insights on current usage and trends on topics. Additionally, Vitech uses Amazon Bedrock runtime metrics to measure latency, performance, and number of tokens.

“We noted that the combination of Amazon Bedrock and Claude not only matched, but in some cases surpassed, in performance and quality and it conforms to Vitech security standards compared to what we saw with a competing generative AI solution.”
– Madesh Subbanna, VP Databases & Analytics at Vitech

Solution overview
Let’s look on how all these components come together to illustrate the end-user experience. The following diagram shows the solution architecture.

The VitechIQ user experience can be split into two process flows: document repository, and knowledge retrieval.
Document repository flow
This step involves the curation and collection of documents that will comprise the knowledge base. Internally, Vitech stakeholders conduct due diligence to review and approve a document before it is uploaded to VitechIQ. For each document uploaded to VitechIQ, the user provides an internal reference link (Confluence or SharePoint), to make sure any future revisions can be tracked and the most up-to-date information is available on VitechIQ. As new document versions are available, VitechIQ updates the embeddings to so the recommendations remain relevant and up to date.
Vitech stakeholders conduct a manual review on a weekly basis of the documents and revisions that are being requested to be uploaded. As a result, the documents have a 1-week turnaround to be available in VitechIQ for user consumption.
The following screenshot illustrates the VitechIQ interface to upload documents.

The upload procedure includes the following steps:

The domain stakeholder uploads the documents to VitechIQ.
LangChain uses recursive chunking to parse the document and send it to the Amazon Titan Embeddings model.
The Amazon Titan Embeddings model generates vector embeddings.
These vector embeddings are stored in an Aurora PostgreSQL database.
The user receives notification of the success (or failure) of the upload.

Knowledge retrieval flow
In this flow, the user interacts with the VitechIQ chatbot, which provides a summarized and accurate response to their question. VitechIQ also provides source document attribution in response to the user question (it uses the URL of the document uploaded in the previous process flow).
The following screenshot illustrates a user interaction with VitechIQ.

The process includes the following steps:

The user interacts with VitechIQ by asking a question in natural language.
The question is sent by the Amazon Bedrock interface endpoint to the Amazon Titan Embeddings model.
The Amazon Titan Embeddings model converts the question and generates vector embeddings.
The vector embeddings are sent to Amazon Aurora PostgreSQL to perform a semantic search on the knowledge base documents.
Using RAG, the prompt is enhanced with context and relevant documents, and then sent to Amazon Bedrock (Anthropic Claude) for summarization.
Amazon Bedrock generates a summarized response according to the prompt instructions and sends the response back to the user.

As additional questions are asked by user, the context is passed back into the prompt, making it aware of the ongoing conversation.
Benefits offered by VitechIQ
By using the power of generative AI, VitechIQ has successfully addressed the critical challenges of information fragmentation and inaccessibility. The following are the key achievements and innovative impact of VitechIQ:

Centralized knowledge hub – This helps streamline the process of information retrieval, resulting in over 50% reduction in inquiries to product teams.
Enhanced productivity and efficiency – Users are provided quick and accurate access. VitechIQ is used on average by 50 users daily, which accounts to approximately 2,000 queries on a monthly basis.
Continuous evolution and learning – Vitech is able to expand its knowledge base on new domains. Vitech’s API documentation (spanning 35,000 documents with a document size up to 3 GB) was uploaded to VitechIQ, enabling development teams to seamlessly search for documentation.

Conclusion
VitechIQ stands as a testament to the company’s commitment to harnessing the power of AI for operational excellence and the capabilities offered by Amazon Bedrock. As Vitech iterates through the solution, few of the top priority roadmap items include using the LangChain Expression Language (LCEL), modernizing the Streamlit application to host on Docker, and automating the document upload process. Additionally, Vitech is exploring opportunities to build similar capability for their external customers. The success of VitechIQ is a stepping stone for further technological advancements, setting a new standard for how technology can augment human capabilities in the corporate world. Vitech continues to innovate by partnering with AWS on programs like the Generative AI Innovation Center and identify additional customer-facing implementations. To learn more, visit Amazon Bedrock.

About the Authors
Samit Kumbhani is an AWS Senior Solutions Architect in the New York City area with over 18 years of experience. He currently collaborates with Independent Software Vendors (ISVs) to build highly scalable, innovative, and secure cloud solutions. Outside of work, Samit enjoys playing cricket, traveling, and biking.
Murthy Palla is a Technical Manager at Vitech with 9 years of extensive experience in data architecture and engineering. Holding certifications as an AWS Solutions Architect and AI/ML Engineer from the University of Texas at Austin, he specializes in advanced Python, databases like Oracle and PostgreSQL, and Snowflake. In his current role, Murthy leads R&D initiatives to develop innovative data lake and warehousing solutions. His expertise extends to applying generative AI in business applications, driving technological advancement and operational excellence within Vitech.
Madesh Subbanna is the Vice President at Vitech, where he leads the database team and has been a foundational figure since the early stages of the company. With two decades of technical and leadership experience, he has significantly contributed to the evolution of Vitech’s architecture, performance, and product design. Madesh has been instrumental in integrating advanced database solutions, DataInsight, AI, and ML technologies into the V3locity platform. His role transcends technical contributions, encompassing project management and strategic planning with senior management to ensure seamless project delivery and innovation. Madesh’s career at Vitech, marked by a series of progressive leadership positions, reflects his deep commitment to technological excellence and client success.
Ameer Hakme is an AWS Solutions Architect based in Pennsylvania. He collaborates with Independent Software Vendors (ISVs) in the Northeast region, assisting them in designing and building scalable and modern platforms on the AWS Cloud. An expert in AI/ML and generative AI, Ameer helps customers unlock the potential of these cutting-edge technologies. In his leisure time, he enjoys riding his motorcycle and spending quality time with his family.

Google’s Advanced AI Models: Gemini, PaLM, and Bard

With significant advancements through its Gemini, PaLM, and Bard models, Google has been at the forefront of AI development. Each model has distinct capabilities and applications, reflecting Google’s research in the LLM world to push the boundaries of AI technology.

Gemini: Google’s Multimodal Marvel

Gemini represents the pinnacle of Google’s AI research, developed by Google DeepMind. It is a multimodal large language model capable of understanding and generating text, code, audio, image, and video inputs. This makes Gemini particularly versatile for various applications, from natural language processing to complex multimedia tasks. The Gemini family includes three versions:

Gemini Ultra: The most powerful variant, designed for highly complex tasks.

Gemini Pro: Optimized for various tasks and scalable for enterprise use.

Gemini Nano: A more efficient model for on-device applications like smartphones.

Gemini has achieved state-of-the-art performance across numerous benchmarks. For example, it surpassed human experts on the Massive Multitask Language Understanding (MMLU) benchmark, highlighting its superior reasoning capabilities. Gemini’s multimodal nature allows it to process and integrate different types of information seamlessly, making it a robust tool for diverse AI applications.

Gemini 1.0 has a context length of 32,768 tokens, and it uses a mixture of expert approaches to enhance its performance across different tasks. The model has been trained on a multimodal and multilingual dataset, including web documents, books, code, images, audio, and video data. This diverse training set enables Gemini to handle various inputs, further establishing its flexibility and robustness in multiple applications.

PaLM: The Pathways Language Model

PaLM (Pathways Language Model) and its successor, PaLM 2, are Google’s responses to the growing need for efficient, scalable, and multilingual AI models. PaLM 2 is built on compute-optimal scaling, balancing model size with the training dataset to enhance efficiency and performance.

Key Features:

Multilingual Capabilities: PaLM 2 is heavily trained on multilingual text, enabling it to understand and generate nuanced language across more than 100 languages. This makes it particularly effective for translation and multilingual tasks. PaLM 2 can handle idioms, poems, and riddles, showcasing its deep understanding of linguistic nuances.

Reasoning and Coding: The model excels in logical reasoning, common sense tasks, and coding, benefiting from a diverse training corpus that includes scientific papers and web pages with mathematical content. This broad training set includes datasets containing code, which helps PaLM 2 generate specialized code in languages like Prolog, Fortran, and Verilog.

Efficiency: PaLM 2 is designed to be more efficient than its predecessor, offering faster inference times and lower serving costs. It uses compute-optimal scaling to ensure that the model size and training dataset are balanced, making it both powerful and cost-effective.

PaLM 2 features an improved architecture and a larger context window, capable of handling up to one million tokens. This substantial context length allows it to manage extensive inputs like long documents or sequences of data, enhancing its application in various domains.

Bard: Google’s Conversational AI

Initially launched as a conversational AI, Bard has evolved significantly by integrating Gemini and PaLM models. Bard leverages these advanced models to enhance its natural language understanding and generation capabilities. This integration allows Bard to provide more accurate and contextually relevant responses, making it a powerful dialogue and information retrieval tool.

Bard’s capabilities are showcased in various Google products, from search enhancements to customer support solutions. Its ability to draw on real-time web data ensures that it provides up-to-date and high-quality responses, making it an invaluable resource for users. Bard’s integration with Gemini and PaLM enhances its performance in handling complex queries, making it a versatile tool for everyday users and professionals.

Conclusion

Google’s AI models, Gemini, PaLM, and Bard, demonstrate the company’s dedication to advancing AI technology. Gemini’s multimodal prowess, PaLM’s efficiency and multilingual strength, and Bard’s conversational abilities collectively contribute to a robust AI ecosystem that addresses various challenges and applications.

Gemini’s context length of 32,768 tokens and multimodal training data set it apart as a leader in AI innovation. PaLM 2’s ability to handle up to one million tokens and compute-optimal scaling makes it powerful and efficient. By integrating these advanced models, Bard provides high-quality conversational AI capabilities.

Sources

https://blog.google/technology/ai/google-gemini-ai/#scalable-efficient

https://ai.google/discover/palm2/

https://ai.google/static/documents/google-about-bard.pdf

The post Google’s Advanced AI Models: Gemini, PaLM, and Bard appeared first on MarkTechPost.

Genie 2: Transforming Protein Design with Advanced Multi-Motif Scaffol …

Protein design is a rapidly advancing field leveraging computational models to create proteins with novel structures and functions. This technology has significant applications in therapeutics and industrial processes, revolutionizing how proteins are engineered for specific tasks. Researchers in this field aim to develop methods that accurately predict and generate protein structures that perform desired functions efficiently. The complexity of protein folding and interaction dynamics presents a significant challenge, making it crucial to innovate in this space.

Designing proteins with precise structural and functional properties remains challenging. The primary objective is to create proteins that perform specific functions, such as enzyme catalysis or molecular recognition, essential in various biological and industrial applications. The intricate nature of protein structures, composed of amino acids folding into three-dimensional shapes, necessitates advanced computational tools to accurately predict and design these configurations.

Current methods in protein design include sequence-based and structure-based approaches. Sequence-based models, such as EvoDiff, predict amino acid sequences that fold into functional proteins, while structure-based models like ProteinMPNN propose plausible sequences for given structures. However, these methods often need help designing proteins involving multiple interaction sites. For example, RFDiffusion integrates sequence information as a condition of a structure-based diffusion process, and FrameFlow combines a structural flow with a sequence flow. Designing proteins with multiple independent motifs remains a significant hurdle despite these advancements.

Researchers from Columbia University and Rutgers University introduced Genie 2, an advanced protein design model that extends the capabilities of its predecessor, Genie. Developed by Columbia University and Rutgers University, Genie 2 incorporates architectural innovations and data augmentation to capture a broader protein structure space and enables multi-motif scaffolding for complex protein designs. This new model represents proteins as point clouds of C-alpha atoms in the forward process and clouds of reference frames in the reverse process, enhancing its ability to design complex protein structures.

Genie 2 utilizes SE(3)-equivariant attention mechanisms and asymmetric protein representations in its forward and reverse diffusion processes. It encodes motifs using pairwise distance matrices and integrates these into the diffusion model, allowing the generation of proteins with multiple, independent functional sites without predefined inter-motif positions. This approach sidesteps challenges in multi-motif scaffolding, enabling the design of proteins with complex interaction patterns and multiple functional motifs. The training process involves data augmentation using a subset of the AlphaFold database, consisting of approximately 214 million predictions, significantly enhancing the model’s capabilities.

Genie 2’s performance achieves state-of-the-art designability, diversity, and novelty results. It outperforms existing models like RFDiffusion and FrameFlow in unconditional protein generation and motif scaffolding tasks. For example, Genie 2 achieves a designability score of 0.96, compared to RFDiffusion’s 0.63, and exhibits higher structural diversity and novelty. The model also solves motif scaffolding problems with unique and varied solutions, demonstrating its superior ability to generate complex protein designs.

In conclusion, Genie 2 addresses significant challenges in protein design by introducing a robust model capable of generating complex, multifunctional proteins. It sets a new standard in the field, offering promising tools for future applications in biotechnology and medicine. The researchers’ advancements in architectural innovations and data augmentation techniques have resulted in a model that achieves high performance and broadens the potential for designing novel proteins with specific functional properties. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform
The post Genie 2: Transforming Protein Design with Advanced Multi-Motif Scaffolding and Enhanced Structural Diversity appeared first on MarkTechPost.

Beyond High-Level Features: Dense Connector Boosts Multimodal Large La …

Multimodal Large Language Models (MLLMs) represent an advanced field in artificial intelligence where models integrate visual and textual information to understand and generate responses. These models have evolved from large language models (LLMs) that excelled in text comprehension and generation to now also processing and understanding visual data, enhancing their overall capabilities significantly.

The main problem addressed in this research is the need for more utilization of visual information in current MLLMs. Despite advancements in language processing, the visual component often needs to be expanded to high-level features extracted by a frozen visual encoder. This study seeks to explore how leveraging more detailed visual features can improve the performance of MLLMs, addressing the gap in fully utilizing visual signals for better multimodal understanding.

Current research includes various frameworks and models for MLLMs, such as CLIP, SigLIP, and Q-former, which connect visual and language models using pre-trained visual encoders and linear projections. Approaches like LLaVA and Mini-Gemini utilize high-resolution visual representations and instruction tuning to enhance performance. Methods such as Sparse Token Integration and Dense Channel Integration efficiently leverage multi-layer visual features to improve the robustness and scalability of MLLMs across diverse datasets and architectures.

Researchers from Tsinghua University, Baidu Inc., The University of Sydney, Amazon Web Services, and The Chinese University of Hong Kong have introduced the Dense Connector, a vision-language connector that enhances MLLMs by leveraging multi-layer visual features. This approach involves minimal additional computational overhead and can be integrated seamlessly with existing MLLMs. This innovative connector addresses the limitations of current MLLMs by providing a more comprehensive integration of visual data into the language model.

The Dense Connector uses a plug-and-play mechanism that incorporates visual features from various layers of the frozen visual encoder, enhancing the input to the LLM. It offers three instantiations: Sparse Token Integration (STI), Sparse Channel Integration (SCI), and Dense Channel Integration (DCI). Each method utilizes visual tokens effectively to improve the robustness of visual embeddings fed into the LLM. STI increases the number of visual tokens by aggregating them from different layers and mapping them into the text space. SCI concatenates visual tokens from other layers in the feature dimension, reducing feature dimensionality while maintaining the number of tokens. DCI incorporates features from all layers, combining adjacent layers to avoid redundancy and high dimensionality.

The Dense Connector demonstrated remarkable zero-shot capabilities in video understanding and achieved state-of-the-art performance across 19 image and video benchmarks. It was tested with various vision encoders, image resolutions, and LLM sizes, ranging from 2.7 billion to 70 billion parameters, validating its versatility and scalability. Experimental results highlighted the Dense Connector’s ability to enhance visual representations in MLLMs with minimal computational cost. The model achieved significant improvements across various datasets, with pronounced enhancements of 2.9% on MMBench and 1.7% on GQA. The research team also conducted extensive empirical studies demonstrating its compatibility with different visual encoders, such as CLIP-ViT-L and SigLIP-ViT-SO, and varying training dataset scales.

Furthermore, the Dense Connector outperformed existing methods by leveraging high-resolution representations and integrating them using the DCI method. This approach yielded substantial performance gains across multiple benchmarks, including MathVista, MMBench, and MM-Vet, with improvements of 1.1%, 1.4%, and 1.4%, respectively. By applying the Dense Connector to high-resolution methods like Mini-Gemini, the researchers showcased its plug-and-play capability, significantly enhancing detail expression in MLLMs.

In conclusion, this research introduces the Dense Connector, a novel method that enhances MLLMs by effectively utilizing multi-layer visual features. This approach overcomes limitations in current MLLMs, where visual information is often restricted to high-level features. The Dense Connector offers several instantiations, each integrating visual data from different layers of the visual encoder. This improves the quality of visual information fed into the LLM without significant computational cost. Experiments demonstrate that the Dense Connector significantly improves MLLM performance on various image and video benchmarks, highlighting its potential to advance multimodal understanding in AI. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform
The post Beyond High-Level Features: Dense Connector Boosts Multimodal Large Language Models MLLMs with Multi-Layer Visual Integration appeared first on MarkTechPost.

Enhance image search experiences with Amazon Personalize, Amazon OpenS …

A variety of different techniques have been used for returning images relevant to search queries. Historically, the idea of creating a joint embedding space to facilitate image captioning or text-to-image search has been of interest to machine learning (ML) practitioners and businesses for quite a while. Contrastive Language–Image Pre-training (CLIP) and Bootstrapping Language-Image Pre-training (BLIP) were the first two open source models that achieved near-human results on the task. More recently, however, there has been a trend to use the same techniques used to train powerful generative models to create multimodal models that map text and images to the same embedding space to achieve state-of-the-art results.
In this post, we show how to use Amazon Personalize in combination with Amazon OpenSearch Service and Amazon Titan Multimodal Embeddings from Amazon Bedrock to enhance a user’s image search experience by using learned user preferences to further personalize image searches in accordance with a user’s individual style.
Solution overview
Multimodal models are being used in text-to-image searches across a variety of industries. However, one area where these models fall short is in incorporating individual user preferences into their responses. A user searching for images of a bird, for example, could have many different desired results.

In an ideal world, we can learn a user’s preferences from their previous interactions with images they either viewed, favorited, or downloaded, and use that to return contextually relevant images in line with their recent interactions and style preferences.
Implementing the proposed solution includes the following high-level steps:

Create embeddings for your images.
Store embeddings in a data store.
Create a cluster for the embeddings.
Update the image interactions dataset with the image cluster.
Create an Amazon Personalize personalized ranking solution.
Serve user search requests.

Prerequisites
To implement the proposed solution, you should have the following:

An AWS account and familiarity with Amazon Personalize, Amazon SageMaker, OpenSearch Service, and Amazon Bedrock.
The Amazon Titan Multimodal Embeddings model enabled in Amazon Bedrock. You can confirm it’s enabled on the Model access page of the Amazon Bedrock console. If Amazon Titan Multimodal Embeddings is enabled, the access status will show as Access granted, as shown in the following screenshot. You can enable access to the model by choosing Manage model access, selecting Amazon Titan Multimodal Embeddings G1, and then choosing Save Changes.

A SageMaker domain. You can onboard a SageMaker domain by using the Set up for single users procedure from the SageMaker console.
Either an OpenSearch Service collection or a domain. For Amazon OpenSearch Serverless, you can create a collection. For OpenSearch Service, you can create a domain.

Create embeddings for your images
Embeddings are a mathematical representation of a piece of information such as a text or an image. Specifically, they are a vector or ordered list of numbers. This representation helps capture the meaning of the image or text in such a way that you can use it to determine how similar images or text are to each other by taking their distance from each other in the embedding space.

→ [-0.020802604, -0.009943095, 0.0012887075, -0….

As a first step, you can use the Amazon Titan Multimodal Embeddings model to generate embeddings for your images. With the Amazon Titan Multimodal Embeddings model, we can use an actual bird image or text like “bird” as an input to generate an embedding. Furthermore, these embeddings will be close to each other when the distance is measured by an appropriate distance metric in a vector database.
The following code snippet shows how to generate embeddings for an image or a piece of text using Amazon Titan Multimodal Embeddings:

def generate_embeddings_with_titan(image=None, text=None):
user_input = {}

if image is not None:
user_input[“inputImage”] = image
if text is not None:
user_input[“inputText”] = text

if not user_input:
raise ValueError(“One user input of an image or a text is required”)

body = json.dumps(user_input)

response = bedrock_runtime.invoke_model(
body=body,
modelId=”amazon.titan-embed-image-v1″,
accept=”application/json”,
contentType=”application/json”
)

response_body = json.loads(response.get(“body”).read())

embedding_error = response_body.get(“message”)

if finish_reason is not None:
raise EmbedError(f”Embeddings generation error: {embedding_error}”)

return response_body.get(“embedding”)

It’s expected that the image is base64 encoded in order to create an embedding. For more information, see Amazon Titan Multimodal Embeddings G1. You can create this encoded version of your image for many image file types as follows:

with open(Image_Filepath+ “/” + image, “rb”) as image_file:
input_image = base64.b64encode(image_file.read()).decode(‘utf8’)

In this case, input_image can be directly fed to the embedding function you generated.
Create a cluster for the embeddings
As a result of the previous step, a vector representation for each image has been created by the Amazon Titan Multimodal Embeddings model. Because the goal is to create more personalize image search influenced by the user’s previous interactions, you create a cluster out of the image embeddings to group similar images together. This is useful because will force the downstream re-ranker, in this case an Amazon Personalize personalized ranking model, to learn user presences for specific image styles as opposed to their preference for individual images.
In this post, to create our image clusters, we use an algorithm made available through the fully managed ML service SageMaker, specifically the K-Means clustering algorithm. You can use any clustering algorithm that you are familiar with. K-Means clustering is a widely used method for clustering where the aim is to partition a set of objects into K clusters in such a way that the sum of the squared distances between the objects and their assigned cluster mean is minimized. The appropriate value of K depends on the data structure and the problem being solved. Make sure to choose the right value of K, because a small value can result in under-clustered data, and a large value can cause over-clustering.
The following code snippet is an example of how to create and train a K-Means cluster for image embeddings. In this example, the choice of 100 clusters is arbitrary—you should experiment to find a number that is best for your use case. The instance type represents the Amazon Elastic Compute Cloud (Amazon EC2) compute instance that runs the SageMaker K-Means training job. For detailed information on which instance types fit your use case, and their performance capabilities, see Amazon Elastic Compute Cloud instance types. For information about pricing for these instance types, see Amazon EC2 Pricing. For information about available SageMaker notebook instance types, see CreateNotebookInstance.
For most experimentation, you should use an ml.t3.medium instance. This is the default instance type for CPU-based SageMaker images, and is available as part of the AWS Free Tier.

num_clusters = 100

kmeans = KMeans(
role=role,
instance_count=1,
instance_type=”ml.t3.medium”,
output_path=”s3://your_unique_s3bucket_name/”,
k=num_clusters,
num_trials=num_clusters,
epochs=10
)

kmeans.fit(kmeans.record_set(np.asarray(image_embeddings_list, dtype=np.float32)))

Store embeddings and their clusters in a data store
As a result of the previous step, a vector representation for each image has been created and assigned to an image cluster by our clustering model. Now, you need to store this vector such that the other vectors that are nearest to it can be returned in a timely manner. This allows you to input a text such as “bird” and retrieve images that prominently feature birds.
Vector databases provide the ability to store and retrieve vectors as high-dimensional points. They add additional capabilities for efficient and fast lookup of nearest neighbors in the N-dimensional space. They are typically powered by nearest neighbor indexes and built with algorithms like the Hierarchical Navigable Small World (HNSW) and Inverted File Index (IVF) algorithms. Vector databases provide additional capabilities like data management, fault tolerance, authentication and access control, and a query engine.
AWS offers many services for your vector database requirements. OpenSearch Service is one example; it makes it straightforward for you to perform interactive log analytics, real-time application monitoring, website search, and more. For information about using OpenSearch Service as a vector database, see k-Nearest Neighbor (k-NN) search in OpenSearch Service.
For this post, we use OpenSearch Service as a vector database to store the embeddings. To do this, you need to create an OpenSearch Service cluster or use OpenSearch Serverless. Regardless which approach you used for the cluster, you need to create a vector index. Indexing is the method by which search engines organize data for fast retrieval. To use a k-NN vector index for OpenSearch Service, you need to add the index.knn setting and add one or more fields of the knn_vector data type. This lets you search for points in a vector space and find the nearest neighbors for those points by Euclidean distance or cosine similarity, either of which is acceptable for Amazon Titan Multimodal Embeddings.
The following code snippet shows how to create an OpenSearch Service index with k-NN enabled to serve as a vector datastore for your embeddings:

def create_index(opensearch_client, index_name, vector_field_name):
settings = {
“settings”: {
“index”: {
“knn”: True
}
},
“mappings”: {
“properties”: {
vector_field_name: {
“type”: “knn_vector”,
“dimension”: 1024,
“method”: {
“name”: “hnsw”,
“space_type”: “l2”,
“engine”: “faiss”,
“parameters”: {
“m”: 32
}
}
}
}
}
}
response = opensearch_client.indices.create(index=index_name, body=settings)
return bool(response[‘acknowledged’])

The following code snippet shows how to store an image embedding into the open search service index you just created:

embedding_vector = {“_index”:index_name,
“name”: image_name,
“type”: “Image”,
“embedding”: image_embedding,
“cluster”: image_cluster }
#opensearch_client is your Amazon Opensearch cluster client
opensearch_client.index(
index=index_name,
body=embedding_vector,
id = str(index),
refresh = True
)

Update the image interactions dataset with the image cluster
When creating an Amazon Personalize re-ranker, the item interactions dataset represents the user interaction history with your items. Here, the images represent the items and the interactions could consist of a variety of events, such as a user downloading an image, favoriting it, or even viewing a higher resolution version of it. For our use case, we train our recommender on the image clusters instead of the individual images. This gives the model the opportunity to recommend based on the cluster-level interactions and understand the user’s overall stylistic preferences as opposed to preferences for an individual image in the moment.
To do so, update the interaction dataset including the image cluster instead of the image ID in the dataset, and store the file in an Amazon Simple Storage Service (Amazon S3) bucket, at which point it can be brought into Amazon Personalize.
Create an Amazon Personalize personalized ranking campaign
The Personalized-Ranking recipe generates personalized rankings of items. A personalized ranking is a list of recommended items that are re-ranked for a specific user. This is useful if you have a collection of ordered items, such as search results, promotions, or curated lists, and you want to provide a personalized re-ranking for each of your users. Refer to the following example available on GitHub for complete step-by-step instructions on how to create an Amazon Personalize recipe. The high-level steps are as follows:

Create a dataset group.
Prepare and import data.
Create recommenders or custom resources.
Get recommendations.

We create and deploy a personalized ranking campaign. First, you need to create a personalized ranking solution. A solution is a combination of a dataset group and a recipe, which is basically a set of instructions for Amazon Personalize to prepare a model to solve a specific type of business use case. Then you train a solution version and deploy it as a campaign.
The following code snippet shows how to create a Personalized-Ranking solution resource:

personalized_ranking_create_solution_response = personalize_client.create_solution(
name = “personalized-image-reranker”,
datasetGroupArn = dataset_group_arn,
recipeArn = personalized_ranking_recipe_arn
)
personalized_ranking_solution_arn = personalized_ranking_create_solution_response[‘solutionArn’]

The following code snippet shows how to create a Personalized-Ranking solution version resource:

personalized_ranking_create_solution_version_response = personalize_client.create_solution_version(
solutionArn = personalized_ranking_solution_arn
)

personalized_ranking_solution_version_arn = personalized_ranking_create_solution_version_response[‘solutionVersionArn’]

The following code snippet shows how to create a Personalized-Ranking campaign resource:

create_campaign_response = personalize_client.create_campaign(
name = “personalized-image-reranker-campaign”,
solutionVersionArn = personalized_ranking_solution_version_arn,
minProvisionedTPS = 1
)

personalized_ranking_campaign_arn = create_campaign_response[‘campaignArn’]

Serve user search requests
Now our solution flow is ready to serve a user search request and provide personalized ranked results based on the user’s previous interactions. The search query will be processed as shown in the following diagram.

To setup personalized multimodal search, one would execute the following steps:

Multimodal embeddings are created for the image dataset.
A clustering model is created in SageMaker, and each image is assigned to a cluster.
The unique image IDs are replaced with cluster IDs in the image interactions dataset.
An Amazon Personalize personalized ranking model is trained on the cluster interaction dataset.
Separately, the image embeddings are added to an OpenSearch Service vector index.

The following workflow would be executed to process a user’s query:

Amazon API Gateway calls an AWS Lambda function when the user enters a query.
The Lambda function calls the same multimodal embedding function to generate an embedding of the query.
A k-NN search is performed for the query embedding on the vector index.
A personalized score for the cluster ID for each retrieved image is obtained from the Amazon Personalize personalized ranking model.
The scores from OpenSearch Service and Amazon Personalize are combined through a weighted mean. The images are re-ranked and returned to the user.

The weights on each score could be tuned based on the available data and desired outcomes and desired degrees of personalization vs. contextual relevance.

To see what this looks like in practice, let’s explore a few examples. In our example dataset, all users would, in absence of any personalization, receive the following images if they search for “cat”.

However, a user who has a history of viewing the following images (let’s call them comic-art-user) clearly has a certain style preference that isn’t addressed by the majority of the previous images.

By combining Amazon Personalize with the vector database capabilities of OpenSearch Service, we are able to return the following results for cats to our user:

In the following example, a user has been viewing or downloading the following images (let’s call them neon-punk-user).

They would receive the following personalized results instead of the mostly photorealistic cats that all users would receive absent any personalization.

Finally, a user viewed or downloaded the following images (let’s call them origami-clay-user).

They would receive the following images as their personalized search results.

These examples illustrate how the search results have been influenced by the users’ previous interactions with other images. By combining the power of Amazon Titan Multimodal Embeddings, OpenSearch Service vector indexing, and Amazon Personalize personalization, we are able to deliver each user relevant search results in alignment with their style preferences as opposed to showing all of them the same generic search result.
Furthermore, because Amazon Personalize is capable of updating based on changes in the user style preference in real time, these search results would update as the user’s style preferences change, for example if they were a designer working for an ad agency who switched mid-browsing session to working on a different project for a different brand.
Clean up
To avoid incurring future charges, delete the resources created while building this solution:

Delete the OpenSearch Service domain or OpenSearch Serverless collection.
Delete the SageMaker resources.
Delete the Amazon Personalize resources.

Conclusion
By combining the power of Amazon Titan Multimodal Embeddings, OpenSearch Service vector indexing and search capabilities, and Amazon Personalize ML recommendations, you can boost the user experience with more relevant items in their search results by learning from their previous interactions and preferences.
For more details on Amazon Titan Multimodal Embeddings, refer to Amazon Titan Multimodal Embeddings G1 model. For more details on OpenSearch Service, refer to Getting started with Amazon OpenSearch Service. For more details on Amazon Personalize, refer to the Amazon Personalize Developer Guide.

About the Authors
Maysara Hamdan is a Partner Solutions Architect based in Atlanta, Georgia. Maysara has over 15 years of experience in building and architecting Software Applications and IoT Connected Products in Telecom and Automotive Industries. In AWS, Maysara helps partners in building their cloud practices and growing their businesses. Maysara is passionate about new technologies and is always looking for ways to help partners innovate and grow.
Eric Bolme is a Specialist Solution Architect with AWS based on the East Coast of the United States. He has 8 years of experience building out a variety of deep learning and other AI use cases and focuses on Personalization and Recommendation use cases with AWS.

End-to-end LLM training on instance clusters with over 100 nodes using …

Llama is Meta AI’s large language model (LLM), with variants ranging from 7 billion to 70 billion parameters. Llama uses a transformers-based decoder-only model architecture, which specializes at language token generation. To train a model from scratch, a dataset containing trillions of tokens is required. The Llama family is one of the most popular LLMs. However, training Llama models can be technically challenging, prolonged, and costly.
In this post, we show you how to accelerate the full pre-training of LLM models by scaling up to 128 trn1.32xlarge nodes, using a Llama 2-7B model as an example. We share best practices for training LLMs on AWS Trainium, scaling the training on a cluster with over 100 nodes, improving efficiency of recovery from system and hardware failures, improving training stability, and achieving convergence. We demonstrate that the quality of Llama 2-7B trained on Trainium is of comparable quality to the open source version on multiple tasks, ranging from multi-task language understanding, math reasoning, to code generation. We also demonstrate the scaling benefits of Trainium.
What makes distributed training across over 100 nodes so challenging?
Training large-scale LLMs requires distributed training across over 100 nodes, and getting elastic access to large clusters of high-performance compute is difficult. Even if you manage to get the required accelerated compute capacity, it’s challenging to manage a cluster of over 100 nodes, maintain hardware stability, and achieve model training stability and convergence. Let’s look at these challenges one by one and how we address them with Trainium clusters during the end-to-end training:

Distributed training infrastructure efficiency and scalability – Training LLMs is both computation and memory intensive. In this post, we show you how to enable the different parallel training algorithms on Trainium and select the best hyperparameters to achieve the highest throughput of Llama 2-7B on the Trainium cluster. We also demonstrate the implementations of other memory and computation optimization techniques such as coalescing layers and data type selection on Trainium. Empirically, we have proven that Trainium clusters can reduce costs by up to 46% compared to comparable Amazon Elastic Compute Cloud (Amazon EC2) instances.
Efficient hardware and system recovery – End-to-end LLM training at this scale will inevitably encounter hardware or system failures. We demonstrate how to efficiently enable checkpoint saving and automatically recover using the NeuronX Distributed library. Empirically, we demonstrate that with automatic failure recovery, the effective utilization of hardware computing hours reaches 98.81% compared to 77.83% with a manual recovery method.
Training stability and convergence – Finally, frequent occurrence of spikes of loss functions in pre-training deep neural networks such as Llama 2 can lead to catastrophic divergence. Due to the large computation cost required for training LLMs, we want to reduce loss function spikes, improve training stability, and achieve convergence of training. We demonstrate best practices and implementation of techniques such as scaled initialization, gradient clipping, and cache management on Trainium clusters to achieve this. We also show how to monitor and debug for training stability.

Llama 2-7B pre-training setup
In this section, we discuss the steps for setting up Llama 2-7B pre-training.
Infrastructure
Setting up the Llama 2-7B infrastructure consists of the following components:

EC2 cluster – The training cluster includes 128 trn1.32xlarge instances (nodes), totaling 2048 Trainium accelerators. The networking among the instances is connected through 8×100 Gbps EFAs. We mounted 56 TB Amazon FSx storage for immediate data storage and checkpoint saving and loading. The raw training data was saved on Amazon Simple Storage Service (Amazon S3) buckets.
Orchestration – We first trained the Llama 2-7B from scratch using a trn1.32xlarge cluster that is managed through Amazon Elastic Kubernetes Service (Amazon EKS). For details about the setup procedure, refer to Train Llama2 with AWS Trainium on Amazon EKS. We followed the same procedure but set up the cluster at a much larger scale with 128 trn1.32xlarge instances.
Container build – We used a customer Docker image that was built based on the following training containers and included the Llama 2-7B training source files. We stored the customer Docker image in an Amazon Elastic Container Registry (Amazon ECR) registry and deployed it in EKS pods. The following diagram shows the architecture of the cluster and container setup.

Data preparation
The original format of the training dataset contains a large number of compressed files. To use this dataset, we first converted them into a format compatible with the Hugging Face dataset package. We used the Apache Arrow format (the default storage format for datasets) to combine all data into a single file and a single block of a file. This method significantly reduces load times for TB-sized datasets compared to the default method of loading many separate files.
We first downloaded the preprocessed training dataset, a small subset of the full dataset that contains 12 trillion tokens, using a special EC2 instance with 20–30 TB of memory. The data download script is as follows:

import os

# Cache and tmpdir can be large. Make sure ~/ has enough disk space.
os.environ[“HF_DATASETS_CACHE”] = “~/dataset/cache”
os.environ[“TMPDIR”] = “~/dataset/tmpdir”

import datasets
from datasets import load_dataset

save_path = “~/<data path>/arrow”
save_path = os.path.expanduser(save_path)
os.makedirs(save_path, exist_ok=True)

raw_datasets = load_dataset(“togethercomputer/<1T data file name>”, ‘default’, num_proc=448)
raw_datasets[“train”].save_to_disk(
save_path,
num_shards=1,
num_proc=448,
)

The dataset is processed for optimized storage and access:
import pyarrow as pa
import time

a = time.time()
stream = pa.memory_map(“~/<data path>/arrow/train.arrow”)
stream = pa.ipc.open_stream(stream)
table = stream.read_all()
print(“completed step 1 in seconds: “, time.time() – a)

ca = table[“text”]
l = ca.to_pylist()
schema = pa.schema({“text”: pa.large_string()})
arr = pa.array(l, type=pa.large_string())

with pa.OSFile(“~/<data path>/arrow/train.arrow”, “wb”) as sink:
with pa.ipc.new_stream(sink, schema=schema) as writer:
batch = pa.record_batch([arr], schema=schema)
writer.write(batch)
print(“completed step 2 in seconds: “, time.time() – a)

On the same instance, we cleaned up the dataset and uploaded the clean dataset to an S3 bucket. We then used a 128 trn1.32xlarge cluster to perform tokenization and packaging (such as dynamically filling sequences and applying masking mechanisms) online during training. Compared with offline packaging methods, this online method saves tremendous development time and computing resources, especially for multiple experiments that use different large datasets and tokenizers.
Model hyperparameters
We adopted the same training hyperparameters as Llama models. Specifically, we used a cosine learning rate scheduler with the same maximum learning rate of 3𝑒−4 and the same minimum learning rate of 3𝑒−5. We followed the same linear warmup of 2,000 steps. The following figure shows a plot of the overall learning rate scheduler.

We used the AdamW optimizer with 𝛽1 = 0.9 and 𝛽2 = 0.95. We used weight decay value of 0.1 for all parameters, including normalization weights. For training stability, gradient-norm clipping of 1.0 was applied. For a different model setup, such as Llama 3, these parameters need to be tuned for optimal performance.
Distributed training infrastructure efficiency and scalability
During the training, we applied general optimization techniques, such as activation checkpointing, model and data parallelism, and computation and communication overlapping in Trainium through the Neuron SDK, as well as some unique enhancement such as BF16 with stochastic rounding. In this section, we list the key features and configurations used in our model pre-training to improve training efficiency.
Model and data parallelism
Neuron supports tensor parallelism (TP), pipeline parallelism (PP), sequence parallelism (SP), and data parallelism (DP). For the 7B model with 4,096 sequence length, we found that a TP degree of 8, PP degree of 1, SP degree of 8, and DP degree of 512 yields the highest training throughput. On a trn1.32xlarge instance cluster, this leads to having four model copies per instance.
We used a global batch size of 1,024 sequences with a maximum sequence length of 4,096 tokens. Each step covered about 4 million tokens. The gradient accumulation step is 2, which resulted in the actual batch size per Neuron core being 1. The following figure illustrates the data parallelism and tensor parallelism we applied in the training.

Neuron Distributed library
AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium-based instances. It includes the compiler, runtime, and profiling tools. It supports a variety of data types, including FP32, BF16, FP16, and stochastic rounding. The Neuron SDK enables tensor parallelism, pipeline parallelism, and data parallelism distributed strategies through the NeuronX Distributed library. This allows trade-offs between preserving the high accuracy of trained models and training efficiency in throughput and memory consumption. We applied the following features in the training process:

Selective activation checkpointing – We used selective activation checkpointing to improve training efficiency. It has a slightly higher memory cost than full activation checkpointing, but increases the overall training throughput.
BF16 with stochastic rounding – We compared three precision settings: BF16, BF16 with SR, and mixed precision training. Empirically, we found that BF16 with SR showed the same convergence behavior as mixed precision training, with higher training throughput and lower memory footprint; whereas the training loss of BF16 diverged. Therefore, we chose BF16 with SR in our pre-training exercise.
Coalescing layers with the same inputs – We coalesced linear layers with the same inputs to reduce the communication in tensor and sequence parallelism, and improve the efficiency of matrix operations. Specifically, the Q, K, and V layers in an attention block are coalesced, and the two linear projections layers in SwiGLU are also coalesced. This optimization technique is generic to LLMs. The following are the example code snippets:

q_proj, k_proj, v_proj were merged into qkv_proj

if not self.config.separate_qkv and self.num_heads == self.num_key_value_heads and self.config.kv_shared_group_size == 1:
qkv_states = self.qkv_proj(hidden_states)
query_states, key_states, value_states = qkv_states.split(self.split_size, dim=2)
elif self.config.qkv_linear:
query_states, key_states, value_states = self.qkv_proj(hidden_states)
else:
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)

gate_proj, up_proj were merged into gate_up_proj

gate_proj, up_proj = self.gate_up_proj(x).split(self.split_size, dim=2)

Compiler optimization – We used the compiling flag –distribution-strategy=llm-training to enable the compiler to perform optimizations applicable to LLM training runs that shard parameters, gradients, and optimizer states across data parallel workers. We also used –model-type=transformer, which performs optimizations specific to transformer models. We set the Neuron environment variable NEURON_FUSE_SOFTMAX=1 to enable compiler optimizations on custom lowering for Softmax operation. Finally, we used NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3 to reduce training latency with asynchronous runs. This overlaps some runs of accelerators and host (CPU).

The following table summarizes all hyperparameters used in our pre-training exercise.

.
.
Trn – NxD

Optimization parameters
Seq_len
4096

.
Precision
bf16

.
GBS
1024

.
learning rate
3.00E-04

.
min_lr
3.00E-05

.
weight decay
0.1

.
grad_clip
1

.
LR scheduler
cosine

.
warmup step
2000

.
constant step
0

.
AdamW (bete1, beta2)
(0.9, 0.95)

.
AdamW eps
1.00E-05

Distributed Parameters
Number of Nodes
128

.
TP
8

.
PP
1

.
DP
512

.
GBS
1024

.
Per Neuron BS
1

.
Gradient accumulation steps
2

.
Sequence Parallel
Yes

Steps
LR decay steps
480,000

.
Training steps
500,000

Hardware and system recovery
Training a billion-parameter LLM often requires training on a cluster with over 100 nodes, running for multiple days or even weeks. The following are best practices of sanity checking the health of the cluster, monitoring cluster health, and efficient recovering from hardware and system failures:

Health sanity check and monitoring – It’s important to monitor the health of the computing nodes. In the initial setup, we first did a scrutiny check using the Neuron standard test library to make sure the networking bandwidth performs as expected. During the training, the process can be interrupted due to hardware failures, communication timeouts, and so on. We used Amazon EKS settings to monitor the behavior of the computing nodes. It will send out a warning message if a node or networking goes bad. After that, the cluster stops all the instances and restarts with the health sanity check.
Efficient recovery with Neuron automatic fault recovery – To improve the efficiency of fault recovery, NeuronX Distributed supports checkpoint saving and loading. Particularly, it optimizes the checkpoint saving time by supporting asynchronous checkpoint saving. To reduce the overhead of manual intervention, NeuronX Distributed provides an API that automatically loads the latest saved checkpoint before failures and restarts the training. Those APIs are important for achieving high system uptime and therefore finishing end-to-end training. With the automatic node failure recovery and resuming methods, the effective utilization of hardware computing hours reached 98.81% compared to 77.83% with the manual recovery method. The comparison was based on another experimental training run (over 600 billion tokens) without automatic fault recovery, and we observed an average of 20% lower system up time.

Training stability and convergence
During the training process, we found that the training convergence depends on initialization, weight normalization, and gradient synchronization, which can be constantly monitored during the training. The stability depends on reducing frequent distributed file system access. In this section, we discuss the best practices we exercised to improve numeric stability and achieve convergence of the model.
Initialization
We used a scaled initialization strategy for initializing model parameters. Specifically, the initial standard deviation of output layers in attention blocks and MLP layers was scaled by the square root of layer numbers. Similar to what is discussed in the following whitepaper, we found better numerical stability and convergence with smaller initial variance on deeper layers. Additionally, all parameters were initialized on CPU and offloaded to Trainium. The following figure shows that without the scaled initialization (plotted in green and black), the training loss diverged after 22,000–23,000 steps. In contrast, the training loss (plotted in yellow) converges after enabling the scaled initialization. The default initialization is replaced by this code:

scaled_init_method = partial( _init_normal,
config.initializer_range / math.sqrt(2.0 * config.num_hidden_layers))

Gradient synchronization with all-reduce
The gradient all-reduce in torch/xla normalizes the global gradient by world_size instead of data parallelism degrees. When we applied hybrid parallelism including both model parallelism (tensor parallelism and pipeline parallelism) and data parallelism, the world_size was larger than the data parallelism degree. This led to divergence issues because of the incorrect gradient normalization. To fix this, we modified the gradient normalization with a bucket_allreduce_gradients based on data parallelism degrees in NeuronX Distributed. The recommended way is to use neuronx_distributed.parallel_layers.grads.bucket_allreduce_gradients.
Neuron persistent cache on a local worker
When we set up the training cluster, all nodes in the 128 trn1.32xlarge instances shared the same file system, using Amazon FSx for storing data, checkpoints, logs, and so on. Storing the Neuron persistent cache generated from the model compilation on Amazon FSx caused a communication bottleneck because those cached graphs are frequently checked by all Trainium devices in the cluster. Such bottlenecks led to a communication timeout and affected training stability. Therefore, we instead stored Neuron persistent caches (compiled graph binary) in the root volume of each local worker.
Training stability monitoring
During the training, we monitored the training loss, L2-norm of gradients, and L2-norm of parameters for debugging the training stability.
Monitoring the training loss curve gives us the first high-level stability signal. We used TensorBoard to monitor the training loss curve and validation loss curve, as shown in the following figure. The entire model was trained on 1.8 trillion tokens. We observed that the training loss decreases fast for the initial 250 billion tokens and enters a log-linear decrease afterwards.

Monitoring the gradient norm and parameter norms
We monitored the gradient norm as an early signal of divergence. Rapid growth of the gradient norm means (more than three times growth from lowest value) or persistent spikes (benign spikes should return the normal values within a few iterations) can lead to divergence issues. In our training, we observed an ensured gradient norm trending even with BF16, as illustrated in the following figure.

The spikes in our gradient norm often last for a single step and don’t impact the overall training convergence. Specifically, we first tracked a running average (𝑟) of the gradient norm over a window of 20 steps to smooth out the natural fluctuations due to batching. We defined occurrence of a gradient spike when the current gradient norm is higher than 𝑟 + 0.1. Next, we tracked the number of steps for the gradient norm returning to less than 𝑟 + 0.1. Over 86%, the spike deviates from running average for only a single step, as shown in the following figure.

Finally, we also monitored the parameter norm. This metric is a good way to monitor convergence during the initialization stage. For this setup, the initial values are around 1,600, which is expected based on empirical training results from other hardware.

Training results
In this section, we present the results for model quality evaluation and throughput scalability.
Model quality evaluation
The whole training process takes a few weeks. With the saved pre-training model, we benchmarked the model quality based on different tasks and compared it with OpenLlama 2-7B. The following table benchmarks the accuracy over a variety of tasks: MMLU, BBH, common reasoning, world knowledge, reading comprehension, math, and code. For OpenLLaMA 2, we used the available pre-trained weights and evaluated using the same evaluation pipeline as our pre-trained model. Overall, the model trained on Trn1 shows better or comparable accuracy for all tasks except common reasoning.

Task
Shots
Metric
Llama2-7B on trn1
OpenLlama-2

MMLU (5 shot)
5
accuracy
41.318 (3.602)
41.075 (3.611)

BBH (3 shot)
3
multiple_choice_grade
36.565 (1.845)
35.502 (1.861)

Common Reasoning
0
accuracy
56.152 (1.194)
56.893(1.195)

.
.
accuracy_norm
59.455 (1.206)
61.262(1.19)

World Knowledge (5 shot)
Average
exact match
38.846 (0.534)
37.023 (0.52)

Reading Comprehension
0
accuracy
72.508 (0.781)
72.416 (0.782)

Math
8
accuracy
9.401 (0.804)
5.231 (0.613)

Code
0
pass@1
7.62
9.06

.
.
pass@10
19.83
23.58

.
.
pass@100
34.15
40.24

We also verified that the model accuracy keeps increasing by training more tokens in the dataset. For comparison, we tracked the model accuracy using saved intermediate checkpoints for different tasks, as shown in the following figures.
The first figure shows the model accuracy for world knowledge.

The following figure shows the model accuracy for common reasoning.

The following figure shows the model accuracy for math.

We observed that the accuracy increases with more training tokens for different tasks.
The model quality could be further improved with fine-tuning for specific tasks based on domain specific dataset.
Throughput scalability
In addition to the model quality, we checked the training throughput scaling and got more than 90% scaling efficiency for Llama 2-70B for 64 instances, as shown in the following figure. The Llama 2-7B scaling efficiency is slightly lower because the model size is relatively small for a cluster at this scale.

Clean up
To clean up all the provisioned resources for this post, use the following code and the cleanup script described in Train Llama2 with AWS Trainium on Amazon EKS:

./cleanup.sh

Conclusion
This post showed the end-to-end training example for the Llama 2-7B model with up to 1.8 tokens of dataset on 128 trn1.32xlarge clusters. We discussed best practices to overcome the challenges associated to this type of large model training: hardware stability and recovery, model training stability and convergence, and throughput optimization. The saved training model demonstrated good model quality for the general tasks and showed great cost benefit training on AI purpose-built Trainium accelerators. To learn more about the model architectures supported for training on Trainium and access tutorials, refer to Training Samples/Tutorials.
Reference
HLAT: High-quality Large Language Model Pre-trained on AWS Trainium, https://arxiv.org/pdf/2404.10630

About the Authors
Jianying Lang is a Principal Solutions Architect at AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in the HPC and AI field. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD in Computational Physics from the University of Colorado at Boulder.
Fei Chen has 15 years’ industry experiences of leading teams in developing and productizing AI/ML at internet scale. At AWS, she leads the worldwide solution teams in Advanced Compute, including AI accelerators, HPC, IoT, visual and spatial compute, and the emerging technology focusing on technical innovations (AI and generative AI) in the aforementioned domains.
Haozheng Fan is a software engineer at AWS. He is interested in large language models (LLMs) in production, including pre-training, fine-tuning, and evaluation. His works span from framework application level to hardware kernel level. He currently works on LLM training on novel hardware, with a focus on training efficiency and model quality.
Hao Zhou is a Research Scientist with Amazon SageMaker. Before that, he worked on developing machine learning methods for fraud detection for Amazon Fraud Detector. He is passionate about applying machine learning, optimization, and generative AI techniques to various real-world problems. He holds a PhD in Electrical Engineering from Northwestern University.
Yida Wang is a principal scientist in the AWS AI team of Amazon. His research interest is in systems, high-performance computing, and big data analytics. He currently works on deep learning systems, with a focus on compiling and optimizing deep learning models for efficient training and inference, especially large-scale foundation models. The mission is to bridge the high-level models from various frameworks and low-level hardware platforms including CPUs, GPUs, and AI accelerators, so that different models can run in high performance on different devices.
Jun (Luke) Huan is a Principal Scientist at AWS AI Labs. Dr. Huan works on AI and data science. He has published more than 160 peer-reviewed papers in leading conferences and journals and has graduated 11 PhD students. He was a recipient of the NSF Faculty Early Career Development Award in 2009. Before joining AWS, he worked at Baidu Research as a distinguished scientist and the head of Baidu Big Data Laboratory. He founded StylingAI Inc., an AI startup, and worked as the CEO and Chief Scientist from 2019–2021. Before joining the industry, he was the Charles E. and Mary Jane Spahr Professor in the EECS Department at the University of Kansas. From 2015–2018, he worked as a program director at the US NSF, in charge of its big data program.

Fine-tune large multimodal models using Amazon SageMaker

Large multimodal models (LMMs) integrate multiple data types into a single model. By combining text data with images and other modalities during training, multimodal models such as Claude3, GPT-4V, and Gemini Pro Vision gain more comprehensive understanding and improved ability to process diverse data types. The multimodal approach allows models to handle a wider range of real-world tasks that involve both text and non-text inputs. In this way, multimodality helps overcome the restrictions of pure text models. LMMs have the potential to profoundly impact various industries, such as healthcare, business analysis, autonomous driving, and so on.
However, a general-purpose language model can only process relatively simple visual tasks such as answering basic questions about an image or generating short captions. This is primarily due to the lack of access to detailed pixel-level information, object segmentation data, and other granular annotations that would allow the model to precisely understand and reason about the various elements, relationships, and context within an image. Without this fine-grained visual understanding, the language model is constrained to more superficial, high-level analysis and generation capabilities related to images. Fine-tuning LMMs on domain-specific data can significantly improve their performance for targeted tasks. The prospect of fine-tuning open source multimodal models like LLaVA are highly appealing because of their cost effectiveness, scalability, and impressive performance on multimodal benchmarks. For those seeking flexible and economical solutions, the ability to use and customize these powerful models holds immense potential.
In this blog post, we demonstrate how to fine-tune and deploy the LLaVA model on Amazon SageMaker. The source code is available in this GitHub repository.
LLaVA overview
LLaVA is trained end-to-end to enable general-purpose understanding across both visual and textual data. In the LLaVA model architecture, pre-trained language models such as Vicuna or LLaMA are combined with visual models such as CLIP’s visual encoder. The integration converts the visual features from images into a format that matches the language model’s embeddings through a projection layer.
LLaVA training happens in two stages, as shown in Figure 1 that follows. The first stage is pre-training, which uses image-text pairs to align the visual features with the language model’s embeddings. In this stage, the visual encoder and language model weights are kept frozen, and only the projection matrix is trained. The second stage is fine-tuning the whole model end-to-end. Here, the visual encoder’s weights are frozen, while the projection layer and language model are updated.

Figure 1: LLaVA architecture
Prepare data
When it comes to fine-tuning the LLaVA model for specific tasks or domains, data preparation is of paramount importance because having high-quality, comprehensive annotations enables the model to learn rich representations and achieve human-level performance on complex visual reasoning challenges. In this post, we focus on preparing an instruction dataset.
Data annotation
The dataset should contain image text pairs that involve reasoning to answer questions about images. To help the model gain comprehensive understanding during the training process, text data should be enriched with contextual nuances. For example, instead of simply asking the model to describe the image, ask specific questions about the image and relating to its content.
To demonstrate LLaVA’s capabilities, we created a small synthetic dataset focused on understanding and interpreting infographics and charts. We used Amazon Bedrock and Python for this task. Specifically, we employed the Amazon Bedrock LLaMA2-70B model to generate text descriptions and question-answer pairs based on those descriptions. Subsequently, we used Python to generate different types of visual presentation such as pie charts and funnel charts based on the text descriptions. If you already have an existing dataset, this method can be used as a data augmentation technique to expand your dataset and potentially enhance the fine-tuning outcome. By creating synthetic examples of text descriptions, question-answer pairs, and corresponding charts, you can augment your dataset with multimodal examples tailored to your specific use case.
The dataset we created consists of image-text pairs, with each image being an infographic, chart, or other data visualization. The corresponding text is a series of questions about the infographic along with ground truth answers, formatted in a question-answer style intended to resemble how a human might ask the model about the information contained in the image. Some examples of generated questions for images as shown in Figure 2 include:

What is the percentage of people who spend less than 2 hours a day on screen time?
What proportion of people do not exercise at all weekly?
How many people are teachers?

Figure 2: Example charts in the training dataset (left is a pie chart of distribution of daily screen time, right is a funnel chart of occupation)
Data structure
These image-text pairs must be formatted in JSON lines (.jsonl) format, where each line is a training sample. An example training sample follows. Specifically, the id field is the unique identifier of a training sample, the image field specifies the name of the image, and the conversations field provides a question-and-answer pair.

{
“id”: “1”,
“image”: “screen_time.png”,
“conversations”: [
{
“from”: “human”,
“value”: “What is the percentage of people who spend less than 2 hours a day on screen time?”
},
{
“from”: “gpt”,
“value”: “15%”
}
]
}

By training the model to answer in-depth and analytical questions about infographics it hasn’t seen before, we aim to strengthen model’s ability to generalize its understanding of data visualizations and draw accurate insights.
Fine tune the model
After the data is prepared, we upload it to Amazon Simple Storage Service (Amazon S3) as the SageMaker training input. In configuring the SageMaker training job, we use the TrainingInput object to specify the input data location in Amazon S3 and define how SageMaker should handle it during training. In this case, input_mode=’FastFile’ indicates the use of S3 fast file mode, which is ideal for scenarios where the dataset is stored as individual files in S3. S3 fast file mode is also advantageous when working with large datasets or when fast access to data is critical for training performance.

from sagemaker.inputs import TrainingInput

training_input = TrainingInput(
s3_data_type=”S3Prefix”, # Available Options: S3Prefix | ManifestFile | AugmentedManifestFile
s3_data=s3uri,
distribution=”FullyReplicated”, # Available Options: FullyReplicated | ShardedByS3Key
input_mode=”FastFile”,
)

We will reuse the training script from LLaVA, which uses DeepSpeed for training efficiency. DeepSpeed is a library that helps train very large deep learning models faster and more efficiently. ZeRO, short for Zero Redundancy Optimizer, is a memory optimization technique in DeepSpeed that reduces the required memory footprint for data parallelism by partitioning optimization states and gradients across data-parallel processes, enabling larger model sizes and batch sizes within limited GPU memory. This allows you to train much larger models on the same hardware. ZeRO Stage 2 reduces memory usage by splitting the model’s optimizer state, gradients, and parameters across multiple processes. Each process only stores a part of these, reducing the memory needed per process. If you run into CUDA memory errors with this configuration, try the Stage 3 configuration instead. Stage 3 offloads gradients to the CPU, which slows training but might solve the memory issue. The training command follows. See the LLaVA: Large Language and Vision Assistant on GitHub for more details about the training parameters

#!/bin/bash
# Set the prompt and model versions directly in the command
deepspeed /root/LLaVA/llava/train/train_mem.py
–deepspeed /root/LLaVA/scripts/zero2.json
–lora_enable True
–lora_r 128
–lora_alpha 256
–mm_projector_lr 2e-5
–bits 4
–model_name_or_path /root/LLaVA/llava/llava-v1.5-7b
–version llava_llama_2
–data_path /root/dataset/train/dataset.json
–validation_data_path /root/dataset/validation/dataset.json
–image_folder /root/dataset/images/
–vision_tower openai/clip-vit-large-patch14-336
–mm_projector_type mlp2x_gelu
–mm_vision_select_layer -2
–mm_use_im_start_end False
–mm_use_im_patch_token False
–image_aspect_ratio pad
–group_by_modality_length True
–bf16 True
–output_dir /root/LLaVA/llava/checkpoints/llama-2-7b-chat-task-qlora
–num_train_epochs 500
–per_device_train_batch_size 32
–per_device_eval_batch_size 32
–gradient_accumulation_steps 1
–evaluation_strategy “epoch”
–save_strategy “steps”
–save_steps 50000
–save_total_limit 1
–learning_rate 2e-4
–weight_decay 0.
–warmup_ratio 0.03
–lr_scheduler_type “cosine”
–logging_steps 1
–tf32 True
–model_max_length 2048
–gradient_checkpointing True
–dataloader_num_workers 4
–lazy_preprocess True
–report_to wandb

LLaVA allows you to fine-tune all parameters of the base model or use LoRA to tune a smaller number of parameters. LoRA’s strategy keeps the original pre-trained model backbone unchanged and adds new, easier-to-train layers. This allows quick adaptation to new tasks without retraining the whole network. You can use the lora_enable parameter to specify the fine-tuning method. For full parameter fine-tuning, ml.p4d.24xlarge is recommended, while ml.g5.12xlarge is sufficient for LoRA fine-tuning if LLaMA-13B language model is used.
The following code initializes a SageMaker Estimator using the HuggingFace SDK. It sets up a SageMaker training job to run the custom training script from LLaVA. This allows the script to be run within the SageMaker managed environment, benefiting from its scalability. Then we bring our own Docker container to run the SageMaker training job. You can download the Docker image from this code repo, where the dependencies of the training LLaVA model are installed. To learn more about how to adapt your own Docker container to work with SageMaker, see adapting your own training container.

huggingface_estimator = HuggingFace(
entry_point=”finetune-lora-piechart-QA.sh”,
source_dir=”./LLaVA”,
instance_type=instance_type,
instance_count=instance_count,
py_version=PYTHON_VERSION,
image_uri=CONTAINER_URI,
role=ROLE,
metric_definitions=metric_definitions,
environment=environment,
use_spot_instances=use_spot_instances,
max_run=max_run,
max_wait=max_wait,
output_path=output_uri,
checkpoint_s3_uri=checkpoint_uri,
)

For logging purpose, you can use metric definitions to extract key metrics from the training script’s printed logs and send them to Amazon CloudWatch. The following is an example metric definition that logs training loss at each epoch, the model’s learning rate, and training throughput.

metric_definitions = [
{“Name”: “loss”, “Regex”: “‘loss’: ([0-9]+(.|e-)[0-9]+),?”},
{“Name”: “learning_rate”, “Regex”: “‘learning_rate’: ([0-9]+(.|e-)[0-9]+),?”},
{“Name”: “epoch”, “Regex”: “‘epoch’: ([0-9]+(.|e-)[0-9]+),?”},
{“Name”: “train_runtime”, “Regex”: “‘epoch’: ([0-9]+(.|e-)[0-9]+),?”},
{“Name”: “train_samples_per_second”, “Regex”: “‘epoch’: ([0-9]+(.|e-)[0-9]+),?”},
{“Name”: “train_steps_per_second”, “Regex”: “‘epoch’: ([0-9]+(.|e-)[0-9]+),?”},
{“Name”: “train_loss”, “Regex”: “‘epoch’: ([0-9]+(.|e-)[0-9]+),?”},
]

Deploy and test
After the training job finishes, the fine-tuned model is uploaded to Amazon S3. You can then use the following code to deploy the model on SageMaker.

HF_TASK = “question-answering”
config = dict(HF_TASK=HF_TASK)
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
model_data=s3_model_path,
role=get_execution_role(),
transformers_version=TRANSFORMERS_VERSION,
pytorch_version=PYTORCH_VERSION,
py_version=PYTHON_VERSION,
model_server_workers=1,
env=config,
)

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
initial_instance_count=instance_count, instance_type=instance_type
)

For testing, provide an image and question pair and make an inference call against the SageMaker endpoint as follows:

prompt = “what is this chart about?”
data = {
“image”: http_img_path,
“question”: prompt,
“temperature”: 0.1,
}
output = predictor.predict(data)

Conclusion
Our exploration into fine-tuning the LLaVA visual language model on Sagemaker for a custom visual question answering task has shed light on the advancements made in bridging the gap between textual and visual comprehension. LLaVA represents a significant step forward in multimodal AI, demonstrating the ability to jointly understand and reason about textual and visual information in a unified model. By using large-scale pretraining on image-text pairs, LLaVA has acquired robust visiolinguistic representations that can be effectively adapted to downstream tasks through fine-tuning. This enables LLaVA to excel at tasks that require deep comprehension of both modalities, such as visual question answering, image captioning, and multimodal information retrieval. However, the fine-tuning mechanism has limitations. In particular, the adjustment of the projection layer and language model themselves while freezing the vision model presents a set of challenges, such as the requirement for a massive amount of data and the lack of capability in handling challenging vision tasks. Confronting these challenges directly allows us to unlock the full potential of multimodal models, paving the way for more sophisticated applications.
Acknowledgement
The authors extend their gratitude to Manoj Ravi, Jenny Vega, and Santhosh Kuriakose for their insightful feedback and review of the post.
Reference

LLaVA Visual Instruction Tuning (pdf)

About the Authors
Dr. Changsha Ma is an AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, and spending time with friends and families.
Jun Shi is a Senior Solutions Architect at Amazon Web Services (AWS). His current areas of focus are AI/ML infrastructure and applications. He has over a decade experience in the FinTech industry as software engineer.
Alfred Shen is a Senior AI/ML Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on CV, NLP, and multimodality. His work has been showcased in publications such as EMNLP, ICLR, and Public Health.

Inductive Biases in Deep Learning: Understanding Feature Representatio …

Machine learning research aims to learn representations that enable effective downstream task performance. A growing subfield seeks to interpret these representations’ roles in model behaviors or modify them to enhance alignment, interpretability, or generalization. Similarly, neuroscience examines neural representations and their behavioral correlations. Both fields focus on understanding or improving system computations, abstract behavior patterns on tasks, and their implementations. The relationship between representation and computation is complex and needs to be more straightforward.

Highly over-parameterized deep networks often generalize well despite their capacity for memorization, suggesting an implicit inductive bias towards simplicity in their architectures and gradient-based learning dynamics. Networks biased towards simpler functions facilitate easier learning of simpler features, which can impact internal representations even for complex features. Representational biases favor simple, common features influenced by factors such as feature prevalence and output position in transformers. Shortcut learning and disentangled representation research highlight how these biases affect network behavior and generalization.

In this work, DeepMind researchers investigate dissociations between representation and computation by creating datasets that match the computational roles of features while manipulating their properties. Various deep learning architectures are trained to compute multiple abstract features from inputs. Results show systematic biases in feature representation based on properties like feature complexity, learning order, and feature distribution. Simpler or earlier-learned features are more strongly represented than complex or later-learned ones. These biases are influenced by architectures, optimizers, and training regimes, such as transformers favoring features decoded earlier in the output sequence.

Their approach involves training networks to classify multiple features either through separate output units (e.g., MLP) or as a sequence (e.g., Transformer). The datasets are constructed to ensure statistical independence among features, with models achieving high accuracy (>95%) on held-out test sets, confirming the correct computation of features. The study investigates how properties such as feature complexity, prevalence, and position in the output sequence affect feature representation. Families of training datasets are created to systematically manipulate these properties, with corresponding validation and test datasets ensuring expected generalization.

Training various deep learning architectures to compute multiple abstract features reveals systematic biases in feature representation. These biases depend on extraneous properties like feature complexity, learning order, and feature distribution. Simpler or earlier-learned features are represented more strongly than complex or later-learned ones, even if all are learned equally well. Architectures, optimizers, and training regimes, such as transformers, also influence these biases. These findings characterize the inductive biases of gradient-based representation learning and highlight challenges in disentangling extraneous biases from computationally important aspects for interpretability and comparison with brain representations.

In this work, researchers trained deep learning models to compute multiple input features, revealing substantial biases in their representations. These biases depend on feature properties like complexity, learning order, dataset prevalence, and output sequence position. Representational biases may relate to implicit inductive biases in deep learning. Practically, these biases pose challenges for interpreting learned representations and comparing them across different systems in machine learning, cognitive science, and neuroscience.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform
The post Inductive Biases in Deep Learning: Understanding Feature Representation appeared first on MarkTechPost.

The Rise of Agentic Retrieval-Augmented Generation (RAG) in Artificial …

In the rapidly developing fields of data science and Artificial Intelligence (AI), the search for increasingly effective systems is also increasing significantly. The development of Agentic Retrieval-Augmented Generation (RAG) is among the most revolutionary developments of recent times. This strategy is set to completely transform the way information is used and managed, offering a substantial improvement over current RAG systems. 

Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation (RAG) is an architectural strategy that enhances the effectiveness of Large Language Model (LLM) applications by utilizing custom data. Conventional RAG refers to external authoritative knowledge bases before response generation to improve the output of LLMs. This methodology tackles a number of significant LLM inherent constraints, including the presentation of inaccurate or out-of-date information as a result of static training data.

Principal Benefits of RAG 

Cost-effective: RAG is a cost-effective solution for many applications because it permits the use of current LLMs without requiring significant retraining. 

Current Information: RAG makes sure that the information is current by establishing connections with live streams and regularly updated sources. 

Enhanced Trust: Users’ confidence and trust in AI-generated content are increased when accurate information and source attributions are provided. 

Better Control: By having more control over the information sources, developers provide more intelligent and pertinent answers.

Agentic RAG

By adding autonomous agents that contribute a new degree of intelligence and decision-making, agentic RAG expands on the capabilities of traditional RAG. Through this transition, a static RAG system becomes a dynamic, context-aware AI that can answer complicated questions with amazing coherence and precision.

Characteristics of Agentic RAG 

Context Awareness: Agentic RAG agents are made to be aware of the broader context of conversations, in contrast to traditional RAG, which could have trouble doing so. They are able to understand the subtleties of a conversation and modify their actions accordingly, producing more thoughtful and pertinent answers. 

Intelligent Retrieval Techniques: Traditional RAG systems frequently use static rules to facilitate retrieval. On the other hand, agentic RAG agents use intelligent techniques that dynamically evaluate the user’s query and contextual clues to decide on the best retrieval action. 

Multi-Agent Orchestration: This technique manages complex searches that traverse several documents or data sources. Experts in their respective fields and specialized agents work together to combine knowledge and deliver thorough answers.

Agentic Reasoning: These agents do more than retrieve data; they also assess, correct, and verify the quality of the result, guaranteeing its accuracy and dependableness. 

Post-Generation Verification: To ensure high-quality outputs, agentic RAG agents can choose the best outcome from several generations and even confirm the accuracy of generated content. 

Adaptability and Learning: With each encounter, these agents learn from their experiences and adjust accordingly, growing in intelligence and productivity over time.

Agentic RAG Architecture

The Agentic RAG Agent, an intelligent orchestrator that interprets user queries and chooses the best course of action, is at the heart of the Agentic RAG architecture. This agent manages a group of specialized tools that are all connected to different data sources, such as financial statements or consumer information. Within their area, document agents are committed to organizing certain documents or data sources, analyzing data, and producing pertinent outputs. 

The interactions between various document agents are managed by a top-level Meta-Agent, which guarantees smooth integration and a cohesive response. In order to handle complicated queries spanning various domains and produce accurate and contextually relevant information synthesis, this dynamic, multi-agent system makes use of intelligent reasoning, context awareness, and post-generation verification.

Applications of Agentic RAG

Customer service and support: Improving communications with customers by comprehending their intricate needs and offering precise, tailored answers gleaned from several information bases. 

Conversational AI and intelligent assistants: Enhancing user experiences by enabling virtual assistants to have natural, contextually appropriate dialogues.

Content Creation and Creative Writing: Producing excellent, contextually appropriate content to support writers and content developers. 

Education and e-learning: Creating customized explanations and obtaining pertinent educational resources to personalize learning experiences. 

Healthcare and Medical Informatics: Enabling medical practitioners to make informed decisions by combining medical knowledge from many sources. 

Legal and Regulatory Compliance: Gathering and evaluating pertinent legal data to support legal research and compliance oversight.

Challenges

Data curation and quality: Producing trustworthy results requires guaranteeing the correctness, relevance, and completeness of the underlying data sources. 

Scalability and Efficiency: As a system grows, performance must be maintained through resource management and retrieval process optimization. 

Interpretability and Explainability: Building methods and models that shed light on the agent’s motivations and sources promotes responsibility and confidence. 

Security and privacy: Securing sensitive data and preserving user privacy need the implementation of strong data protection mechanisms. 

Ethical Considerations: Using rigorous testing and ethical norms to address potential misuse, bias, and fairness.

Conclusion

Combining the inventive powers of autonomous agents with the advantages of classical RAG, agentic RAG is a major breakthrough in AI technology. Its capacity to respond intelligently and contextually to sophisticated queries makes it an essential tool for the future. As development and research proceed, Agentic RAG will open up new avenues for business, spurring creativity and transforming the way humans use and interact with information. 

References

https://aws.amazon.com/what-is/retrieval-augmented-generation/#:~:text=Augmented%20Generation%20requirements%3F-,What%20is%20Retrieval%2DAugmented%20Generation%3F,sources%20before%20generating%20a%20response.

https://www.databricks.com/glossary/retrieval-augmented-generation-rag

https://medium.com/@bijit211987/agentic-rag-81ed8527212b

https://www.linkedin.com/posts/armand-ruiz_forget-rag-welcome-agentic-rag-%3F%3F%3F-activity-7197189208015314944-MMy8/?utm_source=share&utm_medium=member_desktop

https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/

The post The Rise of Agentic Retrieval-Augmented Generation (RAG) in Artificial Intelligence AI appeared first on MarkTechPost.

Deep Learning in Healthcare: Challenges, Applications, and Future Dire …

Biomedical data is increasingly complex, high-dimensional, and heterogeneous, encompassing sources such as electronic health records (EHRs), imaging, -omics data, sensors, and text. Traditional data mining and statistical methods must improve with this complexity, often requiring extensive feature engineering and domain expertise to extract meaningful insights. Recent advancements in deep learning offer a transformative approach by enabling end-to-end learning models that can directly process raw biomedical data. These models, known for their success in fields like computer vision and NL processing, can revolutionize healthcare by facilitating the translation of vast biomedical data into actionable health outcomes. However, challenges remain, including the need for models that are interpretable by healthcare professionals and adaptable to the unique characteristics of medical data, such as its sparsity, heterogeneity, and temporal dependencies.

Despite the promise of deep learning in healthcare, its adoption has been limited due to several challenges. These include the high-dimensional nature of biomedical data, inconsistencies across different medical ontologies, and the need for comprehensive integration into clinical workflows. Nevertheless, ongoing efforts and planned applications, such as those by Google DeepMind and Enlitic, indicate a growing interest in leveraging deep learning for tasks like disease detection and predictive analysis. The future of healthcare lies in developing deep learning models that perform robustly and offer interpretability and ease of use for medical practitioners, thereby advancing precision medicine and improving patient outcomes.

Deep Learning in Medical Imaging:

Deep learning, particularly through CNNs, has significantly advanced computer vision in medical imaging. CNNs excel in tasks like object classification, detection, and segmentation, achieving human-level accuracy in diagnosing conditions from radiographs, dermatology images, retinal scans, and more. These models, often trained on large datasets and fine-tuned for specific medical tasks, assist physicians by flagging potential issues in images and providing second opinions. Despite their success, challenges remain, such as the need for large labeled datasets and incorporating clinical context for more accurate diagnostics.

Advancements in Natural Language Processing for Healthcare:

NLP leverages deep learning to analyze and understand text and speech, significantly impacting fields such as machine translation, text generation, and image captioning. RNNs are pivotal in this domain because they can process sequential data effectively. In healthcare, NLP is instrumental in managing EHRs, which compile extensive medical data across patient histories. Deep learning models can use this data to answer complex medical questions, enhance diagnostic accuracy, and predict patient outcomes. Techniques like supervised and unsupervised learning and auto-encoders help extract meaningful insights from the vast amounts of structured and unstructured data in EHRs.

Future developments in NLP for healthcare include creating clinical voice assistants to transcribe patient visits accurately reducing physician burnout by minimizing time spent on documentation. These voice assistants could use RNN-based language translation to convert conversations directly into EHR entries. Another focus area is combining structured and unstructured data using large-scale RNNs to make comprehensive predictions about patient health, such as mortality risk and length of hospital stay. As these technologies evolve, they promise to revolutionize medical practice by providing timely, data-driven insights and enhancing the overall quality of care.

Image Source

Deep Learning Applications in Healthcare Domains:

Deep learning has revolutionized healthcare across various domains, notably clinical imaging, EHRs, genomics, and mobile health monitoring. In clinical imaging, CNNs analyze MRI scans for Alzheimer’s disease prediction and segment knee cartilage for osteoarthritis risk assessment. In EHR analysis, RNNs predict diseases from patient records, while deep patient representations aid in risk prediction. Genomic studies leverage CNNs for DNA sequence analysis. In mobile health, CNNs and RNNs detect gait freezing in Parkinson’s patients and predict energy expenditure from wearable sensor data. These applications demonstrate deep learning’s potential in advancing healthcare diagnostics and monitoring.

Challenges and Opportunities in Applying Deep Learning to Healthcare:

Despite the successes in applying deep learning to healthcare, several challenges still need to be addressed, including data volume, quality, temporality, domain complexity, and interpretability. These challenges present opportunities for future research, such as enriching features, federated inference, ensuring model privacy, incorporating expert knowledge, temporal modeling, and making models interpretable. Deep learning offers powerful methods for analyzing healthcare data and can pave the way for predictive healthcare systems that integrate diverse data sources, support clinicians, and advance medical research. Deep learning could revolutionize healthcare by scaling to large datasets and providing comprehensive patient representations.

Sources:

https://www.nature.com/articles/s41591-018-0316-z

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6455466/

The post Deep Learning in Healthcare: Challenges, Applications, and Future Directions appeared first on MarkTechPost.