FlashSigmoid: A Hardware-Aware and Memory-Efficient Implementation of …

Large Language Models (LLMs) have gained significant prominence in modern machine learning, largely due to the attention mechanism. This mechanism employs a sequence-to-sequence mapping to construct context-aware token representations. Traditionally, attention relies on the softmax function (SoftmaxAttn) to generate token representations as data-dependent convex combinations of values. However, despite its widespread adoption and effectiveness, SoftmaxAttn faces several challenges. One key issue is the tendency of the softmax function to concentrate attention on a limited number of features, potentially overlooking other informative aspects of the input data. Also, the application of SoftmaxAttn necessitates a row-wise reduction along the input sequence length, which can significantly slow down computations, particularly when using efficient attention kernels.

Recent research in machine learning has explored alternatives to the traditional softmax function in various domains. In supervised image classification and self-supervised learning, there’s a trend towards using richer pointwise Bernoulli conditionals parameterized by sigmoid functions, moving away from output conditional categorical distributions typically parameterized by softmax. Some studies have investigated replacing softmax with ReLU activation in both practical and theoretical contexts. Other explorations include the use of ReLU2 activation, purely linear attention, and cosine-similarity based attention mechanisms. A notable approach scaled various activation functions by n^(-α), where n is the sequence length and α is a hyper-parameter, to replace softmax. However, this method faced performance issues without proper initialization and the use of LayerScale. These diverse approaches aim to address the limitations of softmax-based attention, seeking more efficient and effective alternatives for context-aware token representation.

Apple researchers introduce a robust approach to attention mechanisms by replacing the row-wise softmax operation with an element-wise sigmoid nonlinearity. The researchers identify that the main challenge with naive sigmoid attention (SigmoidAttn) lies in the large initial attention norms. To address this, they propose several solutions and make significant contributions to the field. First, they demonstrate that SigmoidAttn is a universal function approximator for sequence-to-sequence tasks. Second, they provide an analysis of SigmoidAttn’s regularity and establish its worst-case Jacobian bound. Third, they enhance the FLASHATTENTION2 algorithm with a sigmoid kernel, resulting in substantial reductions in kernel inference wall-clock time and real-world inference time. Lastly, they show that SigmoidAttn performs comparably to SoftmaxAttn across various tasks and domains, highlighting its potential as a viable alternative in attention mechanisms.

SigmoidAttn, the proposed alternative to traditional softmax attention, is analyzed from two crucial perspectives. First, the researchers demonstrate that transformers using SigmoidAttn retain the Universal Approximation Property (UAP), ensuring their ability to approximate continuous sequence-to-sequence functions with arbitrary precision. This property is vital for maintaining the architecture’s generalizability and representation capability. The proof adapts the framework used for classical transformers, with key modifications to accommodate the sigmoid function. Notably, SigmoidAttn requires at least four attention heads and shifts in both query and key definitions to approximate the necessary selective shift operation, compared to softmax attention’s requirement of two heads and shifts only in the query definition.

Second, the study examines the regularity of SigmoidAttn by computing its Lipschitz constant. The analysis reveals that SigmoidAttn’s local Lipschitz constant is significantly lower than the worst-case scenario for softmax attention. This implies that SigmoidAttn exhibits better regularity, potentially leading to improved robustness and optimization ease in neural networks. The bound for SigmoidAttn depends on the average squared-norm of the input sequence rather than the largest value, allowing for application to unbounded distributions with bounded second moments.

The researchers conducted comprehensive evaluations of SigmoidAttn across various domains to validate its effectiveness. These evaluations encompassed supervised image classification using vision transformers, self-supervised image representation learning with methods like SimCLR, BYOL, and MAE, as well as automatic speech recognition (ASR) and auto-regressive language modeling (LM). Also, they tested sequence length generalization on TED-LIUM v3 for ASR and in small-scale synthetic experiments.

Results demonstrate that SigmoidAttn consistently matches the performance of SoftmaxAttn across all tested domains and algorithms. This performance parity is achieved while offering training and inference speed improvements, as detailed in earlier sections. Key observations from the empirical studies include:

1. For vision tasks, SigmoidAttn proves effective without requiring a bias term, except in the case of MAE. However, it relies on LayerScale to match SoftmaxAttn’s performance in a hyperparameter-free manner.

2. In language modeling and ASR tasks, performance is sensitive to the initial norm of the attention output. To address this, modulation is necessary through either relative positional embeddings like ALiBi, which shifts logit mass to the zero regime under SigmoidAttn, or appropriate initialization of the b parameter to achieve a similar effect.

These findings suggest that SigmoidAttn is a viable alternative to SoftmaxAttn, offering comparable performance across a wide range of applications while potentially providing computational advantages.

This study presents a comprehensive analysis of sigmoid attention as a potential replacement for softmax attention in transformer architectures. The researchers provide both theoretical foundations and empirical evidence to support the viability of this alternative approach. They demonstrate that transformers using sigmoid attention retain the crucial property of being universal function approximators while also exhibiting improved regularity compared to their softmax counterparts. The study identifies two key factors for the successful implementation of sigmoid attention: the use of LayerScale and the prevention of large initial attention norms. These insights contribute to establishing best practices for applying sigmoid attention in transformer models. Also, the researchers introduce FLASHSIGMOID, a memory-efficient variant of sigmoid attention that achieves a significant 17% speed-up in inference kernel performance. Extensive experiments conducted across various domains – including language processing, computer vision, and speech recognition – show that properly normalized sigmoid attention consistently matches the performance of softmax attention across diverse tasks and scales. 

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post FlashSigmoid: A Hardware-Aware and Memory-Efficient Implementation of Sigmoid Attention Yielding a 17% Inference Kernel Speed-Up over FlashAttention-2 on H100 GPUs appeared first on MarkTechPost.

LLM-CI: A New Machine Learning Framework to Assess Privacy Norms Encod …

Large language models (LLMs) are widely implemented in sociotechnical systems like healthcare and education. However, these models often encode societal norms from the data used during training, raising concerns about how well they align with expectations of privacy and ethical behavior. The central challenge is ensuring that these models adhere to societal norms across varying contexts, model architectures, and datasets. Additionally, prompt sensitivity—where small changes in input prompts lead to different responses—complicates assessing whether LLMs reliably encode these norms. Addressing this challenge is critical to preventing ethical issues such as unintended privacy violations in sensitive domains.

Traditional methods for evaluating LLMs focus on technical capabilities like fluency and accuracy, neglecting the encoding of societal norms. Some approaches attempt to assess privacy norms using specific prompts or datasets, but these often fail to account for prompt sensitivity, leading to unreliable outcomes. Additionally, variations in model hyperparameters and optimization strategies—such as capacity, alignment, and quantization—are seldom considered, which results in incomplete evaluations of LLM behavior. These limitations leave a gap in assessing the ethical alignment of LLMs with societal norms.

A team of researchers from York University and the University of Waterloo introduces LLM-CI, a novel framework grounded in Contextual Integrity (CI) theory, to assess how LLMs encode privacy norms across different contexts. It employs a multi-prompt assessment strategy to mitigate prompt sensitivity, selecting prompts that yield consistent outputs across various variants. This provides a more accurate evaluation of norm adherence across models and datasets. The approach also incorporates real-world vignettes that represent privacy-sensitive situations, ensuring a thorough evaluation of model behavior in diverse scenarios. This method is a significant advancement in evaluating the ethical performance of LLMs, particularly in terms of privacy and societal norms.

LLM-CI was tested on datasets such as IoT vignettes and COPPA vignettes, which simulate real-world privacy scenarios. These datasets were used to assess how models handle contextual factors like user roles and information types in various privacy-sensitive contexts. The evaluation also examined the influence of hyperparameters (e.g., model capacity) and optimization techniques (e.g., alignment and quantization) on norm adherence. The multi-prompt methodology ensured that only consistent outputs were considered in the evaluation, minimizing the effect of prompt sensitivity and improving the robustness of the analysis.

The LLM-CI framework demonstrated a marked improvement in evaluating how LLMs encode privacy norms across varying contexts. By applying the multi-prompt assessment strategy, more consistent and reliable results were achieved than with single-prompt methods. Models optimized using alignment techniques showed up to 92% contextual accuracy in adhering to privacy norms. Furthermore, the new assessment approach resulted in a 15% increase in response consistency, confirming that tuning model properties such as capacity and applying alignment strategies significantly improved LLMs’ ability to align with societal expectations. This validated the robustness of LLM-CI in norm adherence evaluations.

LLM-CI offers a comprehensive and robust approach for assessing how LLMs encode privacy norms by leveraging a multi-prompt assessment methodology. It provides a reliable evaluation of model behavior across different datasets and contexts, addressing the challenge of prompt sensitivity. This method significantly advances the understanding of how well LLMs align with societal norms, particularly in sensitive areas such as privacy. By improving the accuracy and consistency of model responses, LLM-CI represents a vital step toward the ethical deployment of LLMs in real-world applications.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post LLM-CI: A New Machine Learning Framework to Assess Privacy Norms Encoded in LLMs appeared first on MarkTechPost.

Google AI Introduces DataGemma: A Set of Open Models that Utilize Data …

Google has introduced a groundbreaking innovation called DataGemma, designed to tackle one of modern artificial intelligence’s most significant problems: hallucinations in large language models (LLMs). Hallucinations occur when AI confidently generates information that is either incorrect or fabricated. These inaccuracies can undermine AI’s utility, especially for research, policy-making, or other important decision-making processes. In response, Google’s DataGemma aims to ground LLMs in real-world, statistical data by leveraging the extensive resources available through its Data Commons.

They have introduced two specific variants designed to enhance the performance of LLMs further: DataGemma-RAG-27B-IT and DataGemma-RIG-27B-IT. These models represent cutting-edge advancements in both Retrieval-Augmented Generation (RAG) and Retrieval-Interleaved Generation (RIG) methodologies. The RAG-27B-IT variant leverages Google’s extensive Data Commons to incorporate rich, context-driven information into its outputs, making it ideal for tasks that need deep understanding and detailed analysis of complex data. On the other hand, the RIG-27B-IT model focuses on integrating real-time retrieval from trusted sources to fact-check and validate statistical information dynamically, ensuring accuracy in responses. These models are tailored for tasks that demand high precision and reasoning, making them highly suitable for research, policy-making, and business analytics domains. 

The Rise of Large Language Models and Hallucination Problems

LLMs, the engines behind generative AI, are becoming increasingly sophisticated. They can process enormous amounts of text, create summaries, suggest creative outputs, and even draft code. However, one of the critical shortcomings of these models is their occasional tendency to present incorrect information as fact. This phenomenon, known as hallucination, has raised concerns about the reliability & trustworthiness of AI-generated content. To address these challenges, Google has made significant research efforts to reduce hallucinations. These advancements culminate in the release of DataGemma, an open model specifically designed to anchor LLMs in the vast reservoir of real-world statistical data available in Google’s Data Commons.

Data Commons: The Bedrock of Factual Data

Data Commons is at the heart of DataGemma’s mission, a comprehensive repository of publicly available, reliable data points. This knowledge graph contains over 240 billion data points across many statistical variables drawn from trusted sources such as the United Nations, the WHO, the Centers for Disease Control and Prevention, and various national census bureaus. By consolidating data from these authoritative organizations into one platform, Google empowers researchers, policymakers, and developers with a powerful tool for deriving accurate insights.

The scale and richness of the Data Commons make it an indispensable asset for any AI model that seeks to improve the accuracy and relevance of its outputs. Data Commons covers various topics, from public health and economics to environmental data and demographic trends. Users can interact with this vast dataset through a natural language interface, asking questions such as how income levels correlate with health outcomes in specific regions or which countries have made the most significant strides in expanding access to renewable energy.

Image Source

The Dual Approach of DataGemma: RIG and RAG Methodologies

Google’s innovative DataGemma model employs two distinct approaches to enhancing the accuracy and factuality of LLMs: Retrieval-Interleaved Generation (RIG) and Retrieval-Augmented Generation (RAG). Each method has unique strengths.

The RIG methodology builds on existing AI research by integrating proactive querying of trusted data sources within the model’s generation process. Specifically, when DataGemma is tasked with generating a response that involves statistical or factual data, it cross-references the relevant data within the Data Commons repository. This methodology ensures that the model’s outputs are grounded in real-world data and fact-checked against authoritative sources.

For example, in response to a query about the global increase in renewable energy usage, DataGemma’s RIG approach would pull statistical data directly from Data Commons, ensuring that the answer is based on reliable, real-time information.

On the other hand, the RAG methodology expands the scope of what language models can do by incorporating relevant contextual information beyond their training data. DataGemma leverages the capabilities of the Gemini model, particularly its long context window, to retrieve essential data before generating its output. This method ensures that the model’s responses are more comprehensive, informative, and less hallucination-prone. 

When a query is posed, the RAG method first retrieves pertinent statistical data from Data Commons before producing a response, thus ensuring that the answer is accurate and enriched with detailed context. This is particularly useful for complex questions that require more than a straightforward factual answer, such as understanding trends in global environmental policies or analyzing the socioeconomic impacts of a particular event.

Initial Results and Promising Future

Although the RIG and RAG methodologies are still in their early stages, preliminary research suggests promising improvements in the accuracy of LLMs when handling numerical facts. By reducing the risk of hallucinations, DataGemma holds significant potential for various applications, from academic research to business decision-making. Google is optimistic that the enhanced factual accuracy achieved through DataGemma will make AI-powered tools more reliable, trustworthy, and indispensable for anyone seeking informed, data-driven decisions.

Google’s research and development team continues to refine RIG and RAG, with plans to scale up these efforts and subject them to more rigorous testing. The ultimate goal is to integrate these improved functionalities into the Gemma and Gemini models through a phased approach. For now, Google has made DataGemma available to researchers and developers, providing access to the models and quick-start notebooks for both the RIG and RAG methodologies.

Image Source

Broader Implications for AI’s Role in Society

The release of DataGemma marks a significant step forward in the journey to make LLMs more reliable and grounded in factual data. As generative AI becomes increasingly integrated into various sectors, ranging from education and healthcare to governance and environmental policy, addressing hallucinations is crucial to ensuring that AI empowers users with accurate information.

Google’s commitment to making DataGemma an open model reflects its broader vision of fostering collaboration and innovation in the AI community. By making this technology available to developers, researchers, and policymakers, Google aims to drive the adoption of data-grounding techniques that enhance AI’s trustworthiness. This initiative advances the field of AI and underscores the importance of fact-based decision-making in today’s data-driven world.

In conclusion, DataGemma is an innovative leap in addressing AI hallucinations by grounding LLMs in the vast, authoritative datasets of Google’s Data Commons. By combining the RIG and RAG methodologies, Google has created a robust tool that enhances the accuracy and reliability of AI-generated content. This release is a significant step toward ensuring that AI becomes a trusted partner in research, decision-making, and knowledge discovery, all while empowering individuals and organizations to make more informed choices based on real-world data.

Check out the Details, Paper, RAG Gemma, and RIG Gemma. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our Newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Google AI Introduces DataGemma: A Set of Open Models that Utilize Data Commons through Retrieval Interleaved Generation (RIG) and Retrieval Augmented Generation (RAG) appeared first on MarkTechPost.

Unlock AWS Cost and Usage insights with generative AI powered by Amazo …

Managing cloud costs and understanding resource usage can be a daunting task, especially for organizations with complex AWS deployments. AWS Cost and Usage Reports (AWS CUR) provides valuable data insights, but interpreting and querying the raw data can be challenging.
In this post, we explore a solution that uses generative artificial intelligence (AI) to generate a SQL query from a user’s question in natural language. This solution can simplify the process of querying CUR data stored in an Amazon Athena database using SQL query generation, running the query on Athena, and representing it on a web portal for ease of understanding.
The solution uses Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
Challenges addressed
The following challenges can hinder organizations from effectively analyzing their CUR data, leading to potential inefficiencies, overspending, and missed opportunities for cost-optimization. We aim to target and simplify them using generative AI with Amazon Bedrock.

Complexity of SQL queries – Writing SQL queries to extract insights from CUR data can be complex, especially for non-technical users or those unfamiliar with the CUR data structure (unless you’re a seasoned database administrator)
Data accessibility – To gain insights from structured data in databases, users need to get access to databases, which can be a potential threat to overall data protection
User-friendliness – Traditional methods of analyzing CUR data often lack a user-friendly interface, making it challenging for non-technical users to take advantage of the valuable insights hidden within the data

Solution overview
The solution that we discuss is a web application (chatbot) that allows you to ask questions related to your AWS costs and usage in natural language. The application generates SQL queries based on the user’s input, runs them against an Athena database containing CUR data, and presents the results in a user-friendly format. The solution combines the power of generative AI, SQL generation, database querying, and an intuitive web interface to provide a seamless experience for analyzing CUR data.
The solution uses the following AWS services:

Amazon Athena
Amazon Bedrock
AWS Billing and Cost Management for cost and usage reports
Amazon Simple Storage Service (Amazon S3)
The compute service of your choice on AWS to call Amazon Bedrock APIs. This could be Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, AWS SDK, Amazon SageMaker notebooks, or your workstation if you are doing a quick proof of concept. For the purpose of this post, this code is running on a t3a.micro EC2 instance with Amazon Linux 2023.

 The following diagram illustrates the solution architecture.

Figure 1. Architecture of Solution

The data flow consists of the following steps:

The CUR data is stored in Amazon S3.
Athena is configured to access and query the CUR data stored in Amazon S3.
The user interacts with the Streamlit web application and submits a natural language question related to AWS costs and usage.

Figure 2. Shows the Chatbot Dashboard to ask question

The Streamlit application sends the user’s input to Amazon Bedrock, and the LangChain application facilitates the overall orchestration.
The LangChain code uses the BedrockChat class from LangChain to invoke the FM and interact with Amazon Bedrock to generate a SQL query based on the user’s input.

Figure 3. Shows initialization of SQL chain

The generated SQL query is run against the Athena database using the FM on Amazon Bedrock, which queries the CUR data stored in Amazon S3.
The query results are returned to the LangChain application.

Figure 4. Shows generated Query in the application output logs

LangChain sends the SQL query and query results back to the Streamlit application.
The Streamlit application displays the SQL query and query results to the user in a formatted and user-friendly manner.

Figure 5. Shows final output presented on the chat bot webapp including SQL Query and the Query results

Prerequisites
To set up this solution, you should have the following prerequisites:

An AWS account with access to AWS Cost Explorer, Athena, and Amazon S3. (This is a proof of concept setup. You should follow the least privilege model when using AWS Identity and Access Management (IAM), refer to the IAM security best practices documentation, and conduct your own due diligence when setting this up.)
CUR data stored in an S3 bucket. For instructions, see Creating Cost and Usage Reports.
Athena set up to analyze the data from your S3 bucket holding CUR data. For instructions, see Querying Cost and Usage Reports using Amazon Athena.
An AWS compute environment created to host the code and call the Amazon Bedrock APIs.
An AWS Identity and Access Management (IAM) role with permission to access Amazon Bedrock and access to FMs in Amazon Bedrock.

Configure the solution
Complete the following steps to set up the solution:

Create an Athena database and table to store your CUR data. Make sure the necessary permissions and configurations are in place for Athena to access the CUR data stored in Amazon S3.
Set up your compute environment to call Amazon Bedrock APIs. Make sure you associate an IAM role with this environment that has IAM policies that grant access to Amazon Bedrock.
When your instance is up and running, install the following libraries that are used for working within the environment:

pip install langchain==0.2.0 langchain-experimental==0.0.59 langchain-community==0.2.0 langchain-aws==0.1.4 pyathena==3.8.2 sqlalchemy==2.0.30 streamlit==1.34.0

Use the following code to establish a connection to the Athena database using the langchain library and the pyathena Configure the language model to generate SQL queries based on user input using Amazon Bedrock. You can save this file as cur_lib.py.

from langchain_experimental.sql import SQLDatabaseChain
from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine, URL
from langchain_aws import ChatBedrock as BedrockChat
from pyathena.sqlalchemy.rest import AthenaRestDialect

class CustomAthenaRestDialect(AthenaRestDialect):
def import_dbapi(self):
import pyathena
return pyathena

# DB Variables
connathena = “athena.us-west-2.amazonaws.com”
portathena = ‘443’
schemaathena = ‘mycur’
s3stagingathena = ‘s3://cur-data-test01/athena-query-result/’
wkgrpathena = ‘primary’
connection_string = f”awsathena+rest://@{connathena}:{portathena}/{schemaathena}?s3_staging_dir={s3stagingathena}/&work_group={wkgrpathena}”
url = URL.create(“awsathena+rest”, query={“s3_staging_dir”: s3stagingathena, “work_group”: wkgrpathena})
engine_athena = create_engine(url, dialect=CustomAthenaRestDialect(), echo=False)
db = SQLDatabase(engine_athena)

# Setup LLM
model_kwargs = {“temperature”: 0, “top_k”: 250, “top_p”: 1, “stop_sequences”: [“nnHuman:”]}
llm = BedrockChat(model_id=”anthropic.claude-3-sonnet-20240229-v1:0″, model_kwargs=model_kwargs)

# Create the prompt
QUERY = “””
Create a syntactically correct athena query for AWS Cost and Usage report to run on the my_c_u_r table in mycur database based on the question, then look at the results of the query and return the answer as SQLResult like a human
{question}
“””
db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)

def get_response(user_input):
question = QUERY.format(question=user_input)
result = db_chain.invoke(question)
query = result[“result”].split(“SQLQuery:”)[1].strip()
rows = db.run(query)
return f”SQLQuery: {query}nSQLResult: {rows}”

Create a Streamlit web application to provide a UI for interacting with the LangChain application. Include the input fields for users to enter their natural language questions and display the generated SQL queries and query results. You can name this file cur_app.py.

import streamlit as st
from cur_lib import get_response
import os

st.set_page_config(page_title=”AWS Cost and Usage Chatbot”, page_icon=”chart_with_upwards_trend”, layout=”centered”, initial_sidebar_state=”auto”,
menu_items={
‘Get Help’: ‘https://docs.aws.amazon.com/cur/latest/userguide/cur-create.html’,
#’Report a bug’:,
‘About’: “# The purpose of this app is to help you get better understanding of your AWS Cost and Usage report!”
})#HTML title
st.title(“_:orange[Simplify] CUR data_ :sunglasses:”)

def format_result(result):
parts = result.split(“nSQLResult: “)
if len(parts) > 1:
sql_query = parts[0].replace(“SQLQuery: “, “”)
sql_result = parts[1].strip(“[]”).split(“), (“)
formatted_result = []
for row in sql_result:
formatted_result.append(tuple(item.strip(“(),'”) for item in row.split(“, “)))
return sql_query, formatted_result
else:
return result, []

def main():
# Get the current directory
current_dir = os.path.dirname(os.path.abspath(__file__))
st.markdown(“<div class=’main’>”, unsafe_allow_html=True)
st.title(“AWS Cost and Usage chatbot”)
st.write(“Ask a question about your AWS Cost and Usage Report:”)

Connect the LangChain application and Streamlit web application by calling the get_response Format and display the SQL query and result in the Streamlit web application. Append the following code with the preceding application code:

# Create a session state variable to store the chat history
if “chat_history” not in st.session_state:
st.session_state.chat_history = []

user_input = st.text_input(“You:”, key=”user_input”)

if user_input:
try:
result = get_response(user_input)
sql_query, sql_result = format_result(result)
st.code(sql_query, language=”sql”)
if sql_result:
st.write(“SQLResult:”)
st.table(sql_result)
else:
st.write(result)
st.session_state.chat_history.append({“user”: user_input, “bot”: result})
st.text_area(“Conversation:”, value=”n”.join([f”You: {chat[‘user’]}nBot: {chat[‘bot’]}” for chat in st.session_state.chat_history]), height=300)
except Exception as e:
st.error(str(e))

st.markdown(“</div>”, unsafe_allow_html=True)

if __name__ == “__main__”:
main()

Deploy the Streamlit application and LangChain application to your hosting environment, such as Amazon EC2, or a Lambda function.

Clean up
Unless you invoke Amazon Bedrock with this solution, you won’t incur charges for it. To avoid ongoing charges for Amazon S3 storage for saving the CUR reports, you can remove the CUR data and S3 bucket. If you set up the solution using Amazon EC2, make sure you stop or delete the instance when you’re done.
Benefits
This solution offers the following benefits:

Simplified data analysis – You can analyze CUR data using natural language using generative AI, eliminating the need for advanced SQL knowledge
Increased accessibility – The web-based interface makes it efficient for non-technical users to access and gain insights from CUR data without needing credentials for the database
Time-saving – You can quickly get answers to your cost and usage questions without manually writing complex SQL queries
Enhanced visibility – The solution provides visibility into AWS costs and usage, enabling better cost-optimization and resource management decisions

Summary
The AWS CUR chatbot solution uses Anthropic Claude on Amazon Bedrock to generate SQL queries, database querying, and a user-friendly web interface to simplify the analysis of CUR data. By allowing you to ask natural language questions, the solution removes barriers and empowers both technical and non-technical users to gain valuable insights into AWS costs and resource usage. With this solution, organizations can make more informed decisions, optimize their cloud spending, and improve overall resource utilization. We recommend that you do due diligence while setting this up, especially for production; you can choose other programming languages and frameworks to set it up according to your preference and needs.
Amazon Bedrock enables you to build powerful generative AI applications with ease. Accelerate your journey by following the quick start guide on GitHub and using Amazon Bedrock Knowledge Bases to rapidly develop cutting-edge Retrieval Augmented Generation (RAG) solutions or enable generative AI applications to run multistep tasks across company systems and data sources using Amazon Bedrock Agents.

About the Author
Anutosh is a Solutions Architect at AWS India. He loves to dive deep into his customers’ use cases to help them navigate through their journey on AWS. He enjoys building solutions in the cloud to help customers. He is passionate about migration and modernization, data analytics, resilience, cybersecurity, and machine learning.

Streamline workflow orchestration of a system of enterprise APIs using …

Intricate workflows that require dynamic and complex API orchestration can often be complex to manage. In industries like insurance, where unpredictable scenarios are the norm, traditional automation falls short, leading to inefficiencies and missed opportunities. With the power of intelligent agents, you can simplify these challenges. In this post, we explore how chaining domain-specific agents using Amazon Bedrock Agents can transform a system of complex API interactions into streamlined, adaptive workflows, empowering your business to operate with agility and precision.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
Benefits of chaining Amazon Bedrock Agents
Designing agents is like designing other software components—they tend to work best when they have a focused purpose. When you have focused, single-purpose agents, combining them into chains can allow them to solve significantly complex problems together. Using natural language processing (NLP) and OpenAPI specs, Amazon Bedrock Agents dynamically manages API sequences, minimizing dependency management complexities. Additionally, agents enable conversational context management in real-time scenarios, using session IDs and, if necessary, backend databases like Amazon DynamoDB for extended context storage. By using prompt instructions and API descriptions, agents collect essential information from API schemas to solve specific problems efficiently. This approach not only enhances agility and flexibility, but also demonstrates the value of chaining agents to simplify complex workflows and solve larger problems effectively.
In this post, we explore an insurance claims use case, where we demonstrate the concept of chaining with Amazon Bedrock Agents. This involves an orchestrator agent calling and interacting with other agents to collaboratively perform a series of tasks, enabling efficient workflow management.
Solution overview
For our use case, we develop a workflow for an insurance digital assistant focused on streamlining tasks such as filing claims, assessing damages, and handling policy inquiries. The workflow simulates API sequencing dependencies, such as conducting fraud checks during claim creation and analyzing uploaded images for damage assessment if the user provides images. The orchestration dynamically adapts to user scenarios, guided by natural language prompts from domain-specific agents like an insurance orchestrator agent, policy information agent, and damage analysis notification agent. Using OpenAPI specifications and natural language prompts, the API sequencing in our insurance digital assistant adapts to dynamic user scenarios, such as users opting in or out of image uploads for damage assessment, failing fraud checks or choosing to ask a variety of questions related to their insurance policies and coverages. This flexibility is achieved by chaining domain-specific agents like the insurance orchestrator agent, policy information agent, and damage analysis notification agent.
Traditionally, insurance processes are rigid, with fixed steps for tasks like fraud detection. However, agent chaining allows for greater flexibility and adaptability, enabling the system to respond to real-time user inputs and variations in scenarios. For instance, instead of strictly adhering to predefined thresholds for fraud checks, the agents can dynamically adjust the workflow based on user interactions and context. Similarly, when users choose to upload images while filing a claim, the workflow can perform real-time damage analysis and immediately send a summary to claims adjusters for further review. This enables a quicker response and more accurate decision-making. This approach not only streamlines the claims process but also allows for a more nuanced and efficient handling of tasks, providing the necessary balance between automation and human intervention. By chaining Amazon Bedrock Agents, we create a system that is adaptable. This system caters to diverse user needs while maintaining the integrity of business processes.
The following diagram illustrates the end-to-end insurance claims workflow using chaining with Amazon Bedrock Agents.

The diagram shows how specialized agents use various tools to streamline the entire claims process—from filing claims and assessing damages to answering customer questions about insurance policies.
Prerequisites
Before proceeding, make sure you have the following resources set up:

An AWS account. If you don’t have an account, you can sign up for one.
Access as an AWS Identity and Access Management (IAM) administrator or an IAM user that has permissions for:

Deploying AWS CloudFormation
Creating and managing Amazon Simple Storage Service (Amazon S3) buckets and uploading objects.
Creating and updating Amazon Simple Queue Service (Amazon SQS) queues, AWS Lambda functions, and Amazon API Gateway.
Creating and managing IAM roles.
Access to Amazon Bedrock, Anthropic’s Claude models on Amazon Bedrock, and the Cohere Embed English embedding model on Amazon Bedrock. You must explicitly enable access to models before they can be used with Amazon Bedrock. For instructions, refer to Model access.

Deploy the solution with AWS CloudFormation
Complete the following steps to set up the solution resources:

Sign in to the AWS Management Console as an IAM administrator or appropriate IAM user.
Choose Launch Stack to deploy the CloudFormation template.
Provide the necessary parameters and create the stack.

For this setup, we use us-east-1 as our AWS Region, the Anthropic Claude 3 Haiku model for orchestrating the flow between the different agents, the Anthropic Claude 3 Sonnet model for damage analysis of the uploaded images, and the Cohere Embed English V3 model as an embedding model to translate text from the insurance policy documents into numerical vectors, which allows for efficient search, comparison, and categorization of the documents.
If you want to choose other models on Amazon Bedrock, you can do so by making appropriate changes in the CloudFormation template. Check for appropriate model support in the Region and the features that are supported by the models.
This will take about 15 minutes to deploy the solution. After the stack is deployed, you can view the various outputs of the CloudFormation stack on the Outputs tab, as shown in the following screenshot.

The following screenshot shows the three Amazon Bedrock agents that were deployed in your account.

Test the claims creation, damage detection, and notification workflows
The first part of the deployed solution is to mimic filing a new insurance claim, fraud detection, optional damage analysis of uploading images, and subsequent notification to claims adjusters. This is a smaller version of task automation to fulfill a particular business problem achieved by chaining agents, each performing a set of specific tasks. The agents work in harmony to solve the larger function of insurance claims handling.
Let’s explore the architecture of the claim creation workflow, where the insurance orchestrator agent and the damage analysis notification agent work together to simulate filing new claims, assessing damages, and sending a summary of damages to the claim adjusters for human oversight. The following diagram illustrates this workflow.

In this workflow, the insurance orchestrator agent mimics fraud detection and claims creation as well as orchestrates handing off the responsibility to other task-specific agents. The image damage analysis notification agent is responsible for doing a preliminary analysis of the images uploaded for a damage. This agent invokes a Lambda function that internally calls the Anthropic Claude Sonnet large language model (LLM) on Amazon Bedrock to perform preliminary analysis on the images. The LLM generates a summary of the damage, which is sent to an SQS queue, and is subsequently reviewed by the claim adjusters.
The NLP instruction prompts combined with the OpenAPI specifications for each action group guide the agents in their decision-making process, determining which action group to invoke, the sequence of invocation, and the required parameters for calling specific APIs.
Use the UI to invoke the claims processing workflow
Complete the following steps to invoke the claims processing workflow:

From the outputs of the CloudFormation stack, choose the URL for HttpApiEndpoint.

You can ask the chatbots sample questions to start exploring the functionality of filing a new claim.

In the following example, we ask for filing a new claim and uploading images as evidence for the claim.

On the Amazon SQS console, you can view the SQS queue that has been created by the CloudFormation stack and check the message that shows the damage analysis from the image performed by our LLM.

Test the policy information workflow
The following diagram shows the architecture of just the policy information agent. The policy agent accesses the Policy Information API to extract answers to insurance-related questions from unstructured policy documents such as PDF files.

The policy information agent is responsible for doing a lookup against the insurance policy documents stored in the knowledge base. The agent invokes a Lambda function that will internally invoke the knowledge base to find answers to policy-related questions.
Set up the policy documents and metadata in the data source for the knowledge base
We use Amazon Bedrock Knowledge Bases to manage our documents and metadata. As part of deploying the solution, the CloudFormation stack created a knowledge base. Complete the following steps to set up its data source:

On the Amazon Bedrock console, navigate to the deployed knowledge base and navigate to the S3 bucket that is mentioned as its data source.

Upload a few insurance policy documents and metadata documents to the S3 bucket to mimic the naming conventions as shown in the following screenshot.

The naming conventions are <Type of Policy>_PolicyNumber.pdf for the insurance policy PDF documents and <Type of Policy>_PolicyNumber.pdf.metadata.json for the metadata documents.

The following screenshot shows an example of what a sample metadata.json file looks like.

After the documents are uploaded to Amazon S3, navigate to the deployed knowledge base, select the data source, and choose Sync.

To understand more about how metadata support in Knowledge Bases on Amazon Bedrock helps you get accurate results, refer to Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy.

Now you can go back to the UI and start asking questions related to the policy documents.

The following screenshot shows the set of questions we asked for finding answers related to policy coverage.

Clean up
To avoid unexpected charges, complete the following steps to clean up your resources:

Delete the contents from the S3 buckets corresponding to the ImageBucketName and PolicyDocumentsBucketName keys from the outputs of the CloudFormation stack.
Delete the deployed stack using the AWS CloudFormation console.

Best practices
The following are some additional best practices that you can follow for your agents:

Automated testing – Implement automated tests using tools to regularly test the orchestration workflows. You can use mock APIs to simulate various scenarios and validate the agent’s decision-making process.
Version control – Maintain version control for your agent configurations and prompts in a repository. This provides traceability and quick rollback if needed.
Monitoring and logging – Use Amazon CloudWatch to monitor agent interactions and API calls. Set up alarms for unexpected behaviors or failures.
Continuous integration – Set up a continuous integration and delivery (CI/CD) pipeline that integrates automated testing, prompt validation, and deployment to maintain smooth updates without disrupting ongoing workflows.

Conclusion
In this post, we demonstrated the power of chaining Amazon Bedrock agents, offering a fresh perspective on integrating back-office automation workflows and enterprise APIs. This solution offers several benefits: as new enterprise APIs emerge, dependencies in existing ones can be minimized, reducing coupling. Moreover, Amazon Bedrock Agents can maintain conversational context, enabling follow-up queries to use conversation history. For extended contextual memory, a more persistent backend implementation can be considered.
To learn more, refer to Amazon Bedrock Agents.

About the Author
Piyali Kamra is a seasoned enterprise architect and a hands-on technologist who has over two decades of experience building and executing large scale enterprise IT projects across geographies. She believes that building large scale enterprise systems is not an exact science but more like an art, where you can’t always choose the best technology that comes to one’s mind but rather tools and technologies must be carefully selected based on the team’s culture , strengths, weaknesses and risks, in tandem with having a futuristic vision as to how you want to shape your product a few years down the road.

Build ultra-low latency multimodal generative AI applications using st …

Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and confidently build, train, and deploy ML models into a production-ready hosted environment. SageMaker provides a broad selection of ML infrastructure and model deployment options to help meet your ML inference needs. It also helps scale your model deployment, manage models more effectively in production, and reduce operational burden.
Although early large language models (LLMs) were limited to processing text inputs, the rapid evolution of these AI systems has enabled LLMs to expand their capabilities to handle a wide range of media types, including images, video, and audio, ushering in the era of multimodal models. Multimodal is a type of deep learning using multiple modalities of data, such as text, audio, or images. Multimodal inference adds challenges of large data transfer overhead and slow response times. For instance, in a typical chatbot scenario, users initiate the conversation by providing a multimedia file or a link as input payload, followed by a back-and-forth dialogue, asking questions or seeking information related to the initial input. However, transmitting large multimedia files with every request to a model inference endpoint can significantly impact the response times and latency, leading to an unsatisfactory user experience. For example, sending a 500 MB input file could potentially add 3–5 seconds to the response time, which is unacceptable for a chatbot aiming to deliver a seamless and responsive interaction.
We are announcing the availability of sticky session routing on Amazon SageMaker Inference which helps customers improve the performance and user experience of their generative AI applications by leveraging their previously processed information. Amazon SageMaker makes it easier to deploy ML models including foundation models (FMs) to make inference requests at the best price performance for any use case.
By enabling sticky sessions routing, all requests from the same session are routed to the same instance, allowing your ML application to reuse previously processed information to reduce latency and improve user experience. This is particularly valuable when you want to use large data payloads or need seamless interactive experiences. By using your previous inference requests, you can now take advantage of this feature to build innovative state-aware AI applications on SageMaker. To do, you create a session ID with your first request, and then use that session ID to indicate that SageMaker should route all subsequent requests to the same instance. Sessions can also be deleted when done to free up resources for new sessions.
This feature is available in all AWS Regions where SageMaker is available. To learn more about deploying models on SageMaker, see Amazon SageMaker Model Deployment. For more about this feature, refer to Stateful sessions with Amazon SageMaker models.
Solution overview
SageMaker simplifies the deployment of models, enabling chatbots and other applications to use their multimodal capabilities with ease. SageMaker has implemented a robust solution that combines two key strategies: sticky session routing in SageMaker with load balancing, and stateful sessions in TorchServe. Sticky session routing makes sure all requests from a user session are serviced by the same SageMaker server instance. Stateful sessions in TorchServe cache the multimedia data in GPU memory from the session start request and minimize loading and unloading of this data from GPU memory for improved response times.
With this focus on minimizing data transfer overhead and improving response time, our approach makes sure the initial multimedia file is loaded and processed only one time, and subsequent requests within the same session can use the cached data.
Let’s look at the sequence of events when a client initiates a sticky session on SageMaker:

In the first request, you call the Boto3 SageMaker runtime invoke_endpoint with session-id=NEW_SESSION in the header and a payload indicating an open session type of request. SageMaker then creates a new session and stores the session ID. The router initiates an open session (this API is defined by the client; it could be some other name like start_session) with the model server, in this case TorchServe, and responds back with 200 OK along with the session ID and time to live (TTL), which is sent back to the client.

Whenever you need to use the same session to perform subsequent actions, you pass the session ID as part of the invoke_endpoint call, which allows SageMaker to route all the subsequent requests to the same model server instance.
To close or delete a session, you can use invoke_endpoint with a payload indicating a close session type of request along with the session ID. The SageMaker router first checks if the session exists. If it does, the router initiates a close session call to the model server, which responds back with a successful 200 OK along with session ID, which is sent back to the client. In the scenario, when the session ID doesn’t exist, the router responds back with a 400 response.

In the following sections, we walk through an example of how you can use sticky routing in SageMaker to achieve stateful model inference. For this post, we use the LLaVA: Large Language and Vision Assistant model. LLaVa is a multimodal model that accepts images and text prompts.
We use LLaVa to upload an image and then ask questions about the image without having to resend the image for every request. The image is cached in the GPU memory as opposed to the CPU memory, so we don’t have to incur the latency cost of moving this image from CPU memory to GPU memory on every call.
We use TorchServe as our model server for this example. TorchServe is a performant, flexible and easy to use tool for serving PyTorch models in production. TorchServe supports a wide array of advanced features, including dynamic batching, microbatching, model A/B testing, streaming, torch XLA, tensorRT, ONNX and IPEX. Moreover, it seamlessly integrates PyTorch’s large model solution, PiPPy, enabling efficient handling of large models. Additionally, TorchServe extends its support to popular open-source libraries like DeepSpeed, Accelerate, Fast Transformers, and more, expanding its capabilities even further.
The following are the main steps to deploy the LLava model. The section below introduces the steps conceptually, so you’ll have a better grasp of the overall deployment workflow before diving into the practical implementation details in the subsequent section.
Build a TorchServe Docker container and push it to Amazon ECR
The first step is to build a TorchServe Docker container and push it to Amazon Elastic Container Registry (Amazon ECR). Because we’re using a custom model, we use the bring your own container approach. We use one of the AWS provided deep learning containers as our base, namely pytorch-inference:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker.
Build TorchServe model artifacts and upload them to Amazon S3
We use torch-model-archiver to gather all the artifacts, like custom handlers, the LlaVa model code, the data types for request and response, model configuration, prediction API, and other utilities. Then we upload the model artifacts to Amazon Simple Storage Service (Amazon S3).
Create the SageMaker endpoint
To create the SageMaker endpoint, complete the following steps:

To create the model, use the SageMaker Python SDK Model class and as inputs. Specify the S3 bucket you created earlier to upload the TorchServe model artifacts and the image_uri of the Docker container you created.

SageMaker expects the session ID in X-Amzn-SageMaker-Session-Id format; you can specify that in the environment properties to the model.

To deploy the model and create the endpoint, specify the initial instance count to match the load, instance type, and timeouts.
Lastly, create a SageMaker Python SDK Predictor by passing in the endpoint name.

Run inference
Complete the following steps to run inference:

Use an open session to send a URL to the image you want to ask questions about.

This is a custom API we have defined for our use case (see inference_api.py). You can define the inputs, outputs, and APIs to suit your business use case. For this use case, we use an open session to send a URL to the image we want to ask questions about. For the session ID header value, use the special string NEW_SESSION to indicate this is the start of a session. The custom handler you wrote downloads the image, converts it to a tensor, and caches that in the GPU memory. We do this because we have access to the LLaVa source code; we could also modify the original predict.py file from LLaVa model to accept a tensor instead of a PIL image. By caching the tensor in GPU, we have saved some inference time by not moving the image from CPU memory to GPU memory for every call. If you don’t have access to the model source code, you have to cache the image in CPU memory. Refer to inference_api.py for this source code. The open session API call returns a session ID, which you use for the rest of the calls in this session.

To send a text prompt, get the session ID from the open session and send it along with the text prompt.

inference_api.py looks up the cache in GPU for the image based on the session ID and uses that for inference. This returns the LLaVa model output as a string.

Repeat the previous step to send a different text prompt.
When you’re done with all the text prompts, use the session ID to close the session.

In inference_api.py, we no longer hold on to the image cache in GPU.
The source code for this example is in the GitHub repo. You can run the steps using the following notebook.
Prerequisites
Use the following code to deploy an AWS CloudFormation stack that creates an AWS Identity and Access Management (IAM) role to deploy the SageMaker endpoints:

aws cloudformation create-stack –stack-name sm-stateful-role
–template-body https://raw.githubusercontent.com/aws-samples/sagemaker-genai-hosting-examples/main/LLava/torchserve/workspace/sm_role.yaml
–capabilities CAPABILITY_NAMED_IAM
–region us-west-2

Create a SageMaker notebook instance
Complete the following steps to create a notebook instance for LLaVa model deployment:

On the SageMaker console, choose Notebooks in the navigation pane.

Choose Create notebook instance.

In the Notebook instance settings section, under Additional configuration, choose at least 500 GB for the storage volume.

In the Permissions and encryption section, choose to use an existing IAM role, and choose the role you created in the prerequisites (sm-stateful-role-xxx).

You can get the full name of the role on the AWS CloudFormation console, on the Resources tab of the stack sm-stateful-role.

In the Git repositories section, for Git repository URL, enter https://github.com/aws-samples/sagemaker-genai-hosting-examples.git.

Choose Create notebook instance.

Run the notebook
When the notebook is ready, complete the following steps:

On the SageMaker console, choose Notebooks in the navigation pane.
Choose Open JupyterLab for this new instance.

In JupyterLab, navigate to LLava using the file explorer.

Navigate to torchserve /workspace / and open the notebook llava_stateful_deploy_infer.ipynb.

Run the notebook.

The ./build_and_push.sh script takes approximately 30 minutes to run. You can also run the ./build_and_push.sh script in a terminal for better feedback. Note the input parameters from the previous step and make sure you’re in the right directory (sagemaker-genai-hosting-examples/LLava/torchserve/workspace).

The model.deploy() step also takes 20–30 minutes to complete.

When you’re done, run the last cleanup cell.

Additionally, delete the SageMaker notebook instance.

Troubleshooting
When you run ./build_and_push.sh, you might get the following error:

./build_and_push.sh: line 48: docker: command not found

This means you’re not using SageMaker notebooks, and are probably using Amazon SageMaker Studio. Docker is not installed in SageMaker Studio by default.
Look at the screen shot below to learn how to open Amazon SageMaker Notebook.

Conclusion
In this post, we explained how the new sticky routing feature in Amazon SageMaker allows you to achieve ultra-low latency and enhance your end-user experience when serving multi-modal models. You can use the provided notebook and create stateful endpoints for your multimodal models to enhance your end-user experience.
Try out this solution for your own use case, and let us know your feedback and questions in the comments.

About the authors
Harish Rao is a senior solutions architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.
Raghu Ramesha is a Senior GenAI/ML Solutions Architect on the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in computer science from UT Dallas. In his free time, he enjoys traveling and photography.
Lingran Xia is a software development engineer at AWS. He currently focuses on improving inference performance of machine learning models. In his free time, he enjoys traveling and skiing.
Naman Nandan is a software development engineer at AWS, specializing in enabling large scale AI/ML inference workloads on SageMaker using TorchServe, a project jointly developed by AWS and Meta. In his free time, he enjoys playing tennis and going on hikes.
Li Ning is a senior software engineer at AWS with a specialization in building large-scale AI solutions. As a tech lead for TorchServe, a project jointly developed by AWS and Meta, her passion lies in leveraging PyTorch and AWS SageMaker to help customers embrace AI for the greater good. Outside of her professional endeavors, Li enjoys swimming, traveling, following the latest advancements in technology, and spending quality time with her family.
Frank Liu is a Principal Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. Frank has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.
Deepika Damojipurapu is a Senior Technical Account Manager at AWS, specializing in distributed AI training and inference. She helps customers unlock the full potential of AWS by providing consultative guidance on architecture and operations, tailored to their specific applications and use cases. When not immersed in her professional responsibilities, Deepika finds joy in spending quality time with her family – exploring outdoors, traveling to new destinations, cooking wholesome meals together, creating cherished memories.
Alan Tan is a Principal Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to building novel solutions. Outside of work, he enjoys the outdoors.

How healthcare payers and plans can empower members with generative AI

In this post, we discuss how generative artificial intelligence (AI) can help health insurance plan members get the information they need. Many health insurance plan beneficiaries find it challenging to navigate through the complex member portals provided by their insurance plans. These portals often require multiple clicks, filters, and searches to find specific information about their benefits, deductibles, claim history, and other important details. This can lead to dissatisfaction, confusion, and increased calls to customer service, resulting in a suboptimal experience for both members and providers.
The problem arises from the inability of traditional UIs to understand and respond to natural language queries effectively. Members are forced to learn and adapt to the system’s structure and terminology, rather than the system being designed to understand their natural language questions and provide relevant information seamlessly. Generative AI technology, such as conversational AI assistants, can potentially solve this problem by allowing members to ask questions in their own words and receive accurate, personalized responses. By integrating generative AI powered by Amazon Bedrock and purpose-built AWS data services such as Amazon Relational Database Service (Amazon RDS) into member portals, healthcare payers and plans can empower their members to find the information they need quickly and effortlessly, without navigating through multiple pages or relying heavily on customer service representatives. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a unified API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
The solution presented in this post not only enhances the member experience by providing a more intuitive and user-friendly interface, but also has the potential to reduce call volumes and operational costs for healthcare payers and plans. By addressing this pain point, healthcare organizations can improve member satisfaction, reduce churn, and streamline their operations, ultimately leading to increased efficiency and cost savings.

Figure 1: Solution Demo

Solution overview
In this section, we dive deep to show how you can use generative AI and large language models (LLMs) to enhance the member experience by transitioning from a traditional filter-based claim search to a prompt-based search, which allows members to ask questions in natural language and get the desired claims or benefit details. From a broad perspective, the complete solution can be divided into four distinct steps: text-to-SQL generation, SQL validation, data retrieval, and data summarization. The following diagram illustrates this workflow.

Figure 2: Logical Workflow

Let’s dive deep into each step one by one.
Text-to-SQL generation
This step takes the user’s questions as input and converts that into a SQL query that can be used to retrieve the claim- or benefit-related information from a relational database. A pre-configured prompt template is used to call the LLM and generate a valid SQL query. The prompt template contains the user question, instructions, and database schema along with key data elements, such as member ID and plan ID, which are necessary to limit the query’s result set.
SQL validation
This step validates the SQL query generated in previous step and makes sure it’s complete and safe to be run on a relational database. Some of the checks that are performed include:

No delete, drop, update, or insert operations are present in the generated query
The query starts with select
WHERE clause is present
Key conditions are present in the WHERE clause (for example, member-id = “78687576501” or member-id like “786875765%%”)
Query length (string length) is in expected range (for example, not more than 250 characters)
Original user question length is in expected range (for example, not more than 200 characters)

If a check fails, the query isn’t run; instead, a user-friendly message suggesting that the user contact customer service is sent.
Data retrieval
After the query has been validated, it is used to retrieve the claims or benefits data from a relational database. The retrieved data is converted into a JSON object, which is used in the next step to create the final answer using an LLM. This step also checks if no data or too many rows are returned by the query. In both cases, a user-friendly message is sent to the user, suggesting they provide more details.
Data summarization
Finally, the JSON object retrieved in the data retrieval step along with the user’s question is sent to LLM to get the summarized response. A pre-configured prompt template is used to call the LLM and generate a user-friendly summarized response to the original question.
Architecture
The solution uses Amazon API Gateway, AWS Lambda, Amazon RDS, Amazon Bedrock, and Anthropic Claude 3 Sonnet on Amazon Bedrock to implement the backend of the application. The backend can be integrated with an existing web application or portal, but for the purpose of this post, we use a single page application (SPA) hosted on Amazon Simple Storage Service (Amazon S3) for the frontend and Amazon Cognito for authentication and authorization. The following diagram illustrates the solution architecture.

Figure 3: Solution Architecture

The workflow consists of the following steps:

A single page application (SPA) is hosted using Amazon S3 and loaded into the end-user’s browser using Amazon CloudFront.
User authentication and authorization is done using Amazon Cognito.
After a successful authentication, a REST API hosted on API Gateway is invoked.
The Lambda function, exposed as a REST API using API Gateway, orchestrates the logic to perform the functional steps: text-to-SQL generation, SQL validation, data retrieval, and data summarization. The Amazon Bedrock API endpoint is used to invoke the Anthropic Claude 3 Sonnet LLM. Claim and benefit data is stored in a PostgreSQL database hosted on Amazon RDS. Another S3 bucket is used for storing prompt templates that will be used for SQL generation and data summarizations. This solution uses two distinct prompt templates:

The text-to-SQL prompt template contains the user question, instructions, database schema along with key data elements, such as member ID and plan ID, which are necessary to limit the query’s result set.
The data summarization prompt template contains the user question, raw data retrieved from the relational database, and instructions to generate a user-friendly summarized response to the original question.

Finally, the summarized response generated by the LLM is sent back to the web application running in the user’s browser using API Gateway.

Sample prompt templates
In this section, we present some sample prompt templates.
The following is an example of a text-to-SQL prompt template:

<role>
You are a data analyst and expert in writing PostgreSQL DB queries and healthcare claims data.
</role>
<task>
Your task is to generate a SQL query based on the provided DDL, instructions, user_question, examples, and member_id.
Always add the condition “member_id =” in the generated SQL query, where the value of member_id will be provided in the member_id XML tag below.
</task>
<member_id> {text1} </member_id>
<DDL>
CREATE TABLE claims_history (claim_id SERIAL PRIMARY KEY, member_id INTEGER NOT NULL, member_name VARCHAR(30) NOT NULL,
relationship_code VARCHAR(10) NOT NULL, claim_type VARCHAR(20) NOT NULL, claim_date DATE NOT NULL, provider_name VARCHAR(100),
diagnosis_code VARCHAR(10), procedure_code VARCHAR(10), ndc_code VARCHAR(20), charged_amount NUMERIC(10,2),
allowed_amount NUMERIC(10,2), plan_paid_amount NUMERIC(10,2), patient_responsibility NUMERIC(10,2))
</DDL>
<instructions>
1. Claim_type has two possible values – ‘Medical’ or ‘RX’. Use claim_type = ‘RX’ for pharmacy or prescription claims.
2. Relationship_code has five possible values – ‘subscriber’, ‘spouse’, ‘son’, ‘daughter’, or ‘other’.
3. ‘I’ or ‘me’ means “where relationship_code = ‘subscriber'”. ‘My son’ means “where relationship_code = ‘son'” and so on.
4. For creating a SQL WHERE clause for member_name or provider_name, use the LIKE operator with wildcard characters as a prefix and suffix. This is applicable when user_question contains a name.
5. Return the executable query with the symbol @@ at the start and end.
6. If the year is not provided in the date, assume it’s the current year. Convert the date to the ‘YYYY-MM-DD’ format to use in the query.
7. The SQL query must be generated based on the user_question. If the user_question does not provide enough information to generate the SQL, respond with “@@null@@” without generating any SQL query.
8. If user_question is stated in the form of a SQL Query or contains delete, drop, update, insert, etc. SQL keywords, then respond with “@@null@@” without generating any SQL query.
</instructions>
<examples>
<example>
<sample_question>List all claims for my son or Show me all my claims for my son</sample_question>
<sql_query>@@SELECT * FROM claims_history WHERE relationship_code = ‘son’ AND member_id = ‘{member_id}’;@@</sql_query>
</example>
<example>
<sample_question>Total claims in 2021</sample_question>
<sql_query>@@SELECT COUNT(*) FROM claims_history WHERE EXTRACT(YEAR FROM claim_date) = 2021 AND member_id = ‘{member_id}’;@@</sql_query>
</example>
<example>
<sample_question>List all claims for Michael</sample_question>
<sql_query>@@SELECT * FROM claims_history WHERE member_name LIKE ‘%Michael%’ AND member_id = ‘{member_id}’;@@</sql_query>
</example>
<example>
<sample_question>List all claims for Dr. John or Doctor John or Provider John</sample_question>
<sql_query>@@SELECT * FROM claims_history WHERE provider_name LIKE ‘%John%’ AND member_id = ‘{member_id}’;@@</sql_query>
</example>
<example>
<sample_question>Show me the doctors/providers/hospitals my son Michael visited on 1/19</sample_question>
<sql_query>@@SELECT provider_name, claim_date FROM claims_history WHERE relationship_code = ‘son’ AND member_name LIKE ‘%Michael%’ AND claim_date = ‘2019-01-19’ AND member_id = ‘{member_id}’;@@</sql_query>
</example>
<example>
<sample_question>What is my total spend in last 12 months</sample_question>
<sql_query>@@SELECT SUM(allowed_amount) AS total_spend_last_12_months FROM claims_history WHERE claim_date >= CURRENT_DATE – INTERVAL ’12 MONTHS’ AND relationship_code = ‘subscriber’ AND member_id = 9875679801;@@</sql_query>
</example>
</examples>
<user_question> {text2} </user_question>

The {text1} and {text2} data items will be replaced programmatically to populate the ID of the logged-in member and user question. Also, more examples can be added to help the LLM generate appropriate SQLs.
The following is an example of a data summarization prompt template:

<role>
You are a customer service agent working for a health insurance plan and helping to answer questions asked by a customer.
</role>
<task>
Use the result_dataset containing healthcare claims data to answer the user_question. This result_dataset is the output of the sql_query.
</task>
<instructions>
1. To answer a question, use simple non-technical language, just like a customer service agent talking to a 65-year-old customer.
2. Use a conversational style to answer the question precisely.
3. If the JSON contains a “count” field, it means the count of claims. For example, “count”: 6 means there are 6 claims, and “count”: 11 means there are 11 claims.
4. If the result_dataset does not contain meaningful claims data, then respond with one line only: “No data found for the search criteria.”
</instructions>
<user_question> {text1} </user_question>
<sql_query> {text2} </sql_query>
<result_dataset> {text3} </result_dataset>

The {text1}, {text2}, and {text3} data items will be replaced programmatically to populate the user question, the SQL query generated in the previous step, and data formatted in JSON and retrieved from Amazon RDS.
Security
Amazon Bedrock is in scope for common compliance standards such as Service and Organization Control (SOC), International Organization for Standardization (ISO), and Health Insurance Portability and Accountability Act (HIPAA) eligibility, and you can use Amazon Bedrock in compliance with the General Data Protection Regulation (GDPR). The service enables you to deploy and use LLMs in a secured and controlled environment. The Amazon Bedrock VPC endpoints powered by AWS PrivateLink allow you to establish a private connection between the virtual private cloud (VPC) in your account and the Amazon Bedrock service account. It enables VPC instances to communicate with service resources without the need for public IP addresses. We define the different accounts as follows:

Customer account – This is the account owned by the customer, where they manage their AWS resources such as RDS instances and Lambda functions, and interact with the Amazon Bedrock hosted LLMs securely using Amazon Bedrock VPC endpoints. You should manage access to Amazon RDS resources and databases by following the security best practices for Amazon RDS.
Amazon Bedrock service accounts – This set of accounts is owned and operated by the Amazon Bedrock service team, which hosts the various service APIs and related service infrastructure.
Model deployment accounts – The LLMs offered by various vendors are hosted and operated by AWS in separate accounts dedicated for model deployment. Amazon Bedrock maintains complete control and ownership of model deployment accounts, making sure no LLM vendor has access to these accounts.

When a customer interacts with Amazon Bedrock, their requests are routed through a secured network connection to the Amazon Bedrock service account. Amazon Bedrock then determines which model deployment account hosts the LLM model requested by the customer, finds the corresponding endpoint, and routes the request securely to the model endpoint hosted in that account. The LLM models are used for inference tasks, such as generating text or answering questions.
No customer data is stored within Amazon Bedrock accounts, nor is it ever shared with LLM providers or used for tuning the models. Communications and data transfers occur over private network connections using TLS 1.2+, minimizing the risk of data exposure or unauthorized access.
By implementing this multi-account architecture and private connectivity, Amazon Bedrock provides a secure environment, making sure customer data remains isolated and secure within the customer’s own account, while still allowing them to use the power of LLMs provided by third-party providers.
Conclusion
Empowering health insurance plan members with generative AI technology can revolutionize the way they interact with their insurance plans and access essential information. By integrating conversational AI assistants powered by Amazon Bedrock and using purpose-built AWS data services such as Amazon RDS, healthcare payers and insurance plans can provide a seamless, intuitive experience for their members. This solution not only enhances member satisfaction, but can also reduce operational costs by streamlining customer service operations. Embracing innovative technologies like generative AI becomes crucial for organizations to stay competitive and deliver exceptional member experiences.
To learn more about how generative AI can accelerate health innovations and improve patient experiences, refer to Payors on AWS and Transforming Patient Care: Generative AI Innovations in Healthcare and Life Sciences (Part 1). For more information about using generative AI with AWS services, refer to Build generative AI applications with Amazon Aurora and Knowledge Bases for Amazon Bedrock and the Generative AI category on the AWS Database Blog.

About the Authors
Sachin Jain is a Senior Solutions Architect at Amazon Web Services (AWS) with focus on helping Healthcare and Life-Sciences customers in their cloud journey. He has over 20 years of experience in technology, healthcare and engineering space.
Sanjoy Thanneer is a Sr. Technical Account Manager with AWS based out of New York. He has over 20 years of experience working in Database and Analytics Domains. He is passionate about helping enterprise customers build scalable , resilient and cost efficient Applications.
Sukhomoy Basak is a Sr. Solutions Architect at Amazon Web Services, with a passion for Data, Analytics, and GenAI solutions. Sukhomoy works with enterprise customers to help them architect, build, and scale applications to achieve their business outcomes.

Understanding the Hidden Layers in Large Language Models LLMs

Hebrew University Researchers addressed the challenge of understanding how information flows through different layers of decoder-based large language models (LLMs). Specifically, it investigates whether the hidden states of previous tokens in higher layers are as crucial as believed. Current LLMs, such as transformer-based models, use the attention mechanism to process tokens by attending to all previous tokens in every layer. While each transformer layer applies this attention uniformly, prior research indicates that different layers capture different types of information. The study builds on the idea that not all layers may equally rely on the hidden states of previous tokens, especially in higher layers.

The research team hypothesized that while lower layers focus on aggregating information from previous tokens, higher layers may rely less on this information. They propose various manipulations in the hidden states of previous tokens in different layers of the model. These include replacing hidden states with random vectors, freezing hidden states at specific layers, and swapping the hidden states of one token with another from a different prompt. They conduct experiments on four open-source LLMs (Llama2-7B, Mistral-7B, Yi-6B, and Llemma-7B) and four tasks, including question answering and summarization, to evaluate the impact of these manipulations on model performance.

One technique involves introducing noise by replacing hidden states with random vectors, which allows researchers to evaluate whether the content of these hidden states still matters at certain layers. The second method, freezing, locks the hidden states at a particular layer and reuses them for the subsequent layers, reducing the computational load.

The researchers found that when these manipulations were applied to the top 30-50% of the model, performance across multiple tasks experienced little to no drop, suggesting that the top layers rely less on the hidden representations of previous tokens. For example, when freezing up to 50% of the layers, the models retained performance similar to that of the baseline. Additionally, swapping hidden states from different prompts further confirmed this observation; the model ignored changes made in the top layers, while changes in lower layers significantly altered the output. The experiments were conducted to understand whether attention was needed in the higher layers of the model by skipping the attention block in those layers. This test demonstrated that skipping attention in the upper layers had minimal impact on tasks like summarization and question answering, while doing so in lower layers led to severe performance degradation.

In conclusion, the study reveals a two-phase process in transformer-based LLMs: the early layers gather information from previous tokens, while the higher layers primarily process that information internally. The findings suggest that higher layers are less dependent on the detailed representation of previous tokens, offering potential optimizations, such as skipping attention in these layers to reduce computational costs. Overall, the paper dives deep into the hierarchical nature of information processing in LLMs and leads to more informed and efficient model designs.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Understanding the Hidden Layers in Large Language Models LLMs appeared first on MarkTechPost.

MAPF-GPT: A Decentralized and Scalable AI Approach to Multi-Agent Path …

Multi-agent pathfinding (MAPF), within computer science and robotics, deals with the problem of routing multiple agents, such as robots, to their individual goals within a shared environment. These agents must find collision-free paths while maintaining a high level of efficiency. MAPF is crucial for applications such as automated warehouses, traffic management, and drone fleets. The complexity of the problem increases with the number of agents, making real-time solutions necessary for practical use.

A significant challenge researchers face in MAPF is managing the complexity and computational demand of routing multiple agents without collisions, especially as the number of agents increases. The computational difficulty of solving MAPF optimally makes it NP-hard, meaning it is almost impossible to find a perfectly optimal solution in a reasonable time for large-scale problems. Traditional methods struggle with these issues, often relying on oversimplified assumptions or excessive computational resources. Another challenge is the agents’ limited view of their environment, which makes decentralized decision-making difficult without a global view or real-time communication.

Over the years, researchers have explored several approaches to solve MAPF. Rule-based solvers, graph-based methods, and optimization techniques such as minimum flow on graphs are common. These approaches attempt to simplify the problem by transforming it into another, more solvable type of problem or by applying graph search techniques to find paths. More recent methods have incorporated machine learning and deep reinforcement learning, where agents learn from their environment and adjust their paths accordingly. However, these methods often require communication between agents or rely on heuristics to enhance performance, which adds layers of complexity to an already difficult problem.

The research team from AIRI, the Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, and the Moscow Institute of Physics and Technology introduced an innovative approach to solving MAPF called MAPF-GPT. This method stands out because it utilizes a transformer-based model trained through imitation learning. MAPF-GPT is decentralized, meaning each agent makes decisions independently, relying on local observations. Unlike previous methods, MAPF-GPT does not require communication between agents or additional planning steps, making it more scalable and efficient. The team also used a large dataset of expert trajectories, allowing the model to learn from sub-optimal solutions and still perform well in unseen environments.

In developing MAPF-GPT, the researchers created a comprehensive dataset of sub-optimal MAPF solutions generated by existing solvers. These solutions were converted into sequences of observations and actions, referred to as tokens, from which the model could learn. Using a neural network architecture known as transformers, MAPF-GPT could predict the correct actions for agents based on their observations. The local observation for each agent included the current map layout and the agent’s position relative to obstacles and other agents. The model was trained using cross-entropy loss, which allowed it to optimize its decision-making process based on the observed actions from expert data. The researchers ensured that the dataset was diverse, containing over 1 billion observation-action pairs from a variety of MAPF scenarios, including mazes and random maps.

The performance of MAPF-GPT was thoroughly evaluated against other state-of-the-art decentralized MAPF solvers, including DCC and SCRIMP. In terms of success rate, MAPF-GPT outperformed these models across several scenarios. For example, the largest version of the model, MAPF-GPT-85M, achieved a significantly higher success rate on random maps and maze-like environments compared to its competitors. It was also shown that MAPF-GPT-85M solved problems involving up to 192 agents with linear scalability, meaning its computational requirements increased predictably as the number of agents grew. The model proved to be 13 times faster than SCRIMP and 8 times faster than DCC in high-agent environments. This was particularly evident in large-scale warehouse simulations, where MAPF-GPT demonstrated both speed and efficiency.

MAPF-GPT’s zero-shot learning capabilities were another remarkable achievement. The model solved MAPF problems it had not encountered during its training, demonstrating an ability to generalize to new environments. In a lifelong MAPF scenario, where agents receive new goals after reaching their initial ones, MAPF-GPT performed impressively without further training. The model outperformed traditional solvers like RHCR and learning-based models like FOLLOWER, particularly in warehouse simulations, where its decentralized nature allowed it to maintain high throughput.

Overall, the research introduced a promising new approach to solving the complex problem of multi-agent pathfinding. By relying on imitation learning and a transformer-based architecture, MAPF-GPT demonstrated significant advantages in speed, scalability, and generalization over existing methods. The model’s ability to operate without inter-agent communication or additional heuristics offers a streamlined solution for real-world applications, particularly in environments with large agents.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post MAPF-GPT: A Decentralized and Scalable AI Approach to Multi-Agent Pathfinding appeared first on MarkTechPost.

SuRF: An Unsupervised Surface-Centric Framework for High-Fidelity 3D R …

Reconstructing high-fidelity surfaces from multi-view images, especially with sparse inputs, is a critical challenge in computer vision. This task is essential for various applications, including autonomous driving, robotics, and virtual reality, where accurate 3D models are necessary for effective decision-making and interaction with real-world environments. However, achieving this level of detail and accuracy is difficult due to constraints in memory, computational resources, and the ability to capture intricate geometric information from limited data. Overcoming these challenges is vital for advancing AI technologies that demand both precision and efficiency, particularly in resource-constrained settings.

Current approaches for neural surface reconstruction are divided into multi-stage pipelines and end-to-end neural implicit methods. Multi-stage pipelines, like those used by SparseNeuS, involve separate stages for depth estimation, filtering, and meshing. These methods tend to accumulate errors across stages and are inefficient in optimizing coarse and fine stages together. End-to-end methods, such as those employing neural implicit functions, streamline the process by extracting geometry directly using techniques like Marching Cubes. However, these methods face significant memory limitations, particularly when working with high-resolution volumes, and they require a large number of input views to achieve satisfactory results. Additionally, view-dependent methods like C2F2NeuS, which construct separate cost volumes for each view, are computationally expensive and impractical for scenarios with numerous input views. These limitations hinder the application of these methods in real-time and resource-constrained environments.

A team of researchers from Peking University, Peng Cheng Laboratory, University of Birmingham, and  Alibaba propose SuRF, a novel surface-centric framework designed to overcome the limitations of existing methods by enabling efficient, high-resolution surface reconstruction from sparse input views. The innovation lies in SuRF’s end-to-end sparsification strategy, which is unsupervised and surface-centric, reducing memory consumption and computational load while enhancing the model’s ability to capture detailed geometric features. A key component of SuRF is the Matching Field module, which efficiently locates surface regions by leveraging weight distribution along rays, allowing the model to concentrate computational resources on regions near the surface. The Region Sparsification strategy further optimizes this process by retaining only the voxels within the identified surface regions, thus reducing the volume size and enabling the use of higher-resolution features. This approach provides a significant advancement in surface reconstruction by offering a scalable, efficient, and accurate solution, particularly in scenarios with limited input data.

SuRF is constructed using multi-scale feature volumes generated through a feature pyramid network (FPN) and an adaptive cross-scale fusion strategy. The model first extracts multi-scale features from the input images and aggregates them using a fusion network that integrates both global and local features. The Matching Field module identifies surface regions by creating a single-channel matching volume at each scale, which estimates the rough position of the surface along a ray, refined through region sparsification. This strategy ensures that only voxels within the surface regions are retained for higher-resolution scales, significantly reducing memory and computational demands. Training the model involves a combination of color loss, feature consistency loss, eikonal loss, and a warping loss from the matching field. The overall loss function is designed to optimize both the surface prediction and the matching field, allowing the model to efficiently locate and reconstruct high-fidelity surfaces even from sparse inputs.

SuRF demonstrates substantial improvements in accuracy and efficiency across multiple benchmarks, including DTU, BlendedMVS, Tanks and Temples, and ETH3D. Specifically, SuRF achieves a 46% improvement in accuracy while reducing memory consumption by 80% compared to previous methods. It consistently outperforms existing state-of-the-art approaches, achieving lower chamfer distances, which indicates finer and more detailed surface reconstructions. These results confirm that SuRF offers a more efficient and accurate solution for high-fidelity surface reconstruction, particularly when working with sparse input views, making it highly suitable for applications requiring both precision and resource efficiency.

SuRF introduces a significant advancement in neural surface reconstruction by providing a novel surface-centric approach that combines unsupervised end-to-end sparsification with efficient memory usage. Through the Matching Field and Region Sparsification strategies, SuRF directs computational resources toward high-resolution surface reconstruction, even with sparse input views. The experimental results validate SuRF’s effectiveness, highlighting its potential to set a new standard in high-fidelity surface reconstruction within AI research. This approach not only addresses a critical challenge in the field but also opens the door to more scalable and efficient AI systems suitable for deployment in resource-constrained environments.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post SuRF: An Unsupervised Surface-Centric Framework for High-Fidelity 3D Reconstruction with Region Sparsification appeared first on MarkTechPost.

Introducing Amazon EKS support in Amazon SageMaker HyperPod

We are thrilled to introduce Amazon Elastic Kubernetes Service (Amazon EKS) support in Amazon SageMaker HyperPod, a purpose-built infrastructure engineered with resilience at its core. This capability allows for the seamless addition of SageMaker HyperPod managed compute to EKS clusters, using automated node and job resiliency features for foundation model (FM) development.
FMs are typically trained on large-scale compute clusters with hundreds or thousands of accelerators. Under such circumstances, hardware failures pose a significant challenge, because a single accelerator failure among thousands can halt the entire training process. For example, Meta Llama 3 405B pre-training over 54 days on 16K NVIDIA H100 Tensor Core GPUs experienced 419 unexpected interruptions, with 78% attributed to confirmed or suspected hardware issues, and with 58.7% of these interruptions being GPU-related problems, including NVLink failures and HBM3 memory failures.
Since its inception, SageMaker HyperPod was designed with a focus on managed resiliency features to mitigate such hardware failures, enabling FM builders such as Thomson Reuters, Perplexity AI, and Hugging Face to scale their FM training and inference on Slurm clusters. With the EKS support in HyperPod, you can now also benefit from the resiliency features on Kubernetes clusters by managing machine learning (ML) workloads using the HyperPod compute and managed Kubernetes control plane on the EKS cluster.
AI startups like Observea and Articul8, and enterprises like Thomson Reuters use this new feature set to manage their ML model development lifecycle:

“Through our use of SageMaker HyperPod, our customers and internal teams no longer have to worry about operating and configuring the Kubernetes control plane, and SageMaker HyperPod provides the network performance and optimized configurations to support complex HPC workloads. With Amazon EKS support in SageMaker HyperPod, we can reduce time we spent for undifferentiated heavy lifting in infrastructure management and reduce operational costs by over 30%.”
– Observea

“As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us as it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers as we are now able to package and productize this capability into our GenAI platform, enabling our customers to run their own training and fine-tuning workloads in a more streamlined manner.”
– Articul8 AI

This post is designed for Kubernetes cluster administrators and ML scientists, providing an overview of the key features that SageMaker HyperPod introduces to facilitate large-scale model training on an EKS cluster.
The post is organized into the following three sections:

Overview of Amazon EKS support in SageMaker HyperPod – This section provides a high-level overview of Amazon EKS support in SageMaker HyperPod, introducing three key resiliency features HyperPod compute provides on the EKS cluster. Additionally, this section explains how HyperPod provides a smooth developer experience for admins and scientists.
HyperPod cluster setup and node resiliency features – This section provides a detailed guide on integrating HyperPod managed compute into your EKS cluster as Kubernetes worker nodes, emphasizing how its built-in resiliency features provide infrastructure stability. This section is especially beneficial for admins.
Training job resiliency with the job auto resume functionality – In this section, we demonstrate how scientists can submit and manage their distributed training jobs using either the native Kubernetes CLI (kubectl) or optionally the new HyperPod CLI (hyperpod) with automatic job recovery enabled.

Overview of EKS support in SageMaker HyperPod
This section provides a high-level overview of Amazon EKS support in SageMaker HyperPod, introduces three key resiliency features HyperPod compute provides on the EKS cluster, and discusses how SageMaker HyperPod provides smooth user experiences for admins and scientists.
Architecture overview
Amazon EKS support in HyperPod supports a 1-to-1 mapping between an EKS cluster (serving as a Kubernetes control plane) and a HyperPod compute (attached as a group of worker nodes). You have three virtual private clouds (VPCs) in this architecture, hosting different types of resources:

Amazon EKS VPC – An AWS managed VPC hosts the EKS control plane. This VPC doesn’t appear in the customer account. Amazon EKS creates a highly available endpoint for the managed Kubernetes API server that you use to communicate with your cluster (using tools like kubectl). The managed endpoint uses Network Load Balancer to load balance Kubernetes API servers.
HyperPod VPC – An AWS managed VPC hosts the HyperPod compute. This VPC doesn’t appear in the customer account. The nodes connect to the EKS control plane through a cross-account elastic network interface (ENI).
SageMaker user VPC – A user-managed VPC hosts resources such as Amazon FSx for Lustre, which is optionally associated with Amazon Simple Storage Service (Amazon S3) using an data repository association, on your account.

Cross-account ENIs also bridge communication between HyperPod compute instances and other AWS services on your account, such as Amazon Elastic Container Registry (Amazon ECR) and Amazon CloudWatch.
The following diagram illustrates the high-level architecture of Amazon EKS support in HyperPod.

HyperPod-managed resiliency features
Amazon EKS support in HyperPod provides the following three capabilities to make sure the cluster stays healthy and training jobs continue under unexpected interruptions:

Deep health checks – This is a managed health check for stress testing GPUs and AWS Trainium instances, as well as performing Elastic Fabric Adapter (EFA) These checks can be run during the cluster creation, update, or node replacement phases, and can be enabled or disabled through HyperPod APIs.
Automated node recovery – HyperPod performs managed, lightweight, and non-invasive checks, coupled with automated node replacement capability. The HyperPod monitoring agent continuously monitors and detects potential issues, including memory exhaustion, disk failures, GPU anomalies, kernel deadlocks, container runtime issues, and out-of-memory (OOM) crashes. Based on the underlying issue, the monitoring agent either replaces or reboots the node.
Job auto resume – SageMaker HyperPod provides a job auto resume capability using the Kubeflow Training Operator for PyTorch to provide recovery and continuation of training jobs in the event of interruptions or failures. The extension makes sure the job waits and restarts after the node is replaced.

User experiences
In addition to the aforementioned managed resiliency features, SageMaker HyperPod provides smooth user experiences for both admins and scientists that are critical for managing a large cluster and running large-scale training jobs on them as part of the Amazon EKS integration:

Admin experience – SageMaker HyperPod provides APIs and a console experience to create and manage node groups in the EKS cluster, along with the ability to SSH into the cluster nodes. SageMaker HyperPod also provides a mechanism to install additional dependencies on the cluster nodes using lifecycle scripts, and an API-based mechanism to provide cluster software updates and improve overall observability.
Scientist experience – Along with enabling scientists to train FMs using Amazon EKS as the orchestrator, SageMaker HyperPod provides additional capabilities for scientists to effortlessly train models. With the HyperPod CLI, scientists can submit training jobs by providing a .yaml file and manage jobs (list, describe, view, cancel) without needing to use kubectl. Scientists can use open source tools like Kueue (a Kubernetes tool for job queuing) and adjacent SageMaker capabilities like managed MLflow to manage their experiments and training runs. Scientists can also access native SageMaker distributed training libraries that provide performance improvements by up to 20%. You can also enable SageMaker HyperPod compute with Amazon EKS support using third-party tools like KubeRay, which runs on the Kubernetes API. This allows you to bring your preferred job submission and management capabilities used with other Kubernetes clusters into your HyperPod environment.

HyperPod compute setup and node resiliency features
In this section, we provide a detailed guide on integrating HyperPod managed compute into your EKS cluster as Kubernetes worker nodes, and discuss how its built-in resiliency features provide infrastructure stability.
Prerequisites
You need to have the following in place prior to the HyperPod compute deployment:

EKS cluster – You can associate HyperPod compute to an existing EKS cluster that satisfies the set of prerequisites. Alternatively, you can deploy a ready-made EKS cluster with a single AWS CloudFormation template. Refer the architecture guide for step-by-step setup instruction.
Custom resources – Running multi-node distributed training requires various resources various components, such as device plugins, CSI drivers, and Training Operators, to be pre-deployed on the EKS cluster. You also need to deploy additional resources for the health monitoring agent and deep health check. HyperPodHelmCharts simplify the process using Helm, one of most commonly used package mangers for Kubernetes. Refer the developer guide for installation.

HyperPod compute setup
With the aforementioned resources successfully deployed, you’re now prepared to create the HyperPod compute. The cluster configuration is specified using a JSON file; the following code provides an example:

cat > cluster-config.json << EOL
{
    “ClusterName”: “ml-cluster”,
    “Orchestrator”: {
        “Eks”: {
            “ClusterArn”: “${EKS_CLUSTER_ARN}”
        }
    },
    “InstanceGroups”: [
        {
            “InstanceGroupName”: “worker-group-1”,
            “InstanceType”: “ml.p5.48xlarge”,
            “InstanceCount”: 4,
            “LifeCycleConfig”: {
                “SourceS3Uri”: “s3://${BUCKET_NAME}”,
                “OnCreate”: “on_create.sh”
            },
            “ExecutionRole”: “${EXECUTION_ROLE}”,
            “ThreadsPerCore”: 1,
            “OnStartDeepHealthChecks”: [
                “InstanceStress”,
                “InstanceConnectivity”
            ]
        }
    ],
    “VpcConfig”: {
        “SecurityGroupIds”: [
            “$SECURITY_GROUP”
        ],
        “Subnets”: [
            “$SUBNET_ID”
        ]
    },
    “NodeRecovery”: “Automatic”
}
EOL

The provided configuration file contains two key highlights:

“OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs HyperPod to conduct a deep health check whenever new GPU or Trainium instances are added
“NodeRecovery”: “Automatic” – Enables HyperPod’s automated node recovery functionality

You can create a HyperPod compute with the following aws command (you need version 2.17.47 or newer):

aws sagemaker create-cluster
    –cli-input-json file://cluster-config.json

{
    “ClusterArn”: “arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49”
}

To verify the cluster status, you can use the following command:

aws sagemaker list-clusters –output table 

This command displays the cluster details, including the cluster name, status, and creation time:

———————————————————————————————————————–
|                                                    ListClusters                                                     |
+———————————————————————————————————————+
||                                                 ClusterSummaries                                                  ||
|+—————————————————————-+————–+—————-+——————+|
||                           ClusterArn                           | ClusterName  | ClusterStatus  |  CreationTime    ||
|+—————————————————————-+————–+—————-+——————+|
||  arn:aws:sagemaker:us-east-2:111111111111:cluster/wccy5z4n4m49 |  ml-cluster  |  Creating      |  1723724079.337  ||
|+—————————————————————-+————–+—————-+——————+|

Alternatively, you can verify the cluster status through the SageMaker console. After a brief period, you can observe that the status for all nodes transitions to Running.

Node resiliency features
To gain further insight into the instances, you can use kubectl get nodes and examine the node labels. The sagemaker.amazonaws.com/node-health-status label reveals the life stage of each node. For instance, nodes with the ml.m5.2xlarge instance type are labeled as Schedulable, indicating that they have successfully passed the regular HyperPod health check. Conversely, nodes with the ml.p5.48xlarge instance type are labeled as Unschedulable, indicating that they have entered the initial deep health checks. The following code shows an example:

# kubectl get nodes –show-labels=true
NAME                         …  LABELS
hyperpod-i-023cfe933b3b34369 …  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  …
hyperpod-i-045961b6424401838 …  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, …
hyperpod-i-074b81fdb5bf52e19 …  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, …
hyperpod-i-0ae97710b3033cdb1 …  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  …

The deep health check logs are stored in the CloudWatch log group at /aws/sagemaker/Clusters/<cluster_name>/<cluster_id>. The log streams are logged at DeepHealthCheckResults/<log_stream_id>. When the deep health checks identify an issue, the output log provides detailed information, including the instance ID that failed the deep health checks and the specific failure reason. For example:

# Example1
{
“level”: “error”,
“ts”: “2024-08-15T21:15:22Z”,
“msg”: “Encountered FaultyInstance. Replace the Instance. Region: us-east-2,
InstanceType: p5.48xlarge. ERROR:Bandwidth has less than threshold: Expected minimum
threshold :80,NCCL Test output Bw: 30”
}
# Example2
{
“level”: “error”,
“ts”: “2024-08-15T21:15:22Z”,
“msg”: “Encountered Unknownerror. Replace the Instance. Region: us-east-2,
InstanceType: p5.48xlarge. ERROR: Crash detected in dcgm test”
}

You can check the progress of the deep health check with the following values for the sagemaker.amazonaws.com/deep-health-check label on each node:

amazonaws.com/deep-health-check: InProgress 
amazonaws.com/deep-health-check: Passed
amazonaws.com/deep-health-check: Failed

If a node fails the deep health checks, it will be replaced. Otherwise, it will be marked with the Schedulable label:
sagemaker.amazonaws.com/node-health-status: Schedulable
When you want to manually replace a specific node in your cluster, you can do so by manually modifying the label.
For complete list of resilience-related Kubernetes labels, please refer AWS documentation.
Even after the initial deep health checks, HyperPod periodically runs regular health checks. To view the health events detected by the HyperPod health monitoring agent, you can check the CloudWatch stream log:

Example log group name – /aws/sagemaker/Clusters/<cluster_name>/<cluster_id>
Example log stream name – SagemakerHealthMonitoringAgent/<your_node_group_name>/<instance_id>

The SagemakerHealthMonitoringAgent log stream for each node contains only the detection events from the health monitoring agent. For example:

# Example1
{
    “level”: “info”,
    “ts”: “2024-09-06T03:15:11Z”,
    “msg”: “NPD caught “,
    “condition type: “: “KernelDeadlock”,
    “with condition details “: {
        “type”: “KernelDeadlock”,
        “status”: “False”,
        “transition”: “2024-09-06T03:15:11.539932213Z”,
        “reason”: “KernelHasNoDeadlock”,
        “message”: “kernel has no deadlock”
    },
    “HealthMonitoringAgentDetectionEvent”: “HealthEvent”
}
# Example2
{
    “level”: “info”,
    “ts”: “2024-09-06T03:15:11Z”,
    “msg”: “NPD caught “,
    “condition type: “: “NvidiaErrorTerminate”,
    “with condition details “: {
        “type”: “NvidiaErrorTerminate”,
        “status”: “False”,
        “transition”: “2024-09-06T03:15:11.539932283Z”,
        “reason”: “NvidiaNoErrorRequiredTerminate”,
        “message”: “Nvidia no error required terminate”
    },
    “HealthMonitoringAgentDetectionEvent”: “HealthEvent”
}

The deep health checks or the health monitor agent identify issues in a certain node, the node is labeled with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplace:NoSchedule to avoid scheduling pods, and then the node is replaced or rebooted.
You can monitor the health status of HyperPod nodes through CloudWatch Container Insights, now with enhanced observability for Amazon EKS. Container Insights helps collect, aggregate, and summarize metrics and logs from containerized applications and microservices, providing detailed insights into performance, health, and status metrics for CPU, GPU, Trainium, EFA, and file system up to the container level. For the complete list of metrics tracked, see Amazon EKS and Kubernetes Container Insights metrics. With the Container Insights integration with SageMaker HyperPod, you can also check the individual node health status and the total number of schedulable and unschedulable nodes, as shown in the following screenshots.

You can find the Container Insights set up guide in Amazon EKS Support in Amazon SageMaker HyperPod Workshop.
Training job resiliency with the job auto resume functionality
In addition to infrastructure resiliency features, you can use the use job auto resume capability using the Kubeflow Training Operator for PyTorch to maintain the recovery and continuation of training jobs in the event of interruptions or failures. The job auto resume feature attempts to continue the job, whereas the HyperPod node auto recovery functionality works on resolving node failures (node reboot or replacement as needed) to minimize training downtime. This section demonstrates the job auto resume feature using a PyTorch FSDP example on the awsome-distributed-training repository.
To enable the job auto resume feature, you create a PyTorchJob with the fsdp.yaml manifest, which includes the following annotations and nodeSelector:

apiVersion: “kubeflow.org/v1”
kind: PyTorchJob
metadata:
    name: fsdpjob
    namespace: kubeflow
    # config for HyperPod job auto-resume
    annotations: {
        sagemaker.amazonaws.com/enable-job-auto-resume: “true”,
        sagemaker.amazonaws.com/job-max-retry-count: “2”
    }
spec:
  pytorchReplicaSpecs:
  ……
  Worker:
      replicas: 10
      restartPolicy: OnFailure

      template:
          spec:
            nodeSelector: sagemaker.amazonaws.com/node-health-status: Schedulable 
……

With the annotations sagemaker.amazonaws.com/enable-job-auto-resume: “true” and sagemaker.amazonaws.com/job-max-retry-count: “2”, SageMaker HyperPod resumes interrupted training jobs up to two times and schedules the resumed jobs onto healthy nodes. These healthy nodes are identified by the node selector label sagemaker.amazonaws.com/node-health-status: Schedulable, ensuring that only nodes that have passed basic health checks and are available for running workloads are used for resumed jobs.
Submit the PyTorchJob using the kubectl command:

kubectl apply -f fsdp.yaml

With the job auto resume feature enabled, if a job fails due to a hardware failure or any transient issues during training, SageMaker HyperPod initiates the node replacement workflow and restarts the job after the faulty nodes are replaced. You can verify the status of job auto resume by describing the PyTorchJob:

kubectl describe pytorchjob -n kubeflow <job-name>

In the event of a hardware failure, the Kubeflow training job restarts as follows:

Start Time: 2024-07-11T05:53:10Z
Enable job auto-resume 27

Events:
Type Reason Age From
Message
—- —— —- —-

Normal SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-worker-0
Normal SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-worker-1
Normal SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-master-0
Warning PyTorchJobRestarting 7m59s pytorchjob-controller
PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed.
Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-worker-0
Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-worker-1
Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-master-0
Warning PyTorchJobRestarting 7m58s pytorchjob-controller
PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed

When you submit a training job with the HyperPod CLI, you can also request the job to be auto resumed in the following way:

hyperpod start-job
    –config-file ./config.yaml
   –auto-resume true  
   –max-retry 2

Refer to config.yaml for full configuration. For other CLI options, refer to the documentation on Github repository.
Clean up
To delete your SageMaker HyperPod compute, use either the SageMaker console or the following AWS Command Line Interface (AWS CLI) command:

aws sagemaker delete-cluster –cluster-name <cluster_name>

Cluster deletion can take a few minutes. You can confirm successful deletion after you see no clusters on the SageMaker console.
Conclusion
With the support for Amazon EKS in SageMaker HyperPod, customers who have standardized their FM development workflows on Kubernetes can adopt SageMaker HyperPod and manage their cluster resources using a familiar Kubernetes interface in SageMaker HyperPod. When training an FM, SageMaker HyperPod automatically monitors cluster health, and when an infrastructure fault such as a GPU failure occurs, SageMaker HyperPod automatically remediates the issue and restarts the training process from the last saved checkpoint, without any human intervention. Amazon EKS further enhances this capability by running deep health checks. Whenever a new instance is added to the SageMaker HyperPod compute, it undergoes a deep health check process to identify and replace potentially problematic instances. SageMaker HyperPod then automatically replaces or reboots nodes identified as faulty and resumes training processes in the event of unexpected interruptions, involving node replacement and job resubmission.
For an end-to-end tutorial on cluster management and FM training, visit the Amazon EKS Support in Amazon SageMaker HyperPod Workshop. For more information on infrastructure deployment and additional distributed training test cases, refer to the awsome-distributed-training repository. If you’re interested in deploying HyperPod with step-by-step commands, you can start from the aws-do-hyperpod repository.

About the authors
Keita Watanabe is a Senior GenAI Specialist Solutions Architect in the world-wide specialist organization at Amazon Web Services, where he helps develop machine learning solutions using OSS projects such as Slurm and Kubernetes. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. Keita holds a PhD in Science from the University of Tokyo.
Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect in the world-wide specialist organization at AWS. In his role, he focuses on helping customers with the orchestration and scaling of ML and AI workloads on container-powered AWS services. He is also the author of the open source do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on democratizing generative AI and ML, combating climate change, and making travel safer, healthcare better, and energy smarter.
Tomonori Shimomura is a Senior Solutions Architect on the Amazon SageMaker team, where he provides in-depth technical consultation to SageMaker customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in cloud-side technology. In his free time, he enjoys playing video games, reading books, and writing software.
Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.
Manoj Ravi is a Senior Product Manager on the Amazon SageMaker team. He is passionate about building next-gen AI products and works on applications and tools to make foundation model development and deployment effortless for customers. He holds an MBA from the Haas School of Business and a master’s degree from Carnegie Mellon University. In his spare time, Manoj enjoys playing tennis and pursuing landscape photography.

A review of purpose-built accelerators for financial services

Data contains information, and information can be used to predict future behaviors, from the buying habits of customers to securities returns. Businesses are seeking a competitive advantage by being able to use the data they hold, apply it to their unique understanding of their business domain, and then generate actionable insights from it. The financial services industry (FSI) is no exception to this, and is a well-established producer and consumer of data and analytics. All industries have their own nuances and ways of doing business, and FSI is no exception—here, considerations such as regulation and zero-sum game competitive pressures loom large. This mostly non-technical post is written for FSI business leader personas such as the chief data officer, chief analytics officer, chief investment officer, head quant, head of research, and head of risk. These personas are faced with making strategic decisions on issues such as infrastructure investment, product roadmap, and competitive approach. The aim of this post is to level-set and inform in a rapidly advancing field, helping to understand competitive differentiators, and formulate an associated business strategy.
Accelerated computing is a generic term that is often used to refer to specialist hardware called purpose-built accelerators (PBAs). In financial services, nearly every type of activity, from quant research, to fraud prevention, to real-time trading, can benefit from reducing runtime. By performing a calculation more quickly, the user may be able to solve an equation more accurately, provide a better customer experience, or gain an informational edge over a competitor. These activities cover disparate fields such as basic data processing, analytics, and machine learning (ML). And finally, some activities, such as those involved with the latest advances in artificial intelligence (AI), are simply not practically possible, without hardware acceleration. ML is often associated with PBAs, so we start this post with an illustrative figure. The ML paradigm is learning followed by inference. Typically, learning is offline (not streaming real-time data, but historical data) on large volumes of data, whereas inference is online on small volumes of streaming data. Learning means identifying and capturing historical patterns from the data, and inference means mapping a current value to the historical pattern. PBAs, such as graphics processing units (GPUs), have an important role to play in both these phases. The following figure illustrates the idea of a large cluster of GPUs being used for learning, followed by a smaller number for inference. The distinct computational nature of the learning and inference phases means some hardware providers have developed independent solutions for each phase, whereas others have single solutions for both phases.

As shown in the preceding figure, the ML paradigm is learning (training) followed by inference. PBAs, such as GPUs, can be used for both these steps. In this example figure, features are extracted from raw historical data, which are then are fed into a neural network (NN). Due to model and data size, learning is distributed over multiple PBAs in an approach called parallelism. Labeled data is used to learn the model structure and weights. Unseen new streaming data is then applied to the model, and an inference (prediction) on that data is made.
This post starts by looking at the background of hardware accelerated computing, followed by reviewing the core technologies in this space. We then consider why and how accelerated computing is important for data processing. Then we review four important FSI use cases for accelerated computing. Key problem statements are identified and potential solutions given. The post finishes by summarizing the three key takeaways, and makes suggestions for actionable next steps.
Background on accelerated computing
CPUs are designed for processing small volumes of sequential data, whereas PBAs are suited for processing large volumes of parallel data. PBAs can perform some functions, such as some floating-point (FP) calculations, more efficiently than is possible by software running on CPUs. This can result in advantages such as reduced latency, increased throughput, and decreased energy consumption. The three types of PBAs are the easily reprogrammable chips such as GPUs, and two types of fixed-function acceleration; field-programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs). Fixed or semi-fixed function acceleration is practical when no updates are needed to the data processing logic. FPGAs are reprogrammable, albeit not very easily, whereas ASICs are custom designed fully fixed for a specific application, and not reprogrammable. As a general rule, the less user-friendly the speedup, the faster it is. In terms of resulting speedups, the approximate order is programming hardware, then programming against PBA APIs, then programming in an unmanaged language such as C++, then a managed language such as Python. Analysis of publications containing accelerated compute workloads by Zeta-Alpha shows a breakdown of 91.5% GPU PBAs, 4% other PBAs, 4% FPGA, and 0.5% ASICs. This post is focused on the easily reprogrammable PBAs.
The recent history of PBAs begins in 1999, when NVIDIA released its first product expressly marketed as a GPU, designed to accelerate computer graphics and image processing. By 2007, GPUs became more generalized computing devices, with applications across scientific computing and industry. In 2018, other forms of PBAs became available, and by 2020, PBAs were being widely used for parallel problems, such as training of NN. Examples of other PBAs now available include AWS Inferentia and AWS Trainium, Google TPU, and Graphcore IPU. Around this time, industry observers reported NVIDIA’s strategy pivoting from its traditional gaming and graphics focus to moving into scientific computing and data analytics.
The union of advances in hardware and ML has led us to the current day. Work by Hinton et al. in 2012 is now widely referred to as ML’s “Cambrian Explosion.” Although NN had been around since the 1960s and never really worked, Hinton noted three key changes. Firstly, they added more layers to their NN, improving their performance. Secondly, there was a massive increase in the volume of labeled data available for training. Thirdly, the presence of GPUs enabled the labeled data to be processed. Together, these elements lead to the start of a period of dramatic progress in ML, with NN being redubbed deep learning. In 2017, the landmark paper “Attention is all you need” was published, which laid out a new deep learning architecture based on the transformer. In order to train transformer models on internet-scale data, huge quantities of PBAs were needed. In November 2022, ChatGPT was released, a large language model (LLM) that used the transformer architecture, and is widely credited with starting the current generative AI boom.
Review of the technology
In this section, we review different components of the technology.
Parallel computing
Parallel computing refers to carrying out multiple processes simultaneously, and can be categorized according to the granularity at which parallelism is supported by the hardware. For example, a grid of connected instances, multiple processors within a single instance, multiple cores within a single processor, PBAs, or a combination of different approaches. Parallel computing uses these multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into independent parts so that each processing element can complete its part of the workload algorithm simultaneously. Parallelism is suited for workloads that are repetitive, fixed tasks, involving little conditional branching and often large amounts of data. It also means not all workloads are equally suitable for acceleration.
In parallel computing, the granularity of a task is a measure of the amount of communication overhead between the processing functional units. Granularity is typically split into the categories of fine-grained and coarse-grained. Fine-grained parallelism refers to a workload being split into a large number of small tasks, whereas coarse-grained refers to splitting into a small number of large tasks. The key difference between the two categories is the degree of communication and synchronization required between the processing units. A thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, and is typically a component of a process. The multiple threads of a given process may be run concurrently by multithreading, while sharing resources such as memory. An application can achieve parallelism by using multithreading to split data and tasks into parallel subtasks and let the underlying architecture manage how the threads run, either concurrently on one core or in parallel on multiple cores. Here, each thread performs the same operation on different segments of memory so that they can operate in parallel. This, in turn, enables better system utilization and provides faster program execution.
Purpose built accelerators
Flynn’s taxonomy is a classification of computer architectures helpful in understanding PBAs. Two classifications of relevance are single instruction stream, multiple data streams (SIMD), and the SIMD sub-classification of single instruction, multiple thread (SIMT). SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. SIMT describes processors that are able to operate on data vectors and arrays (as opposed to just scalars), and therefore handle big data workloads efficiently. Each SIMT core has multiple threads that run in parallel, thereby giving true simultaneous parallel hardware-level execution. CPUs have a relatively small number of complex cores and are designed to run a sequence of operations (threads) as fast as possible, and can run a few tens of these threads in parallel. GPUs, in contrast, feature smaller cores and are designed to run thousands of threads in parallel in the SIMT paradigm. It is this design that primarily distinguishes GPUs from CPUs and allows GPUs to excel at regular, dense, numerical, data-flow-dominated workloads.
Suppliers of data center GPUs include NVIDIA, AMD, Intel, and others. The AWS P5 EC2 instance type range is based on the NVIDIA H100 chip, which uses the Hopper architecture. The Hopper H100 GPU (SXM5 variant) architecture includes 8 GPU processing clusters (GPCs), 66 texture processing clusters (TPCs), 2 Streaming Multiprocessors (SMs)/TPC, 528 Tensor cores/GPU, and 128 CUDA cores/SM. Additionally, it features 80 GB HBM3 GPU memory, 900 GBps NVLink GPU-to-GPU interconnect, and a 50 MB L2 cache minimizing HBM3 trips. An NVIDIA GPU is assembled in a hierarchal manner: the GPU contains multiple GPCs, and the role of each GPC is to act as a container to hold all the components together. Each GPC has a raster engine for graphics and several TPCs. Inside each TPC is a texture unit, some logic control, and multiple SMs. Inside each SM are multiple CUDA and Tensor cores, and it is here that the compute work happens. The ratio of units GPU:GPC:TPC:SM:CUDA core/Tensor core varies according to release and version. This hierarchal architecture is illustrated in the following figure.

SMs are the fundamental building blocks of an NVIDIA GPU, and consist of CUDA cores, Tensor cores, distributed shared memory, and instructions to support dynamic programming. When a CUDA program is invoked, work is distributed to the multithreaded SMs with available execution capacity. The CUDA core, released in 2007, is a GPU core approximately equal to a CPU core. Although it’s not as powerful as a CPU core, the CUDA core advantage is its ability to be used for large-scale parallel computing. Like a CPU core, each CUDA core still only runs one operation per clock cycle; however, the GPU SIMD architecture enables large numbers of CUDA cores to simultaneously address one data point each. CUDA cores are split into support for different precision, meaning that in the same clock cycle, multiple precision work can be done. The CUDA core is well suited for high-performance computing (HPC) use cases, but is not so well suited for the matrix math found in ML. The Tensor core, released in 2017, is another NVIDIA proprietary GPU core that enables mixed-precision computing, and is designed to support the matrix math of ML. Tensor cores support mixed FP accuracy matrix math in a computationally efficient manner by treating matrices as primitives and being able to perform multiple operations in one clock cycle. This makes GPUs well suited for data-heavy, matrix math-based, ML training workloads, and real-time inference workloads needing synchronicity at scale. Both use cases require the ability to move data around the chip quickly and controllably.
From 2010 onwards, other PBAs have started becoming available to consumers, such as AWS Trainium, Google’s TPU, and Graphcore’s IPU. While an in-depth review on other PBAs is beyond the scope of this post, the core principle is one of designing a chip from the ground up, based around ML-style workloads. Specifically, ML workloads are typified by irregular and sparse data access patterns. This means there is a requirement to support fine-grained parallelism based on irregular computation with aperiodic memory access patterns. Other PBAs tackle this problem statement in a variety of different ways from NVIDIA GPUs, including having cores and supporting architecture complex enough for running completely distinct programs, and decoupling thread data access from the instruction flow by having distributed memory next to the cores.
AWS accelerator hardware
AWS currently offers a range of 68 Amazon Elastic Compute Cloud (Amazon EC2) instance types for accelerated compute. Examples include F1 Xilinx FPGAs, P5 NVIDIA Hopper H100 GPUs, G4ad AMD Radeon Pro V520 GPUs, DL2q Qualcomm AI 100, DL1 Habana Gaudi, Inf2 powered by Inferentia2, and Trn1 powered by Trainium. In March 2024, AWS announced it will offer the new NVIDIA Blackwell platform, featuring the new GB200 Grace Blackwell chip. Each EC2 instance type has a number of variables associated with it, such as price, chip maker, Regional availability, amount of memory, amount of storage, and network bandwidth.
AWS chips are produced by our own Annapurna Labs team, a chip and software designer, which is a wholly owned subsidiary of Amazon. The Inferentia chip became generally available (GA) in December 2019, followed by Trainium GA in October 2022, and Inferentia2 GA in April 2023. In November 2023, AWS announced the next generation Trainium2 chip. By owning the supply and manufacturing chain, AWS is able to offer high-levels of availability of its own chips. Availability AWS Regions are shown in the following table, with more Regions coming soon. Both Inferentia2 and Trainium use the same basic components, but with differing layouts, accounting for the different workloads they are designed to support. Both chips use two NeuronCore-v2 cores each, connected by a variable number of NeuronLink-v2 interconnects. The NeuronCores contain four engines: the first three include a ScalarEngine for scalar calculations, a VectorEngine for vector calculations, and a TensorEngine for matrix calculations. By analogy to an NVIDIA GPU, the first two are comparable to CUDA cores, and the latter is equivalent to TensorCores. And finally, there is a C++ programmable GPSIMD-engine allowing for custom operations. The silicon architecture of the two chips is very similar, meaning that the same software can be used for both, minimizing changes on the user side, and this similarity can be mapped back to their two roles. In general, the learning phase of ML is typically bounded by bandwidth associated with moving large volumes of data to the chip and about the chip. The inference phase of ML is typically bounded by memory, not compute. To maximize absolute-performance and price-performance, Trainium chips have twice as many NeuronLink-v2 interconnects as Inferentia2, and Trainium instances also contain more chips per instance than Inferentia2 instances. All these differences are implemented at the server level. AWS customers such as Databricks and Anthropic use these chips to train and run their ML models.
The following figures illustrate the chip-level schematic for the architectures of Inferentia2 and Trainium.

The following table shows the metadata of three of the largest accelerated compute instances.

Instance Name
GPU Nvidia H100 Chips
Trainium Chips
Inferentia Chips
vCPU Cores
Chip Memory (GiB)
Host Memory (GiB)
Instance Storage (TB)
Instance Bandwidth (Gbps)
EBS Bandwidth (Gbps)
PBA Chip Peer-to-Peer Bandwidth (GBps)

p5.48xlarge
8
0
0
192
640
2048
8 x 3.84 SSD
3,200
80
900 NVSwitch

inf2.48xlarge
0
0
12
192
384
768
EBS only
100
60
192 NeuronLink-v2

trn1n.32xlarge
0
16
0
128
512
512
4 x 1.9 SSD
1,600
80
768 NeuronLink-v2

The following table summarizes performance and cost.

Instance Name
On-Demand Rate ($/hr)
3Yr RI Rate ($/hr)
FP8 TFLOPS
FP16 TFLOPS
FP32 TFLOPS
$/TFLOPS (FP16, theoretical)
Source Reference

p5.48xlarge
98.32
43.18
16,000
8,000
8,000
$5.40
URL

inf2.48xlarge
12.98
5.19
2,280
2,280
570
$2.28
URL

trn1n.32xlarge
24.78
9.29
3,040
3,040
760
$3.06
URL

The following table summarizes Region availability.

Instance Name
Number of AWS Regions Supported In
AWS Regions Supported In
Default Quota Limit

p5.48xlarge
4
us-east-2; us-east-1; us-west-2; eu-north-1
0

inf2.48xlarge
13
us-east-2; us-east-1; us-west-2; ap-south-1; ap-southeast-1; ap-southeast-2; ap-northeast-1; eu-central-1; eu-west-1; eu-west-2; eu-west-3; eu-north-1; sa-east-1;
0

trn1n.32xlarge
3
us-east-2; us-east-1; us-west-2; eu-north-1; ap-northeast-1; ap-south-1; ap-southeast-4
0

After a user has selected the EC2 instance type, it can then be combined with AWS services designed to support large-scale accelerated computing use cases, including high-bandwidth networking (Elastic Fabric Adapter), virtualization (AWS Nitro Enclaves), hyper-scale clustering (Amazon EC2 UltraClusters), low-latency storage (Amazon FSx for Lustre), and encryption (AWS Key Management Service), while noting not all services are available for all instances in all Regions.
The following figure shows an example of a large-scale deployment of P5 EC2 instances, includes UltraCluster support for 20,000 H100 GPUs, with non-blocking petabit-scale networking, and high-throughput low latency storage. Using the same architecture, UltraCluster supports Trainium scaling to over 60,000 chips.

In summary, we see two general trends in the hardware acceleration space. Firstly, improving price-performance to handle increasing data processing volumes and model sizes, coupled with a need to serve more users, more quickly, and at reduced cost. Secondly, improving security of the associated workloads by preventing unauthorized users from being able to access training data, code, or model weights.
Accelerator software
CPUs and GPUs are designed for different types of workloads. However, CPU workloads can run on GPUs, a process called general-purpose computing on graphics processing units (GPGPU). In order to run a CPU workload on a GPU, the work needs to be reformulated in terms of graphics primitives supported by the GPU. This reformulation can be carried out manually, though it is difficult programming, requiring writing code in a low-level language to map data to graphics, process it, and then map it back. Instead, it is commonly carried out by a GPGPU software framework, allowing the programmer to ignore the underlying graphical concepts, and enabling straightforward coding against the GPU using standard programming languages such as Python. Such frameworks are designed for sequential parallelism against GPUs (or other PBAs) without requiring concurrency or threads. Examples of GPGPU frameworks are the vendor-neutral open source OpenCL and the proprietary NVIDIA CUDA.
For the Amazon PBA chips Inferentia2 and Trainium, the SDK is AWS Neuron. This SDK enables development, profiling, and deployment of workloads onto these PBAs. Neuron has various native integrations to third-party ML frameworks like PyTorch, TensorFlow, and JAX. Additionally, Neuron includes a compiler, runtime driver, as well as debug and profiling utilities. This toolset includes Neuron-top for real-time visualization of the NeuronCore and vCPU utilization, host and device memory usage, and a breakdown of memory allocation. This information is also available in JSON format if neuron-monitor is used, including Neuron-ls for device discovery and topology information. With Neuron, users can use inf2 and trn1n instances with a range of AWS compute services, such as Amazon SageMaker, Amazon Elastic Container Service, Amazon Elastic Kubernetes Service, AWS Batch, and AWS ParallelCluster. This usability, tooling, and integrations of the Neuron SDK has made Amazon PBAs extremely popular with users. For example, over 90% of the top 100 Hugging Face models (now over 100,000 AI models) now run on AWS using Optimum Neuron, enabling the Hugging Face transformer natively supported for Neuron. In summary, the Neuron SDK allows developers to easily parallelize ML algorithms, such as those commonly found in FSI. The following figure illustrates the Neuron software stack.

The CUDA API and SDK were first released by NVIDIA in 2007. CUDA offers high-level parallel programming concepts that can be compiled to the GPU, giving direct access to the GPU’s virtual instruction set and therefore the ability to specify thread-level parallelism. To achieve this, CUDA added one extension to the C language to let users declare functions that could run and compile on the GPU, and a lightweight way to call those functions. The core idea behind CUDA was to remove programmers’ barrier to entry for coding against GPUs by allowing use of existing skills and tools as much as possible, while being more user friendly than OpenCL. The CUDA platform includes drivers, runtime kernels, compilers, libraries, and developer tools. This includes a wide and impressive range of ML libraries like cuDNN and NCCL. The CUDA platform is used through complier directives and extensions to standard languages, such as the Python cuNumeric library. CUDA has continuously optimized over the years, using its proprietary nature to improve performance on NVIDIA hardware relative to vendor-neutral solutions like OpenCL. Over time, the CUDA programming paradigm and stack has become deeply embedded in all aspects of the ML ecosystem, from academia to open source ML repositories.
To date, alternative GPU platforms to CUDA have not seen widespread adoption. There are three key reasons for this. Firstly, CUDA has had a decades-long head start, and benefits from the networking effect of its mature ecosystem, from organizational inertia of change, and from risk aversion to change. Secondly, migrating CUDA code to a different GPU platform can be technically difficult, given the complexity of the ML models typically being accelerated. Thirdly, CUDA has integrations with major third-party ML libraries, such as TensorFlow and PyTorch.
Despite the central role CUDA plays in the AI/ML community, there is movement by users to diversify their accelerated workflows by movement towards a Pythonic programming layer to make training more open. A number of such efforts are underway, including projects like Triton and OneAPI, and cloud service features such as Amazon SageMaker Neo. Triton is an open source project lead by OpenAI that enables developers to use different acceleration hardware using entirely open source code. Triton uses an intermediate compiler to convert models written in supported frameworks into an intermediate representation that can then be lowered into highly optimized code for PBAs. Triton is therefore a hardware-agnostic convergence layer that hides chip differences.
Soon to be released is the AWS neuron kernel interface (NKI) programming interface. NKI is a Python-based programming environment designed for the compiler, which adopts commonly used Triton-like syntax and tile-level semantics. NKI provides customization capabilities to fully optimize performance by enabling users to write custom kernels, by passing almost all of the AWS compiler layers.
OneAPI is an open source project lead by Intel for a unified API across different accelerators including GPUs, other PBAs, and FPGAs. Intel believes that future competition in this space will happen for inference, unlike in the learning phase, where there is no software dependency. To this end, OneAPI toolkits support CUDA code migration, analysis, and debug tools. Other efforts are building on top of OneAPI; for, example the Unified Acceleration Foundation’s (UXL) goal is a new open standard accelerator software ecosystem. UXL consortium members include Intel, Google, and ARM.
Amazon SageMaker is an AWS service providing an ML development environment, where the user can select chip type from the service’s fleet of Intel, AMD, NVIDIA, and AWS hardware, offering varied cost-performance-accuracy trade-offs. Amazon contributes to Apache TVM, an open source ML compiler framework for GPUs and PBAs, enabling computations on any hardware backend. SageMaker Neo uses Apache TVM to perform static optimizations on trained models for inference for any given hardware target. Looking to the future, the accelerator software field is likely to evolve; however, this may be slow to happen.
Accelerator supply-demand imbalances
It has been widely reported for the last few years that GPUs are in short supply. Such shortages have led to industry leaders speaking out. For example, Sam Altman said “We’re so short on GPUs the less people use our products the better… we don’t have enough GPUs,” and Elon Musk said “It seems like everyone and their dog is buying GPUs at this point.”
The factors leading to this have been high demand coupled with low supply. High demand has risen from a range of sectors, including crypto mining, gaming, generic data processing, and AI. Omdia Research estimates 49% of GPUs go to the hyper-clouds (such as AWS or Azure), 27% go to big tech (such as Meta and Tesla), 20% go to GPU clouds (such as Coreweave and Lambda) and 6% go to other companies (such as OpenAI and FSI firms). The State of AI Report gives the size and owners of the largest A100 clusters, the top few being Meta with 21,400, Tesla with 16,000, XTX with 10,000, and Stability AI with 5,408. GPU supply has been limited by factors including lack of manufacturing competition and ability at all levels in the supply chain, and restricted supply of base components such as rare metals and circuit boards. Additionally, rate of manufacturing is slow, with an H100 taking 6 months to make. Socio-political events have also caused delays and issues, such as a COVID backlog, and with inert gases for manufacturing coming from Russia. A final issue impacting supply is that chip makers strategically allocate their supply to meet their long-term business objectives, which may not always align with end-users’ needs.
Supported workloads
In order to benefit from hardware acceleration, a workload needs to be parallelizable. An entire branch of science is dedicated to parallelizable problems. In The Landscape of Parallel Computing Research, 13 fields (termed dwarfs) are found to be fundamentally parallelizable, including dense and sparse linear algebra, Monte Carlo methods, and graphical models. The authors also call out a series of fields they term “embarrassingly sequential” for which the opposite holds. In FSI, one of the main data structures dealt with is time series, a series of sequential observations. Many time series algorithms have the property where each subsequent observation is dependent on previous observations. This means only some time series workloads can be efficiently computed in parallel. For example, a moving average is a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm. Sequential models, such as Recurrent Neural Networks (RNN) and Neural Ordinary Differential Equations, also have parallel implementations. In FSI, non-time series workloads are also underpinned by algorithms that can be parallelized. For example, Markovitz portfolio optimization requires the computationally intensive inversion of large covariance matrices, for which GPU implementations exist.
In computer science, a number can be represented with different levels of precision, such as double precision (FP64), single precision (FP32), and half-precision (FP16). Different chips support different representations, and different representations are suitable for different use cases. The lower the precision, the less storage is required, and the faster the number is to process for a given amount of computational power. FP64 is used in HPC fields, such as the natural sciences and financial modeling, resulting in minimal rounding errors. FP32 provides a balance between accuracy and speed, is used in applications such as graphics, and is the standard for GPUs. FP16 is used in deep learning where computational speed is valued, and the lower precision won’t drastically affect the model’s performance. More recently, other number representations have been developed which aim to improve the balance between acceleration and precision, such as OCP Standard FP8, Google BFloat16, and Posits. An example of a mixed representation use case is the updating of model parameters by gradient decent, part of the backpropagation algorithm, as used in deep learning. Typically this is done using FP32 to reduce rounding errors, however, in order to reduce memory load, the parameters and gradients can be stored in FP16, meaning there is a conversion requirement. In this case, BFloat16 is a good choice because it prevents float overflow errors while keeping enough precision for the algorithm to work.
As lower-precision workloads become more important, hardware and infrastructure trends are changing accordingly. For example, comparing the latest NVIDIA GB200 chip against the previous generation NVIDIA H100 chip, lower representation FP8 performance has increased 505%, but FP64 performance has only increased 265%. Likewise, in the forthcoming Trainium2 chip, the focus has been on lower-bit performance increases, giving a 400% performance increase over the previous generation. Looking to the future, we might expect to see a convergence between HPC and AI workloads, as AI starts to become increasingly important in solving what were traditionally HPC FP64 precision problems.
Accelerator benchmarking
When considering compute services, users benchmark measures such as price-performance, absolute performance, availability, latency, and throughput. Price-performance means how much compute can be done for $1, or what is the equivalent dollar cost for a given number of FP operations. For a perfect system, the price-performance ratio increases linearly as the size of a job scales up. A complicating factor when benchmarking compute grids on AWS is that EC2 instances come in a range of system parameters and a grid might contain more than one instance type, therefore systems are benchmarked at the grid level rather than on a more granular basis. Users often want to complete a job as quickly as possible and at the lowest cost; the constituent details of the system that achieves this aren’t as important.
A second benchmarking measure is absolute-performance, meaning how quickly can a given job be completed independent of price. Given linear scaling, job completion time can be reduced by simply adding more compute. However, it might be that the job isn’t infinitely divisible, and that only a single computational unit is required. In this case, the absolute performance of that computational unit is important. In an earlier section, we provided a table with one performance measure, the $/TFLOP ratio based on the chip specifications. However, as a rule of thumb, when such theoretical values are compared against experimental values, only around 45% is realized.
There are a few different ways to calculate price-performance. The first is to use a standard benchmark, such as LINPACK, HPL-MxP, or MFU (Model FLOPS Utilization). These can run a wide range of calculations that are representative of varying use cases, such as general use, HPC, and mixed HPC and AI workloads. From this, the TFLOP/s at a given FP precision for the system can be measured, along with the dollar-cost of running the system. However, it might be that the user has specific use cases in mind. In this case, the best data will come from price-performance data on a more representative benchmark.
There are various types of representative benchmark commonly seen. Firstly, the user can use real production data and applications with the hardware being benchmarked. This option gives the most reliable results, but can be difficult to achieve due to operational and compliance hurdles. Secondly, the user can replicate their existing use case with a synthetic data generator, avoiding the challenges of getting production data into new test systems. Thirdly, the use can employ a third-party benchmark for the use case, if one exists. For example, STAC is a company that coordinates an FSI community called the STAC Benchmark Council, which maintain a selection of accelerator benchmarks, including A2, A3, ML and AI (LLM). A2 is designed for compute-intensive analytic workloads involved in pricing and risk management. Specifically, the A2 workload uses option price discovery by Monte Carlo estimation of Heston-based Greeks for a path-dependent, multi-asset option with early exercise. STAC members can access A2 benchmarking reports, for example EC2 c5.metal, with the oneAPI. STAC-ML benchmarks the latency of NN inference—the time from receiving new input data until the model output is computed. STAC-A3 benchmarks the backtesting of trading algorithms to determine how strategies would have performed on historical data. This benchmark supports accelerator parallelism to run many backtesting experiments simultaneously, for the same security. For each benchmark, there exists a series of software packages (termed STAC Packs), which are accelerator-API specific. For some of the preceding benchmarks, STAC Packs are maintained by providers such as NVIDIA (CUDA) and Intel (oneAPI).
Some FSI market participants are performing in-house benchmarking at the microarchitecture level, in order to optimize performance as far as possible. Citadel has published microbenchmarks for NVIDIA GPU chips, dissecting the microarchitecture to achieve “bare-metal performance tuning,” noting that peak performance is inaccessible to software written in plain CUDA. Jane Street has looked at performance optimization through functional programming techniques, while PDT Partners has supported work on the Nixpkgs repository of ML packages using CUDA.
Some AWS customers have benchmarked the AWS PBAs against other EC2 instance types. ByteDance, the technology company that runs the video-sharing app TikTok, benchmarked Inf1 against a comparable EC2 GPU instance type. With Inf1, they were able to reduce their inference latency by 25%, and costs by 65%. In a second example, Inf2 is benchmarked against a comparable inference-optimized EC2 instance. The benchmark used is the RoBERTa-Base, a popular model used in natural language processing (NLP) applications, that uses the transformer architecture. In the following figure, on the x-axis we plotted throughput (the number of inferences that are completed in a set period of time), and on the y-axis we plotted latency (the time it takes the deep learning model to provide an output). The figure shows that Inf2 gives higher throughput and lower latency than the comparable EC2 instance type.

In a third benchmark example, Hugging Face benchmarked the trn1.32xlarge instance (16 Trainium chips) and two comparable EC2 instance types. For the first instance type, they ran fine-tuning for the BERT Large model on the full Yelp review dataset, using the BF16 data format with the maximum sequence length supported by the model (512). The benchmark results show the Trainium job is five times faster while being only 30% more expensive, resulting in a “huge improvement in cost-performance.” For the latter instance type, they ran three tests: language pretraining with GPT2, token classification with BERT Large, and image classification with the Vision Transformer. These results showed trn1 to be 2–5 times faster and 3–8 times cheaper than the comparable EC2 instance types.
FSI use cases
As with other industry sectors, there are two reasons why FSI uses acceleration. The first is to get a fixed result in the lowest time possible, for example parsing a dataset. The second is to get the best result in a fixed time, for example overnight parameter re-estimation. Use cases for acceleration exist across the FSI, including banking, capital markets, insurance, and payments. However, the most pressing demand comes from capital markets, because acceleration speeds up workloads and time is one of the easiest edges people can get in the financial markets. Put differently, a time advantage in financial services often equates to an informational advantage.
We begin by providing some definitions:

Parsing is the process of converting between data formats
Analytics is data processing using either deterministic or simple statistical methods
ML is the science of learning models from data, using a variety of different methods, and then making decisions and predictions
AI is an application able to solve problems using ML

In this section, we review some of the FSI use cases of PBAs. As many FSI activities can be parallelized, most of what is done in FSI can be sped up with PBAs. This includes most modeling, simulations, and optimization problems— currently in FSI, deep learning is only a small part of the landscape. We identify four classes of FSI use cases and look at applications in each class: parsing financial data, analytics on financial data, ML on financial data, and low-latency applications. To try and show how these classes relate to each other, the following figure shows a simplified representation of a typical capital market’s workflow. In this figure, acceleration categories have been assigned to the workflow steps. However, in reality, every step in the process may be able to benefit from one or more of the defined acceleration categories.

Parsing
A typical capital markets workflow consists of receiving data and then parsing it into a useable form. This data is commonly market data, as output from a trading venue’s matching engine, or onward from a market data vendor. Market participants who are receiving either live or historical data feeds need to ingest this data and perform one or more steps, such as parse the message out of a binary protocol, rebuild the limit order book (LOB), or combine multiple feeds into a single normalized format. Any of these parsing steps that run in parallel could be sped up relative to sequential processing. To give an idea of scale, the largest financial data feed is the consolidated US equity options feed, termed OPRA. This feed comes from 18 different trading venues, with 1.5 million contracts broadcast across 96 channels, with a supported peak message rate of 400 billion messages per day, equating to approximately 12 TB per day, or 3 PB per year. As well as maintaining real-time feeds, participants need to maintain a historical depositary, sometimes of several years in size. Processing of historical repositories is done offline, but is often a source of major cost. Overall, a large consumer of market data, such as an investment bank, might consume 200 feeds from across public and private trading venues, vendors, and redistributors.
Any point in this data processing pipeline that can be parallelized, can potentially be sped up by acceleration. For example:

Trading venues broadcast on channels, which can be groupings of alphabetical tickers or products.
On a given channel, different tickers update messages are broadcast sequentially. These can then be parsed out into unique streams per ticker.
For a given LOB, some events might be applicable to individual price levels independently.
Historical data is normally (but not always) independent inter-day, meaning that days can be parsed independently.

In GPU Accelerated Data Preparation for Limit Order Book Modeling, the authors describe a GPU pipeline handling data collection, LOB pre-processing, data normalization, and batching into training samples. The authors note their LOB pre-processing relies on the previous LOB state, and must be done sequentially. For LOB building, FPGAs seem to be used more commonly than GPUs because of the fixed nature of the workload; see examples from Xilinx and Algo-Logic. For example code for a build lab, using the AWS FPGA F1 instance type, refer to the following GitHub repo.
An important part of the data pipeline is the production of features, both online and offline. Features (also called alphas, signals, or predictors) are statistical representations of the data, which can then be used in downstream model building. A current trend in the FSI prediction space is the large-scale automation of dataset ingestion, curation, processing, feature extraction, feature combination, and model building. An example of this approach is given by WorldQuant, an algorithmic trading firm. The WSJ reports “a data group scours the globe for interesting and new data sets, including everything from detailed market pricing data to shipping statistics to footfall in stores captured by apps on smartphones”. WorldQuant states “in 2007 we had two data sets—today [2022] we have more than 1,400.” The general idea being if they could buy, consume, create, and web scrape more data than anyone else, they could create more alphas, and find more opportunities. Such an approach is based on performance being proportional to √N, where N is the number of alphas. Therefore, as long as an alpha is not perfectly correlated with another, there is value in adding it to the set. In 2010, WorldQuant was producing several thousand alphas per year, by 2016 had one million alphas, by 2022, had multiple millions, with a stated ambition to get to 100 million alphas. Although traditional quant finance mandates the importance of an economic rationale behind an alpha, the data-driven approach is led purely by the patterns in the data. After alphas have been produced, they can be intelligently merged together in a time-variant manner. Examples of signal combination methodologies which can benefit from PBA speed-up include Mean Variance Optimization and Bayesian Model Averaging. The same WSJ article states “No one alpha is important. Our edge is putting things together, it’s the implementation…. The idea is that with so many ‘alphas,’ even weak signals can be useful. If counting cars in parking lots next to big box retailers has only a tiny predictive power for those retailers’ stock prices, it can still be used to enhance a bigger prediction if combined with other weak signals. For example, an uptick in cars at Walmart parking lots—itself a relatively weak signal—could combine with similar trends captured by mobile phone apps and credit-card receipts harvested by companies that scan emails to create a more reliable prediction.” The automated process of data ingestion, processing, packaging, combination, and prediction is referred to by WorldQuant as their “alpha factory.”
From examples such as those we’ve discussed, it seems clear that parallelization, speed-up and scale-up, of such huge data pipelines is potentially an important differentiator. All the way through this pipeline, activities could be accelerated using PBAs. For example, for use at the signal combination phase, the Shapley value is a metric that can be used to compute the contribution of a given feature to a prediction. Shapley value computation has PBA-acceleration support in the Python XGBoost library.
Analytics
In this section, we consider the applicability of accelerator parallelism to analytics workloads. One of the parallelizable dwarfs is Monte Carlo, and for FSI and time series work in general, this is an important method. Monte Carlo is a way to compute expected values by generating random scenarios and then averaging them. By using GPUs, a simulated path can be assigned to each thread, allowing simulation of thousands of paths in parallel.
Post the 2008 credit crunch, new regulations require banks to run credit valuation adjustment (CVA) calculations every 24 hours. CVA is an adjustment to a derivatives price as charged by a bank to a counterparty. CVA is one of a family of related valuation adjustments collectively known as xVA, which include debt valuation adjustment (DVA), initial margin valuation adjustment (MVA), capital valuation adjustment (KVA), and funding valuation adjustment (FVA). Because this adjustment calculation can happen over large portfolios of complex, non-linear instruments, closed-form analytical solutions aren’t possible, and as such an empirical approximation by a technique such as Monte Carlo is required. The downside of Monte Carlo here is how computationally demanding it is, due to the size of the search space. The advent of this new regulation coincided with the coming of age of GPUs, and as such banks commonly use GPU grids to run their xVA calculations. In XVA principles, nested Monte Carlo strategies, and GPU optimizations, the authors find a nested simulation time of about an hour for a billion scenarios on the bank portfolio, and a GPU speedup of 100 times faster relative to CPUs. Rather than develop xVA applications internally, banks often use third-party independent software vendor (ISV) solutions to run their xVA calculations, such as Murex M3 or S&P Global XVA. Banking customers can choose to run such ISV software as a service (SaaS) solutions inside their own AWS accounts, and often on AWS accelerated instances.
A second use of PBAs in FSI Monte Carlo is in option pricing, especially for exotic options whose payoff is sometimes too complex to solve in closed-form. The core idea is using a random number generator (RNG) to simulate the stochastic components in a formula and then average the results, leading to the expected value. The more paths that are simulated, the more accurate the result is. In Quasi-Monte Carlo methods for calculating derivatives sensitivities on the GPU, the authors find 200-times greater speedup over CPUs, and additionally develop a number of refinements to reduce variance, leading to fewer paths needing to be simulated. In High Performance Financial Simulation Using Randomized Quasi-Monte Carlo Methods, the authors survey quasi Monte Carlo sequences in GPU libraries and review commercial software tools to help migrate Monte Carlo pricing models to GPU. In GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model, the author computes a volatility measure using Hybrid Monte Carlo (HMC) applied to realized stochastic volatility (RSV), parallelized on a GPU, resulting in a 17-times faster speedup. Finally, in Derivatives Sensitivities Computation under Heston Model on GPU, the authors achieve a 200-times faster speedup; however, the accuracy of the GPU method is inferior for some Greeks relative to CPU.
A third use of PBAs in FSI Monte Carlo is in LOB simulations. We can categorize different types of LOB simulations: replay of the public historical data, replay of the mapped public-private historical data, replay of synthetic LOB data, and replay of a mix of historical and synthetic data to simulate the effects of a feedback loop. For each of these types of simulation, there are multiple ways in which hardware acceleration could occur. For example, for the simple replay case, each accelerator thread could have a different LOB. For the synthetic data case, each thread could have a different version of the same LOB, thereby allowing multiple realizations of a single LOB. In Limit Order Book Simulations: A Review, the authors provide their own simulator classification scheme based on the mathematical modeling technique used—point processes, agent based, deep learning, stochastic differential equations. In JAX-LOB: A GPU-Accelerated limit order book simulator to unlock large scale reinforcement learning for trading, the authors use GPU accelerated training, processing thousands of LOBs in parallel, giving a “notably reduced per message processing time.”
Machine learning
Generative AI is the most topical ML application at this point in time. Generative AI has four main applications: classification, prediction, understanding, and data generation, which in turn map to use cases such as customer experience, knowledge worker productivity, surfacing information and sentiment, and innovation and automation. FSI examples exist for all of these; however, a thorough review of these is beyond the scope of this post. For this post, we remain focused on PBA applicability and look at two of these topics: chatbots and time series prediction.
The 2017, the publication of the paper Attention is all you need resulted in a new wave of interest in ML. The transformer architecture presented in this paper allowed for a highly parallelizable network structure, meaning more data could be processed than before, allowing patterns to be better captured. This has driven impressive real-world performance, as seen by popular public foundation models (FMs) such as OpenAI ChatGPT, and Anthropic Claude. These factors in turn have driven new demand for PBAs for training and inference on these models.
FMs, also termed LLMs, or chatbots when text focused, are models that are typically trained on a broad spectrum of generalized and unlabeled data and are capable of performing a wide variety of general tasks in FSI, such as the Bridgewater Associates LLM-powered Investment Analyst Assistant, which generates charts, computes financial indicators, and summarizes results. FSI LLMs are reviewed in Large Language Models in Finance: A Survey and A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges. FMs are often used as base models for developing more specialized downstream applications.
PBAs are used in three different types of FM training. Firstly, to train a FM from scratch. In BloombergGPT: A Large Language Model for Finance, the training dataset was 51% financial data from their systems and 49% public data, such as Wikipedia and Pile. SageMaker was used to train and evaluate their FM. Specifically, 64 p4d.24xlarge instances, giving a total of 512 A100 GPUs. Also used was SageMaker model parallelism, enabling the automatic distribution of the large model across multiple GPU devices and instances. The authors started with a compute budget of 1.3 million GPU hours, and noted training took approximately 53 days.
The second training approach is to fine-tune an existing FM. This requires using an FM whose model parameters are exposed, and updating them in light of new data. This approach can be effective when the data corpus differs significantly from the FM training data. Fine-tuning is cheaper and quicker than training FM from scratch, because the volume of data is likely to be much smaller. As with the larger-scale training from scratch, fine-tuning benefits significantly from hardware acceleration. In an FSI example, Efficient Continual Pre-training for Building Domain Specific Large Language Models, the authors fine-tune an FM and find that their approach outperforms standard continual pre-training performance with just 10% of the corpus size and cost, without any degradation on open-domain standard tasks.
The third training approach is to perform Retrieval Augmented Generation (RAG). To equip FMs with up-to-date and proprietary information, organizations use RAG, a technique that fetches data from company data sources and enriches the prompt to provide more relevant and accurate responses. The two-step workflow consists of ingesting data and vectorizing data, followed by runtime orchestration. Although hardware acceleration is less common in RAG applications, latency of search is a key component and as such the inference step of RAG can be hardware optimized. For example, the performance of OpenSearch, a vectorized database available on AWS, can be improved by using PBAs, with both NVIDIA GPUs and Inferentia being supported.
For these three training approaches, the role of PBAs varies. For processing the huge data volumes of FM building, PBAs are essential. Then, as the training volumes reduce, so does the value-add role of the PBA. Independent of how the model has been trained, PBAs have a key role in LLM inference, again because they are optimized for memory bandwidth and parallelism. The specifics of how to optimally use an accelerator depend on the use case—for example, a paid-for-service chatbot might be latency sensitive, whereas for a free version, a delay of a few milliseconds might be acceptable. If a delay is acceptable, then batching the queries together could help make sure a given chip’s processes are saturated, giving better dollar usage of the resource. Dollar costs are particularly importance in inference, because unlike training, which is a one-time cost, inference is a reoccurring cost.
Using ML for financial time series prediction is nothing new; a large body of public research exists on these methods and applications dating to the 1970s and beyond—for approximately the last decade, PBAs have been applied to this field. As discussed earlier, most ML approaches can be accelerated with hardware; however, the attention-based architecture using the transformer model is currently the most topical. We consider three areas of FSI application: time series FMs, NN for securities prediction, and reinforcement learning (RL).
The initial work on LLMs was conducted on text-based models. This was followed by multi-modal models, able to handle images and other data structures. Subsequent to this, publications have started to appear on time series FMs, including Amazon Chronos, Nixtla TimeGEN-1, and Google TimesFM. The behavior of the time series models appears to be similar to that of the language models. For example, in Scaling-laws for Large Time-series Models, the authors observe the models follow the same scaling laws. A review of these models is provided in Foundation Models for Time Series Analysis: A Tutorial and Survey. As with leading LLMs, time series FMs are likely to be successfully trained on large clusters of PBAs. In terms of size, GPT-3 was trained on a cluster of 10,000 V100s. The size of the GPT-4 training cluster is not public, but is speculated to have been trained on a cluster of 10,000–25,000 A100s. This is analogous in size to one algorithmic trading firm’s statement, “our dedicated research cluster contains … 25,000 A/V100 GPUs (and growing fast).”
Looking to the future, one possible outcome might be that time series FMs, trained at huge expense by a few large corporates, become the base models for all financial prediction. Financial services firms then modify these FMs through additional training with private data or their own insights. Examples of private labeled data might be knowledge of which orders and executions in the public feed belonged to them, or similarly which (meta)orders and executions had parent-child relationships.
Although such financial time series FMs trained on PBA clusters may offer enhanced predictive capabilities, they also bring risks. For example, the EU’s AI act, adopted in March 2024, states that if a model has been trained with a total compute power in excess of 1025 FLOPs, then that model is considered to pose “systemic risk” and is subject to enhanced regulation, including fines of 3% of global turnover, so on this basis Meta announced in June 2024 they will not be enabling some models inside Europe. This legislation assumes that training compute is a direct proxy for model capabilities. EpochAI provides an analysis of the training compute required for a wide range of FMs; for example, GPT-4 took 2.125 FLOPS to train (exceeding the threshold by a factor of 2.1), whereas BloombergGPT took 2.423 FLOPS (under the threshold by a factor of 0.02). It seems possible that in the future, similar legislation may apply to financial FMs, or even to the PBA clusters themselves, with some market participants choosing not to operate in legislative regimes that are subject to such risks.
Feature engineering plays a key role in building NN models, because features are fed into the NN model. As seen earlier in this post, some participants have generated large numbers of features. Examples of features derived from market time series data include bid-ask spreads, weighted mid-points, imbalance measures, decompositions, liquidity predictions, trends, change-points, and mean-reversions. Together, the features are called the feature space. A transformer assigns more importance to part of the input feature space, even though it might only be a small part of the data. Learning which part of the data is more important than another depends on the context of the features. The true power of FMs in time series prediction is the ability to capture these conditional probabilities (the context) across the feature space. To give a simple example, based on historical data, trends might reduce in strength as they go on, leading to a change-point, and then reversion to the mean. A transformer potentially offers the ability to recognize this pattern and capture the relationship between the features more accurately than other approaches. An informative visualization of this for the textual case is given by the FT article Generative AI exists because of the transformer. In order to build and train such FMs on PBAs, access to high-quality historical data tightly coupled with scalable compute to generate the features is an essential prerequisite.
Prior to the advent of the transformer, NN have historically been applied to securities prediction with varying degrees of success. Deep Learning for Limit Order Books uses a cluster of 50 GPUs to predict the sign of the future return by mapping the price levels of the LOB to the visible input layer of a NN, resulting in a trinomial output layer. Conditional on the return the sign, the magnitude of the return is estimated using regression. Deep Learning Financial Market Data uses raw LOB data pre-processed into discrete, fixed-length features for training a recurrent autoencoder, whose recurrent structure allows learning patterns on different time scales. Inference occurs by generating the decoded LOB, and nearest-matching that to the real-time data.
In Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units, the authors benchmark the performance of Graphcore IPUs against an NVIDIA GPU on an encoder-decoder NN model. Given that encoder-decoder models rely on recurrent neural layers, they generally suffer from slow training processes. The authors address this by finding that the IPU offers a significant training speedup over the GPU, 694% on average, analogous to the speedup a transformer architecture would provide. In some examples of post-transformer work in this space, Generative AI for End-to-End Limit Order Book Modelling and A Generative Model Of A Limit Order Book Using Recurrent Neural Networks have trained LLM analogues on historical LOB data, interpreting each LOB event (such as insertions, cancellations, and executions) as a word and predicting the series of events following a given word history. However, the authors find the prediction horizon for LOB dynamics appears to be limited to a few tens of events, possibly because of the high-dimensionality of the problem and the presence of long-range correlations in order sign. These results have been improved in the work “Microstructure Modes” — Disentangling the Joint Dynamics of Prices & Order Flow, by down-sampling the data and reducing its dimensionality, allowing identification of stable components.
RL is an ML technique where an algorithm interacts with a dynamic environment that provides feedback to the algorithm, allowing the algorithm to iteratively optimize a reward metric. Because RL closely mimics how human traders interact with the world, there are various areas of applicability in FSI. In JAX-LOB: A GPU-Accelerated limit order book simulator to unlock large scale reinforcement learning for trading, the authors use GPUs for end-to-end RL training. RL agent training with a GPU has a 7-times speedup relative to a CPU based simulation implementation. The authors then apply this to the problem of optimal trade execution. A second FSI application of RL to optimal trade execution has been reported by JPMorgan in an algorithm called LOXM.
Latency-sensitive, real-time workloads
Being able to transmit, process, and act on data more quickly than others gives an informational advantage. In the financial markets, this is directly equivalent to being able to profit from trading. These real-time, latency-sensitive workloads exist on a spectrum, from the most sensitive to the least sensitive. The specific numbers in the following table are open to debate, but present the general idea.

Band
Latency
Application Examples

1
Less than 1 microsecond
Low-latency trading strategy. Tick 2 trade.

2
1–4 microseconds
Feed handler. Raw or normalized format.

3
40 microseconds
Normalized format and symbology.

4
4–200 milliseconds
Consolidated feed. Full tick.

5
1 second to daily
Intraday and EOD. Reference, Corp, FI, derivatives.

The most latency-sensitive use cases are typically handled by FPGA or custom ASICs. These react to incoming network traffic, like market data, and put triggering logic directly into the network interface controller. Easily reprogrammable PBAs play little to no role in any latency sensitive work, due to the SIMD architecture being designed for the use case of parallel processing large amounts of data with a bandwidth bottleneck of getting data onto the chip.
However, three factors maybe driving change in the role hardware acceleration plays in the low-latency space. Firstly, as PBAs mature, some of their previous barriers are being reduced. For example, NVIDIA’s new NVLink design now enables significantly higher bandwidth relative to previous chip interconnects, meaning that data can get onto the chip far more quickly than before. Comparing the latest NVIDIA GB200 chip against the previous generation NVIDIA H100 chip, NVLink performance has increased 400%, from 900 GBps to 3.6 TBps.
Secondly, some observers believe the race for speed is shifting to a “race for intelligence.” With approximately only ten major firms competing in the top-tier low latency space, the barrier to entry seems almost unsurmountable for other parties. At some point, low-latency hardware and techniques might slowly diffuse through technology supplier offerings, eventually leveling the playing field, perhaps having been driven by new regulations.
Thirdly, although FPGA/ASIC undoubtedly provides the fastest performance, they come at a cost of being a drain on resources. Their developers are hard to hire for, the work has long deployment cycles, and it results in a significant maintenance burden with bugs that are difficult to diagnose and triage. Firms are keen to identify alternatives.
Although the most latency-sensitive work will remain on FPGA/ASIC, there may be a shift of less latency-sensitive work from FPGA/ASIC to GPUs and other PBAs as users weigh the trade-off between speed and other factors. In comparison, easily reprogrammable PBA processors are now simple to hire for, are straightforward to code against and maintain, and allow for relatively rapid innovation. Looking to the future, we may see innovation at the language level, for example, through functional programming with array-languages such as the Co-dfns project, as well as further innovation at the hardware level, with future chips tightly integrating the best components of today’s FPGAs, GPUs and CPUs.
Key Takeaways
In this section, we present three key takeaways. Firstly, the global supply-demand ratio for GPUs is low, meaning price can be high, but availability can be low. This can be a constraining factor for end-user businesses wanting to innovate in this space. AWS helps address this on behalf of its customers in three ways:

Through economies of scale, AWS is able to offer significant availability of the PBAs, including GPUs.
Through in-house research and development, AWS is able to offer its own PBAs, developed and manufactured in-house, which are not subject to the constraints of the wider market, while also having optimized price-performance.
AWS innovates at the software level to improve allocation to the end-user. Therefore, although total capacity might be fixed, by using intelligent allocation algorithms, AWS is better able to meet customers’ needs. For example, Amazon EC2 Capacity Blocks for ML enables guaranteed access to the required PBAs at the point in time they are needed.

The second takeaway is that proprietary software can lock users in to a single supplier and end up acting as a barrier to innovation. In the case of PBAs, the chips that use proprietary software mean that users can’t easily move between chip manufacturers, as opposed to open source software supporting multiple chip manufacturers. Any future supply constraints, such as regional armed conflict, could further exasperate existing supply-demand imbalances. Although migrating existing legacy workloads from an acceleration chip with proprietary software can be challenging, new greenfield workloads can be built on open source libraries without difficulty. In the FSI space, examples of legacy workloads might include risk calculations, and examples of greenfield workloads might include time series prediction using FMs. In the long term, business leaders need to consider and formulate their strategy for moving away from software lock-in, and enable access to wider acceleration hardware offerings, with the cost benefits that can bring.
The final takeaway is that financial services, and the subsection of capital markets in particular, is subject to constant and evolving competitive pressures. Over time, the industry has seen the race for differentiation move from data access rights, to latency, and now to an increased focus on predictive power. Looking to the future, if the world of financial prediction is based in part on a small number of expensive and complex FMs built and trained by a few large global corporates, where will the differentiation come from? Speculative areas could range from at-scale feature engineering to being able to better handle increased regulatory burdens. Whichever field it comes from, it is certain to include data processing and analytics at its core, and therefore benefit from hardware acceleration.
Conclusion
This post aimed to provide business leaders with a non-technical overview of PBAs and their role within the FSI. With this technology currently being regularly discussed in the mainstream media, it is essential business leaders understand the basis of this technology and its potential future role. Nearly every organization is now looking to a data-centric future, enabled by cloud-based infrastructure and real-time analytics, to support revenue-generating AI and ML use cases. One of the ways organizations will be differentiated in this race will be by making the right strategic decisions about technologies, partners, and approaches. This includes topics such as open source versus closed source, build versus buy, tool complexity and associated ease of use, hiring and retention challenges, and price-performance. Such topics are not just technology decisions within a business, but also cultural and strategic ones.
Business leaders are encouraged to reach out to their AWS point of contact and ask how AWS can help their business win in the long term using PBAs. This might result in a range of outcomes, from a short proof of concept against an existing well-defined business problem, to a written strategy document that can be consumed and debated by peers, to onsite technical workshops and business briefing days. Whatever the outcome, the future of this space is sure to be exciting!
Acknowledgements
I would like to thank the following parties for their kind input and guidance in writing this post: Andrea Rodolico, Alex Kimber, and Shruti Koparkar. Any errors are mine alone.

About the Author
Dr. Hugh Christensen works at Amazon Web Services with a specialization in data analytics. He holds undergraduate and master’s degrees from Oxford University, the latter in computational biophysics, and a PhD in Bayesian inference from Cambridge University. Hugh’s areas of interest include time series data, data strategy, data leadership, and using analytics to drive revenue generation. You can connect with Hugh on LinkedIn.

Anomaly detection in streaming time series data with online learning u …

Time series data is a distinct category that incorporates time as a fundamental element in its structure. In a time series, data points are collected sequentially, often at regular intervals, and they typically exhibit certain patterns, such as trends, seasonal variations, or cyclical behaviors. Common examples of time series data include sales revenue, system performance data (such as CPU utilization and memory usage), credit card transactions, sensor readings, and user activity analytics.
Time series anomaly detection is the process of identifying unexpected or unusual patterns in data that unfold over time. An anomaly, also known as an outlier, occurs when a data point deviates significantly from an expected pattern.
For some time series, like those with well-defined expected ranges such as machine operating temperatures or CPU usage, a threshold-based approach might suffice. However, in areas like fraud detection and sales, where simple rules fall short due to their inability to catch anomalies across complex relationships, more sophisticated techniques are required to identify unexpected occurrences.
In this post, we demonstrate how to build a robust real-time anomaly detection solution for streaming time series data using Amazon Managed Service for Apache Flink and other AWS managed services.
Solution overview
The following diagram illustrates the core architecture of the Anomaly Detection Stack solution.

This solution employs machine learning (ML) for anomaly detection, and doesn’t require users to have prior AI expertise. It offers an AWS CloudFormation template for straightforward deployment in an AWS account. With the CloudFormation template, you can deploy an application stack with the necessary AWS resources required for detecting anomalies. Setting up one stack creates an application with one anomaly detection task or detector. You can set up multiple such stacks to run them simultaneously, with each one analyzing the data and reporting back the anomalies.
The application, once deployed, constructs an ML model using the Random Cut Forest (RCF) algorithm. It initially sources input time series data from Amazon Managed Streaming for Apache Kafka (Amazon MSK) using this live stream for model training. Post-training, the model continues to process incoming data points from the stream. It evaluates these points against the historical trends of the corresponding time series. The model also generates an initial raw anomaly score while processing and maintains an internal threshold to eliminate noisy data points. Subsequently, the model generates a normalized anomaly score for each data point that the model treats as an anomaly. These scores, ranging from 0–100, indicate the deviation from typical patterns; scores closer to 100 signify higher anomaly levels. You have the flexibility to set a custom threshold on these anomaly scores, allowing you to define what you consider anomalous.
This solution uses a CloudFormation template, which takes inputs such as MSK broker endpoint and topics, AWS Identity and Access Management (IAM) roles, and other parameters related to virtual private cloud (VPC) configuration. The template creates the essential resources like the Apache Flink application and Amazon SageMaker real-time endpoint in the customer account.
To request the access to this solution, send an email to anomalydetection-support-canvas@amazon.com.
In this post, we outline how you can build an end-to-end solution with the Anomaly Detection Stack. Consider a hypothetical sales scenario where AnyBooks, an on-campus bookstore at a large university, sells various supplies to college students. Due to the timing of class schedules, their seasonality is such that they sell around 20 Item-A units and 30 Item-B units during even hours, and approximately half that during odd hours throughout the day. Recently, there have been some unexplained spikes in the quantity of items sold, and the management team wants to start tracking these quantity anomalies so that they can better plan their staffing and inventory levels.
The following diagram shows the detailed architecture for the end-to-end solution.

In the following sections, we discuss each layer shown in the preceding diagram.
Ingestion
In the ingestion layer, an AWS Lambda function retrieves sales transactions for the current minute from a PostgreSQL transactional database, transforms each record into a JSON message, and publishes it to an input Kafka topic. This Lambda function is configured to run every minute using Amazon EventBridge Scheduler.
Anomaly detection stack
The Flink application initiates the process of reading raw data from the input MSK topic, training the model, and commencing the detection of anomalies, ultimately recording them to the MSK output topic. The following code is the output results JSON:

{“detectorName”:”canvas-ad-blog-demo-1″,”measure”:”quantity”,”timeseriesId”:”f3c7f14e7a445b79a3a9877dfa02064d56533cc29fb0891945da4512c103e893″,”anomalyDecisionThreshold”:70,”dimensionList”:[{“name”:”product_name”,”value”:”item-A”}],”aggregatedMeasureValue”:14.0,”anomalyScore”:0.0,”detectionPeriodStartTime”:”2024-08-29 13:35:00″,”detectionPeriodEndTime”:”2024-08-29 13:36:00″,”processedDataPoints”:1261,”anomalyConfidenceScore”:80.4674989791107,”anomalyDecision”:0,”modelStage”:”INFERENCE”,”expectedValue”:0.0}

The following is a brief explanation of the output fields:

measure – This represents the metric we are tracking for anomalies. In our case, the measure field is the quantity of sales for Item-A.
aggregatedMeasureVaue – This represents the aggregated value of quantity in the time window.
timeseriesid – This unique identifier corresponds to a combination of unique values for the dimensions and the metric. In this scenario, it’s the product name, Item-A, within the product_name
anomalyConfidenceScore – As the model evolves through learning and inference, this confidence score will progressively improve.
anomalyScore – This field represents the score for anomaly detection. With an anomalyThreshold set at 70, any value exceeding 70 is considered a potential anomaly.
modelStage – When the model is in the learning phase, the anomalyScore is 0.0 and the value of this field is set to LEARNING. After the learning is complete, the value of this field changes to INFERENCE.
anomalyDecisionThreshold – The decision threshold is provided as input in the CloudFormation stack. If you determine there are too many false positives, you can increase this threshold to change the sensitivity.
anomalyDecision – If the anomalyScore exceeds the anomalyDecisionThreshold, this field is set to 1, indicating an anomaly is detected.

Transform
In the transformation layer, an Amazon Data Firehose stream is configured to consume data from the output Kafka topic and invoke a Lambda function for transformation. The Lambda function flattens the nested JSON data from the Kafka topic. The transformed results are then partitioned by date and stored in an Amazon Simple Storage Service (Amazon S3) bucket in Parquet format. An AWS Glue crawler is used to crawl the data in the Amazon S3 location and catalog it in the AWS Glue Data Catalog, making it ready for querying and analysis.
Visualize
To visualize the data, we’ve created an Amazon QuickSight dashboard that connects to the data in Amazon S3 through the Data Catalog and queries it using Amazon Athena. The dashboard can be refreshed to display the latest detected anomalies, as shown in the following screenshot.

In this example, the darker blue line in the line graph represents the seasonality of the quantity measure for Item-A over time, showing higher values during even hours and lower values during odd hours. The pink line represents the anomaly detection score, plotted on the right Y-axis. The anomaly score approaches 100 when the quantity value significantly deviates from its seasonal pattern. The blue line represents the anomaly threshold, set at 70. When anomalyScore exceeds this threshold, anomalyDecision is set to 1.
The “Number of Timeseries Tracked” KPI displays how many time series the model is currently monitoring. In this case, because we’re tracking two products (Item-A and Item-B), the count is 2. The “Number of Datapoints Processed” KPI shows the total number of data points the model has processed, and the “Anomaly Confidence Score” indicates the confidence level in predicting anomalies. Initially, this score is low, but will approach 100 as the model matures over time.
Notification
Although visualization is valuable for investigating anomalies, data analysts often prefer to receive near real-time notifications for critical anomalies. This is achieved by adding a Lambda function that reads results from the output Kafka topic and analyzes them. If the anomalyScore value exceeds the defined threshold, the function invokes an Amazon Simple Notification Service (Amazon SNS) topic to send email or SMS notifications to a designated list, alerting the team about the anomaly in near real time.
Conclusion
This post demonstrated how to build a robust real-time anomaly detection solution for streaming time series data using Managed Service for Apache Flink and other AWS services. We walked through an end-to-end architecture that ingests data from a source database, passes it through an Apache Flink application that trains an ML model and detects anomalies, and then lands the anomaly data in an S3 data lake. The anomaly scores and decisions are visualized through a QuickSight dashboard connected to the Amazon S3 data using AWS Glue and Athena. Additionally, a Lambda function analyzes the results and sends notifications in near real time.
With AWS managed services like Amazon MSK, Data Firehose, Lambda, and SageMaker, you can rapidly deploy and scale this anomaly detection solution for your own time series use cases. This allows you to automatically identify unexpected behaviors or patterns in your data streams in real time without manual rules or thresholds.
Give this solution a try, and explore how real-time anomaly detection on AWS can unlock insights and optimize operations across your business!

About the Authors
Noah Soprala is a Solutions Architect based out of Dallas. He is a trusted advisor to his customers and helps them build innovative solutions using AWS technologies. Noah has over 20 years of experience in consulting, development, and solution architecture and delivery.
Dan Sinnreich is a Sr. Product Manager for Amazon SageMaker, focused on expanding no-code / low-code services. He is dedicated to making ML and generative AI more accessible and applying them to solve challenging problems. Outside of work, he can be found playing hockey, scuba diving, and reading science fiction.
Syed Furqhan is a Senior Software Engineer for AI and ML at AWS. He was part of many AWS service launches like Amazon Lookout for Metrics, Amazon Sagemaker and Amazon Bedrock. Currently, he is focusing on generative AI initiatives as part of Amazon Bedrock Core Systems. He is a clean code advocate and a subject-matter expert on server-less and event-driven architecture. You can follow him on linkedin, syedfurqhan
Nirmal Kumar is Sr. Product Manager for the Amazon SageMaker service. Committed to broadening access to AI/ML, he steers the development of no-code and low-code ML solutions. Outside work, he enjoys travelling and reading non-fiction.

AtScale Open-Sourced Semantic Modeling Language (SML): Transforming An …

AtScale has made a significant move by announcing the open-source release of its Semantic Modeling Language (SML). This initiative aims to provide an industry-standard semantic modeling language that can be adopted across various platforms, fostering greater collaboration and interoperability in the analytics community. The introduction of SML marks a major step in the company’s decade-long journey of democratizing data analytics and advancing semantic layer technology. 

AtScale’s journey began with a vision to create a business-friendly interface for users to interact with data. This led to creation of an independent semantic layer that sits on top of technical data platforms, enabling business users to query data in terms they understand. Since its inception, AtScale has been committed to advancing the concept of a universal semantic layer that can operate across different analytics tools and data platforms, making it easier for business users to derive insights without deep technical knowledge.

The Need for an Open Standard

Semantic layers are vital to modern analytics platforms, bridging the gap between raw data & business insights. When AtScale was founded in 2013, no other vendors offered semantic layer platforms. However, the industry has seen a proliferation of semantic layer platforms from various vendors over the past decade. With the growing diversity of tools, a need for a unified, standard language for semantic modeling emerged. 

AtScale has now open-sourced SML. The company aims to promote model portability, enabling users to build semantic models that can be shared across platforms. A key motivation behind this move is to foster a community where model builders can create and share a library of reusable semantic models that can be plugged into any platform. This will lead to time savings for users, allowing them to consume business data with minimal technical configuration.

What SML Offers

SML results from more than a decade of hands-on development. It is designed to handle complex, multidimensional data across various industries like finance, healthcare, retail, manufacturing, and more. The language supports metrics, dimensions, hierarchies, and semi-additive measures, crucial for building sophisticated analytics models.

SML offers several benefits to developers and organizations:

Object-Oriented Structure: SML is designed to be object-oriented, so its semantic objects can be reused across different models, promoting consistency and efficiency in model building.

Comprehensive Scope: It is a superset of existing semantic modeling languages, incorporating more than a decade’s experience and use cases across different verticals. This makes SML versatile enough to cater to a wide range of applications.

Familiar Syntax: SML is built on YAML, a widely adopted, human-readable syntax, making it easier for developers to adopt the language without steep learning curves.

CI/CD Friendly: Being code-based, SML integrates well with modern software development practices, including Git for version control, and supports continuous integration and continuous deployment (CI/CD) workflows.

Extensibility and Open Access: SML is Apache open-sourced, which means it is free to use and can be extended by the community. This open nature allows for innovation and collaboration, ensuring the language evolves to meet new demands.

What Is Being Open-Sourced

AtScale is making several components available as part of its open-source initiative:

SML Language Specification: This includes tabular and multidimensional constructs, providing a comprehensive framework for model building.

Pre-built Semantic Models: These models, available on GitHub, cover standard data schemas such as TPC-DS and other common training models. AtScale plans to release models for popular SaaS applications like Salesforce and Jira.

Helper Classes and Translators (coming soon): These will include programmatic tools to facilitate the reading and writing of SML syntax and translators for migrating models from other semantic languages, such as those used by dbt Labs and Power BI.

AtScale’s decision to open-source SML represents a significant step towards fostering greater collaboration in the analytics industry. By creating a standard semantic modeling language, the company hopes to accelerate the adoption of semantic layers and promote the development of reusable, interoperable models. With the introduction of SML, AtScale is positioning itself at the forefront of the movement to standardize business logic expression and facilitate seamless data and analytics interoperability across platforms.

In conclusion, the open sourcing of SML underscores AtScale’s commitment to democratizing analytics and building a vibrant community around semantic modeling. As more organizations adopt the standard, the hope is that it will spur innovation and make analytics more accessible and efficient for all industry stakeholders.

Check out the Details and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

LG AI Research Open-Sources EXAONEPath: Transforming Histopathology Image Analysis with a 285M Patch-level Pre-Trained Model for Variety of Medical Prediction, Reducing Genetic Testing Time and Costs

The post AtScale Open-Sourced Semantic Modeling Language (SML): Transforming Analytics with Industry-Standard Framework for Interoperability, Reusability, and Multidimensional Data Modeling Across Platforms appeared first on MarkTechPost.

NVIDIA Researchers Introduce Order-Preserving Retrieval-Augmented Gene …

Retrieval-augmented generation (RAG), a technique that enhances the efficiency of large language models (LLMs) in handling extensive amounts of text, is critical in natural language processing, particularly in applications such as question-answering, where maintaining the context of information is crucial for generating accurate responses. As language models evolve, researchers strive to push the boundaries by improving how these models process and retrieve relevant information from large-scale textual data.

One main problem with existing LLMs is their difficulty in managing long contexts. As the context length increases, the models need help to maintain a clear focus on relevant information, which can lead to a significant drop in the quality of their answers. This issue is particularly pronounced in question-answering tasks, where precision is paramount. The models tend to get overwhelmed by the sheer volume of information, which can cause them to retrieve irrelevant data, diluting the answers’ accuracy.

In recent developments, LLMs like GPT-4 and Gemini have been designed to handle much longer text sequences, with some models supporting up to 1 million tokens in context. However, these advancements come with their own set of challenges. While long-context LLMs can theoretically handle larger inputs, they often introduce unnecessary or irrelevant chunks of information into the process, resulting in a lower precision rate. Thus, researchers are still seeking better solutions to effectively manage long contexts while maintaining answer quality and efficiently using computational resources.

Researchers from NVIDIA, based in Santa Clara, California, proposed an order-preserve retrieval-augmented generation (OP-RAG) approach to address these challenges. OP-RAG offers a substantial improvement over the traditional RAG methods by preserving the order of the text chunks retrieved for processing. Unlike existing RAG systems, which prioritize chunks based on relevance scores, the OP-RAG mechanism retains the original sequence of the text, ensuring that context and coherence are maintained throughout the retrieval process. This advancement allows for a more structured retrieval of relevant information, avoiding the pitfalls of traditional RAG systems that might retrieve highly relevant but out-of-context data.

The OP-RAG method introduces an innovative mechanism that restructures how information is processed. First, the large-scale text is split into smaller, sequential chunks. These chunks are then evaluated based on their relevance to the query. Instead of ranking them solely by relevance, OP-RAG ensures that the chunks are kept in their original order as they appeared in the source document. This sequential preservation helps the model focus on retrieving the most contextually relevant data without introducing irrelevant distractions. The researchers demonstrated that this approach significantly enhances answer generation quality, particularly in long-context scenarios, where maintaining coherence is essential.

The performance of the OP-RAG method was thoroughly tested against other leading models. The researchers from NVIDIA conducted experiments using public datasets, such as the EN.QA and EN.MC benchmarks from ∞Bench. Their results showed a marked improvement in both precision and efficiency compared to traditional long-context LLMs without RAG. For example, in the EN.QA dataset, which contains an average of 150,374 words per context, OP-RAG achieved a peak F1 score of 47.25 when using 48K tokens as input, a significant improvement over models like GPT-4O. Similarly, on the EN.MC dataset, OP-RAG outperformed other models by a considerable margin, achieving an accuracy of 88.65 with only 24K tokens, whereas the traditional Llama3.1 model without RAG could only attain 71.62 accuracy using 117K tokens.

Further comparisons showed that OP-RAG improved the quality of the generated answers and dramatically reduced the number of tokens needed, making the model more efficient. Traditional long-context LLMs, such as GPT-4O and Gemini-1.5-Pro, required nearly double the number of tokens compared to OP-RAG to achieve lower performance scores. This efficiency is particularly valuable in real-world applications, where computational costs and resource allocation are critical factors in deploying large-scale language models.

In conclusion, OP-RAG presents a significant breakthrough in the field of retrieval-augmented generation, offering a solution to the limitations of long-context LLMs. By preserving the order of the retrieved text chunks, the method allows for more coherent and contextually relevant answer generation, even in large-scale question-answering tasks. The researchers at NVIDIA have shown that this innovative approach outperforms existing methods in terms of quality and efficiency, making it a promising solution for future advancements in natural language processing.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

LG AI Research Open-Sources EXAONEPath: Transforming Histopathology Image Analysis with a 285M Patch-level Pre-Trained Model for Variety of Medical Prediction, Reducing Genetic Testing Time and Costs

The post NVIDIA Researchers Introduce Order-Preserving Retrieval-Augmented Generation (OP-RAG) for Enhanced Long-Context Question Answering with Large Language Models (LLMs) appeared first on MarkTechPost.