Researchers from MIT Developed a Machine Learning Technique that Enabl …

With the rapid advancement of technology, Edge devices are an essential part of our everyday existence, perfectly integrating into our networked society. These widely used Edge devices produce an unparalleled amount of data at the edge of our networks. 

The demand for smart, customized, and confidential AI is increasing because one model can’t meet the diverse requirements of various users. Even though edge devices often handle deep learning tasks, the training of deep neural networks usually happens on powerful cloud GPU servers.

However, existing training frameworks are specifically for powerful cloud servers with accelerators, which must be optimized to enable effective learning on edge devices.

Customized deep learning models could enable AI chatbots to adapt to a user’s accent or smart keyboards that continuously improve word predictions based on previous typing activity. 

User data is typically sent to cloud servers because smartphones and other edge devices frequently lack the memory and processing power required for this fine-tuning process. These servers are where the model is updated because they have the resources to complete the difficult task of fine-tuning the AI model.

Consequently, the researchers at MIT and other places have developed PockEngine—a technique that allows deep-learning models to effectively adjust to fresh sensor data directly on an edge device. PockEngine only stores and computes the precise portions of a large machine-learning model that require updating to increase accuracy. 

Most of these calculations are completed during model preparation, before runtime, which reduces computational overhead and expedites the fine-tuning procedure. PockEngine dramatically accelerated on-device training; it performed up to 15 times faster on certain hardware platforms. PockEngine prevented models from losing accuracy. Their fine-tuning technique allowed a well-known AI chatbot to answer challenging queries more accurately.

PockEngine speeds up to 15 times faster on some hardware platforms. The training process is further accelerated by PockEngine’s integration of an extensive set of training graph optimizations.

Benefits of on-device fine-tuning include enhanced privacy, lower expenses, customization options, and lifetime learning. However, more resources are needed to make this process easier.

They said that PockEngine generates a backpropagation graph while the model is compiling and preparing for deployment. It accomplishes this by removing redundant sections of layers, resulting in a simplified diagram that can be utilized during runtime. Then, additional optimizations are made to improve efficiency.

This method is especially useful for models that need a lot of examples to be fine-tuned, as the researchers applied it to the large language model Llama-V2. PockEngine adjusts every layer separately for a particular task, tracking the improvement in accuracy with each layer. By weighing the trade-offs between accuracy and cost, PockEngine can ascertain each layer’s relative contributions and the required fine-tuning percentage.

The system first fine-tunes each layer on a certain task, one at a time, and measures the accuracy improvement after each layer. The researchers emphasized that PockEngine identifies the contribution of each layer, as well as trade-offs between accuracy and fine-tuning cost, and automatically determines the percentage of each layer that needs to be fine-tuned.

With a 15× speed boost over the pre-built TensorFlow for Raspberry Pi, PockEngine has proven to have impressive speed improvements. Furthermore, it achieves a noteworthy memory savings of 5.6× memory savings during backpropagation on Jetson AGX Orin; PockEngine showed impressive speed increases. Primarily, PockEngine allows LLaMav2-7B on NVIDIA to be fine-tuned effectively.

Check out the Paper and MIT Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post Researchers from MIT Developed a Machine Learning Technique that Enables Deep-Learning Models to Efficiently Adapt to new Sensor Data Directly on an Edge Device appeared first on MarkTechPost.

Amazon EC2 DL2q instance for cost-efficient, high-performance AI infer …

This is a guest post by A.K Roy from Qualcomm AI.
Amazon Elastic Compute Cloud (Amazon EC2) DL2q instances, powered by Qualcomm AI 100 Standard accelerators, can be used to cost-efficiently deploy deep learning (DL) workloads in the cloud. They can also be used to develop and validate performance and accuracy of DL workloads that will be deployed on Qualcomm devices. DL2q instances are the first instances to bring Qualcomm’s artificial intelligent (AI) technology to the cloud.
With eight Qualcomm AI 100 Standard accelerators and 128 GiB of total accelerator memory, customers can also use DL2q instances to run popular generative AI applications, such as content generation, text summarization, and virtual assistants, as well as classic AI applications for natural language processing and computer vision. Additionally, Qualcomm AI 100 accelerators feature the same AI technology used across smartphones, autonomous driving, personal computers, and extended reality headsets, so DL2q instances can be used to develop and validate these AI workloads before deployment.
New DL2q instance highlights
Each DL2q instance incorporates eight Qualcomm Cloud AI100 accelerators, with an aggregated performance of over 2.8 PetaOps of Int8 inference performance and 1.4 PetaFlops of FP16 inference performance. The instance has an aggregate 112 of AI cores, accelerator memory capacity of 128 GB and memory bandwidth of 1.1 TB per second.
Each DL2q instance has 96 vCPUs, a system memory capacity of 768 GB and supports a networking bandwidth of 100 Gbps as well as Amazon Elastic Block Store (Amazon EBS) storage of 19 Gbps.

Instance name
vCPUs
Cloud AI100 accelerators
Accelerator memory
Accelerator memory BW (aggregated)
Instance memory
Instance networking
Storage (Amazon EBS) bandwidth

DL2q.24xlarge
96
8
128 GB
1.088 TB/s
768 GB
100 Gbps
19 Gbps

Qualcomm Cloud AI100 accelerator innovation
The Cloud AI100 accelerator system-on-chip (SoC) is a purpose-built, scalable multi-core architecture, supporting a wide range of deep-learning use-cases spanning from the datacenter to the edge. The SoC employs scalar, vector, and tensor compute cores with an industry-leading on-die SRAM capacity of 126 MB. The cores are interconnected with a high-bandwidth low-latency network-on-chip (NoC) mesh.
The AI100 accelerator supports a broad and comprehensive range of models and use-cases. The table below highlights the range of the model support.

Model category
Number of models
Examples​

NLP​
157
BERT, BART, FasterTransformer, T5, Z-code MOE

Generative AI – NLP
40
LLaMA, CodeGen, GPT, OPT, BLOOM, Jais, Luminous, StarCoder, XGen

Generative AI – Image
3
Stable diffusion v1.5 and v2.1, OpenAI CLIP

CV – Image classification
45
ViT, ResNet, ResNext, MobileNet, EfficientNet

CV – Object detection
23
YOLO v2, v3, v4, v5, and v7, SSD-ResNet, RetinaNet

CV – Other
15
LPRNet, Super-resolution/SRGAN, ByteTrack

Automotive networks*
53
Perception and LIDAR, pedestrian, lane, and traffic light detection

Total​
>300 ​

* Most automotive networks are composite networks consisting of a fusion of individual networks.
The large on-die SRAM on the DL2q accelerator enables efficient implementation of advanced performance techniques such as MX6 micro-exponent precision for storing the weights and MX9 micro-exponent precision for accelerator-to-accelerator communication. The micro-exponent technology is described in the following Open Compute Project (OCP) industry announcement: AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for AI » Open Compute Project.
The instance user can use the following strategy to maximize the performance-per-cost:

Store weights using the MX6 micro-exponent precision in the on-accelerator DDR memory. Using the MX6 precision maximizes the utilization of the available memory capacity and the memory-bandwidth to deliver best-in-class throughput and latency.
Compute in FP16 to deliver the required use case accuracy, while using the superior on-chip SRAM and spare TOPs on the card, to implement high-performance low-latency MX6 to FP16 kernels.
Use an optimized batching strategy and a higher batch-size by using the large on-chip SRAM available to maximize the reuse of weights, while retaining the activations on-chip to the maximum possible.

DL2q AI Stack and toolchain
The DL2q instance is accompanied by the Qualcomm AI Stack that delivers a consistent developer experience across Qualcomm AI in the cloud and other Qualcomm products. The same Qualcomm AI stack and base AI technology runs on the DL2q instances and Qualcomm edge devices, providing customers a consistent developer experience, with a unified API across their cloud, automotive, personal computer, extended reality, and smartphone development environments.
The toolchain enables the instance user to quickly onboard a previously trained model, compile and optimize the model for the instance capabilities, and subsequently deploy the compiled models for production inference use-cases in three steps shown in the following figure.

To learn more about tuning the performance of a model, see the Cloud AI 100 Key Performance Parameters Documentation.
Get started with DL2q instances
In this example, you compile and deploy a pre-trained BERT model from Hugging Face on an EC2 DL2q instance using a pre-built available DL2q AMI, in four steps.
You can use either a pre-built Qualcomm DLAMI on the instance or start with an Amazon Linux2 AMI and build your own DL2q AMI with the Cloud AI 100 Platform and Apps SDK available in this Amazon Simple Storage Service (Amazon S3) bucket: s3://ec2-linux-qualcomm-ai100-sdks/latest/.
The steps that follow use the pre-built DL2q AMI, Qualcomm Base AL2 DLAMI.
Use SSH to access your DL2q instance with the Qualcomm Base AL2 DLAMI AMI and follow steps 1 thru 4.
Step 1. Set up the environment and install required packages

Install Python 3.8.

sudo amazon-linux-extras install python3.8

Set up the Python 3.8 virtual environment.

python3.8 -m venv /home/ec2-user/userA/pyenv

Activate the Python 3.8 virtual environment.

source /home/ec2-user/userA/pyenv/bin/activate

Install the required packages, shown in the requirements.txt document available at the Qualcomm public Github site.

pip3 install -r requirements.txt

Import the necessary libraries.

import transformers
from transformers import AutoTokenizer, AutoModelForMaskedLM
import sys
import qaic
import os
import torch
import onnx
from onnxsim import simplify
import argparse
import numpy as np

Step 2. Import the model

Import and tokenize the model.

model_card = ‘bert-base-cased’
model = AutoModelForMaskedLM.from_pretrained(model_card)
tokenizer = AutoTokenizer.from_pretrained(model_card)

Define a sample input and extract the inputIds and attentionMask.

sentence = “The dog [MASK] on the mat.”
encodings = tokenizer(sentence, max_length=128, truncation=True, padding=”max_length”, return_tensors=’pt’)
inputIds = encodings[“input_ids”]
attentionMask = encodings[“attention_mask”]

Convert the model to ONNX, which can then be passed to the compiler.

# Set dynamic dims and axes.
dynamic_dims = {0: ‘batch’, 1 : ‘sequence’}
dynamic_axes = {
“input_ids” : dynamic_dims,
“attention_mask” : dynamic_dims,
“logits” : dynamic_dims
}
input_names = [“input_ids”, “attention_mask”]
inputList = [inputIds, attentionMask]

torch.onnx.export(
model,
args=tuple(inputList),
f=f”{gen_models_path}/{model_base_name}.onnx”,
verbose=False,
input_names=input_names,
output_names=[“logits”],
dynamic_axes=dynamic_axes,
opset_version=11,
)

You’ll run the model in FP16 precision. So, you need to check if the model contains any constants beyond the FP16 range. Pass the model to the fix_onnx_fp16 function to generate the new ONNX file with the fixes required.

from onnx import numpy_helper

def fix_onnx_fp16(
gen_models_path: str,
model_base_name: str,
) -> str:
finfo = np.finfo(np.float16)
fp16_max = finfo.max
fp16_min = finfo.min
model = onnx.load(f”{gen_models_path}/{model_base_name}.onnx”)
fp16_fix = False
for tensor in onnx.external_data_helper._get_all_tensors(model):
nptensor = numpy_helper.to_array(tensor, gen_models_path)
if nptensor.dtype == np.float32 and (
np.any(nptensor > fp16_max) or np.any(nptensor < fp16_min)
):
# print(f’tensor value : {nptensor} above {fp16_max} or below {fp16_min}’)
nptensor = np.clip(nptensor, fp16_min, fp16_max)
new_tensor = numpy_helper.from_array(nptensor, tensor.name)
tensor.CopyFrom(new_tensor)
fp16_fix = True

if fp16_fix:
# Save FP16 model
print(“Found constants out of FP16 range, clipped to FP16 range”)
model_base_name += “_fix_outofrange_fp16″
onnx.save(model, f=f”{gen_models_path}/{model_base_name}.onnx”)
print(f”Saving modified onnx file at {gen_models_path}/{model_base_name}.onnx”)
return model_base_name

fp16_model_name = fix_onnx_fp16(gen_models_path=gen_models_path, model_base_name=model_base_name)

Step 3. Compile the model
The qaic-exec command line interface (CLI) compiler tool is used to compile the model. The input to this compiler is the ONNX file generated in step 2. The compiler produces a binary file (called QPC, for Qualcomm program container) in the path defined by -aic-binary-dir argument.
In the compile command below, you use four AI compute cores and a batch size of one to compile the model.

/opt/qti-aic/exec/qaic-exec
-m=bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16.onnx
-aic-num-cores=4
-convert-to-fp16
-onnx-define-symbol=batch,1 -onnx-define-symbol=sequence,128
-aic-binary-dir=bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc
-aic-hw -aic-hw-version=2.0
-compile-only

The QPC is generated in the bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc folder.
Step 4. Run the model
Set up a session to run the inference on a Cloud AI100 Qualcomm accelerator in the DL2q instance.
The Qualcomm qaic Python library is a set of APIs that provides support for running inference on the Cloud AI100 accelerator.

Use the Session API call to create an instance of session. The Session API call is the entry point to using the qaic Python library.

qpcPath = ‘bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc’

bert_sess = qaic.Session(model_path= qpcPath+’/programqpc.bin’, num_activations=1)
bert_sess.setup() # Loads the network to the device.

# Here we are reading out all the input and output shapes/types
input_shape, input_type = bert_sess.model_input_shape_dict[‘input_ids’]
attn_shape, attn_type = bert_sess.model_input_shape_dict[‘attention_mask’]
output_shape, output_type = bert_sess.model_output_shape_dict[‘logits’]

#create the input dictionary for given input sentence
input_dict = {“input_ids”: inputIds.numpy().astype(input_type), “attention_mask” : attentionMask.numpy().astype(attn_type)}

#run inference on Cloud AI 100
output = bert_sess.run(input_dict)

Restructure the data from output buffer with output_shape and output_type.

token_logits = np.frombuffer(output[‘logits’], dtype=output_type).reshape(output_shape)

Decode the output produced.

mask_token_logits = torch.from_numpy(token_logits[0, mask_token_index, :]).unsqueeze(0)
top_5_results = torch.topk(mask_token_logits, 5, dim=1)
print(“Model output (top5) from Qualcomm Cloud AI 100:”)
for i in range(5):
idx = top_5_results.indices[0].tolist()[i]
val = top_5_results.values[0].tolist()[i]
word = tokenizer.decode([idx])
print(f”{i+1} :(word={word}, index={idx}, logit={round(val,2)})”)

Here are the outputs for the input sentence “The dog [MASK] on the mat.”

1 :(word=sat, index=2068, logit=11.46)
2 :(word=landed, index=4860, logit=11.11)
3 :(word=spat, index=15732, logit=10.95)
4 :(word=settled, index=3035, logit=10.84)
5 :(word=was, index=1108, logit=10.75)

That’s it. With just a few steps, you compiled and ran a PyTorch model on an Amazon EC2 DL2q instance. To learn more about onboarding and compiling models on the DL2q instance, see the Cloud AI100 Tutorial Documentation.
To learn more about which DL model architectures are a good fit for AWS DL2q instances and the current model support matrix, see the Qualcomm Cloud AI100 documentation.
Available now
You can launch DL2q instances today in the US West (Oregon) and Europe (Frankfurt) AWS Regions as On-demand, Reserved, and Spot Instances, or as part of a Savings Plan. As usual with Amazon EC2, you pay only for what you use. For more information, see Amazon EC2 pricing.
DL2q instances can be deployed using AWS Deep Learning AMIs (DLAMI), and container images are available through managed services such as Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster.
To learn more, visit the Amazon EC2 DL2q instance page, and send feedback to AWS re:Post for EC2 or through your usual AWS Support contacts.

About the authors
A.K Roy is a Director of Product Management at Qualcomm, for Cloud and Datacenter AI products and solutions. He has over 20 years of experience in product strategy and development, with the current focus of best-in-class performance and performance/$ end-to-end solutions for AI inference in the Cloud, for the broad range of use-cases, including GenAI, LLMs, Auto and Hybrid AI.
Jianying Lang is a Principal Solutions Architect at AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in HPC and AI field. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD degree in Computational Physics from the University of Colorado at Boulder.

Your guide to generative AI and ML at AWS re:Invent 2023

Yes, the AWS re:Invent season is upon us and as always, the place to be is Las Vegas! You marked your calendars, you booked your hotel, and you even purchased the airfare. Now all you need is some guidance on generative AI and machine learning (ML) sessions to attend at this twelfth edition of re:Invent. And although generative AI has appeared in previous events, this year we’re taking it to the next level. In addition to several exciting announcements during keynotes, most of the sessions in our track will feature generative AI in one form or another, so we can truly call our track “Generative AI and ML.” In this post, we give you a sense of how the track is organized and highlight a few sessions we think you’ll like. And although our track focuses on generative AI, many other tracks have related sessions. Use the “Generative AI” tag as you are browsing the session catalog to find them.
The technical sessions in our track are divided into five areas. First, we’ll have a few foundational sessions related to various aspects of Amazon Bedrock—a fully managed generative AI service we launched earlier this year. These will help you understand the building blocks of your generative AI applications. Second, we’ll have sessions covering the common generative AI use cases and applications. Here you’ll also have a chance to discover novel use cases and techniques. Third, a number of sessions will be of interest to ML practitioners who build, deploy, and operationalize both traditional and generative AI models. This year, learn about LLMOps, not just MLOps! Then, as we started doing last re:Invent, we’ll be offering several sessions on how to build AI responsibly. The greater the power of latest transformer-based models, the greater the responsibility of all ML practitioners to do this right. Be sure to check out the session on the just launched PartyRock, an educational tool for providing any builder with low-friction access to learn through experimentation in a foundation model playground built on Amazon Bedrock. And last but not least (and always fun!) are the sessions dedicated to AWS DeepRacer!
Generative AI is at the heart of the AWS Village this year. Interact with several demos that feature new applications, including a competition that involves using generative AI tech to pilot a drone around an obstacle course. Talk with AWS experts in 14 different industries and explore industry-specific generative AI use cases, including demos from advertising and marketing, aerospace and satellite, manufacturing, and more. The Emerging Tech Zone within the Expo features innovative startups that were selected into the AWS Generative AI Accelerator and the NVIDIA Inception 100 programs.
If you’re new to re:Invent, you can attend sessions of the following types:

Keynotes – Join in person or virtually and learn about all the exciting announcements.
Innovation Talks – Learn about the latest cloud technology from AWS technology leaders and discover how these advancements can help you push your business forward. These sessions will be livestreamed, recorded, and published to YouTube.
Breakout sessions – These 60-minute sessions are expected to have broad appeal, are delivered to larger audiences, and will be recorded. If you miss them, you can watch them on demand after re:Invent.
Chalk talks – Enjoy 60 minutes of content delivered to smaller audiences with an interactive whiteboarding session. Chalk talks are where discussions happen, and these offer you the greatest opportunity to ask questions or share your opinion.
Workshops – In these hands-on learning opportunities, in the course of 2 hours, you’ll be able to build a solution to a problem, and understand the inner workings of the resulting infrastructure and cross-service interaction. Bring your laptop and be ready to learn!
Builders’ sessions – These highly interactive 60-minute mini-workshops are conducted in small groups of fewer than 10 attendees. Some of these appeal to beginners, and others are on specialized topics.
NEW! Code talks – In this new session type for re:Invent 2023, code talks are similar to our popular chalk talk format, but instead of focusing on an architecture solution with whiteboarding, the speakers lead an interactive discussion featuring live coding or code samples. These 60-minute sessions focus on the actual code that goes into building a solution. Attendees are encouraged to ask questions and follow along.

If you have reserved your seat at any of the sessions, great! If not, we always set aside some spots for walk-ins, so make a plan and come to the session early.
To help you plan your agenda for this year’s re:Invent, here are some highlights of the generative AI and ML track. So buckle up, and start registering for your favorite sessions.
Visit the session catalog to learn about all our generative AI and ML sessions.
Keynotes

Adam Selipsky, Chief Executive Officer, Amazon Web Services – Keynote
Tuesday November 28 | 8:30 AM – 10:30 AM (PST) | The Venetian
Join Adam Selipsky, CEO of Amazon Web Services, as he shares his perspective on cloud transformation. He highlights innovations in data, infrastructure, and artificial intelligence and machine learning that are helping AWS customers achieve their goals faster, mine untapped potential, and create a better future.

Swami Sivasubramanian, Vice President of AWS Data and Machine Learning – Keynote
Wednesday November 29 | 8:30 AM – 10:30 AM (PST) | The Venetian
A powerful relationship between humans, data, and AI is unfolding right before us. Generative AI is augmenting our productivity and creativity in new ways, while also being fueled by massive amounts of enterprise data and human intelligence. Join Swami Sivasubramanian, Vice President of Data and AI at AWS, to discover how you can use your company data to build differentiated generative AI applications and accelerate productivity for employees across your organization. Also hear from customer speakers with real-world examples of how they’ve used their data to support their generative AI use cases and create new experiences for their customers.
Innovation Talks

Dr. Bratin Saha, VP of AWS AI and ML Services | AIM245-INT | Innovate faster with generative AI
Wednesday November 29 | 1:00 PM – 2:00 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B
With the emergence of generative AI, we are at a tipping point in the widespread adoption of machine learning. Join Dr. Bratin Saha, VP of AWS AI and ML Services, to hear how customers across industries are transforming their business with the latest breakthroughs in AI and ML, including generative AI. Discover the latest AWS innovations, hear from top customers, and explore where AI/ML is headed.

Francessca Vasquez, Vice President of Professional Services | ARC217-INT | From hype to impact: Building a generative AI architecture
Wednesday November 29 | 11:30 AM – 12:30 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B
Generative AI represents a paradigm shift for how companies operate today. Generative AI is empowering developers to reimagine customer experiences and applications while transforming virtually every industry. Organizations are rapidly innovating to create the right architecture for scaling generative AI securely, economically, and responsibly to deliver business value. In this talk, learn how leaders are modernizing their data foundation, selecting industry-leading foundation models, and deploying purpose-built accelerators to unlock the possibilities of generative AI.

Shaown Nandi, AWS Director of Technology for Industries and Strategic Accounts | AIM248-INT | Unlocking the industry potential of generative AI
Wednesday November 29 | 4:00 PM – 5:00 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B
Generative AI has captured the imagination of many industries and is poised to bring in the next wave of technological advancements. In this innovation talk, hear how the largest industries, from healthcare and financial services to automotive and media and entertainment, are using generative AI to drive outcomes for their customers. Join Shaown Nandi, AWS Director of Technology for Industries and Strategic Accounts, and industry leaders to hear how generative AI is accelerating content creation and helping organizations reimagine customer experiences.

Mai-Lan Tomsen Bukovec, Vice President, Technology | AIM250-INT | Putting your data to work with generative AI
Thursday November 30 | 12:30 PM – 1:30 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B
How can you turn your data lake into a business advantage with generative AI? In this talk, explore strategies for putting your proprietary datasets to work when building unique, differentiated generative AI solutions. Learn how to utilize your datasets using Amazon SageMaker and Amazon Bedrock as well as popular frameworks like PyTorch with AWS compute, storage, and analytics. Hear best practices for using unstructured (video, image, PDF), semi-structured (Parquet), and table-formatted (Iceberg) data for training, fine-tuning, checkpointing, and prompt engineering. Also hear different architectural patterns that customers use today to harness their business data for customized generative AI solutions.
Breakout sessions

AIM218 (LVL 200) | Build your first generative AI application with Amazon Bedrock
Monday November 27 | 2:30 PM – 3:30 PM (PST)
We are truly at an exciting inflection point in the widespread adoption of ML with the growth of generative AI applications. In this session, learn how to build your first generative AI application with key services such as Amazon Bedrock. Get hints and tips for getting started fast, and see example reference architectures for common use cases built with AWS AI and ML such as self-service customer support, text analysis, report generation, post-call analysis, and forecasting trends.
Reserve your seat now!
AIM225 (LVL 200) | Drive personalized CX using generative AI and Amazon Personalize
Tuesday November 28 | 5:00 PM – 6:00 PM (PST)
Delivering the best experience is critical to capture and retain customers today. With generative AI, it is possible to hyper-personalize targeted recommendations for shopping and streaming. While standard taglines like “People who bought this also bought . . .” or “Because you watched . . .” entice some, they don’t fully address individual interests. Companies must find ways to dynamically generate compelling, highly customized content. Amazon Personalize delivers capabilities powered by ML and generative AI to help brands create meaningful experiences. Join this session to hear from powerhouse AWS media customer FOX and learn how hyper-personalized experiences can be used to build engagement and drive revenue.
Reserve your seat now!
AIM327 (LVL 300) | Scaling FM inference to hundreds of models with Amazon SageMaker
Wednesday November 29 | 4:30 PM – 5:30 PM (PST)
Companies need robust and cost-effective solutions to deploy foundation models at scale. Additionally, SaaS providers need scalable and cost-effective ways to serve hundreds of models to their customers. This session explores how to use Amazon SageMaker to roll out hundreds of FMs cost-effectively at scale. Get a detailed overview of deployment strategies to support large-scale generative AI inferencing for SaaS, and learn how to architect solutions that maximize scaling capabilities for performance and cost.
Reserve your seat now!
AIM333 (LVL 300) | Explore text-generation FMs for top use cases with Amazon Bedrock
Tuesday November 28| 2:00 PM – 3:00 PM (PST)
Foundation models can be used for natural language processing tasks such as summarization, text generation, classification, open-ended Q&A, and information extraction. With Amazon Bedrock, you can choose powerful FMs from AI21 Labs, Anthropic, and Cohere to find the right FM for your use case such as the Jurassic-2, Claude, and Command families of text-generation FMs. Join this session to learn which FM is best suited for your use case.
Reserve your seat now!
AIM332 (LVL 300) | Explore image generation and search with FMs on Amazon Bedrock
Thursday November 30 | 11:00 AM – 12:00 PM (PST)
Foundation models understand multiple forms of input, such as images and texts. Join this session to learn how to build transformational experiences using images in Amazon Bedrock.
Reserve your seat now!
AIM377 (LVL 300) | Prompt engineering best practices for LLMs on Amazon Bedrock
Monday November 27 | 9:00 AM – 10:00 AM (PST)
Prompt engineering is the process of guiding large language models to produce desired outputs. In this session, get an overview of prompt engineering best practices and learn how to choose the most appropriate formats, phrases, words, and symbols to get the most out of generative AI solutions while improving accuracy and performance. This session uses the Claude 2 LLM as an example of how prompt engineering helps to solve complex customer use cases. Also learn how prompts can be integrated with your architecture and how to use API parameters for tuning the model parameters using Amazon Bedrock.
Reserve your seat now!
Chalk talks

AIM341 (LVL 300) | Deliver customized search capabilities using Amazon Bedrock
Wednesday November 29 | 5:30 PM – 6:30 PM (PST)
Vector embeddings are numerical representations of your text, image, audio, and video data that can be used to understand the relationship between sentences or words to find more relevant and contextual information in response to a user query. Embeddings can be stored in a database and are used to enable streamlined and more accurate searches. You can use an embeddings model in Amazon Bedrock to create vectors of your organization’s data, which can then be used to enable semantic search. Join this hands-on chalk talk to learn how.
Reserve your seat now!
AIM340-R (LVL 300) | Customize your FMs securely to deliver differentiated experiences
Wednesday November 29 | 6:00 PM – 7:00 PM (PST)
Foundation model customizations help you build differentiated generative AI applications using your own data. It’s easy to securely customize models in Amazon Bedrock. You can point Amazon Bedrock at a few labeled examples in Amazon S3, and the service can fine-tune the FM for a particular task without having to annotate large volumes of data; none of your data is used to train the original base FMs. Join this chalk talk for a deep dive on FM customizations through an interactive demo.
Reserve your seat now!
This session will be repeated Thursday, November 30 11:00 AM – 12:00 PM (PST), and Friday, December 1 8:30 AM – 9:30 AM PST.
AIM342 (LVL 300) | Advancing responsible AI: Assessing and mitigating risk
Wednesday November 29 | 4:30 PM – 5:30 PM (PST)
Risk assessment is an essential part of developing AI solutions responsibly, especially with emerging industry standards and laws regarding AI risk, such as ISO 42001 and the EU AI Act. This chalk talk provides an introduction to best practices for risk assessment related to fairness, robustness, explainability, privacy and security, transparency, and governance. Explore examples to estimate the severity and likelihood of potential events that could be harmful. Learn about Amazon SageMaker tooling for model governance, bias, explainability, and monitoring, and about transparency in the form of service cards as potential risk mitigation strategies.
Reserve your seat now!
AIM347-R (LVL 300) | Next-generation ML builder experience
Thursday November 30 | 4:00 PM – 5:00 PM (PST)
Amazon SageMaker offers different integrated development environments (IDEs) that are purpose-built for machine learning. In this chalk talk, learn how to select and use your preferred environment to perform end-to-end ML development steps, from preparing data to building, training, and deploying your ML models. Discover how you can quickly upload data, create new notebooks, train and tune models, move back and forth between steps to adjust experiments, collaborate seamlessly within your organization, and deploy models to production all in one place.
Reserve your seat now!
This session will be repeated Friday, December 1 9:00 AM – 10:00 AM (PST), and Friday, December 1 11:30 AM – 12:00 PM (PST).
AIM352-R (LVL 300) | Securely build generative AI apps and control data with Amazon Bedrock
Monday November 27 | 11:30 AM – 12:30 PM (PST)
Generative AI applications have captured widespread attention; however, they have also introduced new security challenges, especially around the handling of customer data. Organizations want to ensure that their data remains safe and secure while working with foundation models and don’t want to worry about their data being used to train an FM. Amazon Bedrock provides comprehensive data protection and privacy. In this chalk talk, explore architectures, data flows, and security-related aspects of model fine-tuning, as well as prompting and inference, while you learn about Amazon Bedrock’s security capabilities.
Reserve your seat now!
This session will be repeated Wednesday, November 29 6:00 PM – 7:00 PM (PST), and Thursday, November 30 4:00 PM – 5:00 PM (PST).
AIM404 (LVL 400) | Train and deploy FMs on Amazon EC2 and Amazon SageMaker, feat. Flip AI
Wednesday November 29 | 2:30 PM – 3:30 PM (PST)
Organizations that are running machine learning systems and generative AI applications on their local laptops/servers want to take advantage of the scalability and performance of the AWS Cloud. In this chalk talk, hear about compute and ML services from self-managed Amazon EC2 to fully managed Amazon SageMaker that you can use to build, train, and deploy foundation models. See a demo of how you can fine-tune a Stable Diffusion model on Amazon EC2 and then deploy it on SageMaker using the AWS Deep Learning AMIs (DLAMI) and AWS Deep Learning Containers. Also, hear how Flip AI built their own models using these AWS services.
Reserve your seat now!
Workshops

AIM302 (LVL 300) | Use generative AI to extract insights from contact center recordings
Monday November 27 | 8:30 AM – 10:30 AM (PST)
Learn how to derive insights from contact center recordings and other media using Amazon Transcribe and generative AI. In this workshop, learn how to combine automatic call recording, transcription, post-call analysis, sentiment analysis, issue detection, and call summarization from your own telephony recordings (Cisco, Genesys, Talkdesk, Avaya, and more) using AWS Contact Center Intelligence (CCI) solutions and generative AI. See demos on how to build analytics dashboards and integrations between LLMs and Amazon QuickSight to visualize your key metrics. You must bring your laptop to participate.
Reserve your seat now!
AIM307 (LVL 300) | Retrieval Augmented Generation with Amazon Bedrock
Wednesday November 29 | 8:30 AM – 10:30 AM (PST)
Large language models are often limited by the data they were trained on and don’t always provide up-to-date responses—or worse, they make things up. To overcome this limitation, you can supplement prompts with up-to-date information using embeddings stored in vector databases, a process known as Retrieval Augmented Generation (RAG). With supplemental information in the prompt providing more context, the LLM can respond more accurately and is less likely to hallucinate. In this workshop, learn how to use vector databases with Amazon Bedrock, a service that makes foundation models from Amazon and leading AI companies available via a single API. You must bring your laptop to participate.
Reserve your seat now!
AIM304 (LVL 300) | How to generate text responsibly using foundation models on AWS
Wednesday November 29 | 5:30 PM – 7:30 PM (PST)
Foundation models such as Claude are commonly used to create new pieces of original content, such as short stories, essays, social media posts, and webpage copy, and also to summarize text from articles, blog posts, books, and more. In this workshop, learn how you can generate text in minutes using foundation models available through Amazon Bedrock in a responsible way. You must bring your laptop to participate.
Reserve your seat now!
Code talks

AIM364-R (LVL 300) | Boost ML development with Amazon SageMaker Studio notebooks
Tuesday November 28 | 4:00 PM – 5:00 PM (PST)
Amazon SageMaker Studio notebooks are collaborative notebooks that you can launch quickly and that can help you integrate with purpose-built ML tools in SageMaker and other AWS services for complete ML development. In this code talk, learn how to prepare data at scale using built-in data preparation assistance, co-edit the same notebook in real time, and automate conversion of notebook code to production-ready jobs. This talk also introduces the new generative AI-powered features that can help you maximize productivity, write higher-quality code, and improve security.
Reserve your seat now!
This session will be repeated Wednesday, November 29 12:00 PM – 1:00 PM (PST).
Builders’ sessions

AIM219-R (LVL 200) | Learn and experiment with LLMs in Amazon SageMaker Studio Lab
Monday November 27 | 10:00 AM – 11:00 AM (PST)
Machine learning can sound complicated, but Amazon SageMaker Studio Lab makes it easier for anyone to get started at no cost. In this hands-on builders’ session, be guided through the basics of experimenting with large language models in Amazon SageMaker Studio Lab. No prior machine learning experience is required. You must bring your laptop to participate.
This session will be repeated Monday, November 27 4:00 PM – 5:00 PM (PST), Tuesday, November 28 3:30 PM – 4:30 PM (PST), Wednesday, November 29, 12:00 PM – 1:00 PM (PST), and Thursday, November 30 11:30 AM – 12:30 PM (PST).
Reserve your seat now!
AWS DeepRacer

Get ready to race with AWS DeepRacer at re:Invent 2023!
Developers, fasten your seatbelts—AWS DeepRacer is bringing ML to everyone at re:Invent! Whether you’re looking to get started with ML or improve your skills, AWS DeepRacer offers an exciting way to get hands-on with ML.
Watch the world’s top 72 racers of the AWS DeepRacer 2023 League battle it out Monday through Wednesday at our Championship Stadium at the Venetian Expo. It will all come down to the finale on Wednesday (November 29) at 2:30 PM (PST) as the eight finalists compete for the cup and $44,000 in prize money. You can also get behind the wheel yourself on November 30, when the track opens for the 2024 Open Racing. Post the fastest time and you’ll win a ticket back to Vegas for the 2024 Championship!
Dive into 10 not-to-miss workshops where you’ll learn to train reinforcement learning models, solve business problems with generative AI, and more. Want to learn tips and tricks from the best racers in the world? Be sure to check out our DPR301 workshop featuring five of our top AWS DeepRacer League Champions who will be sharing their approaches for training their AWS DeepRacer models and answering questions during an open Q&A.
Don’t forget to check out the rest of the AWS DeepRacer workshops before they fill up to reserve your spot! Whether you take a workshop, take a spin in our gamified virtual racing simulator, catch the global competition, or test your own ML model on the track, AWS DeepRacer brings the thrill of high-speed racing to hands-on machine learning at re:Invent. Let the countdown begin. We can’t wait to see you in Las Vegas!
See you at re:Invent!
Make sure to check out the re:Invent content catalog and the generative AI at re:Invent guide for more gen AI and ML content at re:Invent. We’ll see you there!

About the authors
Denis V. Batalov is a 17-year Amazon veteran and a PhD in Machine Learning, Denis worked on such exciting projects as Search Inside the Book, Amazon Mobile apps and Kindle Direct Publishing. Since 2013 he has helped AWS customers adopt AI/ML technology as a Solutions Architect. Currently, Denis is a Worldwide Tech Leader for AI/ML responsible for the functioning of AWS ML Specialist Solutions Architects globally. Denis is a frequent public speaker, you can follow him on Twitter @dbatalov.
Paxton Hall is a Marketing Program Manager for the AWS AI/ML Community on the AI/ML Education team at AWS. He has worked in retail and experiential marketing for the past 7 years, focused on developing communities and marketing campaigns. Out of the office, he’s passionate about public lands access and conservation, and enjoys backcountry skiing, climbing, biking, and hiking throughout Washington’s Cascade mountains.

Build a contextual chatbot for financial services using Amazon SageMak …

The financial service (FinServ) industry has unique generative AI requirements related to domain-specific data, data security, regulatory controls, and industry compliance standards. In addition, customers are looking for choices to select the most performant and cost-effective machine learning (ML) model and the ability to perform necessary customization (fine-tuning) to fit their business use cases. Amazon SageMaker JumpStart is ideally suited for generative AI use cases for FinServ customers because it provides the necessary data security controls and meets compliance standards requirements.
In this post, we demonstrate question answering tasks using a Retrieval Augmented Generation (RAG)-based approach with large language models (LLMs) in SageMaker JumpStart using a simple financial domain use case. RAG is a framework for improving the quality of text generation by combining an LLM with an information retrieval (IR) system. The LLM generated text, and the IR system retrieves relevant information from a knowledge base. The retrieved information is then used to augment the LLM’s input, which can help improve the accuracy and relevance of the model generated text. RAG has been shown to be effective for a variety of text generation tasks, such as question answering and summarization. It is a promising approach for improving the quality and accuracy of text generation models.
Advantages of using SageMaker JumpStart
With SageMaker JumpStart, ML practitioners can choose from a broad selection of state-of-the-art models for use cases such as content writing, image generation, code generation, question answering, copywriting, summarization, classification, information retrieval, and more. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment.
SageMaker JumpStart is ideally suited for generative AI use cases for FinServ customers because it offers the following:

Customization capabilities – SageMaker JumpStart provides example notebooks and detailed posts for step-by-step guidance on domain adaptation of foundation models. You can follow these resources for fine-tuning, domain adaptation, and instruction of foundation models or to build RAG-based applications.
Data security – Ensuring the security of inference payload data is paramount. With SageMaker JumpStart, you can deploy models in network isolation with single-tenancy endpoint provision. Furthermore, you can manage access control to selected models through the private model hub capability, aligning with individual security requirements.
Regulatory controls and compliances – Compliance with standards such as HIPAA BAA, SOC123, PCI, and HITRUST CSF is a core feature of SageMaker, ensuring alignment with the rigorous regulatory landscape of the financial sector.
Model choices – SageMaker JumpStart offers a selection of state-of-the-art ML models that consistently rank among the top in industry-recognized HELM benchmarks. These include, but are not limited to, Llama 2, Falcon 40B, AI21 J2 Ultra, AI21 Summarize, Hugging Face MiniLM, and BGE models.

In this post, we explore building a contextual chatbot for financial services organizations using a RAG architecture with the Llama 2 foundation model and the Hugging Face GPTJ-6B-FP16 embeddings model, both available in SageMaker JumpStart. We also use Vector Engine for Amazon OpenSearch Serverless (currently in preview) as the vector data store to store embeddings.
Limitations of large language models
LLMs have been trained on vast volumes of unstructured data and excel in general text generation. Through this training, LLMs acquire and store factual knowledge. However, off-the-shelf LLMs present limitations:

Their offline training renders them unaware of up-to-date information.
Their training on predominantly generalized data diminishes their efficacy in domain-specific tasks. For instance, a financial firm might prefer its Q&A bot to source answers from its latest internal documents, ensuring accuracy and compliance with its business rules.
Their reliance on embedded information compromises interpretability.

To use specific data in LLMs, three prevalent methods exist:

Embedding data within the model prompts, allowing it to utilize this context during output generation. This can be zero-shot (no examples), few-shot (limited examples), or many-shot (abundant examples). Such contextual prompting steers models towards more nuanced results.
Fine-tuning the model using pairs of prompts and completions.
RAG, which retrieves external data (non-parametric) and integrates this data into the prompts, enriching the context.

However, the first method grapples with model constraints on context size, making it tough to input lengthy documents and possibly increasing costs. The fine-tuning approach, while potent, is resource-intensive, particularly with ever-evolving external data, leading to delayed deployments and increased costs. RAG combined with LLMs offers a solution to the previously mentioned limitations.
Retrieval Augmented Generation
RAG retrieves external data (non-parametric) and integrates this data into ML prompts, enriching the context. Lewis et al. introduced RAG models in 2020, conceptualizing them as a fusion of a pre-trained sequence-to-sequence model (parametric memory) and a dense vector index of Wikipedia (non-parametric memory) accessed via a neural retriever.
Here’s how RAG operates:

Data sources – RAG can draw from varied data sources, including document repositories, databases, or APIs.
Data formatting – Both the user’s query and the documents are transformed into a format suitable for relevancy comparisons.
Embeddings – To facilitate this comparison, the query and the document collection (or knowledge library) are transformed into numerical embeddings using language models. These embeddings numerically encapsulate textual concepts.
Relevancy search – The user query’s embedding is compared to the document collection’s embeddings, identifying relevant text through a similarity search in the embedding space.
Context enrichment – The identified relevant text is appended to the user’s original prompt, thereby enhancing its context.
LLM processing – With the enriched context, the prompt is fed to the LLM, which, due to the inclusion of pertinent external data, produces relevant and precise outputs.
Asynchronous updates – To ensure the reference documents remain current, they can be updated asynchronously along with their embedding representations. This ensures that future model responses are grounded in the latest information, guaranteeing accuracy.

In essence, RAG offers a dynamic method to infuse LLMs with real-time, relevant information, ensuring the generation of precise and timely outputs.
The following diagram shows the conceptual flow of using RAG with LLMs.

Solution overview
The following steps are required to create a contextual question answering chatbot for a financial services application:

Use the SageMaker JumpStart GPT-J-6B embedding model to generate embeddings for each PDF document in the Amazon Simple Storage Service (Amazon S3) upload directory.
Identify relevant documents using the following steps:

Generate an embedding for the user’s query using the same model.
Use OpenSearch Serverless with the vector engine feature to search for the top K most relevant document indexes in the embedding space.
Retrieve the corresponding documents using the identified indexes.

Combine the retrieved documents as context with the user’s prompt and question. Forward this to the SageMaker LLM for response generation.

We employ LangChain, a popular framework, to orchestrate this process. LangChain is specifically designed to bolster applications powered by LLMs, offering a universal interface for various LLMs. It streamlines the integration of multiple LLMs, ensuring seamless state persistence between calls. Moreover, it boosts developer efficiency with features like customizable prompt templates, comprehensive application-building agents, and specialized indexes for search and retrieval. For an in-depth understanding, refer to the LangChain documentation.
Prerequisites
You need the following prerequisites to build our context-aware chatbot:

An AWS account with appropriate AWS Identity and Access Management (IAM) permissions.
An Amazon SageMaker Studio domain and user. For setup instructions, refer to Onboard to Amazon SageMaker Domain using Quick setup.
An OpenSearch Serverless collection.
A SageMaker execution role with access to OpenSearch Serverless.

For instructions on how to set up an OpenSearch Serverless vector engine, refer to Introducing the vector engine for Amazon OpenSearch Serverless, now in preview.
For a comprehensive walkthrough of the following solution, clone the GitHub repo and refer to the Jupyter notebook.
Deploy the ML models using SageMaker JumpStart
To deploy the ML models, complete the following steps:

Deploy the Llama 2 LLM from SageMaker JumpStart:

from sagemaker.jumpstart.model import JumpStartModel
llm_model = JumpStartModel(model_id = “meta-textgeneration-llama-2-7b-f”)
llm_predictor = llm_model.deploy()
llm_endpoint_name = llm_predictor.endpoint_name

Deploy the GPT-J embeddings model:

embeddings_model = JumpStartModel(model_id = “huggingface-textembedding-gpt-j-6b-fp16”)
embed_predictor = embeddings_model.deploy()
embeddings_endpoint_name = embed_predictor.endpoint_name

Chunk data and create a document embeddings object
In this section, you chunk the data into smaller documents. Chunking is a technique for splitting large texts into smaller chunks. It’s an essential step because it optimizes the relevance of the search query for our RAG model, which in turn improves the quality of the chatbot. The chunk size depends on factors such as the document type and the model used. A chunk chunk_size=1600 has been selected because this is the approximate size of a paragraph. As models improve, their context window size will increase, allowing for larger chunk sizes.
Refer to the Jupyter notebook in the GitHub repo for the complete solution.

Extend the LangChain SageMakerEndpointEmbeddings class to create a custom embeddings function that uses the gpt-j-6b-fp16 SageMaker endpoint you created earlier (as part of employing the embeddings model):

from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler

logger = logging.getLogger(__name__)

# extend the SagemakerEndpointEmbeddings class from langchain to provide a custom embedding function
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
def embed_documents(
self, texts: List[str], chunk_size: int = 1
) → List[List[float]]:
“””Compute doc embeddings using a SageMaker Inference Endpoint.

Args:
texts: The list of texts to embed.
chunk_size: The chunk size defines how many input texts will
be grouped together as request. If None, will use the
chunk size specified by the class.

Returns:
List of embeddings, one for each text.
“””
results = []
_chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
st = time.time()
for i in range(0, len(texts), _chunk_size):
response = self._embedding_func(texts[i : i + _chunk_size])
results.extend(response)
time_taken = time.time() – st
logger.info(
f”got results for {len(texts)} in {time_taken}s, length of embeddings list is {len(results)}”
)
print(
f”got results for {len(texts)} in {time_taken}s, length of embeddings list is {len(results)}”
)
return results

# class for serializing/deserializing requests/responses to/from the embeddings model
class ContentHandler(EmbeddingsContentHandler):
content_type = “application/json”
accepts = “application/json”

def transform_input(self, prompt: str, model_kwargs={}) → bytes:

input_str = json.dumps({“text_inputs”: prompt, **model_kwargs})
return input_str.encode(“utf-8”)

def transform_output(self, output: bytes) → str:

response_json = json.loads(output.read().decode(“utf-8”))
embeddings = response_json[“embedding”]
if len(embeddings) == 1:
return [embeddings[0]]
return embeddings

def create_sagemaker_embeddings_from_js_model(
embeddings_endpoint_name: str, aws_region: str
) → SagemakerEndpointEmbeddingsJumpStart:

content_handler = ContentHandler()
embeddings = SagemakerEndpointEmbeddingsJumpStart(
endpoint_name=embeddings_endpoint_name,
region_name=aws_region,
content_handler=content_handler,
)
return embeddings

Create the embeddings object and batch the creation of the document embeddings:

embeddings = create_sagemaker_embeddings_from_js_model(embeddings_endpoint_name, aws_region)

These embeddings are stored in the vector engine using LangChain OpenSearchVectorSearch. You store these embeddings in the next section. Store the document embedding in OpenSearch Serverless. You’re now ready to iterate over the chunked documents, create the embeddings, and store these embeddings in the OpenSearch Serverless vector index created in vector search collections. See the following code:

docsearch = OpenSearchVectorSearch.from_texts(
texts = [d.page_content for d in docs],
embedding=embeddings,
opensearch_url=[{‘host’: _aoss_host, ‘port’: 443}],
http_auth=awsauth,
timeout = 300,
use_ssl = True,
verify_certs = True,
connection_class = RequestsHttpConnection,
index_name=_aos_index
)

Question and answering over documents
So far, you have chunked a large document into smaller ones, created vector embeddings, and stored them in a vector engine. Now you can answer questions regarding this document data. Because you created an index over the data, you can do a semantic search; this way, only the most relevant documents required to answer the question are passed via the prompt to the LLM. This allows you to save time and money by only passing relevant documents to the LLM. For more details on using document chains, refer to Documents.
Complete the following steps to answer questions using the documents:

To use the SageMaker LLM endpoint with LangChain, you use langchain.llms.sagemaker_endpoint.SagemakerEndpoint, which abstracts the SageMaker LLM endpoint. You perform a transformation for the request and response payload as shown in the following code for the LangChain SageMaker integration. Note that you may need to adjust the code in ContentHandler based on the content_type and accepts format of the LLM model you choose to use.

content_type = “application/json”
accepts = “application/json”
def transform_input(self, prompt: str, model_kwargs: dict) → bytes:
payload = {
“inputs”: [
[
{
“role”: “system”,
“content”: prompt,
},
{“role”: “user”, “content”: prompt},
],
],
“parameters”: {
“max_new_tokens”: 1000,
“top_p”: 0.9,
“temperature”: 0.6,
},
}
input_str = json.dumps(
payload,
)
return input_str.encode(“utf-8”)

def transform_output(self, output: bytes) → str:
response_json = json.loads(output.read().decode(“utf-8”))
content = response_json[0][“generation”][“content”]

return content

content_handler = ContentHandler()

sm_jumpstart_llm=SagemakerEndpoint(
endpoint_name=llm_endpoint_name,
region_name=aws_region,
model_kwargs={“max_new_tokens”: 300},
endpoint_kwargs={“CustomAttributes”: “accept_eula=true”},
content_handler=content_handler,
)

Now you’re ready to interact with the financial document.

Use the following query and prompt template to ask questions regarding the document:

from langchain import PromptTemplate, SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler

query = “Summarize the earnings report and also what year is the report for”
prompt_template = “””Only use context to answer the question at the end.

{context}

Question: {question}
Answer:”””

prompt = PromptTemplate(
template=prompt_template, input_variables=[“context”, “question”]
)

class ContentHandler(LLMContentHandler):
content_type = “application/json”
accepts = “application/json”

def transform_input(self, prompt: str, model_kwargs: dict) → bytes:
payload = {
“inputs”: [
[
{
“role”: “system”,
“content”: prompt,
},
{“role”: “user”, “content”: prompt},
],
],
“parameters”: {
“max_new_tokens”: 1000,
“top_p”: 0.9,
“temperature”: 0.6,
},
}
input_str = json.dumps(
payload,
)
return input_str.encode(“utf-8”)

def transform_output(self, output: bytes) → str:
response_json = json.loads(output.read().decode(“utf-8”))
content = response_json[0][“generation”][“content”]
return content

content_handler = ContentHandler()

chain = load_qa_chain(
llm=SagemakerEndpoint(
endpoint_name=llm_endpoint_name,
region_name=aws_region,
model_kwargs={“max_new_tokens”: 300},
endpoint_kwargs={“CustomAttributes”: “accept_eula=true”},
content_handler=content_handler,
),
prompt=prompt,
)
sim_docs = docsearch.similarity_search(query, include_metadata=False)
chain({“input_documents”: sim_docs, “question”: query}, return_only_outputs=True)

Cleanup
To avoid incurring future costs, delete the SageMaker inference endpoints that you created in this notebook. You can do so by running the following in your SageMaker Studio notebook:

# Delete LLM
llm_predictor.delete_model()
llm_predictor.delete_predictor(delete_endpoint_config=True)

# Delete Embeddings Model
embed_predictor.delete_model()
embed_predictor.delete_predictor(delete_endpoint_config=True)

If you created an OpenSearch Serverless collection for this example and no longer require it, you can delete it via the OpenSearch Serverless console.
Conclusion
In this post, we discussed using RAG as an approach to provide domain-specific context to LLMs. We showed how to use SageMaker JumpStart to build a RAG-based contextual chatbot for a financial services organization using Llama 2 and OpenSearch Serverless with a vector engine as the vector data store. This method refines text generation using Llama 2 by dynamically sourcing relevant context. We’re excited to see you bring your custom data and innovate with this RAG-based strategy on SageMaker JumpStart!

About the authors
Sunil Padmanabhan is a Startup Solutions Architect at AWS. As a former startup founder and CTO, he is passionate about machine learning and focuses on helping startups leverage AI/ML for their business outcomes and design and deploy ML/AI solutions at scale.
Suleman Patel is a Senior Solutions Architect at Amazon Web Services (AWS), with a special focus on Machine Learning and Modernization. Leveraging his expertise in both business and technology, Suleman helps customers design and build solutions that tackle real-world business problems. When he’s not immersed in his work, Suleman loves exploring the outdoors, taking road trips, and cooking up delicious dishes in the kitchen.

A Deep Dive into Google and Yahoo’s Email Spam Filter Changes

Google and Yahoo recently announced significant changes to their spam filter algorithms, prompting a reevaluation of strategies for email marketers and sales teams engaged in both inbound and outbound marketing efforts.

Here at Customers.ai, we see this as an opportunity for better email marketing campaigns.

Because the reality is, when it comes to email, less is more. The more targeted you get, the better. The more intent signals you have, the better. The more customized your campaigns are, the better they will perform.

If you are doing email the right way, these updates shouldn’t have a significant impact on your business. 

We have a lot more to dig into here. To help our customers and email marketers as a whole better understand what these changes mean, we are going to explain key algorithm changes, discuss their impact, and provide strategic considerations to help navigate these changes successfully.

Key Algorithm Changes

Understanding & Analyzing Email Dynamics

Strategic Considerations for the Google & Yahoo Spam Filter Updates

Takeaways: Who Does This Impact?

Customers.ai for Better Email Deliverability 

Two Key Algorithm Changes

Mandatory Digital Email Signing.  One noteworthy adjustment is the requirement for senders with over 5,000 emails to use DomainKeys Identified Mail (DKIM) for digital signing. While this is already a best practice, non-signed emails are now treated more suspiciously. Digital signing enhances email authenticity, providing a layer of trust that is crucial.

New Complaint Rate Threshold. A complaint rate over 0.3% now poses the risk of being blocked. For context, popular platforms like Mailchimp suspend accounts for a 0.01% complaint rate. It’s also essential to clarify that a complaint occurs when a user marks an email as spam, not when they unsubscribe. 

Understanding & Analyzing Email Dynamics

Understanding email dynamics is essential for devising effective strategies and understanding what these changes mean for performance. 

For instance, many promotional emails land in the promotions tab, effectively reducing the visible complaint rate. If 80% of emails go to the promotions tab, the effective complaint rate could be 5x higher and still be ok for the remaining 20%.

In another instance, if emails are going to spam, people aren’t complaining. Low deliverability actually makes it hard to get spam complaints.

Now, we aren’t advising you to lower your deliverability rates. What we are saying, is you have to understand the dynamics at work to truly understand how this change impacts your campaigns. 

Strategic Considerations for the Google & Yahoo Spam Filter Updates

It boils down to this; If the complaint rate equals the number of complaints divided by total emails sent, the strategy is simple; reduce spam complaints and increase total emails sent. 

Let’s dig into these two factors:

Reducing Spam Complaints

When it comes to reducing spam complaints, there are a number of things to consider. 

Clear unsubscribe options are crucial. You may think it’s a good idea to hide your unsubscribe but we’ll be the first to tell you it’s not. If a user can’t find the unsubscribe button, they are more likely to complain and mark you as spam.

Multiple unsubscribe links in emails can help reduce complaints. Again, not only is it best to make your unsubscribe visible but we also recommend giving your users several options. Make it easy! Include one in the body and the footer.

Use shorter, engagement-based outreach cadences. It’s not about blasting your customers for everything, all the time. Focus on engaging them through timely messaging and relevant content.

Diversify outreach communication channels. Go beyond email. Focus on channels like direct messaging or even phone calls.

Tighten your target audience. You can create better audiences by layering additional data filters. Better audiences mean more engagement and less complaints.

Enhanced email outreach customization is key. The more customized your email is to the individual, the less likely it is they will complain. Utilize segmentation, AI email writers, and make sure you are giving your audience what they want.

Convert Website Visitors into Real Contacts!

Identify who is visiting your site with name, email and more. Get 50 contacts for free!

Please enable JavaScript in your browser to complete this form.Website / URL *Grade my website

Optimizing Email Volume

If the way to avoid penalties is to reduce spam complaints and increase total emails sent, we have to optimize our email volume. That doesn’t mean buying huge lists and going overboard. In fact, you should never do that.

What it does mean, is being strategic about the database you have and emails you are sending. A few ideas:

Maintain inactive contacts. By maintaining some inactive contacts, you can strategically increase email volume without causing complaints. Since you know those people won’t open it, they won’t mark you as spam. It’s worth noting here, we believe that maintaining a clean database of emails is important. But if you are toeing that 0.3% complaint rate line, maybe don’t delete your emails so aggressively. 

Increase helpful transactional emails. Emails like order confirmations, tracking information, or purchase follow-ups can help increase email volume and won’t result in complaints. We have seen companies send 3-4 tracking emails alone! 

Reduce the use of Slack and online chat. As every email reply from a customer sends positive signals to algorithms, moving all of your conversations away from email could result in a loss of email deliverability. Again, this doesn’t mean we should make the lives of customers harder but we see many companies moving away from email support when it actually could benefit you in the long run. 

Strategic Takeaways: Who Does This Impact?

The biggest takeaway is that we have a number – 0.3%. Both Google and Yahoo have disclosed the threshold for complaint rates. It’s an unusual move and in actuality, helps email marketers. 

Previously, those consistently sending emails, who feared getting penalized, didn’t know what the line was. Now it’s a math equation that can easily be monitored and adjusted.

The question comes down to, “Who does this impact?”. 

The reality is that it’s not the big businesses with established customer bases and large inbound lists. This update actually helps those companies.

They likely have low existing complaint rates and can more easily and safely expand their outbound efforts, especially now that they know the target number.

The challenge will be for start-ups or smaller, less established companies. Specifically, those in the B2B space that may be using more aggressive outbound strategies or have been leaning on ABM to establish their brand. These companies must adhere pretty closely to best practices to avoid hitting the 0.3% threshold. Risk analysis will be crucial in adapting to these changes effectively. 

Customers.ai for Better Email Deliverability 

As mentioned previously, we believe that when done the right way, email is an amazingly effective tool. 

Looking at our own metrics, where we typically send 300,000 emails, our complaint rate is 0.008%, 36 times below the new 0.3% threshold. Here’s a small sample of those (Note 0 abuse reports):

If you are doing outbound marketing based only on ICP, you are in danger. What you should be focusing on is those who are already aware of you. Those who are familiar with your brand or have been to your website.

That’s why we launched our Website Visitor ID X-Ray pixel. It identifies who is on your site and who has interacted with your brand. It prevents the cold outreach that so often results in spam complaints. 

Convert Website Visitors into Real Contacts!

Identify who is visiting your site with name, email and more. Get 50 contacts for free!

Please enable JavaScript in your browser to complete this form.Website / URL *Grade my website

On top of it, it allows you to go beyond email. When in doubt, remarket!

If someone has visited your site, specifically an internal landing page, you can feel pretty safe remarketing to that person. Take those emails and put them into your Facebook or Google Ads remarketing lists. You are now reaching out to them but in a less intrusive way that won’t result in spam complaints.

It’s an Exciting Time

The changes in Google and Yahoo’s email spam filters present an exciting opportunity for those who can adapt. We know where the boundaries are! By aligning strategies and tools with these new guidelines, email marketers and sales teams can showcase their expertise and commitment to navigating and succeeding in their email marketing efforts. 

Embracing these changes positions businesses to thrive in the dynamic world of email marketing.

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 50 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post A Deep Dive into Google and Yahoo’s Email Spam Filter Changes appeared first on Customers.ai.

NVIDIA AI Researchers Present an Artificial Intelligence Approach for …

Researchers from Nvidia introduce a neural radiance field formulation for view synthesis that efficiently transitions between volumetric and surface-based rendering. The method adapts the rendering process based on scene characteristics by constructing an explicit mesh envelope around a neural volumetric representation. The approach significantly accelerates rendering speed, particularly in solid regions where a single sample per pixel suffices. The proposed method demonstrates high-fidelity rendering through experiments and introduces possibilities for downstream applications such as animation and simulation.

The study extends NeuS, a neural radiance field (NeRF) formulation, by introducing adaptive shells for efficient rendering. The method can adapt its rendering approach based on scene characteristics by utilizing a learned spatially varying kernel size, significantly reducing the required number of samples. It addresses the computational complexity of NeRFs, explores acceleration strategies, and compares performance with surface-based approaches. The proposed method demonstrates comparable results with significantly faster inference, making it suitable for animation and physical simulation applications.

The approach addresses the computational cost of NeRFs in real-time high-resolution novel-view synthesis. It introduces an adaptive shell approach that combines explicit geometry with NeRFs, assigning different rendering styles to distinct scene regions. This approach significantly reduces the number of samples needed for rendering while preserving or enhancing perceptual quality. The goal is to improve the efficiency of NeRFs without compromising their high visual fidelity, allowing for more practical and real-time applications in 3D scene representation and synthesis.

Utilizing explicit mesh envelopes around surfaces reduces the required samples for rendering while maintaining quality. The proposed method, represented by triangle meshes, delineates significant regions for appearance rendering. Evaluation metrics include PSNR, LPIPS, SSIM, and the number of samples per pixel along rays, providing insights into rendering quality and computational complexity. The approach demonstrates improvements in efficiency and visual fidelity for 3D scene rendering.

The proposed adaptive shell approach reduces required rendering samples while maintaining high fidelity, facilitating downstream applications like animation and simulation. Outperforming baselines across all metrics showcase its effectiveness, particularly on the MipNeRF360 dataset. A gallery of results on the DTU dataset further illustrates rendered image quality. The comprehensive use of different metrics provides insights into the method’s computational complexity and overall performance.

The research achieves comparable performance to baselines in PSNR, LPIPS, and SSIM metrics, demonstrating efficiency. Combining NeuS and spatially varying kernel size enhances NeRF rendering. It suggests further speedups through methods of precomputing neural field outputs. Acknowledging limitations in capturing thin structures, it proposes iterative procedures for future work. The study envisions real-time advancements in computer graphics through the synergy of neural representations and high-performance techniques. 

Future work can include exploring iterative procedures to enhance reconstruction and adapt the shell iteratively. Investigating the synergy of recent neural representations with real-time graphics techniques is recommended. Further improvements in surface accuracy using SDF and global kernel size, potentially through regularization, are proposed. Combining the adaptive shell approach with precomputed neural field outputs on a discrete grid for additional speedups is suggested. Addressing limitations in capturing thin structures and reducing artifacts through iterative procedures and algorithmic advancements is identified as an avenue for future research.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post NVIDIA AI Researchers Present an Artificial Intelligence Approach for Efficiently Rendering NeRF by Restricting Volumetric Rendering to a Narrow Band Around the Object appeared first on MarkTechPost.

Meet ECOGEN: A Novel Deep Learning Approach Designed to Generate Reali …

The advent of deep learning has significantly influenced various fields, extending its impact across diverse domains. One notable application is its use in monitoring rare birds through their songs.

Differentiating between birds by their songs has become more accessible due to the availability of numerous mobile applications and software for ecologists and the general public. However, a major problem occurs when identification software comes across a species of bird with which it is unfamiliar or for which there are few reference recordings.

Ecologists and conservationists face the problem of monitoring some of the world’s rarest birds. To overcome this problem, researchers at the University of Moncton, Canada, have developed ECOGEN, which can generate lifelike bird sounds to enhance the samples of underrepresented species. These can then be used to train audio identification tools used in ecological monitoring.

Generating audio poses several challenges, including the substantial number of samples required for synthesis. Different formats are utilized for processing audio files, and many of these representations result in a loss of information, which complicates the production of high-quality audio samples. The waveform representation, which records sound pressure amplitude in the time domain, emerges as one of the most prevalent formats that maintains information integrity without loss.

To tackle this, ECOGEN has created novel instances of bird sounds to improve AI models. Essentially, ECOGEN enables the expansion of sound libraries for species with limited wild recordings without harming the animals or necessitating additional fieldwork.

The researchers found that adding synthetic bird song samples produced by ECOGEN to a bird song identifier improved bird song classification accuracy by 12% on average. One of the lead researchers, Dr. Nicolas Lecomte, underlined the urgent need for automated instruments, like acoustic monitoring, to track changes in biodiversity brought on by notable worldwide fluctuations in animal populations. However, thorough reference libraries are frequently absent from current AI models used for species identification in acoustic monitoring.

The researchers emphasized that creating synthetic bird songs can contribute to the conservation of endangered bird species and provide valuable insight into their vocalizations, behaviors, and habitat preferences.

Dr. Lecomte said that the tool could benefit other types of animals as well and said while ECOGEN was developed for birds, they are confident that it could be applied to mammals, fish, insects, and amphibians.

ECOGEN operates by transforming bird song recordings into spectrograms, visual representations of sounds. Subsequently, it generates new AI images based on these spectrograms, thereby augmenting the dataset specifically for rare species with limited recordings. These newly generated spectrograms are then converted back into audio format to train bird sound identification models. In this study, the researchers utilized a dataset comprising 23,784 wild bird recordings sourced globally, encompassing 264 different species.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post Meet ECOGEN: A Novel Deep Learning Approach Designed to Generate Realistic Bird Songs for Biologists and Ecologists appeared first on MarkTechPost.

This AI Research Helps Microbiologists to Identify Bacteria

A new AI research presents DeepColony, a comprehensive framework for colony identification and analysis in microbiology laboratories. The system utilizes high-resolution digital scans of cultured plates and employs a hierarchical structure with five levels for colony analysis and identification of bacteria. The levels range from identifying the locations and quantifying the number of colonies on a plate to assessing the clinical significance of the entire plate.

At level 0, DeepColony determines colony locations and quantities, providing essential spatial distribution information. Level 1 identifies isolated colonies for further analysis, considering criteria similar to those used by microbiologists. The core of DeepColony lies in levels 2 to 4, where the system performs initial species identification, refines identification rankings, and assesses the clinical significance of the entire plate.

The system’s architecture involves convolutional neural networks (CNNs) organized in a hierarchical structure. The CNN for single colony identification comprises four convolutional layers and one fully connected layer. DeepColony’s unique approach includes context-based identification, where a Siamese neural network is employed for a non-linear similarity-driven embedding. This embedding, combined with mean-shift clustering, enhances the identification of pathogenic species based on visual data.

The datasets used in the study include colony-level and plate-level data obtained from high-resolution digital scans of cultured plates. The system’s evaluation focused on urine cultures, and the dataset includes a diverse range of organisms.

DeepColony demonstrates the potential to improve the efficiency and quality of routine activities in microbiological laboratories. It can reduce the workload, make coherent decisions aligned with interpretation guidelines, and enhance the role of microbiologists. While the system has limitations, such as difficulty in identifying species in confluent areas, its safety-by-design feature minimizes the impact on result consistency.

In conclusion, DeepColony emerges as a unique framework with the capacity to refine and reinforce the critical role of microbiologists in high-throughput laboratories, offering significant potential for improving decision-making processes in microbiological analyses.
The post This AI Research Helps Microbiologists to Identify Bacteria appeared first on MarkTechPost.

How Amazon Music uses SageMaker with NVIDIA to optimize ML training an …

In the dynamic world of streaming on Amazon Music, every search for a song, podcast, or playlist holds a story, a mood, or a flood of emotions waiting to be unveiled. These searches serve as a gateway to new discoveries, cherished experiences, and lasting memories. The search bar is not just about finding a song; it’s about the millions of active users starting their personal journey into the rich and diverse world that Amazon Music has to offer.
Delivering a superior customer experience to instantly find the music that users search for requires a platform that is both smart and responsive. Amazon Music uses the power of AI to accomplish this. However, optimizing the customer experience while managing cost of training and inference of AI models that power the search bar’s capabilities, like real-time spellcheck and vector search, is difficult during peak traffic times.
Amazon SageMaker provides an end-to-end set of services that allow Amazon Music to build, train, and deploy on the AWS Cloud with minimal effort. By taking care of the undifferentiated heavy lifting, SageMaker allows you to focus on working on your machine learning (ML) models, and not worry about things such as infrastructure. As part of the shared responsibility model, SageMaker makes sure that the services they provide are reliable, performant, and scalable, while you make sure the application of the ML models makes the best use of the capabilities that SageMaker provides.
In this post, we walk through the journey Amazon Music took to optimize performance and cost using SageMaker and NVIDIA Triton Inference Server and TensorRT. We dive deep into showing how that seemingly simple, yet intricate, search bar works, ensuring an unbroken journey into the universe of Amazon Music with little-to-zero frustrating typo delays and relevant real-time search results.
Amazon SageMaker and NVIDIA: Delivering fast and accurate vector search and spellcheck capabilities
Amazon Music offers a vast library of over 100 million songs and millions of podcast episodes. However, finding the right song or podcast can be challenging, especially if you don’t know the exact title, artist, or album name, or the searched query is very broad, such as “news podcasts.”
Amazon Music has taken a two-pronged approach to improve the search and retrieval process. The first step is to introduce vector search (also known as embedding-based retrieval), an ML technique that can help users find the most relevant content they’re looking for by using semantics of the content. The second step involves introducing a Transformer-based Spell Correction model in the search stack. This can be especially helpful when searching for music, because users may not always know the exact spelling of a song title or artist name. Spell correction can help users find the music they’re looking for even if they make a spelling mistake in their search query.
Introducing Transformer models in a search and retrieval pipeline (in query embedding generation needed for vector search and the generative Seq2Seq Transformer model in Spell Correction) may lead to significant increase in overall latency, affecting customer experience negatively. Therefore, it became a top priority for us to optimize the real-time inference latency for vector search and spell correction models.
Amazon Music and NVIDIA have come together to bring the best possible customer experience to the search bar, using SageMaker to implement both fast and accurate spellcheck capabilities and real-time semantic search suggestions using vector search-based techniques. The solution includes using SageMaker hosting powered by G5 instances that uses NVIDIA A10G Tensor Core GPUs, SageMaker-supported NVIDIA Triton Inference Server Container, and the NVIDIA TensorRT model format. By reducing the inference latency of the spellcheck model to 25 milliseconds at peak traffic, and reducing search query embedding generation latency by 63% on average and cost by 73% compared to CPU based inference, Amazon Music has elevated the search bar’s performance.
Additionally, when training the AI model to deliver accurate results, Amazon Music achieved a whopping 12 fold acceleration in training time for their BART sequence-to-sequence spell corrector transformer model, saving them both time and money, by optimizing their GPU utilization.
Amazon Music partnered with NVIDIA to prioritize the customer search experience and craft a search bar with well-optimized spellcheck and vector search functionalities. In the following sections, we share more about how these optimizations were orchestrated.
Optimizing training with NVIDIA Tensor Core GPUs
Gaining access to an NVIDIA Tensor Core GPU for large language model training is not enough to capture its true potential. There are key optimization steps that must happen during training in order to fully maximize the GPU’s utilization. However, an underutilized GPU will undoubtedly lead to inefficient use of resources, prolonged training durations, and increased operational costs.
During the initial phases of training the spell corrector BART (bart-base) transformer model on a SageMaker ml.p3.24xlarge instance (8 NVIDIA V100 Tensor Core GPUs), Amazon Music’s GPU utilization was around 35%. To maximize the benefits of NVIDIA GPU-accelerated training, AWS and NVIDIA solution architects supported Amazon Music in identifying areas for optimizations, particularly around the batch size and precision parameters. These two crucial parameters influence the efficiency, speed, and accuracy of training deep learning models.
The resulting optimizations yielded a new and improved V100 GPU utilization, steady at around 89%, drastically reducing Amazon Music’s training time from 3 days to 5–6 hours. By switching the batch size from 32 to 256 and using optimization techniques like running automatic mixed precision training instead of only using FP32 precision, Amazon Music was able to save both time and money.
The following chart illustrates the 54% percentage point increase in GPU utilization after optimizations.

The following figure illustrates the acceleration in training time.

This increase in batch size enabled the NVIDIA GPU to process significantly more data concurrently across multiple Tensor Cores, resulting in accelerated training time. However, it’s important to maintain a delicate balance with memory, because larger batch sizes demand more memory. Both increasing batch size and employing mixed precision can be critical in unlocking the power of NVIDIA Tensor Core GPUs.
After the model was trained to convergence, it was time to optimize for inference deployment on Amazon Music’s search bar.
Spell Correction: BART model inferencing
With the help of SageMaker G5 instances and NVIDIA Triton Inference Server (an open source inference serving software), as well as NVIDIA TensorRT, an SDK for high-performance deep learning inference that includes an inference optimizer and runtime, Amazon Music limits their spellcheck BART (bart-base) model server inference latency to just 25 milliseconds at peak traffic. This includes overheads like load balancing, preprocessing, model inferencing, and postprocessing times.
NVIDIA Triton Inference Server provides two different kind backends: one for hosting models on GPU, and a Python backend where you can bring your own custom code to be used in preprocessing and postprocessing steps. The following figure illustrates the model ensemble scheme.

Amazon Music built its BART inference pipeline by running both preprocessing (text tokenization) and postprocessing (tokens to text) steps on CPUs, whereas the model execution step runs on NVIDIA A10G Tensor Core GPUs. A Python backend sits in the middle of the preprocessing and postprocessing steps, and is responsible for communicating with the TensorRT-converted BART models as well as the encoder/decoder networks. TensorRT boosts inference performance with precision calibration, layer and tensor fusion, kernel auto-tuning, dynamic tensor memory, multi-stream execution, and time fusion.
The following figure illustrates the high-level design of the key modules that make up the spell corrector BART model inferencing pipeline.

Vector search: Query embedding generation sentence BERT model inferencing
The following chart illustrates the 60% improvement in latency (serving p90 800–900 TPS) when using the NVIDIA AI Inference Platform compared to a CPU-based baseline.

The following chart shows a 70% improvement in cost when using the NVIDIA AI Inference Platform compared to a CPU-based baseline.

The following figure illustrates an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.

To achieve these results, Amazon Music experimented with several different Triton deployment parameters using Triton Model Analyzer, a tool that helps find the best NVIDIA Triton model configuration to deploy efficient inference. To optimize model inference, Triton offers features like dynamic batching and concurrent model execution, and has framework support for other flexibility capabilities. The dynamic batching gathers inference requests, seamlessly grouping them together into cohorts in order to maximize throughput, all while ensuring real-time responses for Amazon Music users. The concurrent model execution capability further enhances inference performance by hosting multiple copies of the model on the same GPU. Finally, by utilizing Triton Model Analyzer, Amazon Music was able to carefully fine-tune the dynamic batching and model concurrency inference hosting parameters to find optimal settings that maximize inference performance using simulated traffic.
Conclusion
Optimizing configurations with Triton Inference Server and TensorRT on SageMaker allowed Amazon Music to achieve outstanding results for both training and inference pipelines. The SageMaker platform is the end-to-end open platform for production AI, providing quick time to value and the versatility to support all major AI use cases across both hardware and software. By optimizing V100 GPU utilization for training and switching from CPUs to G5 instances using NVIDIA A10G Tensor Core GPUs, as well as by using optimized NVIDIA software like Triton Inference Server and TensorRT, companies like Amazon Music can save time and money while boosting performance in both training and inference, directly translating to a better customer experience and lower operating costs.
SageMaker handles the undifferentiated heavy lifting for ML training and hosting, allowing Amazon Music to deliver reliable, scalable ML operations across both hardware and software.
We encourage you to check that your workloads are optimized using SageMaker by always evaluating your hardware and software choices to see if there are ways you can achieve better performance with decreased costs.
To learn more about NVIDIA AI in AWS, refer to the following:

How Amazon Search achieves low-latency, high-throughput T5 inference with NVIDIA Triton on AWS
Deploy fast and scalable AI with NVIDIA Triton Inference Server in Amazon SageMaker
Achieve hyper-scale performance for model serving using NVIDIA Triton Inference Server on Amazon SageMaker
NVIDIA H100 Tensor Core GPUs Now Available on AWS Cloud

About the authors
Siddharth Sharma is a Machine Learning Tech Lead at Science & Modeling team at Amazon Music. He specializes in Search, Retrieval, Ranking and NLP related modeling problems. Siddharth has a rich back-ground working on large scale machine learning problems that are latency sensitive e.g. Ads Targeting, Multi Modal Retrieval, Search Query Understanding etc. Prior to working at Amazon Music, Siddharth was working at companies like Meta, Walmart Labs, Rakuten on E-Commerce centric ML Problems. Siddharth spent early part of his career working with bay area ad-tech startups.
Tarun Sharma is a Software Development Manager leading Amazon Music Search Relevance. His team of scientists and ML engineers is responsible for providing contextually relevant and personalized search results to Amazon Music customers.
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.
Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.
Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.
Tugrul Konuk is a Senior Solution Architect at NVIDIA, specializing at large-scale training, multimodal deep learning, and high-performance scientific computing. Prior to NVIDIA, he worked at the energy industry, focusing on developing algorithms for computational imaging. As part of his PhD, he worked on physics-based deep learning for numerical simulations at scale. In his leisure time, he enjoys reading, playing the guitar and the piano.
Rohil Bhargava is a Product Marketing Manager at NVIDIA, focused on deploying NVIDIA application frameworks and SDKs on specific CSP platforms.
Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Machine Learning with MATLAB and Amazon SageMaker

This post is written in collaboration with Brad Duncan, Rachel Johnson and Richard Alcock from MathWorks.
MATLAB  is a popular programming tool for a wide range of applications, such as data processing, parallel computing, automation, simulation, machine learning, and artificial intelligence. It’s heavily used in many industries such as automotive, aerospace, communication, and manufacturing. In recent years, MathWorks has brought many product offerings into the cloud, especially on Amazon Web Services (AWS). For more details about MathWorks cloud products, see MATLAB and Simulink in the Cloud or email Mathworks.
In this post, we bring MATLAB’s machine learning capabilities into Amazon SageMaker, which has several significant benefits:

Compute resources: Using the high-performance computing environment offered by SageMaker can speed up machine learning training.
Collaboration: MATLAB and SageMaker together provide a robust platform that t teams can use to collaborate effectively on building, testing, and deploying machine learning models.
Deployment and accessibility: Models can be deployed as SageMaker real-time endpoints, making them readily accessible for other applications to process live streaming data.

We show you how to train a MATLAB machine learning model as a SageMaker training job and then deploy the model as a SageMaker real-time endpoint so it can process live, streaming data.
To do this, we’ll use a predictive maintenance example where we classify faults in an operational pump that’s streaming live sensor data. We have access to a large repository of labeled data generated from a Simulink simulation that has three possible fault types in various possible combinations (for example, one healthy and seven faulty states). Because we have a model of the system and faults are rare in operation, we can take advantage of simulated data to train our algorithm. The model can be tuned to match operational data from our real pump using parameter estimation techniques in MATLAB and Simulink.

Our objective is to demonstrate the combined power of MATLAB and Amazon SageMaker using this fault classification example.
We start by training a classifier model on our desktop with MATLAB. First, we extract features from a subset of the full dataset using the Diagnostic Feature Designer app, and then run the model training locally with a MATLAB decision tree model. Once we’re satisfied with the parameter settings, we can generate a MATLAB function and send the job along with the dataset to SageMaker. This allows us to scale up the training process to accommodate much larger datasets. After training our model, we deploy it as a live endpoint which can be integrated into a downstream app or dashboard, such as a MATLAB Web App.

This example will summarize each step, providing a practical understanding of how to leverage MATLAB and Amazon SageMaker for machine learning tasks. The full code and description for the example is available in this repository.
Prerequisites

Working environment of MATLAB 2023a or later with MATLAB Compiler and the Statistics and Machine Learning Toolbox on Linux. Here is a quick guide on how to run MATLAB on AWS.
Docker set up in an Amazon Elastic Compute Cloud (Amazon EC2) instance where MATLAB is running. Either Ubuntu or Linux.
Installation of AWS Command-Line Interface (AWS CLI), AWS Configure, and Python3.

AWS CLI, should be already installed if you followed the installation guide from step 1.
Set up AWS Configure to interact with AWS resources.
Verify your python3 installation by running python -V or python –version command on your terminal. Install Python if necessary.

Copy this repo to a folder in your Linux machine by running:

git clone https://github.com/mathworks/Machine-Learning-with-MATLAB-and-Amazon-Sagemaker-Demo.git

Check the permission on the repo folder. If it does not have write permission, run the following shell command:

sudo chmod -R 777

Build the MATLAB training container and push it to the Amazon Elastic Container Registry (Amazon ECR).

Navigate to folder docker
Create an Amazon ECR repo using the AWS CLI (replace REGION with your preferred AWS region)

aws ecr create-repository
–repository-name sagemaker-matlab-training
–image-scanning-configuration scanOnPush=true
–region

Run the following docker command:

docker build -t sagemaker-matlab-training-r2023a .

docker tag sagemaker-matlab-training-r2023a ACCOUNT.dkr.ecr.REGION.amazonaws.com/sagemaker-matlab-training-r2023a:latest

aws ecr get-login-password –region REGION | docker login –username AWS –password-stdin ACCOUNT.dkr.ecr.us-east-1.amazonaws.com

docker push ACCOUNT.dkr.ecr. REGION.amazonaws.com/sagemaker-matlab-training-r2023a:latest

Open MATLAB and open the live script called PumpFaultClassificationMATLABSageMaker.mlx in folder examples/PumpFaultClassification. Make this folder your current working folder in MATLAB.

Part 1: Data preparation & feature extraction 
The first step in any machine learning project is to prepare your data. MATLAB provides a wide range of tools for importing, cleaning, and extracting features from your data.:

load SensorData.mat

The SensorData.mat dataset contains 240 records. Each record has two timetables: flow and pressure. The target column is faultcode, which is a binary representation of three possible fault combinations in the pump. For those time series tables, each table has 1,201 rows which mimic 1.2 seconds of pump flow and pressure measurement with 0.001 seconds increment.

Next, the Diagnostic Feature Designer app allows you to extract, visualize, and rank a variety of features from the data. Here, you use Auto Features, which quickly extracts a broad set of time and frequency domain features from the dataset and ranks the top candidates for model training. You can then export a MATLAB function that will recompute the top 15 ranked features from new input data. Let’s call this function extractFeaturesTraining. This function can be configured to take in data all in one batch or as streaming data.

This function produces a table of features with associated fault codes, as shown in the following figure:

Part 2: Organize data for SageMaker 
Next, you need to organize the data in a way that SageMaker can use for machine learning training. Typically, this involves splitting the data into training and validation sets and splitting the predictor data from the target response.
In this stage, other more complex data cleaning and filtering operations might be required. In this example, the data is already clean. Potentially, if the data processing is very complex and time consuming, SageMaker processing jobs can be used to run these jobs apart from SageMaker training so that they can be separated into two steps.
trainPredictors = trainingData(:,2:end);
trainResponse = trainingData(:,1);
Part 3: Train and test a machine learning model in MATLAB 
Before moving to SageMaker, it’s a good idea to build and test the machine learning model locally in MATLAB. This allows you to quickly iterate and debug the model. You can set up and train a simple decision tree classifier locally.
classifierModel = fitctree(…  trainPredictors,…  trainResponse,…  OptimizeHyperparameters=’auto’);
The training job here should take less than a minute to finish and generates some graphs to indicate the training progress. After the training is finished, a MATLAB machine learning model is produced. The Classification Learner app can be used to try many types of classification models and tune them for best performance, then produce the needed code to replace the model training code above.

After checking the accuracy metrics for the locally-trained model, we can move the training into Amazon SageMaker.
Part 4: Train the model in Amazon SageMaker 
After you’re satisfied with the model, you can train it at scale using SageMaker. To begin calling SageMaker SDKs, you need to initiate a SageMaker session.
session = sagemaker.Session();
Specify a SageMaker execution IAM role that training jobs and endpoint hosting will use.
role = “arn:aws:iam::ACCOUNT:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXXXXXXXXXX”;
From MATLAB, save the training data as a .csv file to an Amazon Simple Storage Service (Amazon S3) bucket.
writetable(trainingData,’pump_training_data.csv’);
trainingDataLocation = “s3:// “+session.DefaultBucket+ +”/cooling_system/input/pump_training”;
copyfile(“pump_training_data.csv”, trainingDataLocation);
Create a SageMaker Estimator
Next, you need to create a SageMaker estimator and pass all the necessary parameters to it, such as a training docker image, training function, environment variables, training instance size, and so on. The training image URI should be the Amazon ECR URI you created in the prerequisite step with the format ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/sagemaker-matlab-training-r2023a:latest. The training function should be provided at the bottom of the MATLAB live script.

trainingImage = “ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/sagemaker-matlab-training-r2023a:latest”;

est = sagemaker.MATLABEstimator(…
    role, …
    Image=trainingImage, …
    Session=session, …
    BaseJobName=”PumpDecisionTreeMatlab”, …
    Environment = loadenv(fullfile(rootFolder, “training.env”)), …
    TrainingFunction = @trainingFunction, …
    HyperParameters = struct(), … % named args to train_decision_tree
    InstanceType=”ml.m5.large”, …
    MaxRunTime=minutes(10), …    
    MaxWaitTime=minutes(20), …
    UseSpotInstances=true);

Submit SageMaker training job
Calling the fit method from the estimator submits the training job into SageMaker.
est.fit(training=struct(Location=trainingDataLocation, ContentType=”text/csv”))
You can also check the training job status from the SageMaker console:

After the training jobs finishes, selecting the job link takes you to the job description page where you can see the MATLAB model saved in the dedicated S3 bucket:

Part 5: Deploy the model as a real-time SageMaker endpoint 
After training, you can deploy the model as a real-time SageMaker endpoint, which you can use to make predictions in real time. To do this, call the deploy method from the estimator. This is where you can set up the desired instance size for hosting depending on the workload.

predictor = est.deploy(role, “ClassificationTreeInferenceHandler”, uint8(1), “ml.m5.large”)

Behind the scenes, this step builds an inference docker image and pushes it to the Amazon ECR repository, nothing is required from the user to build the inference container. The image contains all the necessary information to serve the inference request, such as model location, MATLAB authentication information, and algorithms. After that, Amazon SageMaker creates a SageMaker endpoint configuration and finally deploys the real-time endpoint. The endpoint can be monitored in the SageMaker console and can be terminated anytime if it’s no longer used.

Part 6: Test the endpoint 
Now that the endpoint is up and running, you can test the endpoint by giving it a few records to predict. Use the following code to select 10 records from the training data and send them to the endpoint for prediction. The prediction result is sent back from the endpoint and shown in the following image.

input = trainPredictors(10:19,:)
prediction = predictor.predict(input)

Part 7: Dashboard integration 
The SageMaker endpoint can be called by many native AWS services. It can also be used as a standard REST API if deployed together with an AWS Lambda function and API gateway, which can be integrated with any web applications. For this particular use case, you can use streaming ingestion with Amazon SageMaker Feature Store and Amazon Managed Streaming for Apache Kafka, MSK, to make machine learning-backed decisions in near real-time. Another possible integration is using a combination of Amazon Kinesis, SageMaker, and Apache Flink to build a managed, reliable, scalable, and highly available application that’s capable of real-time inferencing on a data stream.
After algorithms are deployed to a SageMaker endpoint, you might want to visualize them using a dashboard that displays streaming predictions in real time. In the custom MATLAB web app that follows, you can see pressure and flow data by pump, and live fault predictions from the deployed model.
In this dashboard includes a remaining useful life (RUL) model to predict the time to failure for each pump in question. To learn how to train RUL algorithms, see Predictive Maintenance Toolbox.

Clean Up
After you run this solution, make sure you clean up any unneeded AWS resources to avoid unexpected costs. You can clean up these resources using the SageMaker Python SDK or the AWS Management Console for the specific services used here (SageMaker, Amazon ECR, and Amazon S3). By deleting these resources, you prevent further charges for resources you’re no longer using.
Conclusion
We’ve demonstrated how you can bring MATLAB to SageMaker for a pump predictive maintenance use case with the entire machine learning lifecycle. SageMaker provides a fully managed environment for running machine learning workloads and deploying models with a great selection of compute instances serving various needs.
Disclaimer: The code used in this post is owned and maintained by MathWorks. Refer to the license terms in the GitHub repo. For any issues with the code or feature requests, please open a GitHub issue in the repository 
References

https://github.com/mathworks/Machine-Learning-with-MATLAB-and-Amazon-Sagemaker-Demo
https://aws.amazon.com/blogs/machine-learning/use-streaming-ingestion-with-amazon-sagemaker-feature-store-and-amazon-msk-to-make-ml-backed-decisions-in-near-real-time/
https://aws.amazon.com/blogs/architecture/realtime-in-stream-inference-kinesis-sagemaker-flink/
https://github.com/mathworks-ref-arch/matlab-on-aws
https://www.mathworks.com/products/matlab.html
https://www.mathworks.com/solutions/cloud.html
https://docs.docker.com/engine/install/ubuntu/
https://docs.docker.com/engine/install/linux-postinstall/

About the Authors
Brad Duncan is the product manager for machine learning capabilities in the Statistics and Machine Learning Toolbox at MathWorks. He works with customers to apply AI in new areas of engineering such as incorporating virtual sensors in engineered systems, building explainable machine learning models, and standardizing AI workflows using MATLAB and Simulink. Before coming to MathWorks he led teams for 3D simulation and optimization of vehicle aerodynamics, user experience for 3D simulation, and product management for simulation software. Brad is also a guest lecturer at Tufts University in the area of vehicle aerodynamics.
Richard Alcock is the senior development manager for Cloud Platform Integrations at MathWorks. In this role, he is instrumental in seamlessly integrating MathWorks products into cloud and container platforms. He creates solutions that enable engineers and scientists to harness the full potential of MATLAB and Simulink in cloud-based environments. He was previously a software engineering at MathWorks, developing solutions to support parallel and distributed computing workflows.
Rachel Johnson is the product manager for predictive maintenance at MathWorks, and is responsible for overall product strategy and marketing. She was previously an application engineer directly supporting the aerospace industry on predictive maintenance projects. Prior to MathWorks, Rachel was an aerodynamics and propulsion simulation engineer for the US Navy. She also spent several years teaching math, physics, and engineering.
Shun Mao is a Senior AI/ML Partner Solutions Architect in the Emerging Technologies team at Amazon Web Services. He is passionate about working with enterprise customers and partners to design, deploy and scale AI/ML applications to derive their business values. Outside of work, he enjoys fishing, traveling and playing Ping-Pong.
Ramesh Jatiya is a Solutions Architect in the Independent Software Vendor (ISV) team at Amazon Web Services. He is passionate about working with ISV customers to design, deploy and scale their applications in cloud to derive their business values. He is also pursuing an MBA in Machine Learning and Business Analytics from Babson College, Boston. Outside of work, he enjoys running, playing tennis and cooking.

Text embedding and sentence similarity retrieval at scale with Amazon …

Text vectors or embeddings are numerical vector representations of text that are generated by large language models (LLMs). After LLMs are fully pre-trained on a large dataset or fine-tuned from different tasks, including text completion, question answering, and translations, text embeddings capture semantic information of the input text. Different downstream applications are made possible by text embeddings, including similarity searching, information retrieval, recommendations and personalization, multilingual translations, and more.
Before intelligent applications could be built from embeddings, enterprises and organizations had to embed their existing documents, which can be expensive and technically complicated. Amazon SageMaker JumpStart is a machine learning (ML) hub that helps accelerate this journey. With SageMaker JumpStart, you can access pre-trained, cutting-edge text embedding models from various model providers, including Hugging Face, AI 21 Labs, Cohere, and Meta AI. You can seamlessly deploy these models into production with the SageMaker JumpStart user interface or SDK. In addition, none of your data is used to train the underlying models. Because all data is encrypted and doesn’t leave its own VPC, you can trust your data remains private and confidential.
In this post, we demonstrate how to use the SageMaker Python SDK for text embedding and sentence similarity. Sentence similarity involves assessing the likeness between two pieces of text after they are converted into embeddings by the LLM, which is a foundation step for applications like Retrieval Augmented Generation (RAG). We demonstrate how to do the following:

Run inference on a text embedding model deployed from SageMaker JumpStart
Find the nearest neighbors for an input sentence with your own dataset
Run the batch transform on large documents to minimize costs

All the code is available on GitHub.
Deploy a text embedding model via SageMaker JumpStart
To host a model on Amazon SageMaker, the first step is to set up and authenticate the use of AWS services. In Amazon SageMaker Studio, we use the execution role associated with the notebook instance. See the following code:

import sagemaker, boto3, json
from sagemaker.session import Session
sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

On Hugging Face, the Massive Text Embedding Benchmark (MTEB) is provided as a leaderboard for diverse text embedding tasks. It currently provides 129 benchmarking datasets across 8 different tasks on 113 languages. The top text embedding models from the MTEB leaderboard are made available from SageMaker JumpStart, including bge, gte, e5, and more. In this post, we use huggingface-sentencesimilarity-bge-large-en as an example. We can use the SageMaker SDK to deploy this state-of-the-art text embedding model:

from sagemaker.jumpstart.model import JumpStartModel

model_id = “huggingface-sentencesimilarity-bge-large-en”
text_embedding_model = JumpStartModel(model_id=model_id)
predictor = text_embedding_model.deploy()

Text embedding model query
Let’s look at the text embedding model query in more detail.
Text to embedding
If you have already deployed a SageMaker endpoint before, the predictor can be restored as follows:

from sagemaker.predictor import Predictor
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import IdentitySerializer

predictor = Predictor(
endpoint_name=<YOUR_ENDPOINT_NAME>,
deserializer=JSONDeserializer(),
serializer=IdentitySerializer(),
)
predictor.content_type = “application/x-text”

After the model is successfully deployed, you can query the endpoint with a batch of input texts within a JSON payload:

sentences = [
# Pets
“Your dog is so cute.”,
“How cute your dog is!”,
“You have such a cute dog!”,
# Cities
“Sydney is the place where I work.”,
“I work in Sydney.”,
# Color
“What colour do you like the most?”,
“What is your favourite colour?”,
]

predictor.predict(json.dumps(sentences).encode(‘utf-8’))

The correlation of the embeddings of these sentences is plotted in the following figure.

As shown in the preceding figure, same subjects are highly correlated within themselves, including Pets, Cities, and Color; different subjects are much dissimilar. This indicates the embedding generated by the LLMs (in this case, bge) can represent the semantic information accurately.
For this post, we used the preceding sample and compared the latency across different sentence embedding models currently available from SageMaker JumpStart. Latency is the amount of time from the moment that a user sends a request until the time that the application indicates that the request has been completed. The numbers in the following table represent the average latency for a total of 100 requests using the same batch of input texts on the ml.g5.2xlarge and ml.c6i.xlarge instances.

Model
g5.2xlarge Average Latency (ms)
c6i.xlarge Average Latency(ms)
Language Support

all-MiniLM-L6-v2
19.5
27.9
English

BGE Base En
21.2
114
English

BGE Small En
28.3
45.6
English

BGE Large En
34.7
337
English

Multilingual E5 Base
22.1
118
Multilingual

Multilingual E5 Large
39.8
360
Multilingual

E5 Base
25.6
117
English

E5 Base V2
25.2
123
English

E5 Large
32.2
339
English

E5 Large V2
32.5
331
English

GTE Base
22.2
112
English

GTE Small
19.7
46
English

GTE Large
39.7
347
English

Get the nearest neighbors
The deployed model from SageMaker JumpStart can also facilitate the process of identifying the nearest neighbors to queries within the corpus. When provided with queries and a corpus, the model will produce the corpus_id, which denotes the position of the relevant corpus entry in the input corpus list, and a score indicating the degree of proximity to the query. It uses the following parameters:

corpus – Provides the list of inputs from which to find the nearest neighbor
queries – Provides the list of inputs for which to find the nearest neighbor from the corpus
top_k – The number of nearest neighbors to find from the corpus
mode – Set as nn_corpus for getting the nearest neighbors to input queries within the corpus

See the following code:

corpus = [
“Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.”,
“Amazon SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.”,
“Amazon SageMaker provides a full end-to-end workflow, but you can continue to use your existing tools with SageMaker. You can easily transfer the results of each stage in and out of SageMaker as your business requirements dictate.”
]
queries = [
“What is Amazon SageMaker?”,
“How does Amazon SageMaker secure my code?”,
“What if I have my own notebook, training, or hosting environment in my own business environment?”
]

payload_nearest_neighbor = {“corpus”: corpus, “queries”: queries, “top_k”: 3, “mode”: “nn_corpus”}
query_response = predictor.predict(payload_nearest_neighbor)

We get the following output:

[
[
{‘corpus_id’: 0, ‘score’: 0.8992230892181396},
{‘corpus_id’: 2, ‘score’: 0.8664969205856323},
{‘corpus_id’: 1, ‘score’: 0.8456423282623291}
],
[
{‘corpus_id’: 1, ‘score’: 0.8919335603713989},
{‘corpus_id’: 0, ‘score’: 0.840064525604248},
{‘corpus_id’: 2, ‘score’: 0.8145401477813721}
],
[
{‘corpus_id’: 2, ‘score’: 0.7712811231613159},
{‘corpus_id’: 1, ‘score’: 0.7564010620117188},
{‘corpus_id’: 0, ‘score’: 0.7525666356086731}
]
]

This result means the first query is most similar to the first corpus, the second is closer to the second corpus, and so on. This is a correct match in this example.
We also took the preceding sample and compared the latency across different sentence embedding models currently available from SageMaker JumpStart. The numbers in the following table represent the average latency for a total of 100 requests using the same payload on the ml.g5.2xlarge and ml.c6i.xlarge instances.

Model
g5.2xlarge Average Latency (ms)
c6i.xlarge Average Latency(ms)
Language Support

all-MiniLM-L6-v2
21.7
69.1
English

BGE Base En
29.1
372
English

BGE Small En
29.2
124
English

BGE Large En
47.2
1240
English

Multilingual E5 Base
30
389
Multilingual

Multilingual E5 Large
47.1
1380
Multilingual

E5 Base
30.4
373
English

E5 Base V2
31
409
English

E5 Large
45.9
1230
English

E5 Large V2
49.6
1220
English

GTE Base
30.3
375
English

GTE Small
28.5
129
English

GTE Large
46.6
1320
English

Get the nearest neighbors on a large dataset
When making requests to the SageMaker invoke endpoint, payloads are restricted to approximately 5 MB, and the request timeout is set to 1 minute. If corpus size exceeds these limits, you could use a SageMaker training job, which generates embeddings for your large dataset and persists them alongside the model inside the SageMaker endpoint. Therefore, they don’t have to be passed as part of the invocation payload. The process of finding the nearest neighbors is carried out using SentenceTransformer and its utility function. The nearest neighbor is based on the cosine similarity between the input sentence embedding and the precomputed sentence embeddings during the training job.
In the following example, we fetch and prepare the Amazon_SageMaker_FAQs dataset to use it in finding the nearest neighbor to an input question:

!aws s3 cp s3://jumpstart-cache-prod-us-west-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv Amazon_SageMaker_FAQs.csv

import pandas as pd

data = pd.read_csv(“Amazon_SageMaker_FAQs.csv”, names=[“Questions”, “Answers”])
data[“id”] = data.index
data_req = data[[“id”, “Answers”]]
data_req.to_csv(“data.csv”, index=False, header=False)

output_bucket = sess.default_bucket()
output_prefix = “jumpstart-example-ss-training”

s3_output_location = f”s3://{output_bucket}/{output_prefix}/output”
training_dataset_s3_path = f”s3://{output_bucket}/{output_prefix}/data/data.csv”

!aws s3 cp data.csv {training_dataset_s3_path}

For algorithm-specific training hyperparameters, the SageMaker SDK can be fetched or overwritten:

from sagemaker import hyperparameters

hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version = “*”)
hyperparameters[“batch_size”] = “64”
print(hyperparameters)
>>> {‘max_seq_length’: ‘None’, ‘batch_size’: ’64’, ‘store_text_with_embedding’: ‘True’}

The SageMaker training consists of two steps: create the estimator object and launch the training job. The output is a model prepackaged with embeddings of your large dataset used as training data, which can be deployed for inference to get the nearest neighbor for any input sentence. See the following code:

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
model_id=model_id,
hyperparameters=hyperparameters,
output_path=s3_output_location
)

estimator.fit(
{“training”: f”s3://{output_bucket}/{output_prefix}/data”}
)
predictor = estimator.deploy()

The query syntax to convert text into embeddings is the same as before. The code to get the nearest neighbor, however, can be simplified as follows:

payload_nearest_neighbour = {
“queries”: [“Is R supported with Amazon SageMaker?”],
“top_k”: 1,
“mode”: “nn_train_data”,
}

response = predictor.predict(payload_nearest_neighbour)
>>> [[{‘id’: ‘9’, ‘score’: 0.9240573048591614}]]

data[“Answers”].iloc[int(response[0][0][“id”])]
>>> “Yes, R is supported with Amazon SageMaker. You can use R within SageMaker notebook instances, which include a preinstalled R kernel and the reticulate library. Reticulate offers an R interface for the Amazon SageMaker Python SDK, enabling ML practitioners to build, train, tune, and deploy R models.”

We can also query the endpoint with questions in the Amazon_SageMaker_FAQs dataset and compare how many of the correct corresponding answers are returned. In the following example, we measure the top-3 accuracy, given there could be similar question answer pairs. This means if the correct answer is returned as one of the top-3 returns, it’s treated as a correct query.

total_correct_answers = 0

for i in range(len(data)):
question = data[“Questions”].iloc[i]
payload_nearest_neighbor = {
“queries”: [question],
“top_k”: 3,
“mode”: “nn_train_data”,
}
response = predictor.predict(payload_nearest_neighbor)
response_ids = [int(res[“id”]) for res in response[0]]

if i in response_ids:
total_correct_answers += 1
else:
pred_answer = [data[“Answers”].iloc[response_id] for response_id in response_ids]

print(total_correct_answers*100/len(data))
>>>
81.16883116883118

Run a batch transform to get embeddings on large datasets
For enterprises and organizations with a large volume of historical documents that exceed the memory of a single endpoint instance, you can use SageMaker batch transform to save cost. When you start a batch transform job, SageMaker launches the necessary compute resources to process the data. During the job, SageMaker automatically provisions and manage the compute resources. When the batch transform job is complete, those resources are automatically cleaned up, which minimizes costs. By dividing a large dataset into smaller chunks and using more instances, you can scale out the compute for faster inference with similar cost, without managing infrastructure. The maximum payload for batch transform is 100 MB and timeout is 1 hour.
The input format for our batch transform job is a JSONL file, with entries as a line of JSON, which consists of id and text_inputs. See the following code:

test_data_file_name = “test.jsonl”
test_data = []

for i in range(len(data)):
answer = data.loc[i, “Answers”]
payload = {“id”: i, “text_inputs”: answer}
test_data.append(payload)

with open(test_data_file_name, “w”) as outfile:
for entry in test_data:
outfile.write(f”{json.dumps(entry)}n”)

s3 = boto3.client(“s3″)
s3.upload_file(test_data_file_name, output_bucket, f”{output_prefix}/batch_input/test.jsonl”)

When the data is ready in Amazon Simple Storage Service (Amazon S3), you can create the batch transform object from the SageMaker JumpStart model, which triggers the transform job:

s3_input_data_path = f”s3://{output_bucket}/{output_prefix}/batch_input/”
s3_output_data_path = f”s3://{output_bucket}/{output_prefix}/batch_output/”

batch_transformer = text_embedding_model.transformer(
instance_count=1,
instance_type=”ml.p3.2xlarge”,
output_path=s3_output_data_path,
assemble_with=”Line”,
accept=”text/csv”,
max_payload=1,
)

batch_transformer.transform(
s3_input_data_path,
content_type=”application/jsonlines”,
split_type=”Line”
)

batch_transformer.wait()

After the batch transform job is complete, you can download the result from Amazon S3:

s3 = boto3.client(“s3”)
s3.download_file(
output_bucket, output_prefix + “/batch_output/” + “test.jsonl.out”, “predict.jsonl”
)

with open(“predict.jsonl”, “r”) as json_file:
json_list = list(json_file)

Conclusion
SageMaker JumpStart provides a straightforward way to use state-of-the-art large language foundation models for text embedding and semantic search. With the user interface or just a few lines of code, you can deploy a highly accurate text embedding model and find semantic matches across large datasets, at scale and cost-efficiently. SageMaker JumpStart removes the barriers to implement semantic search by providing instant access to cutting-edge models like the ones benchmarked on the MTEB leaderboard. Businesses and developers can build intelligent search and recommendation systems faster.
This post demonstrated how to find semantically similar questions and answers, which could be applied to RAG use cases, recommendations and personalization, multilingual translations, and more. With continued advances in language models and the simplicity of SageMaker JumpStart, more organizations can infuse generative AI capabilities into their products. As the next step, you can try text-embedding models from SageMaker JumpStart on your own dataset to test and benchmark the results for your RAG use cases.

About the Authors
Dr. Baichuan Sun, currently serving as a Sr. AI/ML Solution Architect at AWS, focuses on generative AI and applies his knowledge in data science and machine learning to provide practical, cloud-based business solutions. With experience in management consulting and AI solution architecture, he addresses a range of complex challenges, including robotics computer vision, time series forecasting, and predictive maintenance, among others. His work is grounded in a solid background of project management, software R&D, and academic pursuits. Outside of work, Dr. Sun enjoys the balance of traveling and spending time with family and friends, reflecting a commitment to both his professional growth and personal well-being.
Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his masters from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

How to Turn Anonymous Website Visitors into Sales [Marketing Playbook]

In the fast-paced realm of marketing, where every click holds the potential for a sale, sophisticated marketers face a formidable challenge—converting anonymous website visitors into loyal customers. 

Consider this: on average, nearly 98% of website visitors remain anonymous, navigating through product and offer pages, engaging with content, yet leaving no trace of their identity. In an era where data is king, this substantial segment represents a goldmine of untapped potential. 

As a savvy marketer, you’re keenly aware that a substantial portion of your website traffic operates in the shadows of anonymity. The challenge before you is to unravel the mystery of who these anonymous website visitors are and transform them into valued customers.

To help, we’ve crafted this comprehensive guide specifically tailored to identifying your website visitors and converting them into sales using the data and tools at your disposal.

Our aim is to equip you with not just strategic insights but also practical steps that demystify the enigma surrounding anonymous visitors and pave the way for your unparalleled success in the dynamic landscape of marketing.

Navigating and Overcoming Key Hurdles in the Marketing Landscape

Before we dive into the world of anonymous visitors, we first have to understand the key challenges impacting the marketing landscape. 

Intense Competition and Market Saturation: With a superabundance of online stores and businesses vying for consumers’ attention and wallets, it’s becoming increasingly difficult for businesses to differentiate themselves. You must find innovative ways to stand out, whether through unique branding, personalized customer experiences, or strategic partnerships. Effectively navigating this crowded landscape requires a keen understanding of the target audience and a commitment to delivering value that goes beyond just the product.

Rapid Technological Advancements: The rapid evolution of technology has transformed the way consumers interact with online platforms. Businesses are constantly challenged to keep pace with emerging trends, such as voice search, augmented reality, and artificial intelligence while at the same time trying to learn new platforms and navigate changes on existing platforms.  

Data Privacy and Security Concerns: In an era where data breaches and privacy concerns make daily headlines, tech giants are digging their heels in on privacy, making it harder and harder for to reach customers. Between cookie updates from Apple or email changes from Google and Yahoo, a focus on consumer privacy is adding a new layer of complexity.

Omni-Channel Customer Experience: The modern consumer journey is often multi-faceted, involving interactions across various channels, both online and offline. Businesses face the challenge of creating a seamless omni-channel experience that ensures consistency and cohesion throughout the customer’s journey. Coordinating marketing and sales efforts across platforms, integrating data from different touchpoints, and optimizing the user experience regardless of the channel are all critical.

Adapting to Shifting Consumer Behavior and Preferences: Consumer behavior and preferences are continually evolving, driven by factors such as cultural shifts, economic changes, and global events. You have to be agile and responsive to these shifts to stay relevant. This requires a deep understanding of the target audience, ongoing market research, and a willingness to adjust strategies as needed. 

These challenges are real and they are changing how you have to think about your marketing and sales efforts. Evolve or die, right?

So where do the anonymous website visitors come in? 

Unidentified visitors represent a concealed opportunity that has been in plain sight all along, and the key lies in identifying them. Anonymous visitor identification unlocks the door to enhanced personalization, better marketing, and helps address the challenges we just explored. More importantly, it empowers you to boost sales.

Convert Website Visitors into Real Contacts!

Identify who is visiting your site with name, email and more. Get 50 contacts for free!

Please enable JavaScript in your browser to complete this form.Website / URL *Grade my website

How to Increase Sales by Identifying Anonymous Website Visitors 

Identifying your anonymous visitors is great but if you don’t know what to do with the information, you can’t be successful.

Let’s look at how you can use website visitor data to make informed decisions and improve your marketing efforts.

1. Precision Targeting

One of the foremost advantages of utilizing a website visitor identification tool is the ability to precisely target your audience. By understanding the demographics and behaviors of your visitors, you can tailor your marketing campaigns with laser-like precision, ensuring that your message resonates with the right people at the right time. For example:

Behavioral Insights: Utilizing anonymous visitor data allows you to track user behavior, such as pages viewed or purchase intent. This data can then be used to create email or retargeting campaigns based on those behaviors. 

Demographic Segmentation: Anonymous visitor identification tools enable you to categorize visitors based on demographics. If you sell winter coats, your message to visitors in Florida is going to be different than your message to users in Maine. 

This leads to better messaging and a better customer experience. Plus, 80% of companies that use market segmentation report increased sales.

Timing and Frequency: Understanding the time of day or week when anonymous visitors are most active allows for strategic timing of marketing messages. When are customers visiting your site? When are they opening their emails? In a world where the average person sees 10,000 ads in a day, you have to find a way to stand out.

2. Enhanced Personalization 

Personalization drives growth. Companies that grow faster drive 40% more of their revenue from personalization than their slower-growing counterparts. There is real money in personalization.

Visitor identification provides a new level of personalization in your marketing efforts. 

Think about it: in a typical scenario, a person comes to your site, they browse around for a bit, and they leave.

If they clicked an ad to get there, you have data that can be used for remarketing but if they only subscribed to your blog or filled out form for a coupon code, you have no idea what they actually want. Even worse, what if they didn’t take any action that gives you information about their intent? There is no option for personalization!

Anonymous visitor identification gives you detailed insights into individual preferences and past interactions, allowing you to create highly personalized experiences for each visitor. And with 71% of consumers expecting personalization, it’s your job to give them what they want.

3. Streamlined and Efficient Marketing

Efficiency is the key to successful marketing, and customer identification plays a pivotal role in streamlining your efforts. With a clear understanding of your audience, you can optimize your marketing channels, focus on high-potential leads, and allocate resources where they will yield the greatest return on investment. Here’s how:

Seamless Integration of Data: Identifying anonymous visitors allows you to seamlessly integrate valuable data into your marketing ecosystem. By consolidating insights on user behavior, preferences, and interactions, your marketing efforts gain a cohesive foundation, ensuring your messaging aligns with the unique attributes of each visitor and fostering a more meaningful connection.

Targeted Content Delivery: Understanding the demographics and behaviors of your anonymous visitors empowers you to tailor content. Create targeted campaigns that speak directly to the interests and needs of specific audience segments. Whether it’s curating product recommendations or crafting compelling narratives, the ability to deliver content tailored to individual preferences enhances the overall customer experience.

Automated Campaign Optimization: Automation is key to efficiency. Anonymous visitor identification enables the creation of automated workflows that respond dynamically to user interactions. From triggered emails based on specific actions to personalized offers aligned with individual preferences, automation ensures that your marketing stays agile and responsive without constant manual intervention.

Efficient Resource Allocation: By focusing on identified anonymous visitors, your marketing resources are allocated more efficiently. Rather than employing a one-size-fits-all approach, targeted campaigns reduce waste and amplify the impact of your efforts. 

Enhanced Customer Journey Mapping: Anonymous visitor identification provides the missing pieces of the puzzle in mapping the customer journey. With a clearer understanding of touchpoints, preferences, and engagement levels, you can refine your customer journey map for enhanced coherence. 

4. Warm Outreach and Relationship Building

Knowing your customers allows you to initiate warm outreach strategies. By reaching out to potential customers who have expressed interest in your products, you can build a connection and nurture relationships. This approach fosters trust and increases the chances of turning a casual visitor into a loyal customer. It leads to:

Personalized Communication Initiatives: Identifying anonymous visitors allows you to infuse a personalized touch into your communication initiatives. Leverage the data to tailor outreach messages that resonate with individual preferences. This ensures tyour communication feels relevant but also genuinely tailored to the individual. Whether it’s a personalized email campaign or targeted social media interaction, the ability to speak directly to the needs and interests of your audience fosters a sense of connection.

Strategic Follow-Ups Based on Visitor Behavior: As we mentioned earlier, the name of the game is personalization. If a visitor consistently engages with specific product categories or spends considerable time on certain pages, you can initiate targeted follow-up communications. Whether it’s providing additional information, exclusive offers, or personalized recommendations, these strategic follow-ups demonstrate attentiveness and contribute to building a rapport with the visitor.

Responsive Engagement Across Channels: We know the customer journey spans multiple channels. In fact, the average customer hits nine touchpoints! Anonymous visitor identification ensures a responsive and cohesive engagement across this spectrum. Whether a visitor interacts with your website, subscribes to newsletters, or engages on social media, the ability to recognize and consolidate these touchpoints allows for a seamless and consistent communication flow. This unified approach contributes to a more holistic and authentic relationship-building process.

Building Trust Through Consistent Interaction: Trust is key to sales and key to long-lasting customer relationships. Anonymous visitor identification facilitates consistent interaction, creating touchpoints that contribute to the gradual establishment of trust. By demonstrating a true understanding of who your customers are and consistently delivering value through personalized interactions, you lay the groundwork for a relationship built on trust and mutual benefit.

5. Expanded Retargeting Audiences

With retargeting lists shrinking by the day thanks to privacy changes by Apple and Google, marketers have to act fast. How can you grow your lists if you don’t know who is visiting your site? Identifying anonymous website visitors with contact info you own, can analyze, and sync to other systems is the key to unlocking the full potential of your retargeting efforts, allowing you to cast a wider net and rekindle connections with a broader audience. Here’s how:

Audience List Enhancement: Anonymous visitor identification gives you the names and email addresses of those visiting your website. And with a tool like Customers.ai, that list can be automatically synced to platforms like Facebook to enhance remarketing campaigns. With a list that is continually updated and growing, you can reach more customers than ever before.

Diversification of Retargeting Segments: Identifying anonymous website visitors enables the creation of diverse retargeting segments based on a range of criteria such as behavior, demographics, and engagement levels. Rather than limiting retargeting efforts to a generic audience, you can tailor segments to specific visitor attributes. For example, retargeting visitors who showed interest in a particular product category or those who spent a significant amount of time browsing your site.

Customized Ad Content for Enhanced Relevance: Understanding the preferences and behaviors of anonymous visitors allows for the creation of highly customized ad content. Tailor your retargeting ads to align with individual interests and past interactions, increasing the relevance of your messaging. Whether showcasing products left in the cart, promoting complementary items, or offering exclusive discounts, personalized retargeting content significantly enhances the chances of re-engagement.

Dynamic Retargeting Based on Visitor Behavior: Utilizing anonymous visitor data enables dynamic retargeting that adapts to individual visitor behavior. For instance, if a visitor showed interest in specific products but didn’t make a purchase, dynamic retargeting allows you to showcase those exact products in subsequent ads. This tailored approach not only reinforces the visitor’s initial interest but also increases the likelihood of conversion by presenting the most relevant offerings.

6. Pre-Cart Outreach

Traditional methods often hinge on visitors completing forms or progressing to the checkout stage before capturing their information. However, customer identification dismantles these barriers, enabling you to establish connections with potential customers who may not have filled out forms or abandoned their carts. 

This innovative approach broadens your reach and taps into a previously untapped market, allowing you to proactively engage with visitors before they even consider adding items to their cart. By understanding the preferences and behaviors of these potential customers early on, you can implement pre-cart outreach strategies that lay the foundation for a personalized and compelling shopping experience.

How to Identify Anonymous Visitors with Customers.ai

To turn anonymous website visitors into known users, you need the right tools and technologies. That’s where the Customers.ai Website Visitor ID X-Ray Pixel comes in. By placing this pixel on your site, you can begin identifying who is coming to your site. 

Step 1: To install the Website Visitor ID X-Ray Pixel, sign up (for FREE!), go to your dashboard, and navigate to My Automations. 

Step 2: Select + New Automation and get your pixel. We have easy install options for Google Tag Manager, WordPress, and Shopify, or you can install the pixel manually.

Step 3: From there you can integrate with systems like Klayvio, Mailchimp, Facebook, and more!

Elevate Your Marketing with Anonymous Visitor Identification & Personalized Outreach

In the competitive world of marketing, staying ahead requires innovation and adaptability. Embracing tools like Customers.ai Website Visitor ID X-Ray Pixel powered anonymous visitor identification empowers you to transform anonymous visitors into loyal customers.

By unlocking the potential of personalized, targeted marketing, you can navigate the challenges of the modern marketing landscape and drive sustained growth in sales.

Remember, in the realm of marketing, knowledge is power, and anonymous visitor identification is the key to unlocking the full potential of your website. Elevate your strategies, deepen your connections, and watch as your sales soar to new heights.

Convert Website Visitors into Real Contacts!

Identify who is visiting your site with name, email and more. Get 50 contacts for free!

Please enable JavaScript in your browser to complete this form.Website / URL *Grade my website

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 50 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post How to Turn Anonymous Website Visitors into Sales [Marketing Playbook] appeared first on Customers.ai.

Tencent AI Lab Introduces Chain-of-Noting (CoN) to Improve the Robustn …

Tencent AI Lab researchers address challenges in the reliability of retrieval-augmented language models (RALMs), which may retrieve irrelevant information, leading to misguided responses. The proposed approach, CHAIN-OF-NOTING (CON), aims to enhance RALM. CON-equipped RALMs exhibit substantial performance improvements across open-domain QA benchmarks, achieving notable gains in Exact Match (EM) scores and rejection rates for out-of-scope questions.

The research addresses limitations in RALMs, emphasizing noise robustness and reduced dependence on retrieved documents. The CON approach generates sequential reading notes for retrieved documents, enabling a comprehensive relevance evaluation. The case studies highlight that CON enhances the model’s understanding of document relevance, resulting in more accurate, contextually relevant responses by filtering out irrelevant or less trustworthy content.

Outperforming standard RALMs, CON achieves higher Exact Match scores and rejection rates for out-of-scope questions. It balances direct retrieval, inferential reasoning, and acknowledging knowledge gaps, resembling human information processing. CON’s implementation involves designing reading notes, data collection, and model training, offering a solution to current RALM limitations and enhancing reliability.

CON, a framework generating sequential reading notes for retrieved documents, enhances the performance of RALMs. Trained on a LLaMa-2 7B model with ChatGPT-created training data, CON outperforms standard RALMs, especially in high-noise scenarios. It classifies reading notes into direct answers, useful context, and unknown scenarios, demonstrating a robust mechanism for assessing document relevance. Comparisons with LLaMa-2 wo IR, a baseline method, showcase CON’s ability to filter irrelevant content, improving response accuracy and contextual relevance.

RALMs equipped with CON demonstrate substantial improvements, achieving a remarkable +7.9 average increase in EM score for entirely noisy retrieved documents. CON exhibits a notable +10.5 improvement in rejection rates for real-time questions beyond pre-training knowledge. Evaluation metrics include EM score, F1 score, and reject rate for open-domain QA. Case studies highlight CON’s efficacy in deepening RALMs’ understanding, addressing challenges of noisy, irrelevant documents, and improving overall robustness.

The CON framework significantly enhances RALMs. By generating sequential reading notes for retrieved documents and integrating this information into the final answer, RALMs equipped with CON outperform standard RALMs, showing a notable average improvement. CON addresses the limitations of standard RALMs, fostering a deeper understanding of relevant information and improving overall performance on various open-domain QA benchmarks.

Future research may extend the CON framework’s application to diverse domains and tasks, evaluating its generalizability and efficacy in fortifying RALMs. Investigating varied retrieval strategies and document ranking methods can optimize the retrieval process, enhancing the relevance of retrieved documents. User studies should assess the usability and satisfaction of RALMs with CON in real-world scenarios, considering response quality and trustworthiness. Exploring additional external knowledge sources and combining CON with techniques like pre-training or fine-tuning can further enhance RALM performance and adaptability.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post Tencent AI Lab Introduces Chain-of-Noting (CoN) to Improve the Robustness and Reliability of Retrieval-Augmented Language Models appeared first on MarkTechPost.

Stanford University Researchers Introduce FlashFFTConv: A New Artifici …

Reasoning efficiently across extended sequences is a major difficulty in machine learning. Recently, convolutions have emerged as a critical primitive for sequence modeling, supporting state-of-the-art performance in language modeling, time-series analysis, computer vision, DNA modeling, and more. Despite these impressive quality findings and additional advantages, such as improved stability and better scalability as the sequence length increases, convolutional sequence models are still significantly slower than Transformers. 

One main cause is unreliable hardware support. Convolutions for sequence modeling frequently employ filters as lengthy as the input sequence, in contrast to the short filters used in classical convolutions for visual applications. The Fast Fourier Transform (FFT) convolution algorithm calculates the convolution between an input u and convolution kernel k by mapping the input and output frequencies. 

Despite being asymptotically efficient, the FFT convolution algorithm has a low wall-clock time on contemporary accelerators. However, technological progress in systems has allowed Transformers to reach the limits of current accelerators, with an end-to-end FLOP usage of over 72% when using FlashAttention-v2. 

To offer longer-context capabilities, a new research from Stanford University investigates how to optimize the FFT convolution method on contemporary accelerators. The researchers believe that, as advances in systems like FlashAttention led to better models and new attention algorithms, optimizing the FFT convolution will lead to new and better algorithms, boosting the quality of convolutional sequence models. 

The FFT convolution can be easily optimized for short sequences. It is common practice to reuse kernel filters over multiple batches, which makes it possible to precompute the FFT of the filter before reusing it. Thus, the FFT convolution is parallel across batches and filters, and kernel fusion allows intermediate convolution outputs to be cached in SRAM or registers.

However, the team highlights that two major bottlenecks appear as the sequence length grows. Regarding current accelerators, FFT convolutions do not optimally utilize the specialized matrix-matrix multiply units. 

Second, kernel fusion fails as sequences grow too long to fit in SRAM, and costly I/O operations are required. Padding operations for causality and conversions from real-valued inputs/outputs to complex-valued FFT intermediates might increase these I/O costs further.

In response, the researchers offer FlashFFTConv, a novel algorithm that employs a Monarch decomposition of the FFT to optimize the FFT convolution for extended sequences. The FFT can be effectively transferred onto hardware thanks to a Monarch decomposition of order p, which rewrites the FFT as a series of p matrix-matrix multiply operations. Higher p values incur less FLOP cost due to smaller matrices but call for more I/O to convey intermediate results. Hence, there is a tradeoff involved.

The study demonstrates how to optimize p for FLOP cost, and I/O cost in a GPU using a straightforward cost model based on sequence length. In addition to facilitating kernel fusion at greater sequence lengths, this decomposition reduces the amount of the sequence that must be maintained in SRAM. Therefore, FlashFFTConv can easily handle sequences anywhere from 256 to 4 million characters long. By using a real-valued FFT algorithm and skipping parts of the matrix-multiply operations when the input is zero-padded, FlashFFTConv can reduce the length of the FFT operation by as much as half. Last but not least, the matrix view of the FFT convolution provides a simple interface for implementing two architectural modifications: partial convolutions, which learn with a convolution kernel that is shorter than the input sequence, and frequency sparse convolutions, which zero out sections of the kernel in frequency space. Both approaches can be implemented simply by omitting sections of the matrix decomposition, lowering memory footprint and wall-clock runtime, and can be thought of as convolutional parallels of sparse/approximate attention in Transformers.

The researchers demonstrate that FlashFFTConv accelerates the FFT convolution, resulting in better quality, more efficient, and longer sequence models. 

FlashFFTConv improves the quality of convolutional sequence models via better efficiency: for the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity and allows M2-BERT-base to achieve up to 3.3 higher average GLUE score—a gain in performance equivalent to doubling the parameters of the model.

FlashFFTConv improves the efficiency of convolutions by up to 7.93 and by up to 5.60 in memory savings compared to PyTorch, and this efficiency holds over four orders of magnitude in sequence length. FlashFFTConv is faster in wall-clock time than FlashAttention-v2 end-to-end for sequence lengths 2K and longer due to lower FLOP costs and achieves up to 62.3% end-to-end FLOP usage, which is only 10% less than FlashAttention-v2.

Models of longer sequences are possible with FlashFFTConv. FlashFFTConv has produced the only model capable of completing the lengthy arena benchmark’s Path-512 job (sequence length 256K) for high-resolution picture classification. FlashFFTConv is the first model to embed the longest human genes (up to 2.3M base pairs) at single nucleotide resolution; it extends HyenaDNA to 4M sequence length via partial convolutions. 

The team hopes that FlashFFTConv will pave the way for wider use of convolutional sequence models and that the lessons learned will lead to more resource-efficient computer architectures.

Check out the Paper, Github, and Blog Article. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post Stanford University Researchers Introduce FlashFFTConv: A New Artificial Intelligence System for Optimizing FFT Convolutions for Long Sequences appeared first on MarkTechPost.

A New AI Research Releases SWIM-IR: A Large-Scale Synthetic Multilingu …

Researchers from Google Research, Google DeepMind, and the University of Waterloo introduce SWIM-IR, a synthetic retrieval training dataset encompassing 33 languages, addressing the challenge of limited human-labeled training pairs in multilingual retrieval. Leveraging the SAP (summarize-then-ask prompting) method, SWIM-IR is constructed to enable synthetic fine-tuning of multilingual dense retrieval models without human supervision. SWIM-X models, trained on SWIM-IR, demonstrate competitiveness with human-supervised thick retrieval models across various benchmarks, including XOR-Retrieve, XTREME-UP, and MIRACL. 

The study addresses limitations in multilingual dense retrieval models. Existing multilingual retrieval models face challenges due to scarce or uneven training data. SWIM-IR employs SAP to assist LLMs in generating informative queries in the target language. SWIM-X models, trained on SWIM-IR, exhibit competitive performance with human-supervised models across various benchmarks, highlighting the potential of synthetic datasets as a cost-effective alternative to human-labeled training data for multilingual dense retrieval models.

The research addresses the limited success of multilingual dense retrieval models, attributing it to insufficient supervised training data for non-English languages. This synthetic dataset enables fine-tuning of multilingual dense retrieval models, evaluated on benchmarks like XOR-Retrieve, XTREME-UP, and MIRACL. Results demonstrate SWIM-IR’s efficacy in substituting expensive human-labeled training data, establishing competitive performance for multilingual dense retrieval models against human-supervised counterparts.

SWIM-IR, a synthetic retrieval training dataset spanning 33 languages, was generated through the SAP technique. Employing SWIM-IR, the study explores the synthetic fine-tuning of multilingual dense retrieval models, adapting the Dense Passage Retrieval (DPR) model. Utilizing the T5X Retrieval framework, it replicates mContriever and mDPR zero-shot baselines by initializing from a multilingual T5-base checkpoint and fine-tuning on the English MS MARCO dataset. Pretraining on the mC4 dataset and employing contrastive loss for in-batch negatives, the researchers use the PaLM 2 Small model for cross-language query generation.

Straight-turned on synthetic training data from SWIM-IR, SWIM-X models exhibit competitive performance in multilingual dense retrieval tasks. SWIM-X (7M) outperforms mContriever-X, the best-fine-tuned model, by 7.1 points on Recall5kt in the XOR-Retrieve benchmark. Even the limited-budget baseline, SWIM-X (500k), surpasses mContriever-X by 3.6 points. SWIM-X (180K) competes well on the MIRACL benchmark, outperforming the best zero-shot model by 6.6 points on nDCG10, although it falls short of mContriever-X, which benefits from human-labeled training pairs with hard negatives. Synthetic baselines, SWIM-X (120K) and SWIM-X (120K)MT show promising results in cross-lingual supervised baselines, outperforming existing models in terms of Recall5kt. The study emphasizes the importance of optimized training techniques, including better sampling hard negatives with SWIM-IR, to further enhance the performance of synthetic models.

The SWIM-IR dataset employed in the study exhibits limitations, including decontextualization, code-switching, passage quality and length, and factual inconsistencies in LLM generation. The study acknowledges that LLMs may generate text lacking sufficient grounding to knowledge sources, posing risks of misinformation and hallucination in generated outputs. While these limitations may impact the quality and accuracy of generated queries, they do not directly affect the downstream multilingual retrieval task. However, it does not extensively discuss the methods’ limitations, such as the SAP approach or the fine-tuning process.

SWIM-IR is a synthetic multilingual retrieval training dataset created using the SAP approach to generate informative queries in multiple languages. With 28 million query-passage training pairs across 33 languages, SWIM-IR facilitates fine-tuning multilingual dense retrieval models without requiring human-labeled training data. The resulting SWIM-X models exhibit competitive performance in multilingual retrieval tasks, outperforming existing recall and mean reciprocal rank models on both cross-lingual and monolingual benchmarks. It underscores SWIM-IR’s potential as a cost-effective substitute for expensive human-labeled retrieval training data, enabling the development of robust multilingual dense retrieval models.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post A New AI Research Releases SWIM-IR: A Large-Scale Synthetic Multilingual Retrieval Dataset with 28 Million Training Pairs over 33 Languages appeared first on MarkTechPost.