Alibaba Speech Lab Releases ClearerVoice-Studio: An Open-Sourced Voice …

Clear communication can be surprisingly difficult in today’s audio environments. Background noise, overlapping conversations, and the mix of audio and video signals often create challenges that disrupt clarity and understanding. These issues impact everything from personal calls to professional meetings and even content production. Despite improvements in audio technology, most existing solutions struggle to consistently provide high-quality results in complex scenarios. This has led to an increasing need for a framework that not only handles these challenges but also adapts to the demands of modern applications like virtual assistants, video conferencing, and creative media production.

To address these challenges, Alibaba Speech Lab has introduced ClearerVoice-Studio, a comprehensive voice processing framework. It brings together advanced features such as speech enhancement, speech separation, and audio-video speaker extraction. These capabilities work in tandem to clean up noisy audio, separate individual voices from complex soundscapes, and isolate target speakers by combining audio and visual data.

Developed by Tongyi Lab, ClearerVoice-Studio aims to support a wide range of applications. Whether it’s improving daily communication, enhancing professional audio workflows, or advancing research in voice technology, this framework offers a robust solution. The tools are accessible through platforms like GitHub and Hugging Face, inviting developers and researchers to explore its potential.

Technical Highlights

ClearerVoice-Studio incorporates several innovative models designed to tackle specific voice processing tasks. The FRCRN model is one of its standout components, recognized for its exceptional ability to enhance speech by removing background noise while preserving the natural quality of the audio. This model’s success was validated when it earned second place in the 2022 IEEE/INTER Speech DNS Challenge.

Another key feature is the MossFormer series models, which excel at separating individual voices from complex audio mixtures. These models have surpassed previous benchmarks, such as SepFormer, and have extended their utility to include speech enhancement and target speaker extraction. This versatility makes them particularly effective in diverse scenarios.

For applications requiring high fidelity, ClearerVoice-Studio offers a 48kHz speech enhancement model based on MossFormer2. This model ensures minimal distortion while effectively suppressing noise, delivering clear and natural sound even in challenging conditions. The framework also provides fine-tuning tools, enabling users to customize models for their specific needs. Additionally, its integration of audio-video modeling allows precise target speaker extraction, a critical feature for multi-speaker environments.

ClearerVoice-Studio has demonstrated strong results across benchmarks and real-world applications. The FRCRN model’s recognition in the IEEE/INTER Speech DNS Challenge highlights its capability to enhance speech clarity and suppress noise effectively. Similarly, the MossFormer models have proven their value by handling overlapping audio signals with precision.

The 48kHz speech enhancement model stands out for its ability to maintain audio fidelity while reducing noise. This ensures that speakers’ voices retain their natural tone, even after processing. Users can explore these capabilities through ClearerVoice-Studio’s open platforms, which offer tools for experimentation and deployment in varied contexts. This flexibility makes the framework suitable for tasks like professional audio editing, real-time communication, and AI-driven applications that require top-tier voice processing.

Conclusion

ClearerVoice-Studio marks an important step forward in voice processing technology. By seamlessly integrating speech enhancement, separation, and audio-video speaker extraction, Alibaba Speech Lab has created a framework that addresses a wide array of audio challenges. Its thoughtful design and proven performance make it a valuable resource for developers, researchers, and professionals alike.

As the demand for high-quality audio continues to grow, ClearerVoice-Studio provides an efficient and adaptable solution. With its ability to tackle complex audio environments and deliver reliable results, it sets a promising direction for the future of voice technology.

Check out the GitHub Page and Demo on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)
The post Alibaba Speech Lab Releases ClearerVoice-Studio: An Open-Sourced Voice Processing Framework Supporting Speech Enhancement, Separation, and Target Speaker Extraction appeared first on MarkTechPost.

NVIDIA AI Introduces NVILA: A Family of Open Visual Language Models VL …

Visual language models (VLMs) have come a long way in integrating visual and textual data. Yet, they come with significant challenges. Many of today’s VLMs demand substantial resources for training, fine-tuning, and deployment. For instance, training a 7-billion-parameter model can take over 400 GPU days, which makes it inaccessible to many researchers. Fine-tuning is equally demanding, often requiring over 64GB of GPU memory, far exceeding what consumer hardware can handle. Deploying these models in environments with limited computational resources, such as edge devices or robotics, is another hurdle. These limitations highlight the urgent need for VLMs that are not only powerful but also efficient and scalable.

To tackle these challenges, NVIDIA has introduced NVILA, a family of open VLMs designed with efficiency and accuracy in mind. Building on the VILA model, NVILA adopts a “scale-then-compress” approach. This method increases spatial and temporal resolutions to preserve details in visual inputs and then compresses them into fewer, denser tokens. This combination allows NVILA to handle high-resolution images and long video sequences effectively.

NVILA’s design optimizes every stage of the model lifecycle. It reduces training costs by 4.5×, cuts fine-tuning memory requirements by 3.4×, and improves inference speeds by 1.6 to 2.8× compared to other VLMs. Importantly, these gains do not come at the expense of accuracy. NVILA performs on par with or better than many benchmarks, excelling in visual question answering, video understanding, and document processing tasks. NVIDIA also plans to release NVILA’s code and models, fostering greater accessibility and reproducibility.

Technical Details

At the heart of NVILA’s efficiency is its “scale-then-compress” strategy. Spatial scaling increases image resolutions to dimensions like 896×896 pixels, compared to the usual 448×448. To mitigate the computational cost of scaling, NVILA uses token compression to retain essential information while reducing the number of tokens. For video inputs, the model processes more frames by applying temporal compression, balancing accuracy and computational efficiency.

NVILA incorporates further innovations to streamline training and fine-tuning. Techniques like FP8 mixed precision and dataset pruning accelerate training and lower memory usage. Adaptive learning rates and parameter-efficient fine-tuning ensure the model can handle domain-specific tasks without excessive resource demands. During deployment, NVILA uses advanced quantization—W8A8 for the vision tower and W4A16 for language components—to speed up inference while maintaining performance.

Performance Highlights

NVILA’s value lies in making advanced VLMs more accessible while addressing the need for efficient AI systems. Some key metrics include:

Training Efficiency: NVILA reduces GPU training time by 4.5× compared to leading models, making it more viable for institutions with limited resources.

Fine-Tuning Memory Usage: Memory requirements drop by 3.4×, allowing fine-tuning on standard hardware.

Inference Performance: Decoding latency improves by up to 2.8×, supporting real-time applications.

Benchmark Results: NVILA achieves up to 30% better accuracy on tasks like DocVQA and TextVQA. Its long-context capabilities outperform proprietary models like GPT-4o and Gemini 1.5.

NVILA’s potential spans diverse fields, including robotics and healthcare. For example, its temporal localization capabilities make it ideal for robotic navigation, while its NVILA-M3 framework integrates expert models to improve diagnostic accuracy in medical imaging.

Conclusion

NVILA represents a meaningful step forward in the development of visual language models. By rethinking architecture and optimizing the entire lifecycle, NVIDIA has created a model that balances efficiency and accuracy. NVILA addresses the limitations of traditional VLMs and expands their applicability to resource-constrained and specialized environments. With NVIDIA’s commitment to open access, NVILA is set to inspire further research and innovation in AI.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)
The post NVIDIA AI Introduces NVILA: A Family of Open Visual Language Models VLMs Designed to Optimize both Efficiency and Accuracy appeared first on MarkTechPost.

Advancing Large Multimodal Models: DocHaystack, InfoHaystack, and the …

LMMs have made significant strides in vision-language understanding but still need help reasoning over large-scale image collections, limiting their real-world applications like visual search and querying extensive datasets such as personal photo libraries. Existing benchmarks for multi-image question-answering are constrained, typically involving up to 30 images per question, which needs to address the complexities of large-scale retrieval tasks. To overcome these limitations, new benchmarks like DocHaystack and InfoHaystack have been introduced, requiring models to retrieve and reason across collections of up to 1,000 documents. This shift presents new challenges, significantly expanding the scope of visual question-answering and retrieval tasks.

Retrieval-augmented generation (RAG) frameworks enhance LMMs by integrating retrieval systems with generative models, enabling them to process extensive image-text datasets effectively. While RAG approaches have been widely explored in text-based tasks, their application in vision-language contexts has gained momentum with models like MuRAG, RetVQA, and MIRAGE. These frameworks utilize advanced retrieval methods, such as relevance encoders and CLIP-based training, to filter and process large image collections. Building on these advancements, the proposed V-RAG framework leverages multiple vision encoders and introduces a question-document relevance module, offering superior performance on the DocHaystack and InfoHaystack benchmarks. This sets a new standard for large-scale visual retrieval and reasoning, addressing critical gaps in existing LMM capabilities.

Researchers from KAUST, the University of Sydney, and IHPC, A*STAR, introduced two benchmarks, DocHaystack and InfoHaystack, to evaluate LMMs on large-scale visual document retrieval and reasoning tasks. These benchmarks simulate real-world scenarios by requiring models to process up to 1,000 documents per query, addressing the limitations of smaller datasets. They also proposed V-RAG, a vision-centric retrieval-augmented generation framework that combines specialized vision encoders and a relevance assessment module. V-RAG achieved a 9% and 11% improvement in Recall@1 on the DocHaystack-1000 and InfoHaystack-1000 benchmarks, significantly advancing retrieval and reasoning capabilities for LMMs.

To improve document retrieval and reasoning, the DocHaystack and InfoHaystack benchmarks ensure each question yields a unique, document-specific answer. These benchmarks address ambiguity using a three-step curation pipeline: filtering general questions with an LLM, manual review for specificity, and removing questions answerable through general knowledge. The Vision-centric Retrieval-Augmented Generation (V-RAG) framework enhances retrieval from extensive datasets using a vision encoder ensemble and an LLM-based filtering module. Relevant documents are ranked and refined to focus on specific subsets. Questions and selected documents are then processed by LLMs for accurate answers, emphasizing vision-based understanding.

The experiments section details the training setup, metrics, baselines, and results for evaluating the V-RAG framework. Metrics include Recall@1, @3, and @5 for document retrieval and a GPT-4o-mini-based model evaluation for VQA tasks. V-RAG outperforms baselines like BM25, CLIP, and OpenCLIP across DocHaystack and InfoHaystack benchmarks, achieving superior recall and accuracy scores. Fine-tuning with curated distractor images enhances VQA robustness. Ablation studies reveal the importance of combining multiple encoders and the VLM-filter module, significantly improving retrieval accuracy. V-RAG’s top performance across challenging benchmarks highlights its effectiveness in large-scale multimodal document understanding and retrieval tasks.

In conclusion, the study introduces DocHaystack and InfoHaystack, benchmarks designed to assess LMMs in large-scale document retrieval and reasoning tasks. Current benchmarks for multi-image question-answering are limited to small datasets, failing to reflect real-world complexities. The proposed V-RAG framework integrates multiple vision encoders and a relevance filtering module to address this, enhancing retrieval precision and reasoning capabilities. V-RAG outperforms baseline models, achieving up to 11% higher Recall@1 on the DocHaystack-1000 and InfoHaystack-1000 benchmarks. By enabling efficient processing of thousands of images, V-RAG significantly improves LMM performance in large-scale image retrieval and complex reasoning scenarios.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)
The post Advancing Large Multimodal Models: DocHaystack, InfoHaystack, and the Vision-Centric Retrieval-Augmented Generation Framework appeared first on MarkTechPost.

Google DeepMind’s Patent Transforming Protein Design Through Advance …

Protein design is crucial in biotechnology and pharmaceutical sciences. Google DeepMind, with its patent, WO2024240774A1, unveils a cutting-edge system that harnesses diffusion models operating on full atom representations. This innovative framework redefines the approach to protein design, achieving unprecedented precision and efficiency.

DeepMind’s system is a breakthrough in computational biology, combining advanced neural networks with a diffusion-based methodology to deliver a comprehensive solution for atomic-level protein design. Earlier methods often rely on distinct steps for structure prediction and sequence optimization, leading to increased complexity and computational burden. In contrast, this patent describes an integrated approach where structure and sequence predictions are unified into a single forward pass, streamlining the entire process and setting a new benchmark for the field.

Image Source

This patent offers precise atomic-level representation control, iterative refinement via advanced denoising processes, and conditional design frameworks for specific functional and structural requirements. These features ensure the system’s relevance across various applications, including drug discovery, synthetic biology, and enzyme engineering.

Core Innovations of this patent are as follows:

Full Atomic Representation Management: The model introduces a sophisticated framework for managing atomic-level data. By employing “throw-away spatial positions,” the system efficiently controls atoms within each protein residue. This approach eliminates the complexity associated with traditional phasing mechanisms and enables precise atomic-level control, significantly improving the efficiency of the design process.

Unified Structure-Sequence Prediction: Unlike traditional systems that require separate processes for structure and sequence prediction, this model integrates both in a single forward pass. The result is a streamlined prediction mechanism that enhances computational efficiency and simplifies implementation.

Conditional Design Framework: The system incorporates conditional denoising processes that rely on structural information from target molecules. This enables the design of proteins with specific functional and binding properties, paving the way for custom-tailored protein development.

Advanced Denoising Process: An iterative refinement process ensures high-quality protein designs. The denoising mechanism integrates throw-away position management, enabling dynamic optimization and maintaining computational efficiency.

The patented system consists of three main components:

The diffusion model system

The atomic control framework 

The integration system

The diffusion model system employs neural network-based denoising techniques and dynamic spatial optimization, ensuring a seamless integration of structure and sequence prediction. The atomic control framework provides a robust mechanism for managing atomic representations, ensuring that only relevant atomic data is considered during design. The integration system enables the conditioning of protein designs based on specific target molecule data, optimizing resources and ensuring quality assurance.

Image Source

The operational workflow begins with generating noisy molecular data, incorporating target molecule information, and initializing spatial positions with throw-away positions for unused atomic data. The denoising process follows, systematically reducing noise and dynamically optimizing atomic positions through iterative refinement. This process integrates joint structure-sequence predictions, reducing computational redundancies. Finally, the refined protein structure is generated with optimized atomic-level precision, sequence accuracy is verified, and quality checks validate the structural stability of the final design.

This system offers numerous advantages, such as enhanced efficiency by eliminating separate models and computational redundancies, superior performance with advanced denoising processes, and practical scalability for diverse applications. The throw-away position framework significantly reduces complexity while ensuring the system remains efficient and precise.

The analogy of a master LEGO builder effectively illustrates the functionality of this system. The system performs unified structure and sequence predictions like a builder who visualizes and assembles a structure simultaneously. It organizes unused atomic positions like unused LEGO pieces, progressively refining the structure through advanced denoising. Incorporating specific pieces mirrors how the system integrates target molecule requirements into the design process.

In conclusion, the patent addresses long-standing challenges such as atomic-level precision and computational inefficiency, and this system opens new avenues in biotechnology. Its ability to unify structure and sequence prediction, optimize atomic management, and integrate functional requirements into the design process positions it as a cornerstone for research and applications.

Check out the Patent Details and PDF Copy. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)
The post Google DeepMind’s Patent Transforming Protein Design Through Advanced Atomic-Level Precision and AI Integration appeared first on MarkTechPost.

Mistral-NeMo-Instruct-2407 and Mistral-NeMo-Base-2407 are now availabl …

Today, we are excited to announce that Mistral-NeMo-Base-2407 and Mistral-NeMo-Instruct-2407—twelve billion parameter large language models from Mistral AI that excel at text generation—are available for customers through Amazon SageMaker JumpStart. You can try these models with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models that can be deployed with one click for running inference. In this post, we walk through how to discover, deploy and use the Mistral-NeMo-Instruct-2407 and Mistral-NeMo-Base-2407 models for a variety of real-world use cases.
Mistral-NeMo-Instruct-2407 and Mistral-NeMo-Base-2407 overview
Mistral NeMo, a powerful 12B parameter model developed through collaboration between Mistral AI and NVIDIA and released under the Apache 2.0 license, is now available on SageMaker JumpStart. This model represents a significant advancement in multilingual AI capabilities and accessibility.
Key features and capabilities
Mistral NeMo features a 128k token context window, enabling processing of extensive long-form content. The model demonstrates strong performance in reasoning, world knowledge, and coding accuracy. Both pre-trained base and instruction-tuned checkpoints are available under the Apache 2.0 license, making it accessible for researchers and enterprises. The model’s quantization-aware training facilitates optimal FP8 inference performance without compromising quality.
Multilingual support
Mistral NeMo is designed for global applications, with strong performance across multiple languages including English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. This multilingual capability, combined with built-in function calling and an extensive context window, helps make advanced AI more accessible across diverse linguistic and cultural landscapes.
Tekken: Advanced tokenization
The model uses Tekken, an innovative tokenizer based on tiktoken. Trained on over 100 languages, Tekken offers improved compression efficiency for natural language text and source code.
SageMaker JumpStart overview
SageMaker JumpStart is a fully managed service that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is the Model Hub, which offers a vast catalog of pre-trained models, such as DBRX, for a variety of tasks.
You can now discover and deploy both Mistral NeMo models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and machine learning operations (MLOps) controls with Amazon SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your virtual private cloud (VPC) controls, helping to support data security.
Prerequisites
To try out both NeMo models in SageMaker JumpStart, you will need the following prerequisites:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, see Identity and Access Management for Amazon SageMaker.
Access to Amazon SageMaker Studio, a SageMaker notebook instance, or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
Access to accelerated instances (GPUs) for hosting the model.
This model requires an ml.g6.12xlarge instance. SageMaker JumpStart provides a simplified way to access and deploy over 100 different open source and third-party foundation models. In order to launch an endpoint to host Mistral NeMo from SageMaker JumpStart, you may need to request a service quota increase to access an ml.g6.12xlarge instance for endpoint usage. You can request service quota increases through the console, AWS Command Line Interface (AWS CLI), or API to allow access to those additional resources.

Discover Mistral NeMo models in SageMaker JumpStart
You can access NeMo models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.
SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, see Amazon SageMaker Studio.
In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.

Then choose HuggingFace.

From the SageMaker JumpStart landing page, you can search for NeMo in the search box. The search results will list Mistral NeMo Instruct and Mistral NeMo Base.
You can choose the model card to view details about the model such as license, data used to train, and how to use the model. You will also find the Deploy button to deploy the model and create an endpoint.

Deploy the model in SageMaker JumpStart
Deployment starts when you choose the Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK. When you select the option to use the SDK, you will see example code that you can use in the notebook editor of your choice in SageMaker Studio.
Deploy the model with the SageMaker Python SDK
To deploy using the SDK, we start by selecting the Mistral NeMo Base model, specified by the model_id with the value huggingface-llm-mistral-nemo-base-2407. You can deploy your choice of the selected models on SageMaker with the following code. Similarly, you can deploy NeMo Instruct using its own model ID.

from sagemaker.jumpstart.model import JumpStartModel

accept_eula = True

model = JumpStartModel(model_id=”huggingface-llm-mistral-nemo-base-2407″)
predictor = model.deploy(accept_eula=accept_eula)

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. The EULA value must be explicitly defined as True to accept the end-user license agreement (EULA). Also make sure that you have the account-level service limit for using ml.g6.12xlarge for endpoint usage as one or more instances. You can follow the instructions in AWS service quotas to request a service quota increase. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
“messages”: [
{
“role”: “user”,
“content”: “Hello”
}
],
“max_tokens”: 1024,
“temperature”: 0.3,
“top_p”: 0.9,
}

response = predictor.predict(payload)[‘choices’][0][‘message’][‘content’].strip()
print(response)

An important thing to note here is that we’re using the djl-lmi v12 inference container, so we’re following the large model inference chat completions API schema when sending a payload to both Mistral-NeMo-Base-2407 and Mistral-NeMo-Instruct-2407.
Mistral-NeMo-Base-2407
You can interact with the Mistral-NeMo-Base-2407 model like other standard text generation models, where the model processes an input sequence and outputs predicted next words in the sequence. In this section, we provide some example prompts and sample output. Keep in mind that the base model is not instruction fine-tuned.
Text completion
Tasks involving predicting the next token or filling in missing tokens in a sequence:

payload = {
“messages”: [
{
“role”: “user”,
“content”: “The capital of France is ___.”
}
],
“max_tokens”: 10,
“temperature”: 0.3,
“top_p”: 0.9,
}

response = predictor.predict(payload)[‘choices’][0][‘message’][‘content’].strip()
print(response)

The following is the output:

Paris
The capital of France is Paris.

Mistral NeMo Instruct
The Mistral-NeMo-Instruct-2407 model is a quick demonstration that the base model can be fine-tuned to achieve compelling performance. You can follow the steps provided to deploy the model and use the model_id value of huggingface-llm-mistral-nemo-instruct-2407 instead.
The instruction-tuned NeMo model can be tested with the following tasks:
Code generation
Mistral NeMo Instruct demonstrates benchmarked strengths for coding tasks. Mistral states that their Tekken tokenizer for NeMo is approximately 30% more efficient at compressing source code. For example, see the following code:

payload = {
“messages”: [
{
“role”: “user”,
“content”: “””Create a Binary Search Tree class with methods for insertion, searching, and in-order traversal.”””
}
],
“max_tokens”: 2048,
“temperature”: 0.3,
“top_p”: 0.9,
}
text_response = predictor.predict(payload)[‘choices’][0][‘message’][‘content’].strip()
tokens = predictor.predict(payload)[‘usage’]
print(text_response)
print(tokens)

The following is the output:

class Node:
def __init__(self, key):
self.key = key
self.left = None
self.right = None

class BinarySearchTree:
def __init__(self):
self.root = None

def insert(self, key):
if not self.root:
self.root = Node(key)
else:
self._insert(self.root, key)

def _insert(self, node, key):
if key < node.key:
if node.left:
self._insert(node.left, key)
else:
node.left = Node(key)
elif key > node.key:
if node.right:
self._insert(node.right, key)
else:
node.right = Node(key)

def search(self, key):
return self._search(self.root, key)

def _search(self, node, key):
if not node or node.key == key:
return node

if key < node.key:
return self._search(node.left, key)
else:
return self._search(node.right, key)

def inorder_traversal(self):
self._inorder_traversal(self.root)
print()

def _inorder_traversal(self, node):
if node:
self._inorder_traversal(node.left)
print(node.key, end=” “)
self._inorder_traversal(node.right)

# Example usage:
bst = BinarySearchTree()
bst.insert(50)
bst.insert(30)
bst.insert(20)
bst.insert(40)
bst.insert(70)
bst.insert(60)
bst.insert(80)

print(“In-order traversal:”)
bst.inorder_traversal() # Output: 20 30 40 50 60 70 80

print(f”Search 40: {bst.search(40).key if bst.search(40) else ‘Not found’}”)
print(f”Search 90: {bst.search(90).key if bst.search(90) else ‘Not found’}”)
{‘prompt_tokens’: 22, ‘completion_tokens’: 433, ‘total_tokens’: 455}

The model demonstrates strong performance on code generation tasks, with the completion_tokens offering insight into how the tokenizer’s code compression effectively optimizes the representation of programming languages using fewer tokens.
Advanced math and reasoning
The model also reports strengths in mathematic and reasoning accuracy. For example, see the following code:

payload = {
“messages”: [
{ “role”: “system”,
“content”: “You are an expert in mathematics and reasoning. Your role is to provide examples, explanations, and insights related to mathematical concepts, problem-solving techniques, and logical reasoning.”,
“role”: “user”,
“content”: “””Calculating the orbital period of an exoplanet:
Given: An exoplanet orbits its star at a distance of 2.5 AU (Astronomical Units). The star has a mass of 1.2 solar masses.
Task: Calculate the orbital period of the exoplanet in Earth years.”””
}
],
“max_tokens”: 2048,
“temperature”: 0.3,
“top_p”: 0.9,
}
response = predictor.predict(payload)[‘choices’][0][‘message’][‘content’].strip()
print(response)

The following is the output:

To calculate the orbital period of an exoplanet, we can use Kepler’s Third Law, which states that the square of the orbital period (P) is directly proportional to the cube of the semi-major axis (a) of the orbit and inversely proportional to the mass (M) of the central body. The formula is:

P^2 = (4 * π^2 * a^3) / (G * M)

where:
– P is the orbital period in years,
– a is the semi-major axis in AU (Astronomical Units),
– G is the gravitational constant (6.67430 × 10^-11 m^3 kg^-1 s^-2),
– M is the mass of the star in solar masses.

First, we need to convert the mass of the star from solar masses to kilograms. The mass of the Sun is approximately 1.98847 × 10^30 kg. So, the mass of the star is:

M = 1.2 * 1.98847 × 10^30 kg = 2.386164 × 10^30 kg

Now, we can plug the values into Kepler’s Third Law:

P^2 = (4 * π^2 * (2.5 AU)^3) / (G * M)

Since 1 AU is approximately 1.496 × 10^11 meters, the semi-major axis in meters is:

a = 2.5 AU * 1.496 × 10^11 m/AU = 3.74 × 10^12 m

Now, we can calculate P^2:

P^2 = (4 * π^2 * (3.74 × 10^12 m)^3) / (6.67430 × 10^-11 m^3 kg^-1 s^-2 * 2.386164 × 10^30 kg)

P^2 = (4 * π^2 * 5.62 × 10^36 m^3) / (1.589 × 10^20 m^3 kg^-1 s^-2)

P^2 = (4 * π^2 * 5.62 × 10^36 m^3) / (1.589 × 10^20 m^3 kg^-1 s^-2)

P^2 = 4.15 × 10^16 s^2

Now, we take the square root to find the orbital period in seconds:

P = √(4.15 × 10^16 s^2) ≈ 2.04 × 10^8 s

Finally, we convert the orbital period from seconds to Earth years (1 Earth year = 31,557,600 seconds):

P = (2.04 × 10^8 s) / (31,557,600 s/year) ≈ 6.47 years

Therefore, the orbital period of the exoplanet is approximately 6.47 Earth years.

Language translation task

In this task, let’s test Mistral’s new Tekken tokenizer. Mistral states that the tokenizer is two times and three times more efficient at compressing Korean and Arabic, respectively.
Here, we use some text for translation:

text= “””
“How can our business leverage Mistral NeMo with our new RAG application?”
“What is our change management strategy once we roll out this new application to the field?
“””

We set our prompt to instruct the model on the translation to Korean and Arabic:

prompt=f”””

text={text}

Translate the following text into these languages:

1. Korean
2. Arabic

Label each language section accordingly”””.format(text=text)

We can then set the payload:

payload = {
“messages”: [
{ “role”: “system”,
“content”: “You are an expert in language translation.”,
“role”: “user”,
“content”: prompt
}
],
“max_tokens”: 2048,
“temperature”: 0.3,
“top_p”: 0.9,
}
#response = predictor.predict(payload)
text_response = predictor.predict(payload)[‘choices’][0][‘message’][‘content’].strip()
tokens = predictor.predict(payload)[‘usage’]
print(text_response)
print(tokens)

The following is the output:

**1. Korean**

– “우리의 비즈니스가 Mistral NeMo를 어떻게 활용할 수 있을까요?”
– “이 새 애플리케이션을 현장에 롤아웃할 때 우리의 변화 관리 전략은 무엇입니까?”

**2. Arabic**

– “كيف يمكن لعمليتنا الاست من Mistral NeMo مع تطبيق RAG الجديد؟”
– “ما هو استراتيجيتنا في إدارة التغيير بعد تفعيل هذا التطبيق الجديد في الميدان؟”
{‘prompt_tokens’: 61, ‘completion_tokens’: 243, ‘total_tokens’: 304}

The translation results demonstrate how the number of completion_tokens used is significantly reduced, even for tasks that are typically token-intensive, such as translations involving languages like Korean and Arabic. This improvement is made possible by the optimizations provided by the Tekken tokenizer. Such a reduction is particularly valuable for token-heavy applications, including summarization, language generation, and multi-turn conversations. By enhancing token efficiency, the Tekken tokenizer allows for more tasks to be handled within the same resource constraints, making it an invaluable tool for optimizing workflows where token usage directly impacts performance and cost.
Clean up
After you’re done running the notebook, make sure to delete all resources that you created in the process to avoid additional billing. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion
In this post, we showed you how to get started with Mistral NeMo Base and Instruct in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.
For more Mistral resources on AWS, check out the Mistral-on-AWS GitHub repository.

About the authors
Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics.
Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.
Shane Rai is a Principal Generative AI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using the breadth of cloud-based AI/ML services provided by AWS, including model offerings from top tier foundation model providers.

Solving Ad Spend and Inventory Challenges with Nick Shackelford 

Welcome to our DTC Next Ecommerce Growth Virtual Summit Series where we are recapping all of the goodness from our big-time event. Miss the show? Catch the replay here and get the hottest tips from the top ecommerce pros in the industry along with an exclusive gift!

Don’t have time to watch? Get the full recap below:

Ecommerce is a battlefield and if anyone knows that it’s Nick Shackelford, ecommerce legend and founder of Brez and Structured Agency. 

Whether it’s Meta throttling your ad spend, inventory issues leaving you scrambling, or outdated urgency tactics driving customers away, the challenges are real. 

But as Nick Shackelford says, “You can’t just shut things off and hope for the best. Adaptation is the name of the game.”

Here’s how to take Nick’s advice and turn these obstacles into opportunities.

When Meta Fails, AppLovin Saves the Day

Meta is the golden child of ad platforms until it’s not. 

During Black Friday/Cyber Monday, Nick Shackelford’s team ran into a wall when Meta refused to scale. 

“We couldn’t spend, CPMs were outrageous, and we needed volume.” Their solution? AppLovin.

Why AppLovin Works

AppLovin delivers cheaper traffic and significant volume. Nick highlighted the unique buying behavior of its users:

Engaged Audience: Consumers are incentivized to watch ads (think points or rewards).

Credit Cards Linked: Many users already have payment info attached, making them primed for quick purchases.

Affordable Traffic: While Meta CPMs hovered at $65–$70 (on a good day), AppLovin offered a cost-effective alternative.

Nick Shackelford’s Strategy

AppLovin is likely best as a discovery platform. Nick’s team is planning on pairing AppLovin traffic with Meta for remarketing and using the Customers.ai visitor ID pixel to get a better idea of AppLovin users.

Even if AppLovin’s traffic seems cheaper or unqualified, it works when paired with Meta. You just need to be strategic.

The takeaway? When Meta gets stingy, AppLovin is your fallback hero.

Don’t Pause Ads—Go Pre-Order

When inventory issues hit, it’s tempting to pause your ads. But Shackelford insists that’s the wrong move as turning off ads can kill momentum. On top of that, getting back to where you were is next to impossible.

 Instead, look at pre-orders to keep the sales flowing.

How Nick Shackelford’s Team Nailed It

Product Page Transparency: Updated product pages to show pre-order availability—clear messaging like ‘Ships in October.’

Retention Alignment: Make sure customer service and retention teams are coordinated to ensure consistent messaging.

Post-Purchase Follow-Ups: Sendregular updates to keep customers informed and build trust.

Why It Matters

Pre-orders let you maintain revenue while managing expectations. It’s all about keeping the cash flowing without misleading your customers.

Inventory issues don’t mean you have to hit pause. Pre-orders keep your ads running and your customers happy.

Kill the Countdown Timers

The days of countdown timers are over. Shackelford doesn’t mince words here: “Not only are they ethically questionable, but they’re also a legal risk. One complaint, and you’re in hot water.”

Why Nick Shackelford Says No to Countdown Timers

Legal Risks: False urgency (e.g., a timer that runs out but doesn’t actually end the sale) can lead to complaints or lawsuits.

Erosion of Trust: “Consumers are smarter now. They know when they’re being played, and it damages your brand.”

Nick Shackelford’s Better Approach

Ditch the gimmicks and focus on genuine incentives:

Limited-time offers with clear, honest messaging.

Build urgency with storytelling instead of cheap tricks.

Remember that transparency wins every time. Forget the countdown timers and earn trust instead.

Wrapping It Up: Nick Shackelfords’s Winning Formula

Nick’s advice boils down to smart adaptation:

Switch it up when platforms like Meta don’t deliver. Test AppLovin and pair it with Meta for retargeting.

Keep selling, even when inventory is tight. Pre-orders save the day.

Build trust with honest urgency tactics. No more countdown timer gimmicks.

Ecommerce success isn’t about avoiding problems, it’s about solving them strategically. 

So get strategic and start putting these tips into action! 

Ready to see how Customers.ai can help? Start your free trial today and get 500 free contacts!

See Who Is On Your Site Right Now!

Get names, emails, phone numbers & more.

Try it Free, No Credit Card Required

Start Your Free Trial

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post Solving Ad Spend and Inventory Challenges with Nick Shackelford  appeared first on Customers.ai.

Smash, Marry, or Pass: Tamanna Bawa’s DTC Advertising Playbook for 2 …

Welcome to our DTC Next Ecommerce Growth Virtual Summit Series where we are recapping all of the goodness from our big-time event. Miss the show? Catch the replay here and get the hottest tips from the top ecommerce pros in the industry along with an exclusive gift!

Don’t have time to watch? Get the full recap below:

Let’s face it, ecommerce life comes at you fast. 

One minute you’re experimenting with the latest ad platform and the next, you’re drowning in data spreadsheets that make no sense! 

Thankfully Tamanna Bawa, Tech Partner Manager at Triple Whale, breaks down where the DTC ad space is headed in 2025, giving you the cheat sheet you didn’t know you needed. 

From TikTok ads to Meta’s steady performance and why it’s time to ditch outdated processes, here’s her do’s and don’ts for the coming year. 

Smash: Why TikTok Ads Are Crushing It

TikTok is the wild card you can’t afford to ignore anymore. Sure, it’s fun for viral dances and late-night scrolling, but it’s also a powerhouse for ecommerce. 

Especially if you’re selling low-cost, impulse-buy products. Here’s why TikTok ads are turning heads:

1. Perfect for Lower AOV Products

Tamanna Bawa shared a stat that’ll make you rethink your ad strategy: TikTok’s average order value (AOV) sweet spot is $40, compared to Meta’s $90 and Google’s $70. 

If your product falls in the “grab it and go” category, TikTok is your playground.

2. ROAS on the Rise

Return on ad spend for TikTok ads has shot up by 23%, while cost per acquisition (CPA) continues to drop. 

Translation? It’s getting cheaper to acquire customers who are more likely to buy.

3. Built for Impulse Buys

TikTok is the digital equivalent of a checkout aisle – fast-paced, visual, and irresistible. Its format naturally lends itself to impulse-driven shopping, making it perfect for brands that thrive on quick decisions.

Tamanna Bawa’s Key Takeaway: 

If your product doesn’t require a lot of research or hand-wringing, TikTok is a no-brainer. It’s affordable, effective, and primed to make your brand a hit with scroll-happy shoppers.

Marry: Meta Ads for Stability & Scale

TikTok might be the shiny new thing, but Meta remains the solid, dependable partner you can rely on. 

It’s not flashy, but it gets the job done, especially when you’re planning big holiday campaigns or scaling your ad efforts.

1. Dominating Ad Budgets

Ad budgets on Meta grew by 30% from 2023 to 2024, with a slight improvement in ROAS. While the numbers aren’t as dramatic as TikTok’s, Meta offers something TikTok can’t yet – consistency.

2. Ideal for High-Stakes Seasons

If TikTok is the impulse-buy expert, Meta is the strategic planner. For key shopping periods like Black Friday or Christmas, Meta’s stability and broad audience reach make it the MVP.

3. Scalable for All Budgets

Meta is perfect for scaling your campaigns. Whether you’re a small business dipping your toes into ads or an enterprise-level brand going all in, Meta’s tools and audience segmentation features make it a reliable choice.

Tamanna Bawa’s Key Takeaway: 

Marry Meta for the long haul. It’s not as exciting as TikTok, but it’s the platform you can count on for steady growth and high-performing campaigns.

Pass: Messy Data Processes

Let’s talk about what’s holding you back – messy, outdated data management. If you’re still relying on siloed in-platform data or juggling spreadsheets like a circus act, it’s time to swipe left.

1. Why In-Platform Data Isn’t Enough

Platforms like TikTok and Meta offer plenty of insights, but they’re often siloed. This means you’re not getting the full picture of your customer journey or properly attributing your results.

2. The Problem with Spreadsheets

Spreadsheets might feel safe and familiar, but they’re riddled with inefficiencies. They’re time-consuming, error-prone, and, let’s be honest, a total headache when you’re trying to make sense of all that data.

3. Real-Time Data Is a Game Changer

In 2025, relying on outdated methods is like showing up to a Zoom call with a dial-up modem. Real-time, AI-driven insights are where it’s at. They help you identify what’s working (and what’s not) so you can pivot quickly and stay ahead of the competition.

Why Real-Time Data Matters

People want what they want and they want it now. Seconds can mean the difference between hitting your sales goals or watching your budget burn. 

Tamanna Bawa’s advice? 

Stop looking at yesterday’s numbers and start acting on today’s insights! Tools like Customers.ai or Triple Whale that offer real-time data empower you to:

Adjust ad spend on the fly.

Optimize campaigns for maximum ROI.

Stay nimble in an ever-changing market.

The result? Smarter decisions, faster growth, and a lot less stress.

The 2025 DTC Advertising is All About Data

Tamanna’s insights make one thing clear – the 2025 DTC advertising playbook is all about data and being smart with that data. 

Whether you’re smashing it with TikTok, marrying Meta for stability, or passing on old-school data practices, the goal is the same – drive growth without the guesswork.

So, are you ready to crush it?

If so, then it’s time to see how Customers.ai can help. Start your free trial today and get 500 free contacts!

See Who Is On Your Site Right Now!

Get names, emails, phone numbers & more.

Try it Free, No Credit Card Required

Start Your Free Trial

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post Smash, Marry, or Pass: Tamanna Bawa’s DTC Advertising Playbook for 2025 appeared first on Customers.ai.

Melanie Balke’s Rules for Standing Out in a Crowded Inbox

Welcome to our DTC Next Ecommerce Growth Virtual Summit Series where we are recapping all of the goodness from our big-time event. Miss the show? Catch the replay here and get the hottest tips from the top ecommerce pros in the industry along with an exclusive gift!

Don’t have time to watch? Get the full recap below:

Let’s talk email. You know, that old-school, unsexy marketing channel that somehow still generates a jaw-dropping $42 for every $1 spent. 

Here’s the thing – email isn’t the quiet, predictable corner of the marketing world it used to be. As Melanie Balkie, CEO and Founder of The Email Marketers, tells us, “In 2025, email is going to get tougher, noisier, and way more competitive.”

So, how do you cut through the chaos, stay out of the dreaded spam folder, and keep your customers engaged? Melanie Balke’s got the playbook, and we’re breaking it down for you.

Deliverability: Your Inbox Reputation Matters

First up, deliverability. 

Think of it as your email program’s street cred. If you don’t have it, your messages aren’t getting seen. 

According to Melanie Balke, platforms like Google, Yahoo, and (soon) Apple are making it harder to hit the inbox. New standards demand high open and click rates while keeping spam complaints below 0.02%.

The biggest problem is that once you’re flagged as a “bad sender,” crawling out of the corner of shame is nearly impossible.

Melanie Balke’s Tips to Stay in the Game:

Stop sending to your entire list. Mass blasts are dead. Instead, segment your audience and send smarter.

Clean your list regularly. Dead weight (aka inactive subscribers) will tank your reputation faster than you can say “spam folder.”

Monitor metrics religiously. Know your open rates, click rates, and complaint rates—and optimize based on the data.

The TL;DR? Your email program is only as good as your deliverability. Protect it like your marketing life depends on it.

Plain Text Emails: The Underdog Strategy

In a world of flashy, overdesigned emails, plain text is having a moment. Why? Because it’s real, relatable, and stands out in the cluttered inbox.

Melanie Balke’s take: “Plain text emails are pattern interrupters—they catch attention because they feel personal.” 

But don’t mistake plain text for plain boring. The magic lies in the details. You still need storytelling, emotion, and a strong persona.

Real-Life Example: Original Grain

Nate, the VP of Marketing at Original Grain, crushes plain text emails by creating a persona that feels like a friend dropping into your inbox. His emails use humor, storytelling, and relatable vibes to build genuine connections.

How to Nail Plain Text Emails:

Develop a persona. Who’s writing the email? Make it someone your audience can connect with.

Tell a story. Skip the “Our sale ends tonight!” snooze fest. Share something that hooks readers emotionally.

Keep it human. Think less “corporate update” and more “text from a friend.”

The goal is to make your audience smile, laugh, or feel something and then click “open” the next time they see your name in their inbox.

Send More Emails (Yes, Really)

If you’re clutching your pearls at the idea of sending more emails, take a deep breath. Melanie Balke’s advice is clear: “In 2025, it’s not about value vs. quantity—it’s about delivering value at quantity.”

The data backs her up. 

Brands are sending 4x more emails than they did last year to achieve the same results. 

Why? Because inbox competition is fiercer than ever and the first hurdle is simply being seen.

How to Strike the Balance:

Start with visibility. More emails mean more chances to land in front of your audience.

Deliver value every time. If someone opens your email and feels like it was worth their time, they’ll open the next one.

Track engagement. Keep an eye on metrics to make sure your increased frequency isn’t tanking your deliverability.

As a marketer, you need to look at it this way – the more touchpoints you create, the better chance you have of catching your audience at the right moment.

Put It All Together: The Holistic Email Strategy

Here’s the key to Melanie’s email rules – none of these strategies exist in a vacuum. High deliverability, plain text emails, and increased frequency all work together to amplify your results.

Melanie Balke calls it the “magic sauce” of email marketing: “When you combine these tactics into a cohesive strategy, you’re looking at a 20-30% boost in performance. It’s about putting all the pieces together to make your program unstoppable.”

Why It Works:

Great deliverability means your emails actually get seen.

Plain text adds a human touch that builds relationships.

Increased frequency ensures you stay top of mind in a crowded inbox.

Email Marketing Done Right 

Email isn’t going anywhere, it’s just evolving. In 2025, the brands that succeed will be the ones that adapt, combining the fundamentals with fresh tactics to cut through the noise.

Start cleaning your lists, crafting those plain text stories, and ramping up your send schedule (without sacrificing value). As Melanie Balke says, “If you think you’re sending enough emails, send more. Just make sure they’re worth opening.”

Now, go make your inbox game stronger than ever.

Ready to see how Customers.ai can help amp up those inboxes? Start your free trial today and get 500 free contacts!

See Who Is On Your Site Right Now!

Get names, emails, phone numbers & more.

Try it Free, No Credit Card Required

Start Your Free Trial

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post Melanie Balke’s Rules for Standing Out in a Crowded Inbox appeared first on Customers.ai.

Google DeepMind Just Released PaliGemma 2: A New Family of Open-Weight …

Vision-language models (VLMs) have come a long way, but they still face significant challenges when it comes to effectively generalizing across different tasks. These models often struggle with diverse input data types, like images of various resolutions or text prompts that require subtle understanding. On top of that, finding a balance between computational efficiency and model scalability is no easy feat. These challenges make it hard for VLMs to be practical for many users, especially those who need adaptable solutions that perform consistently well across a wide range of real-world applications, from document recognition to detailed image captioning.

Google DeepMind Just Released PaliGemma 2: A New Family of Open-Weight Vision Language Models (3B, 10B and 28B) recently introduced the PaliGemma 2 series, a new family of Vision-Language Models (VLMs) with parameter sizes of 3 billion (3B), 10 billion (10B), and 28 billion (28B). The models support resolutions of 224×224, 448×448, and 896×896 pixels. This release includes nine pre-trained models with different combinations of sizes and resolutions, making them versatile for a variety of use cases. Two of these models are also fine-tuned on the DOCCI dataset, which contains image-text caption pairs, and support parameter sizes of 3B and 10B at a resolution of 448×448 pixels. Since these models are open-weight, they can be easily adopted as a direct replacement or upgrade for the original PaliGemma, offering users more flexibility for transfer learning and fine-tuning.

Technical Details

PaliGemma 2 builds on the original PaliGemma model by incorporating the SigLIP-So400m vision encoder along with the Gemma 2 language models. The models are trained in three stages, using different image resolutions (224px, 448px, and 896px) to allow for flexibility and scalability based on the specific needs of each task. PaliGemma 2 has been tested on more than 30 transfer tasks, including image captioning, visual question answering (VQA), video tasks, and OCR-related tasks like table structure recognition and molecular structure identification. The different variants of PaliGemma 2 excel under different conditions, with larger models and higher resolutions generally performing better. For example, the 28B variant offers the highest performance, though it requires more computational resources, making it suitable for more demanding scenarios where latency is not a major concern.

The PaliGemma 2 series is notable for several reasons. First, offering models at different scales and resolutions allows researchers and developers to adapt performance according to their specific needs, computational resources, and desired balance between efficiency and accuracy. Second, the models have shown strong performance across a range of challenging tasks. For instance, PaliGemma 2 has achieved top scores in benchmarks involving text detection, optical music score recognition, and radiography report generation. In the HierText benchmark for OCR, the 896px variant of PaliGemma 2 outperformed previous models in word-level recognition accuracy, showing improvements in both precision and recall. Benchmark results also suggest that increasing model size and resolution generally leads to better performance across diverse tasks, highlighting the effective combination of visual and textual data representation.

Conclusion

Google’s release of PaliGemma 2 represents a meaningful step forward in vision-language models. By providing nine models across three scales with open-weight availability, PaliGemma 2 addresses a wide range of applications and user needs, from resource-constrained scenarios to high-performance research tasks. The versatility of these models and their ability to handle diverse transfer tasks make them valuable tools for both academic and industry applications. As more use cases integrate multimodal inputs, PaliGemma 2 is well-positioned to provide flexible and effective solutions for the future of AI.

Check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)
The post Google DeepMind Just Released PaliGemma 2: A New Family of Open-Weight Vision Language Models (3B, 10B and 28B) appeared first on MarkTechPost.

China’s AI Unicorn ‘Moonshot AI’ Open-Sources its Core Reasoning …

Large Language Models (LLMs) have grown in complexity and demand, creating significant challenges for companies aiming to provide scalable and cost-effective Model-as-a-Service (MaaS). The rapid adoption of LLMs in various applications has led to highly variable workloads in terms of input/output lengths, arrival frequencies, and service requirements. Balancing resource utilization to meet these diverse needs has become a critical challenge. Achieving this balance requires sophisticated strategies to meet different Service Level Objectives (SLOs) for latency and throughput. Additionally, conventional LLM serving architectures often assume sufficient resources are available to handle all requests, which is increasingly difficult with rising demand, especially during peak usage times.

The primary challenge is to maximize throughput without compromising latency—particularly as operational costs rise and GPU resources remain limited. To address these issues, Moonshot AI developed a new architecture.

Moonshot AI Open-Sources its Core Reasoning Architecture: Mooncake

China-based AI company Moonshot AI has officially open-sourced its core reasoning architecture, named Mooncake. Mooncake aims to address key scalability and efficiency challenges in LLM serving. Moonshot AI employs a KVCache-centric disaggregated architecture, which sets Mooncake apart from traditional LLM serving platforms. The first open-source component of Mooncake, called the Transfer Engine, is now available on GitHub, with more components planned for future release GitHub link.

The core of Mooncake is its KVCache-centric approach to handling computational workloads. By separating the prefill and decoding clusters, Mooncake can dynamically optimize resources, making use of underutilized CPU, DRAM, and SSD resources for efficient caching. This separation is crucial for addressing the diverse computational characteristics of LLM serving stages. The decision to open source Mooncake reflects a commitment to transparency and community-driven improvements in LLM scalability.

Technical Details

Mooncake leverages a KVCache-centric Prefill-Decoding (PD) separation technique and a storage-computation disaggregated architecture, which have significantly improved the inference throughput of Moonshot AI’s LLM service, Kimi. The KVCache mechanism is central to optimizing both throughput and latency. Instead of keeping GPU resources engaged with all aspects of model serving, Mooncake isolates KVCache usage from computational tasks, allowing it to be managed by underutilized hardware like CPUs and SSDs.

Mooncake’s architecture divides LLM serving into two stages—Prefill and Decoding. During the prefill stage, reusable cache is transferred to prefill instances, which optimizes the first token generation while reducing redundant computations. Then, during the decoding stage, the KVCache is aggregated, allowing for efficient batching. This separation has led to substantial performance improvements.

By implementing a prediction-based early rejection policy, Mooncake also helps prevent system overload during peak request periods. This approach has been instrumental in maintaining Service Level Objectives (SLOs) for time to first token (TTFT) and time between tokens (TBT), even under high workloads. Experimental results have shown that compared to the baseline, Mooncake achieved up to a fivefold increase in throughput in simulated scenarios and enabled 75% more request handling under real-world workloads.

The significance of Mooncake’s open-source release is multi-layered. It represents progress in the decentralization of LLM inference workloads, ensuring that no single hardware component becomes a bottleneck. The KVCache-centric scheduling model balances resource loads effectively, enabling service providers to maximize throughput without violating latency requirements. This efficiency is essential given the growing demand for LLM capabilities across industries.

Experimental results demonstrate that Mooncake achieved a fivefold increase in throughput in some simulated long-context scenarios while maintaining the required SLOs. In real-world settings, Mooncake enabled Kimi to handle 75% more requests compared to previous architectures. These improvements highlight Mooncake’s ability to scale efficiently and reduce costs. The disaggregation approach also provides greater flexibility in adding computational resources on-the-fly, which addresses variability in LLM workloads more efficiently than traditional coupled systems.

The phased open-source rollout also encourages collaborative development. By starting with the Transfer Engine, Moonshot AI aims to gather community insights before releasing additional components. This phased approach is intended to lead to further optimizations and broader adoption across various sectors that need efficient LLM serving solutions.

Conclusion

Moonshot AI’s decision to open source Mooncake reflects a broader industry trend towards transparent and scalable AI development practices. By focusing on KVCache-centric separation, Mooncake addresses the key challenges of LLM serving—latency, efficiency, and scalability. It has already shown significant performance gains, making it a promising framework for LLM serving. Mooncake’s architecture balances computational and caching demands effectively, improving resource utilization, reducing latency, and enhancing overall throughput. The phased open-source approach underscores Moonshot AI’s commitment to continuous improvement and community collaboration.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)
The post China’s AI Unicorn ‘Moonshot AI’ Open-Sources its Core Reasoning Architecture: ‘Mooncake’ appeared first on MarkTechPost.

ZipNN: A New Lossless Compression Method Tailored to Neural Networks

The rapid advancement of large language models (LLMs) has exposed critical infrastructure challenges in model deployment and communication. As models scale in size and complexity, they encounter significant storage, memory, and network bandwidth bottlenecks. The exponential growth of model sizes creates computational and infrastructural strains, particularly in data transfer and storage mechanisms. Current models like Mistral demonstrate the magnitude of these challenges, generating over 40 PBs of transferred information monthly and requiring extensive network resources. The storage requirements for model checkpoints and distributed updates can accumulate hundreds or thousands of times the original model size. 

Existing research in model compression has developed multiple approaches to reduce model sizes while attempting to maintain performance. Four primary model-compression methods have emerged: pruning, network architecture modification, knowledge distillation, and quantization. Among these techniques, quantization remains the most popular, deliberately trading accuracy for storage efficiency and computational speed. These methods share the goal of reducing model complexity, but each approach introduces inherent limitations. Pruning can potentially remove critical model information, distillation may not perfectly capture original model nuances, and quantization introduces entropy variations. Researchers have also begun exploring hybrid approaches that combine multiple compression techniques.

Researchers from IBM Research, Tel Aviv University, Boston University, MIT, and Dartmouth College have proposed ZipNN, a lossless compression technique specifically designed for neural networks. This method shows great potential in model size reduction, achieving significant space savings across popular machine learning models. ZipNN can compress neural network models by up to 33%, with some instances showing reductions exceeding 50% of the original model size. When applied to models like Llama 3, ZipNN outperforms vanilla compression techniques by over 17%, improving compression and decompression speeds by 62%. The method has the potential to save an ExaByte of network traffic monthly from large model distribution platforms like Hugging Face. 

ZipNN’s architecture is designed to enable efficient, parallel neural network model compression. The implementation is primarily written in C (2000 lines) with Python wrappers (4000 lines), utilizing the Zstd v1.5.6 library and its Huffman implementation. The core methodology revolves around a chunking approach that allows independent processing of model segments, making it particularly suitable for GPU architectures with multiple concurrent processing cores. The compression strategy operates at two granularity levels: chunk level and byte-group level. To enhance user experience, the researchers implemented seamless Hugging Face Transformers library integration, enabling automatic model decompression, metadata updates, and local cache management with optional manual compression controls.

Experimental evaluations of ZipNN were conducted on an Apple M1 Max machine with 10 cores and 64GB RAM, running macOS Sonoma 14.3. Model compressibility significantly influenced performance variations, with the FP32 regular model having approximately 3/4 non-compressible content, compared to 1/2 in the BF16 model and even less in the clean model. Comparative tests with LZ4 and Snappy revealed that while these alternatives were faster, they provided zero compression savings. Download speed measurements showed interesting patterns: initial downloads ranged from 10-40 MBps, while cached downloads exhibited significantly higher speeds of 40-130 MBps, depending on the machine and network infrastructure.

The research on ZipNN highlights a critical insight into the contemporary landscape of machine learning models: despite exponential growth and overparametrization, significant inefficiencies persist in model storage and communication. The study reveals substantial redundancies in model architectures that can be systematically addressed through targeted compression techniques. While current trends favor large models, the findings suggest that considerable space and bandwidth can be saved without compromising model integrity. By tailoring compression to neural network architectures, improvements can be achieved with minimal computational overhead, offering a solution to the growing challenges of model scalability and infrastructure efficiency.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)
The post ZipNN: A New Lossless Compression Method Tailored to Neural Networks appeared first on MarkTechPost.

Advancing AI trust with new responsible AI tools, capabilities, and re …

As generative AI continues to drive innovation across industries and our daily lives, the need for responsible AI has become increasingly important. At AWS, we believe the long-term success of AI depends on the ability to inspire trust among users, customers, and society. This belief is at the heart of our long-standing commitment to building and using AI responsibly. Responsible AI goes beyond mitigating risks and aligning to relevant standards and regulations. It’s about proactively building trust and unlocking AI’s potential to drive business value. A comprehensive approach to responsible AI empowers organizations to innovate boldly and achieve transformative business outcomes. New joint research conducted by Accenture and AWS underscores this, highlighting responsible AI as a key driver of business value — boosting product quality, operational efficiency, customer loyalty, brand perception, and more. Nearly half of the surveyed companies acknowledge responsible AI as pivotal in driving AI-related revenue growth. Why? Responsible AI builds trust, and trust accelerates adoption and innovation.
With trust as a cornerstone of AI adoption, we are excited to announce at AWS re:Invent 2024 new responsible AI tools, capabilities, and resources that enhance the safety, security, and transparency of our AI services and models and help support customers’ own responsible AI journeys.

Taking proactive steps to manage AI risks and foster trust and interoperability
AWS is the first major cloud service provider to announce ISO/IEC 42001 accredited certification for AI services, covering Amazon Bedrock, Amazon Q Business, Amazon Textract, and Amazon Transcribe. ISO/IEC 42001 is an international management system standard that outlines the requirements for organizations to manage AI systems responsibly throughout their lifecycle. Technical standards, such as ISO/IEC 42001, are significant because they provide a common framework for responsible AI development and deployment, fostering trust and interoperability in an increasingly global and AI-driven technological landscape. Achieving ISO/IEC 42001 certification means that an independent third party has validated that AWS is taking proactive steps to manage risks and opportunities associated with AI development, deployment, and operation. With this certification, we reinforce our commitments to providing AI services that help you innovate responsibly with AI.
Expanding safeguards in Amazon Bedrock Guardrails to improve transparency and safety
In April 2024, we announced the general availability of Amazon Bedrock Guardrails, which makes it easier to apply safety and responsible AI checks for your gen AI applications. Amazon Bedrock Guardrails delivers industry-leading safety protections by blocking up to 85% more harmful content on top of native protections provided by foundation models (FMs) and filtering over 75% of hallucinated responses from models using contextual grounding checks for Retrieval Augmented Generation (RAG) and summarization use cases. The ability to implement these safeguards was a big step forward in building trust in AI systems. Despite the advancements in FMs, models can still produce hallucinations—a challenge many of our customers face. For use cases where accuracy is critical, customers need the use of mathematically sound techniques and explainable reasoning to help generate accurate FM responses.
To address this need, we are adding new safeguards to Amazon Bedrock Guardrails to help prevent factual errors due to FM hallucinations and offer verifiable proofs. With the launch of the Automated Reasoning checks in Amazon Bedrock Guardrails (preview), AWS becomes the first and only major cloud provider to integrate automated reasoning in our generative AI offerings. Automated Reasoning checks help prevent factual errors from hallucinations using sound mathematical, logic-based algorithmic verification and reasoning processes to verify the information generated by a model, so outputs align with provided facts and aren’t based on hallucinated or inconsistent data. Used alongside other techniques such as prompt engineering, RAG, and contextual grounding checks, Automated Reasoning checks add a more rigorous and verifiable approach to enhancing the accuracy of LLM-generated outputs. Encoding your domain knowledge into structured policies helps your conversational AI applications provide reliable and trustworthy information to your users.
Click on the image below to see a demo of Automated Reasoning checks in Amazon Bedrock Guardrails.

As organizations increasingly use applications with multimodal data to drive business value, improve decision-making, and enhance customer experiences, the need for content filters extends beyond text. Amazon Bedrock Guardrails now supports multimodal toxicity detection (in preview) with support for image content, helping organizations to detect and filter undesirable and potentially harmful image content while retaining safe and relevant visuals. Multimodal toxicity detection helps remove the heavy lifting required to build your own safeguards for image data or invest time in manual evaluation that can be error-prone and tedious. Amazon Bedrock Guardrails helps you to responsibly create AI applications, helping build trust with your users.
Improving generative AI application responses and quality with new Amazon Bedrock evaluation capabilities
With more general-purpose FMs to choose from, organizations now have a wide range of options to power their generative AI applications. However, selecting the optimal model for a specific use case requires efficiently comparing models based on an organization’s preferred quality and responsible AI metrics. While evaluation is an important part of building trust and transparency, it demands substantial time, expertise, and resources for every new use case, making it challenging to choose the model that delivers the most accurate and safe customer experience. Amazon Bedrock Evaluations addresses this by helping you evaluate, compare, and select the best FMs for your use case. You can now use an LLM-as-a-judge (in preview) for model evaluations to perform tests and evaluate other models with human-like quality on your dataset. You can choose from LLMs hosted on Amazon Bedrock to be the judge, with a variety of quality and responsible AI metrics such as correctness, completeness, and harmfulness. You can also bring your own prompt dataset to customize the evaluation with your data, and compare results across evaluation jobs to make decisions faster. Previously, you had a choice between human-based model evaluation and automatic evaluation with exact string matching and other traditional natural language processing (NLP) metrics. These methods, though fast, didn’t provide a strong correlation with human evaluators. Now, with LLM-as-a-judge, you can get human-like evaluation quality at a much lower cost than full human-based evaluations while saving up to weeks of time. Many organizations still want the final assessment to be from expert human annotators. For this, Amazon Bedrock still offers full human-based evaluations with an option to bring your own workforce or have AWS manage your custom evaluation.
To equip FMs with up-to-date and proprietary information, organizations use RAG, a technique that fetches data from company data sources and enriches the prompt to provide more relevant and accurate responses. However, evaluating and optimizing RAG applications can be challenging due to the complexity of optimizing retrieval and generation components. To address this, we’ve introduced RAG evaluation support in Amazon Bedrock Knowledge Bases (in preview). This new evaluation capability now allows you to assess and optimize RAG applications conveniently and quickly, right where your data and LLMs already reside. Powered by LLM-as-a-judge technology, RAG evaluations offer a choice of several judge models and metrics, such as context relevance, context coverage, correctness, and faithfulness (hallucination detection). This seamless integration promotes regular assessments, fostering a culture of continuous improvement and transparency in AI application development. By saving both cost and time compared to human-based evaluations, these tools empower organizations to enhance their AI applications, building trust through consistent improvement.
The model and RAG evaluation capabilities both provide natural language explanations for each score in the output file and on the AWS Management Console. The scores are normalized from 0 to 1 for ease of interpretability. Rubrics are published in full with the judge prompts in the documentation so non-scientists can understand how scores are derived. To learn more about model and RAG evaluation capabilities, see News blog.
Introducing Amazon Nova, built with responsible AI at the core
Amazon Nova is a new generation of state-of-the-art FMs that deliver frontier intelligence and industry leading price-performance. Amazon Nova FMs incorporate built-in safeguards to detect and remove harmful content from data, rejecting inappropriate user inputs, and filtering model outputs. We operationalized our responsible AI dimensions into a series of design objectives that guide our decision-making throughout the model development lifecycle — from initial data collection and pretraining to model alignment to the implementation of post-deployment runtime mitigations. Amazon Nova Canvas and Amazon Nova Reel come with controls to support safety, security, and IP needs with responsible AI. This includes watermarking, content moderation, and C2PA support (available in Amazon Nova Canvas) to add metadata by default to generated images. Amazon’s safety measures to combat the spread of misinformation, child sexual abuse material (CSAM), and chemical, biological, radiological, or nuclear (CBRN) risks also extend to Amazon Nova models. For more information on how Amazon Nova was built responsibly, read the Amazon Science blog.

Enhancing transparency with new resources to advance responsible generative AI
At re:Invent 2024, we announced the availability of new AWS AI Service Cards for Amazon Nova Reel, Amazon Canvas, Amazon Nova Micro, Lite, and Pro, Amazon Titan Image Generator, and Amazon Titan Text Embeddings to increase transparency of Amazon FMs. These cards provide comprehensive information on the intended use cases, limitations, responsible AI design choices, and best practices for deployment and performance optimization. A key component of Amazon’s responsible AI documentation, AI Service Cards offer customers and the broader AI community a centralized resource to understand the development process we undertake to build our services in a responsible way that addresses fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency. As generative AI continues to grow and evolve, transparency on how technology is developed, tested, and used will be a vital component to earn the trust of organizations and their customers alike. You can explore all 16 AI Service Cards on Responsible AI Tools and Resources.
We also updated the AWS Responsible Use of AI Guide. This document offers considerations for designing, developing, deploying, and operating AI systems responsibly, based on our extensive learnings and experience in AI. It was written with a set of diverse AI stakeholders and perspectives in mind—including, but not limited to, builders, decision-makers, and end-users. At AWS, we are committed to continuing to bring transparency resources like these to the broader community—and to iterate and gather feedback on the best ways forward.
Delivering breakthrough innovation with trust at the forefront
At AWS, we’re dedicated to fostering trust in AI, empowering organizations of all sizes to build and use AI effectively and responsibly. We are excited about the responsible AI innovations announced at re:Invent this week. From new safeguards and evaluation techniques in Amazon Bedrock to state-of-the-art Amazon Nova FMs to fostering trust and transparency with ISO/IEC 42001 certification and new AWS AI Service Cards, you have more tools, resources and built-in protections to help you innovate responsibly and unlock value with generative AI.
We encourage you to explore these new tools and resources:

AWS achieves ISO/IEC 42001 AI Management System accredited certification
Prevent factual errors from LLM hallucinations with mathematically sound Automated Reasoning checks (preview)
Amazon Bedrock Guardrails supports multimodal toxicity detection with image support
New RAG evaluation and LLM-as-a-judge capabilities in Amazon Bedrock
Amazon Nova and our commitment to responsible AI
Responsible AI at AWS website
AWS AI Service Cards
AWS Responsible Use of AI Guide

About the author
Dr. Baskar Sridharan is the Vice President for AI/ML and Data Services & Infrastructure, where he oversees the strategic direction and development of key services, including Bedrock, SageMaker, and essential data platforms like EMR, Athena, and Glue.

Deploy RAG applications on Amazon SageMaker JumpStart using FAISS

Generative AI has empowered customers with their own information in unprecedented ways, reshaping interactions across various industries by enabling intuitive and personalized experiences. This transformation is significantly enhanced by Retrieval Augmented Generation (RAG), which is a generative AI pattern where the large language model (LLM) being used references a knowledge corpus outside of its training data to generate a response. RAG has become a popular choice to improve performance of generative AI applications by taking advantage of additional information in the knowledge corpus to augment an LLM. Customers often prefer RAG for optimizing generative AI output over other techniques like fine-tuning due to cost benefits and quicker iteration.
In this post, we show how to build a RAG application on Amazon SageMaker JumpStart using Facebook AI Similarity Search (FAISS).
RAG applications on AWS
RAG models have proven useful for grounding language generation in external knowledge sources. By retrieving relevant information from a knowledge base or document collection, RAG models can produce responses that are more factual, coherent, and relevant to the user’s query. This can be particularly valuable in applications like question answering, dialogue systems, and content generation, where incorporating external knowledge is crucial for providing accurate and informative outputs.
Additionally, RAG has shown promise for improving understanding of internal company documents and reports. By retrieving relevant context from a corporate knowledge base, RAG models can assist with tasks like summarization, information extraction, and question answering on complex, domain-specific documents. This can help employees quickly find important information and insights buried within large volumes of internal materials.
A RAG workflow typically has four components: the input prompt, document retrieval, contextual generation, and output. A workflow begins with a user providing an input prompt, which is searched in a large knowledge corpus, and the most relevant documents are returned. These returned documents along with the original query are then fed into the LLM, which uses the additional conditional context to produce a more accurate output to users. RAG has become a popular technique to optimize generative AI applications because it uses external data that can be frequently modified to dynamically retrieve user output without the need retrain a model, which is both costly and compute intensive.
The next component in this pattern that we have chosen is SageMaker JumpStart. It provides significant advantages for building and deploying generative AI applications, including access to a wide range of pre-trained models with prepackaged artifacts, ease of use through a user-friendly interface, and scalability with seamless integration to the broader AWS ecosystem. By using pre-trained models and optimized hardware, SageMaker JumpStart allows you to quickly deploy both LLMs and embeddings models without spending too much time on configurations for scalability.
Solution overview
To implement our RAG workflow on SageMaker JumpStart, we use a popular open source Python library known as LangChain. Using LangChain, the RAG components are simplified into independent blocks that you can bring together using a chain object that will encapsulate the entire workflow. Let’s review these different components and how we bring them together:

LLM (inference) – We need an LLM that will do the actual inference and answer our end-user’s initial prompt. For our use case, we use Meta Llama 3 for this component. LangChain comes with a default wrapper class for SageMaker endpoints that allows you to simply pass in the endpoint name to define an LLM object in the library.
Embeddings model – We need an embeddings model to convert our document corpus into textual embeddings. This is necessary for when we are doing a similarity search on the input text to see what documents share similarities and possess the knowledge to help augment our response. For this example, we use the BGE Hugging Face embeddings model available through SageMaker JumpStart.
Vector store and retriever – To house the different embeddings we have generated, we use a vector store. In this case, we use FAISS, which allows for similarity search as well. Within our chain object, we define the vector store as the retriever. You can tune this depending on how many documents you want to retrieve. Other vector store options include Amazon OpenSearch Service as you scale your experiments.

The following architecture diagram illustrates how you can use a vector index such as FAISS as a knowledge base and embeddings store.

Standalone vector indexes like FAISS can significantly improve the search and retrieval of vector embeddings, but they lack capabilities that exist in any database. The following is an overview of the primary benefits to using a vector index for RAG workflows:

Efficiency and speed – Vector indexes are highly optimized for fast, memory-efficient similarity search. Because vector databases are built on top of vector indexes, there are additional features that typically contribute additional latency. To build a highly efficient and low-latency RAG workflow, you can use a vector index (such as FAISS) deployed on a single machine with GPU acceleration.
Simplified deployment and maintenance – Because vector indexes don’t require the effort of spinning up and maintaining a database instance, they’re a great option to quickly deploy a RAG workflow if continuous updates, high concurrency, or distributed storage aren’t a requirement.
Control and customization – Vector indexes offer granular control over parameters, the index type, and performance trade-offs, letting you optimize for exact or approximate searches based on the RAG use case.
Memory efficiency – You can tune a vector index to minimize memory usage, especially when using data compression techniques such as quantization. This is advantageous in scenarios where memory is limited and high scalability is required so that more data can be stored in memory on a single machine.

In short, a vector index like FAISS is advantageous when trying to maximize speed, control, and efficiency with minimal infrastructure components and stable data.
In the following sections, we walk through the following notebook, which implements FAISS as the vector store in the RAG solution. In this notebook, we use several years of Amazon’s Letter to Shareholders as a text corpus and perform Q&A on the letters. We use this notebook to demonstrate advanced RAG techniques with Meta Llama 3 8B on SageMaker JumpStart using the FAISS embedding store.
We explore the code using the simple LangChain vector store wrapper, RetrievalQA and ParentDocumentRetriever. RetreivalQA is more advanced than a LangChain vector store wrapper and offers more customizations. ParentDocumentRetriever helps with advanced RAG options like invocation of parent documents for response generation, which enriches the LLM’s outputs with a layered and thorough context. We will see how the responses progressively get better as we move from simple to advanced RAG techniques.
Prerequisites
To run this notebook, you need access to an ml.t3.medium instance.
To deploy the endpoints for Meta Llama 3 8B model inference, you need the following:

At least one ml.g5.12xlarge instance for Meta Llama 3 endpoint usage
At least one ml.g5.2xlarge instance for embedding endpoint usage

Additionally, you may need to request a Service Quota increase.
Set up the notebook
Complete the following steps to create a SageMaker notebook instance (you can also use Amazon SageMaker Studio with JupyterLab):

On the SageMaker console, choose Notebooks in the navigation pane.
Choose Create notebook instance.

For Notebook instance type, choose t3.medium.
Under Additional configuration, for Volume size in GB, enter 50 GB. 

This configuration might need to change depending on the RAG solution you are working with and the amount of data you will have on the file system itself.

For IAM role, choose Create a new role.

Create an AWS Identity and Access Management (IAM) role with SageMaker full access and any other service-related policies that are necessary for your operations.

Expand the Git repositories section and for Git repository URL, enter https://github.com/aws-samples/sagemaker-genai-hosting-examples.git.

Accept defaults for the rest of the configurations and choose Create notebook instance.
Wait for the notebook to be InService and then choose the Open JupyterLab link to launch JupyterLab.

Open genai-recipes/RAG-recipes/llama3-rag-langchain-smjs.ipynb to work through the notebook.

Deploy the model
Before you start building the end-to-end RAG workflow, it’s necessary to deploy the LLM and embeddings model of your choice. SageMaker JumpStart simplifies this process because the model artifacts, data, and container specifications are all pre-packaged for optimal inference. These are then exposed using SageMaker Python SDK high-level API calls, which let you specify the model ID for deployment to a SageMaker real-time endpoint:

from sagemaker.jumpstart.model import JumpStartModel

# Deploying Llama
# Specify the model ID for the HuggingFace Llama 3 8b Instruct LLM model
model_id = “meta-textgeneration-llama-3-8b-instruct”
accept_eula = True
model = JumpStartModel(model_id=model_id)
predictor = model.deploy(accept_eula=accept_eula)

# Deploying Embeddings Model
# Specify the model ID for the HuggingFace BGE Large EN Embedding model
model_id = “huggingface-sentencesimilarity-bge-large-en-v1-5”
text_embedding_model = JumpStartModel(model_id=model_id)
embedding_predictor = text_embedding_model.deploy()
embedding_predictor.endpoint_name

LangChain comes with built-in support for SageMaker JumpStart and endpoint-based models, so you can encapsulate the endpoints with these constructs so they can later be fit into the encompassing RAG chain:

from langchain_community.llms import SagemakerEndpoint
from langchain_community.embeddings import SagemakerEndpointEmbeddings

# Setup for using the Llama3-8B model with SageMaker Endpoint
llm = SagemakerEndpoint(
     endpoint_name=llm_endpoint_name,
     region_name=region,
     model_kwargs={“max_new_tokens”: 1024, “top_p”: 0.9, “temperature”: 0.7},
     content_handler=llama_content_handler
 )
 
 # setup Embeddings models
 sagemaker_embeddings = SagemakerEndpointEmbeddings(
    endpoint_name=embedding_endpoint_name,
    region_name=region,
    model_kwargs={“mode”: “embedding”},
    content_handler=bge_content_handler,
)

After you have set up the models, you can focus on the data preparation and setup of the FAISS vector store.
Data preparation and vector store setup
For this RAG use case, we take public documents of Amazon’s Letter to Shareholders as the text corpus and document source that we will be working with:

# public data to retrieve from
from urllib.request import urlretrieve
urls = [
‘https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf’,
‘https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/d2fde7ee-05f7-419d-9ce8-186de4c96e25.pdf’,
‘https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/f965e5c3-fded-45d3-bbdb-f750f156dcc9.pdf’,
‘https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/336d8745-ea82-40a5-9acc-1a89df23d0f3.pdf’
]
filenames = [
‘AMZN-2024-10-K-Annual-Report.pdf’,
‘AMZN-2023-10-K-Annual-Report.pdf’,
‘AMZN-2022-10-K-Annual-Report.pdf’,
‘AMZN-2021-10-K-Annual-Report.pdf’
]

LangChain comes with built-in processing for PDF documents, and you can use this to load the data from the text corpus. You can also tune or iterate over parameters such as chunk size depending on the documents that you’re working with for your use case.

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

documents = []

# process PDF data
for idx, file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = metadata[idx]
        documents += document
        
# – in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=100,
)
docs = text_splitter.split_documents(documents)
print(docs[100])

You can then combine the documents and embeddings models and point towards FAISS as your vector store. LangChain has widespread support for different LLMs such as SageMaker JumpStart, and also has built-in API calls for integrating with FAISS, which we use in this case:

from langchain_community.vectorstores import FAISS
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
vectorstore_faiss = FAISS.from_documents(
    docs, # doc corpus
    sagemaker_embeddings, # embeddings endpoint
)
wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

You can then make sure the vector store is performing as expected by sending a few sample queries and reviewing the output that is returned:

query = “How did AWS perform in 2021?”
# returns relevant documents
answer = wrapper_store_faiss.query(question=PROMPT.format(query=query), llm=llm)
print(answer)

LangChain inference
Now that you have set up the vector store and models, you can encapsulate this into a singular chain object. In this case, we use a RetrievalQA Chain tailored for RAG applications provided by LangChain. With this chain, you can customize the document fetching process and control parameters such as number of documents to retrieve. We define a prompt template and pass in our retriever as well as these tertiary parameters:

from langchain.chains import RetrievalQA
prompt_template = “””
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
This is a conversation between an AI assistant and a Human.
<|eot_id|><|start_header_id|>user<|end_header_id|>
Use the following pieces of context to provide a concise answer to the question at the end. If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.
#### Context ####
{context}
#### End of Context ####
Question: {question}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
“””
PROMPT = PromptTemplate(
template=prompt_template, input_variables=[“context”, “question”]
)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type=”stuff”,
retriever=vectorstore_faiss.as_retriever(
search_type=”similarity”, search_kwargs={“k”: 3}
),
return_source_documents=True,
chain_type_kwargs={“prompt”: PROMPT}
)

You can then test some sample inference and trace the relevant source documents that helped answer the query:

query = “How did AWS perform in 2023?”
result = qa({“query”: query})
print(result[‘result’])
print(f”n{result[‘source_documents’]}”)

Optionally, if you want to further augment or enhance your RAG applications for more advanced use cases with larger documents, you can also explore using options such as a parent document retriever chain. Depending on your use case, it’s crucial to identify the different RAG processes and architectures that can optimize your generative AI application.
Clean up
After you have built the RAG application with FAISS as a vector index, make sure to clean up the resources that were used. You can delete the LLM endpoint using the delete_endpoint Boto3 API call. In addition, make sure to stop your SageMaker notebook instance to not incur any further charges.
Conclusion
RAG can revolutionize customer interactions across industries by providing personalized and intuitive experiences. RAG’s four-component workflow—input prompt, document retrieval, contextual generation, and output—allows for dynamic, up-to-date responses without the need for costly model retraining. This approach has gained popularity due to its cost-effectiveness and ability to quickly iterate.
In this post, we saw how SageMaker JumpStart has simplified the process of building and deploying generative AI applications, offering pre-trained models, user-friendly interfaces, and seamless scalability within the AWS ecosystem. We also saw how using FAISS as a vector index can enable quick retrieval from a large corpus of information, while keeping costs and operational overhead low.
To learn more about RAG on SageMaker, see Retrieval Augmented Generation, or contact your AWS account team to discuss your use cases.

About the Authors
Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.
Ram Vegiraju is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on SageMaker. In his spare time, he loves traveling and writing.
Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Harish Rao is a Senior Solutions Architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.
Ankith Ede is a Solutions Architect at Amazon Web Services based in New York City. He specializes in helping customers build cutting-edge generative AI, machine learning, and data analytics-based solutions for AWS startups. He is passionate about helping customers build scalable and secure cloud-based solutions.
Sid Rampally is a Customer Solutions Manager at AWS, driving generative AI acceleration for life sciences customers. He writes about topics relevant to his customers, focusing on data engineering and machine learning. In his spare time, Sid enjoys walking his dog in Central Park and playing hockey.

Speed up your cluster procurement time with Amazon SageMaker HyperPod …

Today, organizations are constantly seeking ways to use advanced large language models (LLMs) for their specific needs. These organizations are engaging in both pre-training and fine-tuning massive LLMs, with parameter counts in the billions. This process aims to enhance model efficacy for a wide array of applications across diverse sectors, including healthcare, financial services, and marketing. However, customizing these larger models requires access to the latest and accelerated compute resources.
In this post, we demonstrate how you can address this requirement by using Amazon SageMaker HyperPod training plans, which can bring down your training cluster procurement wait time. A training plan provides simple and predictable access to accelerated compute resources (supporting P4d, P5, P5e, P5en, and trn2 as of the time of writing), allowing you to use this compute capacity to run model training on either Amazon SageMaker training  jobs or SageMaker HyperPod.
We guide you through a step-by-step implementation on how you can use the (AWS CLI) or the AWS Management Console to find, review, and create optimal training plans for your specific compute and timeline needs. We further guide you through using the training plan to submit SageMaker training jobs or create SageMaker HyperPod clusters.
You can check out the launch of this new feature in Meet your training timelines and budget with new Amazon SageMaker HyperPod flexible training plans.
Business challenges
As organizations strive to harness the power of LLMs for competitive advantage, they face a significant hurdle: securing sufficient and reliable compute capacity for model training. The scale of these models demands cutting-edge accelerated compute hardware. However, the high cost and limited availability of such resources create a bottleneck for many businesses. This scarcity not only impacts timelines, but also stretches budgets, potentially delaying critical AI initiatives. As a result, organizations are seeking solutions that can provide consistent, scalable, and cost-effective access to high-performance computing resources, enabling them to train and fine-tune LLMs without compromising on speed or quality.
Solution overview
SageMaker HyperPod training plans, a new SageMaker capability, address this challenge by offering you a simple-to-use console UI or AWS CLI experience to search, review, create, and manage training plans.
Capacity provisioned through SageMaker training plans can be used with either SageMaker training jobs or SageMaker HyperPod. If you want to focus on model development rather than infrastructure management and prefer ease of use with a managed experience, SageMaker training jobs are an excellent choice. For organizations requiring granular control over training infrastructure and extensive customization options, SageMaker HyperPod is the ideal solution. To better understand these services and choose the one most appropriate for your use case, refer to Generative AI foundation model training on Amazon SageMaker, which provides detailed information about both options.
The following diagram provides an overview of the main steps involved in requesting capacity using SageMaker training plans for SageMaker training jobs.

Figure 1: The main steps involved in procuring capacity via SageMaker HyperPod training plans. Note: This workflow arbitrarily uses SageMaker training jobs as the target; you may choose to use SageMaker HyperPod too.

At a high level, the steps to create a training plan are as follows:

Search the training plans that best match your capacity requirements, such as instance type, instance count, start time, and duration. SageMaker finds the optimal plans across one or more segments.
After reviewing the available training plan offerings, you can reserve the plan that meets your requirements.
Schedule your SageMaker training jobs by using a training plan with a training-job target resource. Note, we are only using training-job for illustration purposes. You may also use hyperpod-cluster as your target resource.
Describe and list your existing training plans. When the capacity is available, it will be allocated to the scheduled training job.

In the following sections, we shift our focus to the solution walkthrough associated with training plans.
Prerequisites
Complete the following prerequisite steps:

If you’re using an AWS Identity and Access Management (IAM) user for this solution, make sure that your user has the AmazonSageMakerFullAccess policy attached to it. To learn more about how to attach a policy to an IAM user, see Adding IAM identity permissions (console).
If you’re setting up the AWS CLI for the first time, follow the instructions at Getting started with the AWS CLI.
If you choose to use the AWS CLI, make sure you are on the most up-to-date AWS CLI version.

Create a training plan
In this post, we discuss two ways to create a training plan: using the SageMaker console or the AWS CLI.
Create a SageMaker training plan using the SageMaker console
The SageMaker console user experience for creating a training plan is similar for both training jobs and SageMaker HyperPod. In this post, for demonstration purposes, we show how to create a training plan for a SageMaker HyperPod cluster.

On the SageMaker console, choose Training plans in the navigation pane.
Create a new training plan.
For Target, select HyperPod cluster.
Under Instance attributes, specify your instance type (ml.p5.48xlarge) and instance count (16).
Under Date settings to search for an available plan, choose your preferred training date and duration (for example, 10 days).
Choose Find training plan.

Figure 2: You can search for available training plan offerings via the SageMaker console! Choose your target, select your instance type and count, and specify duration.

SageMaker suggests a training plan that is split into two 5-day segments. This includes the total upfront price for the plan as well as the estimated data transfer cost based on the data location you provided.

Figure 3: SageMaker suggests a training plan based on your inputs. In this example, SageMaker suggests a training plan split across two 5-day segments. You will also see the total upfront price.

Review and purchase your plan.

Figure 4: Once you’re happy with your selection, you can review and purchase your training plan!

After you create the training plan, you can see the list of training plans created. The plan initially enters a Pending state, awaiting payment. Once the payment is processed (unless the payment cycle has changed), the plan will transition to the Scheduled state. At this point, you can begin queuing jobs or creating clusters using the plan. On the plan’s start date, it becomes Active, and resources are allocated. Your training tasks can then start running (pending resource availability).
Make sure you pay for the training plan using the AWS Billing and Cost Management console for your plan to show up on your SageMaker console. You will receive an invoice to resolve before being able to proceed.

Figure 5: You can list out your training plans on the SageMaker console. You can start using your plan once it transitions to the Active state.

Create a SageMaker training plan using the AWS CLI
Complete the following steps to create a training plan using the AWS CLI:

Start by calling the API, passing your capacity requirements as input parameters, to search for all matching training plan offerings.

The following example searches for training plan offerings suitable for two ml.p5.48xlarge instances for 96 hours in the us-west-2 region. In this example, we also have filters for what time frame we want to use the training plan, and we also filter for training plans that can be used for SageMaker HyperPod cluster workloads using the target-resources parameter:

# Required: instance type and instance count, target resources, region
# Optional: duration hours, start time after, and end time before.

aws sagemaker search-training-plan-offerings
–region “us-west-2”
–instance-type ‘ml.p5.48xlarge’
–instance-count 2
–target-resources ‘hyperpod-cluster’
–duration-hours 96
–start-time-after “2025-01-01T00:00:00”
–end-time-before “2025-12-31T23:59:59”

Each TrainingPlanOffering returned in the response is identified by a unique TrainingPlanOfferingId. The first offering in the list represents the best match for your requirements. In this case, the SageMaker SearchTrainingPlanOfferings API returns a single available TrainingPlanOffering that matches the specified capacity requirements:

{
‘TrainingPlanOfferings’: [
{
‘TrainingPlanOfferingId’: ‘tpo-abc123’,
‘TargetResources’: [‘hyperpod-cluster’],
‘RequestedStartTimeAfter’:
datetime.datetime(2024, 11, 18, 11, 40, 47, 928000, tzinfo=tzlocal()),
‘DurationHours’: 96,
‘DurationMinutes’: 0,
‘Upfront’: ‘xx.yy’,
‘CurrencyCode’: ‘USD’,
‘ReservedCapacityOfferings’: [
{
‘InstanceType’: ‘ml.p5.48xlarge’,
‘InstanceCount’: 2,
‘AvailabilityZone’: ‘us-east-1a’,
‘DurationHours’: 96,
‘DurationMinutes’: 0,
‘StartTime’: datetime.datetime(2024, 11, 21, 3, 30, tzinfo=tzlocal()),
‘EndTime’: datetime.datetime(2024, 11, 22, 3, 30, tzinfo=tzlocal())
}
]
}
]
}

Make sure that your SageMaker HyperPod training job subnets are in the same Availability Zone as your training plan.

After you choose the training plan that best suits your schedule and requirements, you can reserve it by calling the CreateTrainingPlan API as follows:

# Required: training-plan-offering-id, training-plan-name
# Optional: target-services (leverages trainig-job by default)
aws sagemaker create-training-plan
–training-plan-offering-id “tpo-abc123” 
–training-plan-name “p5-training-plan” 
–region “us-west-2”

You will see an output that looks like the following:

{
“TrainingPlanArn”:”arn:aws:sagemaker:us-west-2:123456789123:training-plan/p5-training-plan”
}

After you create the training plan, you will have to pay. Be on the lookout for an invoice. You can also find this on the AWS Billing and Cost Management console.

You can list all the training plans that are created in your AWS account (and Region) by calling the ListTrainingPlans API:

aws sagemaker list-training-plans

This will give you a summary of the training plans in your account. After you have your training plan (the newly created p5-training-plan), you can check its details using either the console or the DescribeTrainingPlan API as follows:

export TRAINING_PLAN=”p5-training-plan”
TRAINING_PLAN_DESCRIPTION=$(aws sagemaker describe-training-plan –training-plan-name “$TRAINING_PLAN”)
echo $TRAINING_PLAN_DESCRIPTION

# Picking out individual parameters from the DescribeTrainingPlan API
TRAINING_PLAN_ARN=$(echo “$TRAINING_PLAN_DESCRIPTION” | jq -r ‘.TrainingPlanArn)
AVAILABLE_INSTANCE_COUNT=$(echo “$TRAINING_PLAN_DESCRIPTION” | jq -r ‘.AvailableInstanceCount’)
TOTAL_INSTANCE_COUNT=$(echo “$TRAINING_PLAN_DESCRIPTION” | jq -r ‘.TotalInstanceCount’)

# Note: You may have multiple AZs for your TrainingPlans, so adjust the jq command below accordingly!
TRAINING_PLAN_AZ=$(echo “$TRAINING_PLAN_DESCRIPTION” | jq -r ‘.ReservedCapacitySummaries[0].AvailabilityZone’)

Use a training plan with SageMaker HyperPod
When your training plan status transitions to Scheduled, you can use it for new instance groups in either a new or existing SageMaker HyperPod cluster. You can use both the CreateCluster and UpdateCluster APIs to create a new SageMaker HyperPod cluster with your training plan, or update an existing cluster respectively. You can also choose to directly use the SageMaker console.
For a given SageMaker HyperPod cluster, training plans are attached at the instance group level, separately per each instance group. If desired, one SageMaker HyperPod cluster can have one or more training plans attached to multiple instance groups. You always have the option to omit a training plan and instead continue using On-Demand capacity as previously for other combinations of instance groups. However, you can’t mix training plan capacity with On-Demand capacity within the same instance group. You can also choose to have a partial cluster launch for every instance group. This means that even if all the requested capacity isn’t available, you can still spin up a cluster with capacity already available to you.
When a training plan is active, this is the time window when the TrainingPlanOfferings within it are scheduled to start and stop. Each time a TrainingPlanOffering starts, instance groups will automatically scale up to the specified count, and the instance group TrainingPlanStatus will reflect as Active. When a TrainingPlanOffering is scheduled to stop, your cluster’s instance groups will automatically scale down to zero, and the instance group TrainingPlanStatus will reflect as Expired.
Use a training plan with SageMaker HyperPod on the console
You can choose to either create a new cluster and create an instance group, or edit an existing cluster and edit an existing instance group. In the configuration, choose the same instance type that was chosen for a training plan and specify the desired instance count. The Instance capacity option will appear only when you choose an instance type that is supported for training plans. Choose the dropdown menu to scroll through valid training plans. The available training plan selections are listed by name and are filtered for only those that match the chosen instance type, that have at least the specified instance count, that were created with hyperpod-cluster as the target resource, and currently have a status of Scheduled or Active. Double-check these conditions if you don’t see an expected training plan name, and make sure that the expected training plan was created in the same account and in the same Region. The default selection is to use no training plan. Repeat the process for each instance group that should have a training plan.

Figure 6: You can create an instance group for a SageMaker HyperPod cluster with the instances in your training plan. Make sure to choose the right training plan listed under “Instance capacity”

Use a training plan with SageMaker HyperPod with the AWS CLI
Complete the following steps to use your training plan with the AWS CLI:

Create a SageMaker HyperPod cluster from scratch. For instructions, refer to the Amazon SageMaker HyperPod workshop or the Amazon EKS Support in Amazon SageMaker HyperPod workshop.

The following cluster configuration file defines a SageMaker HyperPod SLURM cluster named ml-cluster. The steps for using training plans will be the same, regardless of if you choose SLURM or Amazon Elastic Kubernetes Service (Amazon EKS) as the orchestrator. This cluster contains an instance group named controller-machine with 1 ml.m5.12xlarge instance as the head node of a SLURM cluster, and it will not use a training plan for the controller-machine instance group. We also define a worker instance group named worker-group-1 that specifies 2 ml.p5.48xlarge instances, which will be sourced from your training plan. Note the line “TrainingPlanArn”—this is where you specify your training plan by the full Amazon Resource Name (ARN). If you followed the steps in the prior sections, this should be the value of the environment variable TRAINING_PLAN_ARN. The following cluster configuration also skips some configuration parameters, such as VPCConfig and InstanceStorageConfig. Refer to the workshop or the following script for a complete SageMaker HyperPod cluster configuration file.

source env_vars
cat > cluster-config.json << EOL
{
“ClusterName”: “ml-cluster”,
“InstanceGroups”: [
{
“InstanceGroupName”: “controller-machine”,
“InstanceType”: “ml.m5.12xlarge”,
“InstanceCount”: 1,

},
{
“InstanceGroupName”: “worker-group-1”,
“InstanceType”: “ml.p5.48xlarge”,
“InstanceCount”: 2,
        “TrainingPlanArn”: “<ENTER TRAINING PLAN ARN HERE>”, …
}
],

}
EOF

You can then create the cluster using the following code:

aws sagemaker create-cluster
–cli-input-json file://create-cluster-config.json
–region $AWS_REGION

These next steps assume that you already have a SageMaker HyperPod cluster created. This section is relevant if you’d like to add an instance group that uses your training plan reserved instances to your existing cluster.

To update an existing cluster, you can define another file called update-cluster-config.json as follows. If you followed the instructions in the workshop to provision the cluster, you can use the provided create_config.sh to get the values for your env_vars before sourcing them.

# Source environment varibales
source env_vars

# Create additional worker group configuration
additional_worker_group=$(cat <<EOF
{
“InstanceGroupName”: “worker-group-2”,
“InstanceType”: “ml.p5.48xlarge”,
“InstanceCount”: 2,
   “trainingPlan”: “<ENTER TRAINING PLAN ARN HERE>”  …
}
EOF
)

# Copy cluster-config.json to a temporary file
cp cluster-config.json temp-cluster-config.json

# Add additional worker group and remove VpcConfig section
jq –argjson additional_worker_group “$additional_worker_group” ‘.InstanceGroups += [$additional_worker_group] | del(.VpcConfig)’ temp-cluster-config.json > update-cluster-config.json

# Remove the temporary file
rm temp-cluster-config.json

In this file, we define an additional worker group named worker-group-2 consisting of 2 ml.p5.48xlarge instances. Again, notice the line “TrainingPlanArn”—this is where you specify your training plan by the full ARN.
Make sure that you also update provisioning_parameters.json, and upload the updated file to your S3 bucket for SageMaker to use while provisioning the new worker group:

Because this file is uploaded to Amazon Simple Storage Service (Amazon S3) for SageMaker to use while provisioning your cluster, you need to first copy that file over from Amazon S3:

aws s3 cp s3://${BUCKET}/src/provisioning_parameters.json provisioning_parameters.json

Assuming your existing cluster has a controller machine group and a worker group with an ml.g5.48xlarge, you can add the lines in bold to your existing yaml file:

{

“controller_group”: “controller-machine”,
“worker_groups”: [
{
“instance_group_name”: “worker-group-1”,
“partition_name”: “ml.g5.48xlarge”
},
{        “instance_group_name”: “worker-group-2”,        “partition_name”: “ml.p5.48xlarge”      }
],

}

This step adds in the new worker group that you just created, which consists of your 2 ml.p5.48xlarge nodes from your training plan.

Now you can re-upload the updated provisioning-parameters.json file to Amazon S3:

# copy to the S3 Bucket
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/

Now, with both cluster-config.json (now update-cluster-config.json) and provisioning-parameters.json updated, you can add the training plan nodes to the cluster:

aws sagemaker update-cluster
–cli-input-json file://update-cluster-config.json
–region $AWS_REGION

Use a training plan with a SageMaker training job
SageMaker training jobs offer two primary methods for execution: an AWS CLI command and the Python SDK. The AWS CLI approach provides direct control and is ideal for scripting, allowing you to create training jobs with a single command. The Python SDK offers a more programmatic interface, enabling seamless integration with existing Python workflows and using the high-level features in SageMaker. In this section, we look at how you can use a training plan with both options.
Run a training job on a training plan using the AWS CLI
The following example demonstrates how to create a SageMaker training job and associate it with a provided training plan using the CapacityScheduleConfig attribute in the create-training-job AWS CLI command:

# Create a training job
aws sagemaker create-training-job
–training-job-name training-job-name

–resource-config ‘{
“InstanceType”: “ml.p5.48xlarge”,
“InstanceCount”: 8,
“VolumeSizeInGB”: 10,
“TrainingPlanArn”: “Enter training plan arn” }’ 

After creating the training job, you can verify that it was properly assigned to the training plan by calling the DescribeTrainingJob API:

aws sagemaker describe-training-job —training-job-name training-job-name

Run a training job on a training plan using the SageMaker Python SDK
The following example demonstrates how to create a SageMaker training job using the SageMaker Python SDK’s Training estimator. It also shows how to associate the job with a provided training plan by using the capacity_schedules attribute in the estimator object when using the SageMaker Python SDK.
For more information on the SageMaker estimator, see Use a SageMaker estimator to run a training job.
Make sure the SageMaker Python SDK version is updated to the latest version.

# Create Estimator
estimator = Estimator(
entry_point=’train.py’,
image_uri=”123456789123.dkr.ecr.{}.amazonaws.com/image:tag”,
role=role,
instance_count=4,
instance_type=’ml.p5.48xlarge’,
training_plan=”Enter training plan arn”, …
)

# Run the training job
estimator.fit(inputs=trainingInput, job_name=job_name)

After creating the training job, you can verify that it was properly assigned to the training plan by calling the DescribeTrainingJob API:

# Check job details
sagemaker_session.describe_training_job(TrainingJobName=job_name)

Clean up
To clean up your resources to avoid incurring more charges, complete the following steps:

Delete the SageMaker HyperPod cluster and associated resources such as storage, VPC, and IAM roles.

If using SLURM, refer to Cleanup.
If using Amazon EKS, refer to Cleanup.

Delete any S3 buckets created.
Make sure that the training plan created is used and completes the fulfillment lifecycle.

Conclusion
SageMaker training plans represent a significant leap forward in addressing the compute capacity challenges faced by organizations working with LLMs. By providing quick access to high-performance GPU resources, it streamlines the process of model training and fine-tuning. This solution not only reduces wait times for cluster provisioning, but also offers flexibility in choosing between SageMaker training jobs and SageMaker HyperPod, catering to diverse organizational needs. Ultimately, SageMaker training plans empower businesses to overcome resource constraints and accelerate their AI initiatives, leading to more efficient and effective usage of advanced language models across various industries.
To get started with a SageMaker training plan and explore its capabilities for your specific LLM training needs, refer to Reserve capacity with training plans and try out the step-by-step implementation guide provided in this post.
Special thanks to Fei Ge, Oscar Hsu, Takuma Yoshitani, and Yiting Li for their support in the launch of this post.

About the Authors
Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML Training and Inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.
Kanwaljit Khurmi is an AI/ML Principal Solutions Architect at Amazon Web Services. He works with AWS product teams, engineering, and customers to provide guidance and technical assistance for improving the value of their hybrid ML solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.
Sean Smith is a Sr Specialist Solution Architect at AWS for HPC and generative AI. Prior to that, Sean worked as a Software Engineer on AWS Batch and CfnCluster, becoming the first engineer on the team that created AWS ParallelCluster.
Ty Bergstrom is a Software Engineer at Amazon Web Services. He works on the Hyperpod Clusters platform for Amazon SageMaker.

BFCM 2024 Results: 61 Million Klaviyo Signals, 30 Million Emails & …

BFCM isn’t just a few big sales days anymore. It’s a whole week of pre-party madness that kicks off the holiday shopping season. 

And for our customers? 

They were already seeing massive wins even before the official Black Friday buzzer! 

Spoiler alert: If you weren’t sending pre-Black Friday emails, you missed out. But no worries, we’re here to help you crush it next year.

Let’s take a look at some of the BFCM trends we saw.

Big Picture: Ecommerce Is On Fire

Before we go deep into Customers.ai data, let’s zoom out for a sec. The whole ecommerce industry is on a hot streak this year!

Black Friday alone raked in $74.4 billion globally (up 5% from last year) with Shopify accounting for a record $11.5 billion. 

Then there was Cyber Monday, where shoppers dropped a jaw-dropping $13.3 billion – a 7.3% increase from last year. 

And it doesn’t appear to be slowing down!

Since November 1, shoppers have already spent $131.5 billion, which is a 9% jump from 2023. 

By the time the dust settles, total spending is expected to hit a whopping $240.8 billion. Talk about a holiday season boom.

People were shopping and they were shopping big. This isn’t just a trend. It’s a full-on movement!

Pre-Black Friday Vibes: Traffic Was Already Spiking

All right, let’s get into what we saw here at Customers.ai. We’re talking big numbers here! 

Even before the weekend sales frenzy, brands using Customers.ai were already raking in the traffic. 

We saw 20-25% of our clients experiencing some serious traffic spikes thanks to their pre-Black Friday emails.

We also saw some other interesting pre-Black Friday data: 

Best Sunday in Months: Yeah, you read that right. Sunday before Black Friday? 75% of our customers had their best Sunday in the last six months. Talk about setting the stage for a big week.

Conversions Up: Those pre-holiday deals? They were hitting. Conversion rates were up 15-20% from the last six months. Basically, your customers were ready to drop cash early.

Return Visitors? Already back. If you saw a 20-25% traffic spike in the week leading up to BFCM, then 10-15% of those visitors were already returning for a second look. That’s some serious staying power.

The Customers.ai BFCM Scorecard is Here!

We are psyched to say this was the best BFCM we’ve seen in company history. Our customers weren’t just sitting back watching the action – they were leading it. 

Let’s look at a few of the highlights:

12.5 Million Emails Captured

Our customers didn’t just send emails, they built serious lists, capturing 12.5 million new emails for future campaigns.

30 Million Emails Sent

No one was holding back. There were 30 million emails sent during BFCM, each one pushing customers closer to the buy button.

$2 Million in Post-Email Revenue

What happens when Customers.ai identifies a visitor and triggers them into an email flow? You make money. $2 million in revenue was captured from Customers.ai triggered emails. That’s a pretty good ROI right there.

On top of that we saw 61 Million Signals Sent to Klaviyo!

Data is power and with 61 million signals sent to Klaviyo, our customers were able to supercharge their campaigns with laser-targeted segmentation and capitalize on both new and returning visitors. 

90 Million Events Sent to Facebook

Facebook ads? Optimized. We saw 90 million new events tracked and sent straight to the platform. That’s precision marketing right there.

8,000 Purchases Made from CAI Emails

All that email magic turned into cold hard conversions – with almost 8,000 purchases. Not too shabby.

Why It All Matters for You

Our customers crushed it and it’s because they were able to capture all of those extra people with Customers.ai.

These brands weren’t just doing the status quo. They were using smarter tools. We’re talking visitor identification, email segmentation, Facebook Super CAPI, and Signal for Klaviyo. 

For our customers, this holiday season was about more than just blasting emails. It was about using real-time insights from our tools to reach the right customers, with the right message, at exactly the right moment. 

The result? Numbers that don’t lie.

Can’t Stop the BFCM Feeling!

Okay, so we’ve shown you what’s possible. The good news? The best is still ahead! 

As we roll into the final stretch of 2024, these results set the stage for even more success in 2025. 

It’s time to double down on what worked, optimize what didn’t, and keep that customer love going strong. Keep your emails fresh, your data tight, and your strategy on point.

Ready to make next year even bigger? 

Start your free trial of Customers.ai today and make some big bucks before the holiday season is over. We’re even giving you 500 free contacts!

See Who Is On Your Site Right Now!

Get names, emails, phone numbers & more.

Try it Free, No Credit Card Required

Start Your Free Trial

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post BFCM 2024 Results: 61 Million Klaviyo Signals, 30 Million Emails & A Lot of $$$ appeared first on Customers.ai.