Tsinghua University Researchers Released the GLM-Edge Series: A Family …

The rapid development of artificial intelligence (AI) has produced models with powerful capabilities, such as language understanding and vision processing. However, deploying these models on edge devices remains challenging due to limitations in computational power, memory, and energy efficiency. The need for lightweight models that can run effectively on edge devices, while still delivering competitive performance, is growing as AI use cases extend beyond the cloud into everyday devices. Traditional large models are often resource-intensive, making them impractical for smaller devices and creating a gap in edge computing. Researchers have been seeking effective ways to bring AI to edge environments without significantly compromising model quality and efficiency.

Tsinghua University researchers recently released the GLM-Edge series, a family of models ranging from 1.5 billion to 5 billion parameters designed specifically for edge devices. The GLM-Edge models offer a combination of language processing and vision capabilities, emphasizing efficiency and accessibility without sacrificing performance. This series includes models that cater to both conversational AI and vision applications, designed to address the limitations of resource-constrained devices.

GLM-Edge includes multiple variants optimized for different tasks and device capabilities, providing a scalable solution for various use cases. The series is based on General Language Model (GLM) technology, extending its performance and modularity to edge scenarios. As AI-powered IoT devices and edge applications continue to grow in popularity, GLM-Edge helps bridge the gap between computationally intensive AI and the limitations of edge devices.

Technical Details

The GLM-Edge series builds upon the structure of GLM, optimized with quantization techniques and architectural changes that make them suitable for edge deployments. The models have been trained using a combination of knowledge distillation and pruning, which allows for a significant reduction in model size while maintaining high accuracy levels. Specifically, the models leverage 8-bit and even 4-bit quantization to reduce memory and computational demands, making them feasible for small devices with limited resources.

The GLM-Edge series has two primary focus areas: conversational AI and visual tasks. The language models are capable of carrying out complex dialogues with reduced latency, while the vision models support various computer vision tasks, such as object detection and image captioning, in real-time. A notable advantage of GLM-Edge is its modularity—it can combine language and vision capabilities into a single model, offering a solution for multi-modal applications. The practical benefits of GLM-Edge include efficient energy consumption, reduced latency, and the ability to run AI-powered applications directly on mobile devices, smart cameras, and embedded systems.

The significance of GLM-Edge lies in its ability to make sophisticated AI capabilities accessible to a wider range of devices beyond powerful cloud servers. By reducing the dependency on external computational power, the GLM-Edge models allow for AI applications that are both cost-effective and privacy-friendly, as data can be processed locally on the device without needing to be sent to the cloud. This is particularly relevant for applications where privacy, low latency, and offline operation are important factors.

The results from GLM-Edge’s evaluation demonstrate strong performance despite the reduced parameter count. For example, the GLM-Edge-1.5B achieved comparable results to much larger transformer models when tested on general NLP and vision benchmarks, highlighting the efficiency gains through careful design optimizations. The series also showcased strong performance in edge-relevant tasks, such as keyword spotting and real-time video analysis, offering a balance between model size, latency, and accuracy.

https://github.com/THUDM/GLM-Edge/blob/main/README_en.md

Conclusion

Tsinghua University’s GLM-Edge series represents an advancement in the field of edge AI, addressing the challenges of resource-limited devices. By providing models that blend efficiency with conversational and visual capabilities, GLM-Edge enables new edge AI applications that are practical and effective. These models help bring the vision of ubiquitous AI closer to reality, allowing AI computations to happen on-device and making it possible to deliver faster, more secure, and cost-effective AI solutions. As AI adoption continues to expand, the GLM-Edge series stands out as an effort that addresses the unique challenges of edge computing, providing a promising path forward for AI in the real world.

Check out the GitHub Page and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

‘Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)
The post Tsinghua University Researchers Released the GLM-Edge Series: A Family of AI Models Ranging from 1.5B to 5B Parameters Designed Specifically for Edge Devices appeared first on MarkTechPost.

Microsoft Researchers Present a Novel Implementation of MH-MoE: Achiev …

Machine learning is advancing rapidly, particularly in areas requiring extensive data processing, such as natural language understanding and generative AI. Researchers are constantly striving to design algorithms that maximize computational efficiency while improving the accuracy and performance of large-scale models. These efforts are critical for building systems capable of managing the complexities of language representation, where precision and resource optimization are key.

One persistent challenge in this field is balancing computational efficiency with model accuracy, especially as neural networks scale to handle increasingly complex tasks. Sparse Mixture-of-Experts (SMoE) architectures have shown promise by using dynamic parameter selection to improve performance. However, these models often need help processing multi-representation spaces effectively, limiting their ability to exploit available data fully. This inefficiency has created a demand for more innovative methods to leverage diverse representation spaces without compromising computational resources.

SMoE architectures traditionally use gating mechanisms to route tokens to specific experts, optimizing the use of computational resources. These models have succeeded in various applications, particularly through top-1 and top-2 gating methods. However, while these methods excel at parameter efficiency, they cannot harness the full potential of multi-representational data. Furthermore, the standard approach of embedding sparse layers within a Transformer framework limits their capacity to scale effectively while maintaining operational efficiency.

Researchers from Microsoft have presented a novel implementation of the MH-MoE framework. This design builds on the foundations of SMoE while addressing its limitations. The MH-MoE implementation allows for the efficient processing of diverse representation spaces by introducing a multi-head mechanism and integrating projection layers. This approach ensures that the computational and parameter efficiency of traditional SMoE models is preserved while significantly enhancing their representational capacity.

The methodology behind MH-MoE is centered on enhancing the information flow through a refined multi-head mechanism. Input tokens are split into sub-tokens, routed to distinct heads, and then processed in parallel. This process is facilitated by linear projection layers that transform the tokens before and after passing through the mixture-of-experts layer. By adjusting the intermediate dimensions and optimizing the gating mechanism, the model ensures FLOPs parity with traditional SMoE models. In one configuration, the researchers used two heads with an intermediate dimension of 768 and top-2 gating, increasing the number of experts to 40. Another configuration employed three heads with an intermediate dimension of 512, utilizing top-3 gating and 96 experts. These adjustments illustrate the adaptability of MH-MoE in aligning its computational efficiency with performance goals.

Experiments demonstrated that MH-MoE consistently outperformed existing SMoE models across various benchmarks. In language modeling tasks, the model achieved significant improvements in perplexity, a measure of model accuracy. For example, after 100,000 training steps, the three-head MH-MoE achieved a perplexity of 10.51 on the RedPajama dataset compared to 10.74 for fine-grained SMoE and 10.90 for standard SMoE. On the Wiki dataset, the three-head MH-MoE achieved a perplexity of 9.18, further underscoring its superior performance. Further, in experiments involving 1-bit quantization using BitNet, MH-MoE maintained its performance advantage, achieving a perplexity of 26.47 after 100,000 steps on the RedPajama dataset compared to 26.68 for fine-grained SMoE and 26.78 for standard SMoE.

Ablation studies conducted by the research team highlighted the importance of the head and merge layers in MH-MoE’s design. These studies demonstrated that both components contribute significantly to model performance, with the head layer offering a more substantial improvement than the merge layer. For example, adding the head layer reduced perplexity on the RedPajama dataset from 11.97 to 11.74. These findings emphasize the critical role of these layers in enhancing the model’s ability to integrate and utilize multi-representational data.

The researchers’ efforts have resulted in a model that addresses key limitations of traditional SMoE frameworks while setting a new benchmark for performance and efficiency. MH-MoE offers a robust solution for effectively scaling neural networks by leveraging multi-head mechanisms and optimizing computational design. This innovation marks a significant step in developing efficient and powerful machine-learning models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

‘Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)
The post Microsoft Researchers Present a Novel Implementation of MH-MoE: Achieving FLOPs and Parameter Parity with Sparse Mixture-of-Experts Models appeared first on MarkTechPost.

Andrew Ng’s Team Releases ‘aisuite’: A New Open Source Python Li …

Generative AI (Gen AI) is transforming the landscape of artificial intelligence, opening up new opportunities for creativity, problem-solving, and automation. Despite its potential, several challenges arise for developers and businesses when implementing Gen AI solutions. One of the most prominent issues is the lack of interoperability between different large language models (LLMs) from multiple providers. Each model has unique APIs, configurations, and specific requirements, making it difficult for developers to switch between providers or use different models in the same application. This fragmented landscape often leads to increased complexity, extended development time, and challenges for engineers aiming to create effective Gen AI applications.

Andrew Ng’s team has released a new open source Python library for Gen AI called aisuite. This library aims to address the issue of interoperability and simplify the process of building applications that utilize large language models from different providers. With aisuite, developers can switch between models from OpenAI, Anthropic, Ollama, and others by changing a single string in their code. The library introduces a standard interface that allows users to choose a “provider:model” combination, such as “openai:gpt-4o,” “anthropic:claude-3-5-sonnet-20241022,” or “ollama:llama3.1:8b,” enabling an easy switch between different language models without needing to rewrite significant parts of the code.

Technical Details

From a technical perspective, aisuite offers a straightforward interface for managing various LLMs, making it a useful tool for developers. By abstracting the complexities associated with multiple APIs, it provides a unified framework that handles different types of requests and responses. Developers can leverage aisuite to integrate multiple models into their applications with just a few lines of code. This results in lower barriers to entry for Gen AI projects, quicker prototyping, and faster deployment. Another key feature is its extensibility—aisuite allows developers to add new models and providers as they emerge in the market, ensuring that applications can stay up to date with the latest AI capabilities.

The significance of aisuite lies in its ability to streamline the development process, saving time and reducing costs. For teams that need flexibility, aisuite’s capability to switch between models based on specific tasks and requirements provides a valuable tool for optimizing performance. For instance, developers might use OpenAI’s GPT-4 for creative content generation but switch to a specialized model from Anthropic for more constrained, factual outputs. Early benchmarks and community feedback indicate that using aisuite can reduce integration time for multi-model applications, highlighting its impact on improving developer efficiency and productivity.

Conclusion

Aisuite represents a meaningful advancement for the Gen AI community, simplifying the complexities involved in using large language models from different providers. By providing a unified interface, Andrew Ng’s team has lowered the barriers for integrating advanced AI capabilities into applications, making it easier for developers to experiment and build. As the Gen AI ecosystem continues to expand, tools like aisuite will play an important role in driving accessibility and adoption, enabling more individuals and organizations to leverage the power of AI without being constrained by the technical hurdles that have traditionally accompanied model integration.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

‘Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)
The post Andrew Ng’s Team Releases ‘aisuite’: A New Open Source Python Library for Generative AI appeared first on MarkTechPost.

Easily deploy and manage hundreds of LoRA adapters with SageMaker effi …

The new efficient multi-adapter inference feature of Amazon SageMaker unlocks exciting possibilities for customers using fine-tuned models. This capability integrates with SageMaker inference components to allow you to deploy and manage hundreds of fine-tuned Low-Rank Adaptation (LoRA) adapters through SageMaker APIs. Multi-adapter inference handles the registration of fine-tuned adapters with a base model and dynamically loads them from GPU memory, CPU memory, or local disk in milliseconds, based on the request. This feature provides atomic operations for adding, deleting, or updating individual adapters across a SageMaker endpoint’s running instances without affecting performance or requiring a redeployment of the endpoint.
The efficiency of LoRA adapters allows for a wide range of hyper-personalization and task-based customization which had previously been too resource-intensive and costly to be feasible. For example, marketing and software as a service (SaaS) companies can personalize artificial intelligence and machine learning (AI/ML) applications using each of their customer’s images, art style, communication style, and documents to create campaigns and artifacts that represent them. Similarly, enterprises in industries like healthcare or financial services can reuse a common base model with task-based adapters to efficiently tackle a variety of specialized AI tasks. Whether it’s diagnosing medical conditions, assessing loan applications, understanding complex documents, or detecting financial fraud, you can simply swap in the appropriate fine-tuned LoRA adapter for each use case at runtime. This flexibility and efficiency unlocks new opportunities to deploy powerful, customized AI across your organization. With this new efficient multi-adapter inference capability, SageMaker reduces the complexity of deploying and managing the adapters that power these applications.
In this post, we show how to use the new efficient multi-adapter inference feature in SageMaker.
Problem statement
You can use powerful pre-trained foundation models (FMs) without needing to build your own complex models from scratch. However, these general-purpose models might not always align with your specific needs or your unique data. To make these models work for you, you can use Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA.
The benefit of PEFT and LoRA is that it lets you fine-tune models quickly and cost-effectively. These methods are based on the idea that only a small part of a large FM needs updating to adapt it to new tasks or domains. By freezing the base model and just updating a few extra adapter layers, you can fine-tune models much faster and cheaper, while still maintaining high performance. This flexibility means you can quickly customize pre-trained models at low cost to meet different requirements. When inferencing, the LoRA adapters can be loaded dynamically at runtime to augment the results from the base model for best performance. You can create a library of task-specific, customer-specific, or domain-specific adapters that can be swapped in as needed for maximum efficiency. This allows you to build AI tailored exactly to your business.
Although fine-tuned LoRA adapters can effectively address targeted use cases, managing these adapters can be challenging at scale. You can use open-source libraries, or the AWS managed Large Model Inference (LMI) deep learning container (DLC) to dynamically load and unload adapter weights. Current deployment methods use fixed adapters or Amazon Simple Storage Service (Amazon S3) locations, making post-deployment changes impossible without updating the model endpoint and adding unnecessary complexity. This deployment method also makes it impossible to collect per-adapter metrics, making the evaluation of their health and performance a challenge.
Solution overview
In this solution, we show how to use efficient multi-adapter inference in SageMaker to host and manage multiple LoRA adapters with a common base model. The approach is based on an existing SageMaker capability, inference components, where you can have multiple containers or models on the same endpoint and allocate a certain amount of compute to each container. With inference components, you can create and scale multiple copies of the model, each of which retains the compute that you have allocated. With inference components, deploying multiple models that have specific hardware requirements becomes a much simpler process, allowing for the scaling and hosting of multiple FMs. An example deployment would look like the following figure.

This feature extends inference components to a new type of component, inference component adapters, which you can use to allow SageMaker to manage your individual LoRA adapters at scale while having a common inference component for the base model that you’re deploying. In this post, we show how to create, update, and delete inference component adapters and how to call them for inference. You can envision this architecture as the following figure.

Prerequisites
To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created. For details, refer to Create an AWS account.
If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you may need to request a service quota increase for the corresponding SageMaker hosting instances. In this example, you host the base model and multiple adapters on the same SageMaker endpoint, so you will use an ml.g5.12xlarge SageMaker hosting instance.
In this example, you learn how to deploy a base model (Meta Llama 3.1 8B Instruct) and LoRA adapters on an SageMaker real-time endpoint using inference components. You can find the example notebook in the GitHub repository.

import sagemaker
import boto3
import json

role = sagemaker.get_execution_role() # execution role for the endpoint
sess = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket() # bucket to house artifacts
region = sess._region_name

sm_client = boto3.client(service_name=’sagemaker’)
sm_rt_client = boto3.client(service_name=’sagemaker-runtime’)

Download the base model from the Hugging Face model hub. Because Meta Llama 3.1 8B Instruct is a gated model, you will need a Hugging Face access token and to submit a request for model access on the model page. For more details, see Accessing Private/Gated Models.

from huggingface_hub import snapshot_download

model_name = sagemaker.utils.name_from_base(“llama-3-1-8b-instruct”)

HF_TOKEN = “<<YOUR_HF_TOKEN>>”
model_id = “meta-llama/Llama-3.1-8B-Instruct”
model_id_pathsafe = model_id.replace(“/”,”-“)
local_model_path = f”./models/{model_id_pathsafe}”
s3_model_path = f”s3://{bucket}/models/{model_id_pathsafe}”

snapshot_download(repo_id=model_id, use_auth_token=HF_TOKEN, local_dir=local_model_path, allow_patterns=[“.json”, “.safetensors”])

Copy your model artifact to Amazon S3 to improve model load time during deployment:
!aws s3 cp —recursive {local_model_path} {s3_model_path}
Select one of the available LMI container images for hosting. Efficient adapter inference capability is available in 0.31.0-lmi13.0.0 and higher.
inference_image_uri = “763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124”
Create a container environment for the hosting container. LMI container parameters can be found in the LMI Backend User Guides.
The parameters OPTION_MAX_LORAS and OPTION_MAX_CPU_LORAS control how adapters move between GPU, CPU, and disk. OPTION_MAX_LORAS sets a limit on the number of adapters concurrently stored in GPU memory, with excess adapters offloaded to CPU memory.  OPTION_MAX_CPU_LORAS determines how many adapters are staged in CPU memory, offloading excess adapters to local SSD storage.
In the following example, 30 adapters can live in GPU memory and 70 adapters in CPU memory before going to local storage.

env = {
    “HF_MODEL_ID”: f”{s3_model_path}”,
    “OPTION_ROLLING_BATCH”: “lmi-dist”,
    “OPTION_MAX_ROLLING_BATCH_SIZE”: “16”,
    “OPTION_TENSOR_PARALLEL_DEGREE”: “max”,
    “OPTION_ENABLE_LORA”: “true”,
    “OPTION_MAX_LORAS”: “30”,
    “OPTION_MAX_CPU_LORAS”: “70”,
    “OPTION_DTYPE”: “fp16”,
    “OPTION_MAX_MODEL_LEN”: “6000”
}

With your container image and environment defined, you can create a SageMaker model object that you will use to create an inference component later:

model_name = sagemaker.utils.name_from_base(“llama-3-1-8b-instruct”)

create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        “Image”: inference_image_uri,
        “Environment”: env,
    },
)

Set up a SageMaker endpoint
To create a SageMaker endpoint, you need an endpoint configuration. When using inference components, you don’t specify a model in the endpoint configuration. You load the model as a component later on.

endpoint_config_name = f”{model_name}”
variant_name = “AllTraffic”
instance_type = “ml.g5.12xlarge”
model_data_download_timeout_in_seconds = 900
container_startup_health_check_timeout_in_seconds = 900

initial_instance_count = 1

sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ExecutionRoleArn = role,
    ProductionVariants = [
        {
            “VariantName”: variant_name,
            “InstanceType”: instance_type,
            “InitialInstanceCount”: initial_instance_count,
            “ModelDataDownloadTimeoutInSeconds”: model_data_download_timeout_in_seconds,
            “ContainerStartupHealthCheckTimeoutInSeconds”: container_startup_health_check_timeout_in_seconds,
            “RoutingConfig”: {“RoutingStrategy”: “LEAST_OUTSTANDING_REQUESTS”},
        }
    ]
)

Create the SageMaker endpoint with the following code:

create_endpoint_response = sm_client.create_endpoint(
    EndpointName = endpoint_name, EndpointConfigName = endpoint_config_name
)

With your endpoint created, you can now create the inference component for the base model. This will be the base component that the adapter components you create later will depend on.
Notable parameters here are ComputeResourceRequirements. These are a component-level configuration that determine the amount of resources that the component needs (memory, vCPUs, accelerators). The adapters will share these resources with the base component.

base_inference_component_name = f”base-{model_name}”

variant_name = “AllTraffic”

initial_copy_count = 1
min_memory_required_in_mb = 32000
number_of_accelerator_devices_required = 4

sm_client.create_inference_component(
    InferenceComponentName = base_inference_component_name,
    EndpointName = endpoint_name,
    VariantName = variant_name,
    Specification={
        “ModelName”: model_name,
        “StartupParameters”: {
            “ModelDataDownloadTimeoutInSeconds”: model_data_download_timeout_in_seconds,
            “ContainerStartupHealthCheckTimeoutInSeconds”: container_startup_health_check_timeout_in_seconds,
        },
        “ComputeResourceRequirements”: {
            “MinMemoryRequiredInMb”: min_memory_required_in_mb,
            “NumberOfAcceleratorDevicesRequired”: number_of_accelerator_devices_required,
        },
    },
    RuntimeConfig={
        “CopyCount”: initial_copy_count,
    },
)

 In this example, you create a single adapter, but you could host up to hundreds of them per endpoint. They will need to be compressed and uploaded to Amazon S3.
The adapter package has the following files at the root of the archive with no sub-folders.

For this example, an adapter was fine-tuned using QLoRA and Fully Sharded Data Parallel (FSDP) on the training split of the ECTSum dataset. Training took 21 minutes on an ml.p4d.24xlarge and cost approximately $13 using current on-demand pricing.
For each adapter you are going to deploy, you need to specify an InferenceComponentName, an ArtifactUrl with the S3 location of the adapter archive, and a BaseInferenceComponentName to create the connection between the base model inference component and the new adapter inference components. You repeat this process for each additional adapter.

ic_ectsum_name = f”adapter-ectsum-{base_inference_component_name}”
adapter_s3_uri = “<<S3_PATH_FOR_YOUR_ADAPTER>>

sm_client.create_inference_component(
    InferenceComponentName = adapter_ic1_name,
    EndpointName = endpoint_name,
    Specification={
        “BaseInferenceComponentName”: inference_component_name,
        “Container”: {
            “ArtifactUrl”: adapter_s3_uri
        },
    },
)

Use the deployed adapter
First, you build a prompt to invoke the model for earnings summarization, filling in the source text with a random item from the ECTSum dataset. Then you store the ground truth summary from the item for comparison later.

from datasets import load_dataset
dataset_name = “mrSoul7766/ECTSum”

test_dataset = load_dataset(dataset_name, trust_remote_code=True, split=”test”)

test_item = test_dataset.shuffle().select(range(1))

prompt =f”””
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are an AI assistant trained to summarize earnings calls.
    Provide a concise summary of the call, capturing the key points and overall context.
    Focus on quarter over quarter revenue, earnings per share, changes in debt, highlighted risks, and growth opportunities.
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    Summarize the following earnings call:

    {test_item[“text”]}
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>”””

ground_truth_response = test_item[“summary”]

To test the base model, specify the EndpointName for the endpoint you created earlier and the name of the base inference component as InferenceComponentName, along with your prompt and other inference parameters in the Body parameter:

component_to_invoke = base_inference_component_name

response_model = sm_rt_client.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            “inputs”: prompt,
            “parameters”: {“max_new_tokens”: 100, “temperature”:0.9}
        }
    ),
    ContentType = “application/json”,
)

base_model_reponse = response_model[“Body”].read().decode(“utf8”)[“generated_text”]

To invoke the adapter, use the adapter inference component name in your invoke_endpoint call:

component_to_invoke = adapter_ic1_name

response_model = sm_rt_client.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            “inputs”: prompt,
            “parameters”: {“max_new_tokens”: 100, “temperature”:0.9}
        }
    ),
    ContentType = “application/json”,
)

adapter_reponse = response_model[“Body”].read().decode(“utf8”)[“generated_text”]

Compare outputs
Compare the outputs of the base model and adapter to ground truth. While the base model might appear subjectively better in this test, the adapter’s response is actually much closer to the ground truth response. This will be proven with metrics in the next section.

Ground Truth:

q3 non-gaap earnings per share $3.71.
q3 sales rose 15.4 percent to $747 million.
bio rad laboratories – now anticipates 2021 non-gaap currency-neutral revenue growth between 12 to 13 percent.
sees 2021 estimated non-gaap operating margin of about 19.5 percent.
qtrly gaap income per share $129.96.

———————————-

Base Model Response:

Here is a summary of the earnings call:

**Key Points:**

* Revenue: $747 million, up 15.4% year-over-year (13.8% on a currency-neutral basis)
* Earnings per share: $129.96, up from $3 per share in Q3 2020
* Gross margin: 58.6% on a GAAP basis, 57.9% on a non-GAAP basis
* Operating income: $156.8 million, up from $109.6 million in Q3 2020
* Net income: $3.928

———————————-

Adapter Model Response:

                Here is a concise summary of the call:

                q3 revenue $747.6 million versus refinitiv ibes estimate of $753.9 million.
q3 earnings per share $3.71.
sees fy earnings per share $11.85 to $12.05.
sees fy 2021 non-gaap revenue growth to be 12% to 13%.
sees fy 2021 non-gaap gross margin to be 57.5% to 57.8%.
sees fy 2021 non-gaap operating margin to be 19.5%.

To validate the true adapter performance, you can use a tool like fmeval to run an evaluation of summarization accuracy. This will calculate the METEOR, ROUGE, and BertScore metrics for the adapter vs. the base model. Doing so against the test split of ECTSum yields the following results.

The fine-tuned adapter shows a 59% increase in METEOR score, 159% increase in ROUGE score, and 8.6% increase in BertScore.
The following diagram shows the frequency distribution of scores for the different metrics, with the adapter consistently scoring better more often in all metrics.

We observed an end-to-end latency difference of up to 10%  between base model invocation and the adapter in our tests. If the adapter is loaded from CPU memory or disk, it will incur an additional cold start delay for the first load to GPU. But depending on your container configurations and instance type chosen, these values may vary.
Update an existing adapter
Because adapters are managed as inference components, you can update them on a running endpoint. SageMaker handles the unloading and deregistering of the old adapter and loading and registering of the new adapter onto every base inference component on all the instances that it is running on for this endpoint. To update an adapter inference component, use the update_inference_component API and supply the existing inference component name and the Amazon S3 path to the new compressed adapter archive.
You can train a new adapter, or re-upload the existing adapter artifact to test this functionality.

update_inference_component_response = sm_client.update_inference_component(
    InferenceComponentName = adapter_ic1_name,
    Specification={
        “Container”: {
            “ArtifactUrl”: new_adapter_s3_uri
        },
    },
)

Remove adapters
If you need to delete an adapter, call the delete_inference_component API with the inference component name to remove it:

sess = sagemaker.session.Session()
sess.delete_inference_component(adapter_ic1_name, wait = True)

Deleting the base model inference component will automatically delete the base inference component and any associated adapter inference components:

sess.delete_inference_component(base_inference_component_name, wait = True)

Pricing
SageMaker multi-adapter inference is generally available in AWS Regions US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Jakarta, Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London, Stockholm), Middle East (UAE), and South America (São Paulo), and is available at no extra cost.
Conclusion
The new efficient multi-adapter inference feature in SageMaker opens up exciting possibilities for customers with fine-tuning use cases. By allowing the dynamic loading of fine-tuned LoRA adapters, you can quickly and cost-effectively customize AI models to your specific needs. This flexibility unlocks new opportunities to deploy powerful, customized AI across organizations in industries like marketing, healthcare, and finance. The ability to manage these adapters at scale through SageMaker inference components makes it effortless to build tailored generative AI solutions.

About the Authors
Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at AWS, helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in the financial services industry.
Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.
Ram Vegiraju is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Improve the performance of your Generative AI applications with Prompt …

Prompt engineering refers to the practice of writing instructions to get the desired responses from foundation models (FMs). You might have to spend months experimenting and iterating on your prompts, following the best practices for each model, to achieve your desired output. Furthermore, these prompts are specific to a model and task, and performance isn’t guaranteed when they are used with a different FM. This manual effort required for prompt engineering can slow down your ability to test different models.
Today, we are excited to announce the availability of Prompt Optimization on Amazon Bedrock. With this capability, you can now optimize your prompts for several use cases with a single API call or a click of a button on the Amazon Bedrock console.
In this post, we discuss how you can get started with this new feature using an example use case in addition to discussing some performance benchmarks.
Solution overview
At the time of writing, Prompt Optimization for Amazon Bedrock supports Prompt Optimization for Anthropic’s Claude 3 Haiku, Claude 3 Sonnet, Claude 3 Opus, and Claude-3.5-Sonnet models, Meta’s Llama 3 70B and Llama 3.1 70B models, Mistral’s Large model and Amazon’s Titan Text Premier model. Prompt Optimizations can result in significant improvements for Generative AI tasks. Some example performance benchmarks for several tasks were conducted and are discussed.
In the following sections, we demonstrate how to use the Prompt Optimization feature. For our use case, we want to optimize a prompt that looks at a call or chat transcript, and classifies the next best action.
Use automatic prompt optimization
To get started with this feature, complete the following steps:

On the Amazon Bedrock console, choose Prompt management in the navigation pane.
Choose Create prompt.
Enter a name and optional description for your prompt, then choose Create.

For User message, enter the prompt template that you want to optimize.

For example, we want to optimize a prompt that looks at a call or chat transcript and classifies the next best action as one of the following:

Wait for customer input
Assign agent
Escalate

The following screenshot shows what our prompt looks like in the prompt builder.

In the Configurations pane, for Generative AI resource, choose Models and choose your preferred model. For this example, we use Anthropic’s Claude 3.5 Sonnet.
Choose Optimize.

A pop-up appears that indicates that your prompt is being optimized.

When optimization is complete, you should see a side-by-side view of the original and the optimized prompt for your use case.

Add values to your test variables (in this case, transcript) and choose Run.

You can then see the output from the model in the desired format.

As we can see in this example, the prompt is more explicit, with clear instructions on how to process the original transcript provided as a variable. This results in the correct classification, in the required output format. Once a prompt has been optimized, it can be deployed into an application by creating a version which creates a snapshot of its configuration. Multiple versions can be stored to enable switching between different use-case prompt configurations. See prompt management for more details on prompt version control and deployment.
Performance benchmarks
We ran the Prompt Optimization feature on several open source datasets. We are excited to share the improvements seen in a few important and common use cases that we see our customers working with:

Summarization (XSUM)
RAG-based dialog continuation (DSTC)
Function calling (GLAIVE)

To measure performance improvement with respect to the baseline prompts, we use ROUGE-2 F1 for the summarization use case, HELM-F1 for the dialog continuation use case, and HELM-F1 and JSON matching for function calling. We saw a performance improvement of 18% on the summarization use case, 8% on dialog completion, and 22% on function calling benchmarks. The following table contains the detailed results.

Use Case
Original Prompt
Optimized Prompt
Performance Improvement

Summarization
First, please read the article below. {context}  Now, can you write me an extremely short abstract for it?
<task> Your task is to provide a concise 1-2 sentence summary of the given text that captures the main points or key information. </task><context> {context} </context><instructions> Please read the provided text carefully and thoroughly to understand its content. Then, generate a brief summary in your own words that is much shorter than the original text while still preserving the core ideas and essential details. The summary should be concise yet informative, capturing the essence of the text in just 1-2 sentences. </instructions><result_format> Summary: [WRITE YOUR 1-2 SENTENCE SUMMARY HERE] </result_format>
18.04%

Dialog continuation
Functions available: {available_functions} Examples of calling functions: Input: Functions: [{“name”: “calculate_area”, “description”: “Calculate the area of a shape”, “parameters”: {“type”: “object”, “properties”: {“shape”: {“type”: “string”, “description”: “The type of shape (e.g. rectangle, triangle, circle)”}, “dimensions”: {“type”: “object”, “properties”: {“length”: {“type”: “number”, “description”: “The length of the shape”}, “width”: {“type”: “number”, “description”: “The width of the shape”}, “base”: {“type”: “number”, “description”: “The base of the shape”}, “height”: {“type”: “number”, “description”: “The height of the shape”}, “radius”: {“type”: “number”, “description”: “The radius of the shape”}}}}, “required”: [“shape”, “dimensions”]}}] Conversation history: USER: Can you calculate the area of a rectangle with a length of 5 and width of 3? Output: {“name”: “calculate_area”, “arguments”: {“shape”: “rectangle”, “dimensions”: {“length”: 5, “width”: 3}}}Input: Functions: [{“name”: “search_books”, “description”: “Search for books based on title or author”, “parameters”: {“type”: “object”, “properties”: {“search_query”: {“type”: “string”, “description”: “The title or author to search for”}}, “required”: [“search_query”]}}] Conversation history: USER: I am looking for books by J.K. Rowling. Can you help me find them? Output: {“name”: “search_books”, “arguments”: {“search_query”: “J.K. Rowling”}}Input: Functions: [{“name”: “calculate_age”, “description”: “Calculate the age based on the birthdate”, “parameters”: {“type”: “object”, “properties”: {“birthdate”: {“type”: “string”, “format”: “date”, “description”: “The birthdate”}}, “required”: [“birthdate”]}}] Conversation history: USER: Hi, I was born on 1990-05-15. Can you tell me how old I am today? Output: {“name”: “calculate_age”, “arguments”: {“birthdate”: “1990-05-15”}} Current chat history: {conversation_history} Respond to the last message. Call a function if necessary.
Task: Respond to the user’s message in the given conversation by calling appropriate functions if necessary. Instructions: 1. Review the list of available functions: <available_functions> {available_functions} </available_functions> 2. Study the examples of how to call these functions: <fewshot_examples> <example> H: <context>Functions: [{“name”: “calculate_area”, “description”: “Calculate the area of a shape”, “parameters”: {“type”: “object”, “properties”: {“shape”: {“type”: “string”, “description”: “The type of shape (e.g. rectangle, triangle, circle)”}, “dimensions”: {“type”: “object”, “properties”: {“length”: {“type”: “number”, “description”: “The length of the shape”}, “width”: {“type”: “number”, “description”: “The width of the shape”}, “base”: {“type”: “number”, “description”: “The base of the shape”}, “height”: {“type”: “number”, “description”: “The height of the shape”}, “radius”: {“type”: “number”, “description”: “The radius of the shape”}}}}, “required”: [“shape”, “dimensions”]}}]</context> <question>USER: Can you calculate the area of a rectangle with a length of 5 and width of 3?</question> A: <output>{“name”: “calculate_area”, “arguments”: {“shape”: “rectangle”, “dimensions”: {“length”: 5, “width”: 3}}}</output> </example> <example> H: <context>Functions: [{“name”: “search_books”, “description”: “Search for books based on title or author”, “parameters”: {“type”: “object”, “properties”: {“search_query”: {“type”: “string”, “description”: “The title or author to search for”}}, “required”: [“search_query”]}}]</context> <question>USER: I am looking for books by J.K. Rowling. Can you help me find them?</question> A: <output>{“name”: “search_books”, “arguments”: {“search_query”: “J.K. Rowling”}}</output> </example> <example> H: <context>Functions: [{“name”: “calculate_age”, “description”: “Calculate the age based on the birthdate”, “parameters”: {“type”: “object”, “properties”: {“birthdate”: {“type”: “string”, “format”: “date”, “description”: “The birthdate”}}, “required”: [“birthdate”]}}]</context> <question>USER: Hi, I was born on 1990-05-15. Can you tell me how old I am today?</question> A: <output>{“name”: “calculate_age”, “arguments”: {“birthdate”: “1990-05-15”}}</output> </example> </fewshot_examples> 3. Carefully read the current conversation history: <conversation_history> {conversation_history} </conversation_history> 4. Analyze the last message from the user and determine if any of the available functions need to be called to provide an appropriate response. 5. If a function call is necessary, follow the format demonstrated in the examples to invoke the relevant function with the required arguments. 6. If no function call is needed, provide a direct response to the user’s message. 7. Your response should be concise, relevant, and tailored to the specific context of the conversation. 8. Enclose your final response in <response></response> tags, without any additional preamble or explanation. Provide your response immediately after these instructions, following the specified format.
8.23%

Function Calling
## Retrieved knowledge snippets Domain: Restaurant Entity: THE COPPER KETTLE Review: My friend Mark took me to the copper kettle to celebrate my promotion. I decided to treat myself to Shepherds Pie. It was not as flavorful as I’d have liked and the consistency was just runny, but the servers were awesome and I enjoyed the view from the patio. I may come back to try the strawberries and cream come time for Wimbledon.. Highlight: It was not as flavorful as I’d have liked and the consistency was just runny, but the servers were awesome and I enjoyed the view from the patio. Domain: Restaurant Entity: THE COPPER KETTLE Review: Last week, my colleagues and I visited THE COPPER KETTLE that serves British cuisine. We enjoyed a nice view from inside of the restaurant. The atmosphere was enjoyable and the restaurant was located in a nice area. However, the food was mediocre and was served in small portions. Highlight: We enjoyed a nice view from inside of the restaurant. ## Dialog user: Howdy, I’m looking for a British restaurant for breakfast. agent: There are several British restaurants available. Would you prefer a moderate or expensive price range? user: Moderate price range please. agent: Five restaurants match your criteria. Four are in Centre area and one is in the West. Which area would you prefer? user: I would like the Center of town please. agent: How about The Copper Kettle? user: Do they offer a good view? agent: Given the above information from a retrieval augmented generation (RAG) system, respond with the appropriate answer to the question from the user.
<task_description> You are an advanced question-answering system that utilizes information from a retrieval augmented generation (RAG) system to provide accurate and relevant responses to user queries. </task_description><instructions> 1. Carefully review the provided context information: <context> Domain: Restaurant Entity: THE COPPER KETTLE Review: My friend Mark took me to the copper kettle to celebrate my promotion. I decided to treat myself to Shepherds Pie. It was not as flavorful as I’d have liked and the consistency was just runny, but the servers were awesome and I enjoyed the view from the patio. I may come back to try the strawberries and cream come time for Wimbledon.. Highlight: It was not as flavorful as I’d have liked and the consistency was just runny, but the servers were awesome and I enjoyed the view from the patio.Domain: Restaurant Entity: THE COPPER KETTLE Review: Last week, my colleagues and I visited THE COPPER KETTLE that serves British cuisine. We enjoyed a nice view from inside of the restaurant. The atmosphere was enjoyable and the restaurant was located in a nice area. However, the food was mediocre and was served in small portions. Highlight: We enjoyed a nice view from inside of the restaurant. </context>2. Analyze the user’s question: <question> user: Howdy, I’m looking for a British restaurant for breakfast.agent: There are several British restaurants available. Would you prefer a moderate or expensive price range?user: Moderate price range please.agent: Five restaurants match your criteria. Four are in Centre area and one is in the West. Which area would you prefer?user: I would like the Center of town please.agent: How about The Copper Kettle?user: Do they offer a good view? agent: </question> 3. Leverage the context information and your knowledge to generate a concise and accurate answer to the user’s question. 4. Ensure your response directly addresses the specific query while incorporating relevant details from the context. 5. Provide your answer in a clear and easy-to-understand manner, without any unnecessary preamble or explanation. </instructions> <output_format> Answer: [Insert your concise answer here] </output_format> <example> Context: The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower. Constructed from 1887 to 1889 as the centerpiece of the 1889 World’s Fair, it was initially criticized by some of France’s leading artists and intellectuals for its design, but it has become a global cultural icon of France and one of the most recognizable structures in the world. Question: What is the Eiffel Tower? Answer: The Eiffel Tower is a wrought-iron lattice tower in Paris, France, named after its designer Gustave Eiffel, and constructed as the centerpiece of the 1889 World’s Fair. </example>
22.03%

The consistent improvements across different tasks highlight the robustness and effectiveness of Prompt Optimization in enhancing prompt performance for various natural language processing (NLP) tasks. This shows Prompt Optimization can save you considerable time and effort while achieving better outcomes by testing models with optimized prompts implementing the best practices for each model.
Conclusion
Prompt Optimization on Amazon Bedrock empowers you to effortlessly enhance your prompt’s performance across a wide range of use cases with just a single API call or a few clicks on the Amazon Bedrock console. The substantial improvements demonstrated on open-source benchmarks for tasks like summarization, dialog continuation, and function calling underscore this new feature’s capability to streamline the prompt engineering process significantly. Prompt Optimization on Amazon Bedrock enables you to easily test many different models for your generative-AI application, following the best prompt engineering practices for each model. The reduced manual effort, will greatly accelerate the development of generative-AI applications in your organization.
We encourage you to try out Prompt Optimization with your own use cases and reach out to us for feedback and collaboration.

About the Authors
Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.
Chris Pecora is a Generative AI Data Scientist at Amazon Web Services. He is passionate about building innovative products and solutions while also focusing on customer-obsessed science. When not running experiments and keeping up with the latest developments in generative AI, he loves spending time with his kids.
Zhengyuan Shen is an Applied Scientist at Amazon Bedrock, specializing in foundational models and ML modeling for complex tasks including natural language and structured data understanding. He is passionate about leveraging innovative ML solutions to enhance products or services, thereby simplifying the lives of customers through a seamless blend of science and engineering. Outside work, he enjoys sports and cooking.
Shipra Kanoria is a Principal Product Manager at AWS. She is passionate about helping customers solve their most complex problems with the power of machine learning and artificial intelligence. Before joining AWS, Shipra spent over 4 years at Amazon Alexa, where she launched many productivity-related features on the Alexa voice assistant.

Search enterprise data assets using LLMs backed by knowledge graphs

Enterprises are facing challenges in accessing their data assets scattered across various sources because of increasing complexities in managing vast amount of data. Traditional search methods often fail to provide comprehensive and contextual results, particularly for unstructured data or complex queries.
Search solutions in modern big data management must facilitate efficient and accurate search of enterprise data assets that can adapt to the arrival of new assets. Customers want to search through all of the data and applications across their organization, and they want to see the provenance information for all of the documents retrieved. The application needs to search through the catalog and show the metadata information related to all of the data assets that are relevant to the search context. To accomplish all of these goals, the solution should include the following features:

Provide connections between related entities and data sources
Consolidate fragmented data cataloging systems that contain metadata
Provide reasoning behind the search outputs

In this post, we present a generative AI-powered semantic search solution that empowers business users to quickly and accurately find relevant data assets across various enterprise data sources. In this solution, we integrate large language models (LLMs) hosted on Amazon Bedrock backed by a knowledge base that is derived from a knowledge graph built on Amazon Neptune to create a powerful search paradigm that enables natural language-based questions to integrate search across documents stored in Amazon Simple Storage Service (Amazon S3), data lake tables hosted on the AWS Glue Data Catalog, and enterprise assets in Amazon DataZone.
Foundation models (FMs) on Amazon Bedrock provide powerful generative models for text and language tasks. However, FMs lack domain-specific knowledge and reasoning capabilities. Knowledge graphs available on Neptune provide a means to represent interconnected facts and entities with inferencing and reasoning abilities for domains. Equipping FMs with structured reasoning abilities using domain-specific knowledge graphs harnesses the best of both approaches. This allows FMs to retain their inductive abilities while grounding their language understanding and generation in well-structured domain knowledge and logical reasoning. In the context of enterprise data asset search powered by a metadata catalog hosted on services such Amazon DataZone, AWS Glue, and other third-party catalogs, knowledge graphs can help integrate this linked data and also enable a scalable search paradigm that integrates metadata that evolves over time.
Solution overview
The solution integrates with your existing data catalogs and repositories, creating a unified, scalable semantic layer across the entire data landscape. When users ask questions in plain English, the search is not just for keywords; it comprehends the query’s intent and context, relating it to relevant tables, documents, and datasets across your organization. This semantic understanding enables more accurate, contextual, and insightful search results, making the entire company’s data as accessible and simple to search as using a consumer search engine, but with the depth and specificity your business demands. This significantly enhances decision-making, efficiency, and innovation throughout your organization by unlocking the full potential of your data assets. The following video shows the sample working solution.

Using graph data processing and the integration of natural language-based search on embedded graphs, these hybrid systems can unlock powerful insights from complex data structures.
The solution presented in this post consists of an ingestion pipeline and a search application UI that the user can submit queries to in natural language while searching for data assets.
The following diagram illustrates the end-to-end architecture, consisting of the metadata API layer, ingestion pipeline, embedding generation workflow, and frontend UI.

The ingestion pipeline (3) ingests metadata (1) from services (2), including Amazon DataZone, AWS Glue, and Amazon Athena, to a Neptune database after converting the JSON response from the service APIs into an RDF triple format. The RDF is converted into text and loaded into an S3 bucket, which is accessed by Amazon Bedrock (4) as the source of the knowledge base. You can extend this solution to include metadata from third-party cataloging solutions as well. The end-users access the application, which is hosted on Amazon CloudFront (5).
A state machine in AWS Step Functions defines the workflow of the ingestion process by invoking AWS Lambda functions, as illustrated in the following figure.

The functions perform the following actions:

Read metadata from services (Amazon DataZone, AWS Glue, and Athena) in JSON format. Enhance the JSON format metadata to JSON-LD format by adding context, and load the data to an Amazon Neptune Serverless database as RDF triples. The following is an example of RDF triples in N-triples file format:

<arn:aws:glue:us-east-1:440577664410:table/default/market_sales_table#sales_qty_sold>
<http://www.w3.org/2000/01/rdf-schema#label> “sales_qty_sold” .
<arn:aws:glue:us-east-1:440577664410:table/sampleenv_pub_db/mkt_sls_table#disnt>
<http://www.w3.org/2000/01/rdf-schema#label> “disnt” .
<arn:aws:glue:us-east-1:440577664410:table/sampleenv_pub_db/mkt_sls_table>
<http://www.amazonaws.com/datacatalog/hasColumn>
<arn:aws:glue:us-east-1:440577664410:table/sampleenv_pub_db/mkt_sls_table#item_id> .
<arn:aws:glue:us-east-1:440577664410:table/sampledata_pub_db/raw_customer>
<http://www.w3.org/2000/01/rdf-schema#label> “raw_customer” .
For more details about RDF data format, refer to the W3C documentation.
Run SPARQL queries in the Neptune database to populate additional triples from inference rules. This step enriches the metadata by using the graph inferencing and reasoning capabilities. The following is a SPARQL query that inserts new metadata inferred from existing triples:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
INSERT
{
?asset <http://www.amazonaws.com/datacatalog/exists_in_aws_account> ?account
}
WHERE
{
?asset <http://www.amazonaws.com/datacatalog/isTypeOf> “GlueTableAssetType” .
?asset <http://www.amazonaws.com/datacatalog/catalogId> ?account .
}

Read triples from the Neptune database and convert them into text format using an LLM hosted on Amazon Bedrock. This solution uses Anthropic’s Claude 3 Haiku v1 for RDF-to-text conversion, storing the resulting text files in an S3 bucket.

Amazon Bedrock Knowledge Bases is configured to use the preceding S3 bucket as a data source to create a knowledge base. Amazon Bedrock Knowledge Bases creates vector embeddings from the text files using the Amazon Titan Text Embeddings v2 model.
A Streamlit application is hosted in Amazon Elastic Container Service (Amazon ECS) as a task, which provides a chatbot UI for users to submit queries against the knowledge base in Amazon Bedrock.
Prerequisites
The following are prerequisites to deploy the solution:

An AWS account.
The following LLM models must be enabled. For more information, see Add or remove access to Amazon Bedrock foundation models.

Anthropic’s Claude 3 Haiku v1 – For converting triples to text.
Anthropic’s Claude 3 Sonnet v1 – Used in the app for response generation.
Amazon Titan Text Embeddings v2 – For embedding the documents and store in knowledge base.

An AWS Identity and Access Management (IAM) role that has the privileges to run the AWS CloudFormation
A user pool created in Amazon Cognito. A user pool application client is created for the user pool. Select User name for Cognito user pool sign-in options.

Capture the user pool ID and application client ID, which will be required while launching the CloudFormation stack for building the web application.
Create an Amazon Cognito user (for example, username=test_user) for your Amazon Cognito user pool that will be used to log in to the application. An email address must be included while creating the user.

Prepare the test data
A sample dataset is needed for testing the functionalities of the solution. In your AWS account, prepare a table using Amazon DataZone and Athena completing Step 1 through Step 8 in Amazon DataZone QuickStart with AWS Glue data. This will create a table and capture its metadata in the Data Catalog and Amazon DataZone.
To test how the solution is combining metadata from different data catalogs, create another table only in the Data Catalog, not in Amazon DataZone. On the Athena console, open the query editor and run the following query to create a new table:

CREATE TABLE raw_customer AS SELECT 203 AS cust_id, ‘John Doe’ AS cust_name

Deploy the application
Complete the following steps to deploy the application:

To launch the CloudFormation template, choose Launch Stack or download the template file (yaml) and launch the CloudFormation stack in your AWS account.
Modify the stack name or leave as default, then choose Next.
In the Parameters section, input the Amazon Cognito user pool ID (CognitoUserPoolId) and application client ID (CognitoAppClientId). This is required for successful deployment of the stacks.
Review and update other AWS CloudFormation parameters if required. You can use the default values for all the parameters and continue with the stack deployment. The following table lists the default parameters for the CloudFormation template.

Parameter Name
Description
Default Value

EnvironmentName
Unique name to distinguish different web applications in the same AWS account (min length 1 and max length 4).
dev

S3DataPrefixKB
S3 object prefix where the knowledge base source documents (metadata files) should be stored.
knowledge_base

Cpu
CPU configuration of the ECS task.
512

Memory
Memory configuration of the ECS task.
1024

ContainerPort
Port for the ECS task host and container.
80

DesiredTaskCount
Number of desired ECS task count.
1

MinContainers
Minimum containers for auto scaling. Should be less than or equal to DesiredTaskCount.
1

MaxContainers
Maximum containers for auto scaling. Should be greater than or equal to DesiredTaskCount.
3

AutoScalingTargetValue
CPU utilization target percentage for ECS task auto scaling.
80

Launch the stack.

The CloudFormation stack creates the required resources to launch the application by invoking a series of nested stacks. It deploys the following resources in your AWS account:

An S3 bucket to save metadata details from AWS Glue, Athena, and Amazon DataZone, and its corresponding text data
An additional S3 bucket to store code, artifacts, and logs related to the deployment
A virtual private cloud (VPC), subnets, and network infrastructure
An Amazon OpenSearch Serverless index
An Amazon Bedrock knowledge base
A data source for the knowledge base that connects to the S3 data bucket provisioned, with an event rule to sync the data
A Lambda function that watches for objects dropped under the S3 prefix configured as parameter S3DataPrefixKB and starts an ingestion job using Amazon Bedrock Knowledge Bases APIs, which will read data from Amazon S3, chunk it, convert the chunks into embeddings using the Amazon Titan Embeddings model, and store these embeddings in OpenSearch Serverless
An serverless Neptune database to store the RDF triples
A State Functions state machine that invokes a series of Lambda functions that read from the different AWS services, generate RDF triples, and convert them to text documents
An ECS cluster and service to host the Streamlit web application

After the CloudFormation stack is deployed, a Step Functions workflow will run automatically that orchestrates the metadata extract, transform, and load (ETL) job, and stores the final results in Amazon S3. View the execution status and details of the workflow by fetching the state machine Amazon Resource Name (ARN) from the CloudFormation stack. If AWS Lake Formation is enabled for the AWS Glue databases and tables in the account, complete the following steps after the CloudFormation stack is deployed to update the permission and extract the metadata details from AWS Glue and update the metadata details to load to the knowledge base:

Add a role to the AWS Glue Lambda function that grants access to the AWS Glue database.
Fetch the state machine ARN from the CloudFormation stack.
Run the state machine with default input values to extract the metadata details and write to Amazon S3.

You can search for the application stack name <MainStackName>-deploy-<EnvironmentName> (for example, mm-enterprise-search-deploy-dev) on the AWS CloudFormation console. Locate the web application URL in the stack outputs (CloudfrontURL). Launch the web application by choosing the URL link.

Use the application
You can access the application from a web browser using the domain name of the Amazon CloudFront distribution created in the deployment steps. Log in using a user credential that exists in the Amazon Cognito user pool.

Now you can submit a query using a text input. The AWS account used in this example contains sample tables related to sales and marketing. We ask the question, “How to query sales data?” The answer includes metadata on the table mkt_sls_table that was created in the previous steps.

We ask another question: “How to get customer names from sales data?” In the previous steps, we created the raw_customer table, which wasn’t published as a data asset in Amazon DataZone. The table only exists in the Data Catalog. The application returns an answer that combines metadata from Amazon DataZone and AWS Glue.

This powerful solution opens up exciting possibilities for enterprise data discovery and insights. We encourage you to deploy it in your own environment and experiment with different types of queries across your data assets. Try combining information from multiple sources, asking complex questions, and see how the semantic understanding improves your search experience.
Clean up
The total cost of running this setup is less than $10 per day. However, we recommend deleting the CloudFormation stack after use because the deployed resources incur costs. Deleting the main stack also deletes all the nested stacks except the VPC because of dependency. You also need to delete the VPC from the Amazon VPC console.
Conclusion
In this post, we presented a comprehensive and extendable multimodal search solution of enterprise data assets. The integration of LLMs and knowledge graphs shows that by combining the strengths of these technologies, organizations can unlock new levels of data discovery, reasoning, and insight generation, ultimately driving innovation and progress across a wide range of domains.
To learn more about LLM and knowledge graph use cases, refer to the following resources:

Building commonsense knowledge graphs to aid product recommendation
Using knowledge graphs to build GraphRAG applications with Amazon Bedrock and Amazon Neptune
Build a knowledge graph on Amazon Neptune with AI-powered video analysis using Media2Cloud

About the Authors
Sudipta Mitra is a Generative AI Specialist Solutions Architect at AWS, who helps customers across North America use the power of data and AI to transform their businesses and solve their most challenging problems. His mission is to enable customers achieve their business goals and create value with data and AI. He helps architect solutions across AI/ML applications, enterprise data platforms, data governance, and unified search in enterprises.
Gi Kim is a Data & ML Engineer with the AWS Professional Services team, helping customers build data analytics solutions and AI/ML applications. With over 20 years of experience in solution design and development, he has a background in multiple technologies, and he works with specialists from different industries to develop new innovative solutions using his skills. When he is not working on solution architecture and development, he enjoys playing with his dogs at a beach under the San Francisco Golden Gate Bridge.
Surendiran Rangaraj is a Data & ML Engineer at AWS who helps customers unlock the power of big data, machine learning, and generative AI applications for their business solutions. He works closely with a diverse range of customers to design and implement tailored strategies that boost efficiency, drive growth, and enhance customer experiences.

Embodied AI Chess with Amazon Bedrock

Generative AI continues to transform numerous industries and activities, with one such application being the enhancement of chess, a traditional human game, with sophisticated AI and large language models (LLMs). Using the Custom Model Import feature in Amazon Bedrock, you can now create engaging matches between foundation models (FMs) fine-tuned for chess gameplay, combining classical strategy with generative AI capabilities.
Amazon Bedrock provides managed access to leading FMs from Anthropic, Meta, Mistral AI, AI21 Labs, Cohere, Stability AI, and Amazon, enabling developers to build sophisticated AI-powered applications. These models demonstrate remarkable capabilities in understanding complex game patterns, strategic decision-making, and adaptive learning. With the Custom Model Import feature, you can now seamlessly deploy your customized chess models fine-tuned on specific gameplay styles or historical matches, eliminating the need to manage infrastructure while enabling serverless, on-demand inference. This capability allows you to experiment on fascinating matchups between:

Base FMs vs. custom fine-tuned models
Custom fine-tuned models trained on distinct grandmaster playing styles

In this post, we demonstrate Embodied AI Chess with Amazon Bedrock, bringing a new dimension to traditional chess through generative AI capabilities. Our setup features a smart chess board that can detect moves in real time, paired with two robotic arms executing those moves. Each arm is controlled by different FMs—base or custom. This physical implementation allows you to observe and experiment with how different generative AI models approach complex gaming strategies in real-world chess matches.
Solution overview
The chess demo uses a broad spectrum of AWS services to create an interactive and engaging gaming experience. The following architecture diagram illustrates the service integration and data flow in the demo.

On the frontend, AWS Amplify hosts a responsive React TypeScript application while providing secure user authentication through Amazon Cognito using the Amplify SDK. This authentication layer connects users to backend services through GraphQL APIs, managed by AWS AppSync, allowing for real-time data synchronization and game state management.
The application’s core backend functionality is handled by a combination of Unit and Pipeline Resolvers. Whereas Unit Resolvers manage lightweight operations such as game state management, creation, and deletion, the critical move-making processes are orchestrated through Pipeline Resolvers. These resolvers queue moves for processing by AWS Step Functions, providing reliable and scalable game flow management.
For generative AI-powered gameplay, Amazon Bedrock integration enables access to both FMs and custom fine-tuned models. The FMs fine-tuned using Amazon SageMaker are then imported into Amazon Bedrock through the Custom Model Import feature, making them available alongside FMs for on-demand access during gameplay. More details on fine-tuning and importing a fine-tuned FM into Amazon Bedrock can be found in the blog post Import a question answering fine-tuned model into Amazon Bedrock as a custom model.
The execution of chess moves on the board is coordinated by a custom component called Chess Game Manager, running on AWS IoT Greengrass. This component bridges the gap between the cloud infrastructure and the physical hardware.
When processing a move, the Step Functions workflow publishes a move request to an AWS IoT Core topic and pauses, awaiting confirmation. The Chess Game Manager component consumes the message, and implements a three-phase validation system to make sure moves are executed accurately. First, it validates the intended move with the smart chessboard, which can detect piece positions. Second, it sends requests to the two robotic arms to physically move the chess pieces. Finally, it confirms with the smart chessboard that the pieces are in their correct positions after the move. This third-phase validation by the smart chessboard is the concept of “trust but verify” in Embodied AI, where the physical state of something may be different from what is shown in a dashboard. Therefore, after the state of the move is registered, the Step Functions workflow continues. After a move has been confirmed, the component publishes a response message back to AWS IoT Core, on a separate topic, which signals the Step Functions workflow to continue.
The demo offers a few gameplay options. Players can choose from the following list of opponents:

Generative AI models available on Amazon Bedrock
Custom fine-tuned models deployed to Amazon Bedrock
Chess engines
Human opponents
Random moves

An infrastructure as code (IaC) approach was taken when constructing this project. You will use the AWS Cloud Deployment Kit (AWS CDK) when building the components for deployment into any AWS account. After you download the code base, you can deploy the project following the instructions outlined in the GitHub repo.
Prerequisites
This post assumes you have the following:

An AWS account
The AWS Command Line Interface (AWS CLI) installed
The AWS CDK Toolkit (cdk command) installed
Node
PNPM
Access to models in Amazon Bedrock

Chess with fine-tuned models
Traditional approaches to chess AI have focused on handcrafted rules and search algorithms. These methods, though effective, often struggle to capture the nuanced decision-making and long-term strategic thinking characteristic of human grandmasters. More recently, reinforcement learning (RL) has shown promise in mastering chess by allowing AI agents to learn through self-play and trial and error. RL models can discover strategies and evaluate board positions, but they often require extensive computational resources and training time—typically several weeks to months of continuous learning to reach grandmaster-level play.
Fine-tuning generative AI FMs offers a compelling alternative by learning the underlying patterns and principles of chess in just a few days using standard GPU instances, making it a more resource-efficient approach for developing specialized chess AI. The fine-tuning process significantly reduces the time and computational resources needed because the model already understands basic patterns and structures, allowing it to focus on learning chess-specific strategies and tactics.
Prepare the dataset
This section dives into the process of preparing a high-quality dataset for fine-tuning a chess-playing model, focusing on extracting valuable insights from games played by grandmasters and world championship games.
At the heart of our dataset lies the Portable Game Notation (PGN), a standard chess format that records every aspect of a chess game. PGN includes Forsyth–Edwards Notation (FEN), which captures the exact position of pieces on the board at any given moment. Together, these formats store both the moves played and important game details like player names and dates, giving our model comprehensive data to learn from.
Dataset preparation consists of the following key steps:

Data acquisition – We begin by downloading a collection of games in PGN format from publicly available PGN files on the PGN mentor program website. We used the games played by Magnus Carlsen, a renowned chess grandmaster. You can download a similar dataset using the following commands:

# Download games zip file to the target directory – You may choose a different set of games – replace filename with the name of the file you want to download
curl -o /data/filename.zip https://www.pgnmentor.com/players/filename.zip

# Unzip the file in the target directory
unzip filename.zip

Filtering for success – To train a model focused on winning strategies, we filter the games to include only games where the player emerged victorious. This allows the model to learn from successful games.
PGN to FEN conversion – Each move in a PGN file represents a transition in the chessboard state. To capture these states effectively, we convert PGN notation to FEN format. This conversion process involves iterating through the moves in the PGN, updating the board state accordingly, and generating the corresponding FEN for each move.

The following is a sample game in a PGN file:
[Event “Titled Tue DDth MMM Late”] [Site “chess.com INT”] [Date “YYYY.MM.DD”] [Round “10”] [White “Player 1 last name,Player 1 first name”] [Black “Player 2 last name, Player 2 first name “] [Result “0-1”] [WhiteElo “2xxx”] [BlackElo “2xxx”] [ECO “A00”]1.e4 c5 2.d4 cxd4 3.c3 Nc6 4.cxd4 d5 5.exd5 Qxd5 6.Nf3 e5 7.Nc3 Bb4 8.Bd2 Bxc3 9.Bxc3 e4 10.Nd2 Nf6 11.Bc4 Qg5 12.Qb3 O-O 13.O-O-O Bg4 14.h4 Bxd1 15.Rxd1 Qf5 16.g4 Nxg4 17.Rg1 Nxf2 18.d5 Ne5 19.Rg5 Qd7 20.Bxe5 f5 21.d6+  1-0
The following are sample JSON records with FEN, capturing next move and next color to move. We followed two approaches for the JSON record creation. For models that have good understanding of FEN format, we used a more concise record:

{
“move”: “d4”,
“fen”: “rnbqkbnr/pp1ppppp/8/2p5/4P3/8/PPPP1PPP/RNBQKBNR w KQkq – 0 2”,
“nxt_color”: “WHITE”
}

For models with limited understanding of FEN format, we used a more detailed record:

{
“move”: “d4”,
“fen”: “rnbqkbnr/pp1ppppp/8/2p5/4P3/8/PPPP1PPP/RNBQKBNR w KQkq – 0 2”,
“nxt_color”: “WHITE”,
“move_history”: “e4, c5”
}

The records include the following parameters:

move – A valid next move for the given FEN state.
fen – The current board position in FEN.
nxt_color – Which color has the next turn to move.
move_history – The history of game moves performed until the current board state.

For each game in the PGN file, multiple records similar to the preceding examples are created to capture the FEN, next move, and next move color.

Move validation – We validate the legality of each move captured in the records in the preceding format. This step maintains data integrity and prevents the model from learning incorrect or impossible chess moves.
Dataset splitting – We split the processed dataset into two parts: a training set and an evaluation set. The training set is used to train the model, and the evaluation set is used to assess the model’s performance on unseen data. This splitting helps us understand how well the model generalizes to new chess positions.

By following these steps, we create a comprehensive and refined dataset that enables our chess AI to learn from successful games, understand legal moves, and grasp the nuances of strategic chess play. This approach to data preparation creates the foundation for fine-tuning a model that can play chess at a high level.
Fine-tune a model
With our refined dataset prepared from successful games and legal moves, we now proceed to fine-tune a model using Amazon SageMaker JumpStart. The fine-tuning process requires clear instructions through a structured prompt template. Here again, based on the FM, we followed two approaches.
For fine-tuning an FM that understands FEN format, we used a more concise prompt template:

template = {
    “prompt”: (
        “<s>[INST] You are a chess engine. Given a chess position in FEN notation and the color to move, provide the next best valid move in SAN (Standard Algebraic Notation) format to progress towards winning the game of chess. Your response must be a single move wrapped in <move></move> tags.nn”
        “Chess Position (FEN): {fen}n”
        “Color to Move: {nxt_color} [/INST]”
    ),
    “completion”: ” <move>{move}</move> </s>”
}

Alternatively, for models with limited FEN knowledge, we provide a prompt template similar to the following:

template = {
“prompt”: (
“<s>[INST]nYou are a chess engine that provides the next best valid move in SAN format based on:n- FEN position where:n Black pieces: p=pawn, r=rook, n=knight, b=bishop, q=queen, k=king (lowercase)n White pieces: P=pawn, R=rook, N=knight, B=bishop, Q=queen, K=king (uppercase)n Numbers 1-8 indicate consecutive empty squaresn- Color to moven- Move historynnAnalyze these inputs to recommend a legal move that progresses toward winning. Respond with a single move in <move></move> tags.nn”
“Chess Position (FEN): {fen}n”
“Color to Move: {nxt_color}n”
“Move History: {move_history}n”
),
“completion”: ” <move>{move}</move> </s>”
}

Training and evaluation datasets along with the template.json file created using one of the preceding templates are then uploaded to an Amazon Simple Storage Service (Amazon S3) bucket so they are ready for the fine-tuning job that will be submitted using SageMaker JumpStart.
Now that the dataset is prepared and our model is selected, we submit a SageMaker training job with the following code:

estimator = JumpStartEstimator(
model_id=model_id,
model_version=model_version,
environment={“accept_eula”: “true”},
disable_output_compression=True,
instance_type=”ml.g5.24xlarge”
)
# By default, instruction tuning is set to false.
estimator.set_hyperparameters(instruction_tuned=True, epoch=”3″, max_input_length=”1024″)
estimator.fit({“training”: train_test_data_location})

Let’s break down the preceding code, and look at some important sections:

estimator – this is the SageMaker object used to accept all training parameters, while launching and orchestrating the training job.
model_id – This is the SageMaker JumpStart model ID for the LLM that you need to fine-tune.
accept_eula – This EULA varies from provider to provider and must be accepted when deploying or fine-tuning models from SageMaker JumpStart.
instance_type – This is the compute instance the fine-tuning job will take place on. In this case, it’s a g5.24xlarge. This specific instance contains 4 NVIDIA A10G GPUs with 96 GiB of GPU memory. When deciding on an instance type, select the one that best balances your computational needs with your budget to maximize value.
fit – The .fit method is the actual line of code that launches the SageMaker training job. All of the algorithm metrics and instance usage metrics can be viewed in Amazon CloudWatch logs, which are directly integrated with SageMaker.

When the SageMaker training job is complete, the model artifacts will be stored in an S3 bucket specified either by the user or the system default.
The notebook we use for fine-tuning one of the models can be accessed in the following GitHub repo.
Challenges and best practices for fine-tuning
In this section, we discuss common challenges and best practices for fine-tuning.
Automated Optimizations with SageMaker JumpStart
Fine-tuning an LLM for chess move prediction using SageMaker presents unique opportunities and challenges. We used SageMaker JumpStart to do the fine-tuning because it provides automated optimizations for different model sizes when fine-tuning for chess applications. SageMaker JumpStart automatically applies appropriate quantization techniques and resource allocations based on model size. For example:

3B–7B models – Enables FSDP with full precision training
13B models – Configures FSDP with optional 8-bit quantization
70B models – Automatically implements 8-bit quantization and disables FSDP for stability

This means if you create a SageMaker JumpStart Estimator without explicitly specifying the int8_quantization parameter, it will automatically use these default values based on the model size you’re working with. This design choice is made because larger models (like 70B) require significant computational resources, so quantization is enabled by default to reduce the memory footprint during training.
Data preparation and format
Dataset identification and preparation can be a challenge. We used readily available PGN datasets from world championships and grandmaster matches to streamline the data preparation process for chess LLM fine-tuning, significantly reducing the complexity of dataset curation.
Choosing the right chess format that produces optimal results with an LLM is critical for successful results post-fine-tuning. We discovered that Standard Algebraic Notation (SAN) significantly outperforms Universal Chess Interface (UCI) format in terms of training convergence and model performance.
Prompt consistency
Using consistent prompt templates during fine-tuning helps the model learn the expected input-output patterns more effectively, and Amazon Bedrock Prompt Management provide robust tools to create and manage these templates systematically. We recommend using the prompt template suggestions provided by the model providers for improved performance.
Model size and resource allocation
Successful LLM training requires a good balance of cost management through multiple approaches, with instance selection being a primary aspect. You can start with the following recommended instance and work your way up, depending on the quality and time available for training.

Model Size
Memory Requirements
Recommended Instance and Quantization

3B – 7B
24 GB
Fits on g5.2xlarge with QLoRA 4-bit quantization

8B -13B
48 GB
Requires g5.4xlarge with efficient memory management

70B
400 GB
Needs g5.48xlarge or p4d.24xlarge with multi-GPU setup

Import the fine-tuned model into Amazon Bedrock
After the model is fine-tuned and the model artifacts are in the designated S3 bucket, it’s time to import it to Amazon Bedrock using Custom Model Import.
The following section outlines two ways to import the model: using the SDK or the Amazon Bedrock console.
The following is a code snippet showing how the model can be imported using the SDK:

create_model_import_job_resp = br_client.create_model_import_job(
jobName=rivchess_imp_jb_nm,
importedModelName=rivchess_model_nm,
roleArn=role_arn,
modelDataSource=rivchess_model_src)

In the code snippet, a create model import job is submitted to import the fine-tuned model into Amazon Bedrock. The parameters in the job are as follows:

JobName – The name of the import job so it may be identified using the SDK or Amazon Bedrock console
ImportedModelName – The name of the imported model, which will be used to invoke inference using the SDK and identify said model on the Amazon Bedrock console
roleArn – The role with the correct permissions to import a model onto Amazon Bedrock
modelDataSource – The S3 bucket in which the model artifacts were stored in, upon the completed training job

To use the Amazon Bedrock console, complete the following steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Imported models.

Choose Import model.

Provide the following information:

For Model name, enter a name for your model.
For Import job name¸ enter a name for your import job.
For Model import settings, select Amazon S3 bucket and enter your bucket location.
Create an IAM role or use an existing one.

Choose Import.

After the job is submitted, the job will populate the queue on the Imported models page.

When the model import job is complete, the model may now be called for inference using the Amazon Bedrock console or SDK.
Test the fine-tuned model to play chess
To test the fine-tuned model that is imported into Amazon Bedrock, we use the AWS SDK for Python (Boto3) library to invoke the imported model. We simulated the fine-tuned model against the Stockfish library for a game of up to 50 moves or when the game is won either by the fine-tuned model or by Stockfish.
The Stockfish Python library requires the appropriate version of the executable to be downloaded from the Stockfish website. We also use the chess Python library to visualize the status of the board. This is basically simulating a chess player at a particular Elo rating. An Elo rating represents a player’s strength as a numerical value.
Stockfish and chess Python libraries are GPL-3.0 licensed chess engines, and any usage, modification, or distribution of these libraries must comply with the GPL 3.0 license terms. Review the license agreements before using the Stockfish and chess Python libraries.
The first step is to install the chess and Stockfish libraries:

!pip install chess stockfish —upgrade —quiet

We then initialize the Stockfish library. The path to the command line executable needs to be provided:

stockfish = Stockfish(path=’/home/sagemaker-user/riv2024-chess/stockfish/stockfish-ubuntu-x86-64-sse41-popcnt’)
stockfish.update_engine_parameters({“Hash”: 2048, “UCI_Chess960”: “true”})
stockfish.set_elo_rating(1350)
fen_state = stockfish.get_fen_position()

We set the Elo rating, using Stockfish API methods (set_elo_rating). Additional configuration can be provided by following the Stockfish Python Library documentation.
We initialize the chess Python library similarly with equivalent code to the Stockfish Python library initialization. Further configuration can be provided to the chess library following the chess Python library documentation.

board = chess.Board()
board.reset_board()
board.chess960 = True
stockfish.set_fen_position(board.fen())

Upon initialization, we initiate the fine-tuned model imported into Amazon Bedrock against the Stockfish library. In the following code, the first move is performed by Stockfish. Then the fine-tuned model is invoked using the Amazon Bedrock invoke_model API wrapped in a helper function by providing the FEN position of the chess board currently. We continue playing each side until one side wins or when a total of 50 moves are played. We check if each move proposed by the fine-tuned model is legal or not. We continue to invoke the fine-tuned model up to five times if the proposed move is an illegal move.
while True:

sfish_move = stockfish.get_best_move()
try:
move_color = ‘WHITE’ if board.turn else ‘BLACK’
uci_move = board.push_san(sfish_move).uci()
stockfish.set_fen_position(board.fen())
move_count += 1
move_list.append(f”{sfish_move}”)
print(f’SF Move – {sfish_move} | {move_color} | Is Move Legal: {stockfish.is_fen_valid(board.fen())} | FEN: {board.fen()} | Move Count: {move_count}’)
except (chess.InvalidMoveError, chess.IllegalMoveError) as e:
print(f”Stockfish Error for {move_color}: {e}”)
print(f”### Move Count: {move_count} ###”)
print(f’Moves list – {s.join(move_list)}’)
break

if board.is_checkmate():
print(“Stockfish won!”)
print(f”### Move Count: {move_count} ###”)
print(f’Moves list – {s.join(move_list)}’)
break

if board.is_stalemate():
print(“Draw!”)
print(f”### Move Count: {move_count} ###”)
print(f’Moves list – {s.join(move_list)}’)
break

next_turn = ‘WHITE’ if board.turn else ‘BLACK’
llm_next_move = get_llm_next_move(board.fen(), next_turn, None)
if llm_next_move is None:
print(“Failed to get a move from LLM. Ending the game.”)
break

ill_mov_cnt = 0
while True:
try:
is_llm_move_legal = True
prev_fen = board.fen()
uci_move = board.push_san(llm_next_move).uci()
is_llm_move_legal = stockfish.is_fen_valid(board.fen())
if is_llm_move_legal:
print(f’LLM Move – {llm_next_move} | {next_turn} | Is Move Legal: {stockfish.is_fen_valid(board.fen())} | FEN: {board.fen()} | Move Count: {move_count}’)
stockfish.set_fen_position(board.fen())
move_count += 1
move_list.append(f”{llm_next_move}”)
break
else:
board.pop()
print(‘Popping board and retrying LLM Next Move!!!’)
llm_next_move = get_llm_next_move(board.fen(), next_turn, llm_next_move, s.join(move_list))
except (chess.AmbiguousMoveError, chess.IllegalMoveError, chess.InvalidMoveError) as e:
print(f”LLM Error #{ill_mov_cnt}: {llm_next_move} for {next_turn} is illegal move!!! for {prev_fen} | FEN: {board.fen()}”)
if ill_mov_cnt == 5:
print(f”{ill_mov_cnt} illegal moves so far, exiting….”)
break
ill_mov_cnt += 1
llm_next_move = get_llm_next_move(board.fen(), next_turn, llm_next_move)

if board.is_checkmate():
print(“LLM won!”)
print(f”### Move Count: {move_count} ###”)
print(f’Moves list – {s.join(move_list)}’)
break

if board.is_stalemate():
print(“Draw!”)
print(f”### Move Count: {move_count} ###”)
print(f’Moves list – {s.join(move_list)}’)
break
if move_count == 50:
print(“Played 50 moves hence quitting!!!!”)
break
board
We observe and measure the effectiveness of the model by counting the number of successful legal moves its able to successfully propose.
The notebook we use for testing the fine-tuned model can be accessed from the following GitHub repo.
Deploy the project
You can initiate the deployment of the project using instructions outlined in the GitHub repo, starting with the following command:
pnpm cdk deploy
This will initiate an AWS CloudFormation stack to run. After the stack is successfully deployed to your AWS account, you can begin setting up user access. Navigate to the newly created Amazon Cognito user pool, where you can create your own user account for logging in to the application. After creating your account, you can add yourself to the admin group to gain administrative privileges within the application.
After you complete the user setup, navigate to Amplify, where your chess application should now be visible. You’ll find a published URL for your hosted demo—simply choose this link to access the application. Use the login credentials you created in the Amazon Cognito user pool to access and explore the application.
After you’re logged in with admin privileges, you’ll be automatically directed to the /admin page. You can perform the following actions on this page:

Create a session (game instance) by selecting from various gameplay options.
Start the game from the admin panel.
Choose the session to load the necessary cookie data.
Navigate to the participants screen to view and test the game. The interface is intuitive, but following these steps in order will provide proper game setup and functionality.

Set up the AWS IoT Core resources
Configuring the solution for IoT gameplay follows a similar process to the previous section—you’ll still need to deploy the UI stack. However, this deployment includes an additional IoT flag that signals the stack to deploy the AWS IoT rules in charge of handling game requests and responses. The specific deployment steps are outlined in this section.
Follow the steps from before, but add the following flag when deploying:
pnpm cdk deploy -c iotDevice=true
This will deploy the solution, adding a critical step to the Step Functions workflow, which publishes a move request message to the topic of an AWS IoT rule and then waits for a response.
Users will need to configure an IoT edge device to consume game requests from this topic. This involves setting up a device capable of publishing and subscribing to topics using the MQTT protocol, processing move requests, and sending success messages back to the topic of the AWS IoT rule that is waiting for responses, which then feeds back into the Step Functions workflow. Although the configuration is flexible and can be customized to your needs, we recommend using AWS IoT Greengrass on your edge device. AWS IoT Greengrass is an open source edge runtime and cloud service for building, deploying, and managing device software. This enables secure topic communication between your IoT devices and the AWS Cloud, allowing you to perform edge verifications such as controlling the robotic arms and synchronizing with the physical board before publishing either a success or failure message back to the cloud.
Setting up a Greengrass Core Device and Client Devices
To setup an AWS IoT Greengrass V2 core device, you can deploy the Chess Game Manager component to it, by following the instructions in the GitHub repo for Greengrass Component. The component contains a recipe, where you’ll need to define the configuration that is required for your IoT devices. The default configuration contains a list of topics used to process game requests and responses, to perform board validations and notifications of new moves, and to coordinate move requests and responses from the robotic arms. You also need to update the names of the client devices that will connect to the component, these client devices must be registered as AWS IoT Things on AWS IoT Core.
Users will also need to have a client application that controls the robotic arms, and a client application that fetches information from the smart chess board. Both client applications need to connect and communicate with the Greengrass core device running the Chess Game Manager component. In our demo, we tested with two separate robotic arms client applications, for the first one we used a pair of CR10A arms from Dobot Robotics, and communicated with the robotic arms using its TCP-IP-CR-Python-V4 SDK; For the second one we used a pair of RO1 arms from Standard Bots, using its Standard bots API. For the smart chess board client application, we used a DGT Smart Board, the board comes with a USB cable that allows us to fetch piece move updates using serial communication.

Preventing illegal moves
When using FMs in Amazon Bedrock to generate the next move, the system employs a retry mechanism that makes three distinct attempts with the generative AI model, each providing more context than the last:

First attempt – The model is prompted to predict the next best move based on the current board state.
Second attempt – If the first move was illegal, the model is informed of its failure and prompted to try again, including the context of why the previous attempt failed.
Third attempt – If still unsuccessful, the model is provided with information on previous illegal moves, with an explanation of past failures. However, this attempt includes a list of all legal moves available. The model is then prompted to select from this list the next logical move.

If all three generative AI attempts fail, the system automatically falls back to a chess engine for a guaranteed valid move.
For the custom imported fine-tuned models in Amazon Bedrock, the system employs a retry mechanism that makes five distinct attempts with the model. It all five attempts fail, the system automatically falls back to a chess engine for a guaranteed move.
During chess evaluation tests, models that underwent fine-tuning with over 100,000 training records demonstrated notable effectiveness. These enhanced models prevailed in 80% of their matches against base versions, and the remaining 20% ended in draws.
Clean up
To clean up and remove all deployed resources, run the following command from the AWS CLI:

pnpm cdk destroy

To clean up the imported models in Amazon Bedrock, use the following code:

aws bedrock delete-imported-model
–model-identifier <your-model-name>
–region <your aws region>

You can also delete the imported models by going to the Amazon Bedrock console and selecting the imported model on the Imported models page.
To clean up the imported models in the S3 bucket, use the following commands after replacing the values corresponding to your environment:
# Delete a single model file

aws s3 rm s3://bucket-name/path/to/model/file

# Delete multiple model files in a directory

aws s3 rm s3://bucket-name/models/ –recursive

# Delete specific model files using include/exclude patterns
aws s3 rm s3://bucket-name/ –recursive –exclude “*” –include “model*.tar.gz”
This code uses the following parameters:

–recursive – Required when deleting multiple files or directories
–dryrun – Tests the deletion command without actually removing files

Conclusion
This post demonstrated how you can fine-tune FMs to create Embodied AI Chess, showcasing the seamless integration of cloud services, IoT capabilities, and physical robotics. With the AWS comprehensive suite of services, including Amazon Bedrock Custom Model Import, Amazon S3, AWS Amplify, AWS AppSync, AWS Step Functions, AWS IoT Core, and AWS IoT Greengrass, developers can create immersive chess experiences that bridge the digital and physical realms.
Give this solution a try and let us know your feedback in the comments.
References
More information is available at the following resources:

AWS Internet of Things
SageMaker JumpStart pretrained models
Import a customized model into Amazon Bedrock
Import a question answering fine-tuned model into Amazon Bedrock as a custom model
Emerging Architecture Patterns for Integrating IoT and generative AI on AWS
Robotic arm partners: Dobot Robotics and Standard Bots

About the Authors
Channa Samynathan is a Senior Worldwide Specialist Solutions Architect for AWS Edge AI & Connected Products, bringing over 28 years of diverse technology industry experience. Having worked in over 26 countries, his extensive career spans design engineering, system testing, operations, business consulting, and product management across multinational telecommunication firms. At AWS, Channa uses his global expertise to design IoT applications from edge to cloud, educate customers on the value proposition of AWS, and contribute to customer-facing publications.
Dwaragha Sivalingam is a Senior Solutions Architect specializing in generative AI at AWS, serving as a trusted advisor to customers on cloud transformation and AI strategy. With seven AWS certifications including ML Specialty, he has helped customers in many industries, including insurance, telecom, utilities, engineering, construction, and real estate. A machine learning enthusiast, he balances his professional life with family time, enjoying road trips, movies, and drone photography.
Daniel Sánchez is a senior generative AI strategist based in Mexico City with over 10 years of experience in cloud computing, specializing in machine learning and data analytics. He has worked with various developer groups across Latin America and is passionate about helping companies accelerate their businesses using the power of data.
Jay Pillai is a Principal Solutions Architect at AWS. In this role, he functions as the Lead Architect, helping partners ideate, build, and launch Partner Solutions. As an Information Technology Leader, Jay specializes in artificial intelligence, generative AI, data integration, business intelligence, and user interface domains. He holds 23 years of extensive experience working with several clients across supply chain, legal technologies, real estate, financial services, insurance, payments, and market research business domains.
Mohammad Tahsin is an AI/ML Specialist Solutions Architect at Amazon Web Services. He lives for staying up to date with the latest technologies in AI/ML and helping guide customers to deploy bespoke solutions on AWS. Outside of work, he loves all things gaming, digital art, and cooking.
Nicolai van der Smagt is a Senior Solutions Architect at AWS. Since joining in 2017, he’s worked with startups and global customers to build innovative solutions using AI on AWS. With a strong focus on real-world impact, he helps customers bring generative AI projects from concept to implementation. Outside of work, Nicolai enjoys boating, running, and exploring hiking trails with his family.
Patrick O’Connor is a WorldWide Prototyping Engineer at AWS, where he assists customers in solving complex business challenges by developing end-to-end prototypes in the cloud. He is a creative problem-solver, adept at adapting to a wide range of technologies, including IoT, serverless tech, HPC, distributed systems, AI/ML, and generative AI.
Paul Vincent is a Principal Prototyping Architect on the AWS Prototyping and Cloud Engineering (PACE) team. He works with AWS customers to bring their innovative ideas to life. Outside of work, he loves playing drums and piano, talking with others through Ham radio, all things home automation, and movie nights with the family.
Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.
Sam Castro is a Sr. Prototyping Architect on the AWS Prototyping and Cloud Engineering (PACE) team. With a strong background in software delivery, IoT, serverless technologies, and generative AI, he helps AWS customers solve complex challenges and explore innovative solutions. Sam focuses on demystifying technology and demonstrating the art of the possible. In his spare time, he enjoys mountain biking, playing soccer, and spending time with friends and family.
Tamil Jayakumar is a Specialist Solutions Architect & Prototyping Engineer with AWS specializing in IoT, robotics, and generative AI. He has over 14 years of proven experience in software development, creating minimum viable products (MVPs) and end-to-end prototypes. He is a hands-on technologist, passionate about solving technology challenges using innovative solutions both on software and hardware, aligning business needs to IT capabilities.

Efficiently train models with large sequence lengths using Amazon Sage …

Large language models (LLMs) have witnessed an unprecedented surge in popularity, with customers increasingly using publicly available models such as Llama, Stable Diffusion, and Mistral. Across diverse industries—including healthcare, finance, and marketing—organizations are now engaged in pre-training and fine-tuning these increasingly larger LLMs, which often boast billions of parameters and larger input sequence length. Although these advancements offer remarkable capabilities, they also present significant challenges. Longer sequence lengths and the sheer number of trainable parameters demand innovative approaches to model development and deployment. To maximize performance and optimize training, organizations frequently need to employ advanced distributed training strategies.
In this post, we demonstrate how the Amazon SageMaker model parallel library (SMP) addresses this need through support for new features such as 8-bit floating point (FP8) mixed-precision training for accelerated training performance and context parallelism for processing large input sequence lengths, expanding the list of its existing features.
We guide you through a step-by-step implementation, demonstrating how to accelerate workloads with FP8 and work with longer sequence lengths using context parallelism, with minimal code changes to your existing training workflow.
The implementation of these new SMP features promises several advantages for customers working with LLMs. First, it can lead to lower costs to convergence, allowing for more efficient use of resources during the training process. This results in reduced time to market, allowing organizations to deploy their optimized models more quickly and gain a competitive edge. Second, it enables training with larger dataset records, expanding the scope and complexity of tasks that can be tackled.
The following sections take a deeper look into this.
Business challenge
Businesses today face a significant challenge when training LLMs efficiently and cost-effectively. As models grow larger and more complex, organizations are using fine-tuning and continuous pre-training strategies to train these models with domain-specific data, using larger sequence lengths that can range from 8K to 128K tokens. These longer sequence lengths allow models to better understand long-range dependencies in text, generate more globally coherent outputs, and handle tasks requiring analysis of lengthy documents.
Although there exist various strategies such as Fully Shared Data Parallelism (FSDP), tensor parallelism (TP), and pipeline parallelism to effectively train models with billions of parameters, these methods are primarily designed to distribute model parameters, gradients, and optimizer states across GPUs, and they don’t focus on input data–related optimizations. This approach reduces memory pressure and enables efficient training of large models. However, none of these techniques effectively address partitioning along the sequence dimension. As a result, training with longer sequence lengths can still lead to out-of-memory (OOM) errors, despite using FSDP.
As a result, working with larger sequence length might result in memory pressure, and it often requires innovative approaches such as FP8 and context parallelism.
How does SMP context parallelism and FP8 help accelerate model training?
SMP addresses the challenges of memory pressure by providing an implementation of context parallelism, which is a parallelization technique that partitions on the dimension of sequence length. Furthermore, it can work together with other parallelism techniques such as FSDP and TP. SMP also implements FP8 for supported models such as Llama. FP8 is a reduced-precision floating-point format that boosts efficiency by enabling faster matrix multiplications without significant accuracy loss. You can use these techniques together to train complex models that are orders of magnitude faster and rapidly iterate and deploy innovative AI solutions that drive business value.
The following sections dive deep into the implementation details for each of these features in SMP.
Context parallelism
Context parallelism is a model parallelism technique to allow the model to train with long sequences. It’s a parallelization scheme that partitions a model’s activations along the sequence dimension. During training with SMP context parallel strategy, the inputs are partitioned along the sequence dimension before being fed to the model. With activations being partitioned along the sequence dimension, we need to consider how our model’s computations are affected. For layers that don’t have inter-token dependency during computation, we don’t require special considerations. In a transformer architecture, such layers are the embedding layers and the multilayer perceptron (MLP) layers. The layers that have inter-token dependency are the attention layers. For the attention layer, as we see from the attention computation, Query projections (Q) need to interact with the tokens of key (K) and value (V) projections.

Because we only have a partition of K and V, we require an AllGather operation to collect the keys and queries from other ranks. As detailed in the following figure, we consider a context parallel scheme with context parallel degree 2 for a causal language model. Thus GPU 0 has the first half of the input sequence and GPU 1 has the other half. During forward, the non-attention layers compute their activations as normal. For attention computation, an AllGather operation is performed for K and V across the context parallel ranks belonging to GPU 0 and GPU 1. To conserve memory, the K and V tensors obtained from the AllGather operation are discarded after the attention computation is completed. Consequently, during the backward pass, we require the same AllGather operation for K and V. Additionally, after the attention backward pass, a ReduceScatter operation is performed to scatter the gradients to corresponding context parallel ranks.

Unlike other model parallel schemes such as tensor parallelism, context parallelism keeps the model parameters intact. Thus, there are no additional communication collectives for parameters required for context parallelism.
Supported models
SMP supports context parallelism using NVIDIA Transformer Engine, and it seamlessly integrates with other model parallelism techniques Fully Sharded Data Parallel and Tensor Parallelism. SMP v2.6 supports the Llama 3.1 (and prior Llama models) and Mistral model architectures for context parallelism.
Mixed Precision Training with FP8
As shown in figure below, FP8 is a datatype supported by NVIDIA’s H100 and H200 GPUs, enables efficient deep learning workloads. The FP8 format occupies only 8 bits of memory, half that of its BF16 or FP16 counterparts, significantly reducing computational costs for operations such as matrix multiplication. The compute throughput for running matrix operations such as multipliers and convolutions is significantly higher on 8-bit float tensors compared to 32-bit float tensors. FP8 precision reduces the data footprint and computational requirements, making it ideal for large-scale models where memory and speed are critical.

Delving deeper into FP8’s architecture, we discover two distinct subtypes: E4M3 and E5M2. The E4M3 configuration, with its 1 sign bit, 4 exponent bits, and 3 mantissa bits, offers superior precision but a limited dynamic range. This makes it ideal for the forward pass in model training. Conversely, E5M2, featuring 1 sign bit, 5 exponent bits, and 2 mantissa bits, boasts a broader dynamic range at the expense of reduced precision. This configuration excels in the backward pass, where precision is less critical, but a wider range proves advantageous.
The transition to mixed precision training with FP16 or BF16 has historically necessitated static or dynamic loss-scaling to address convergence issues that stemmed from reduced precision in gradient flow. This challenge is further amplified in FP8 due to its narrower range. To combat this, the Transformer Engine introduced an innovative solution called DelayedScaling. This technique selects scaling factors based on the maximum observed value for each tensor from previous iterations. Although DelayedScaling maximizes the performance benefits of FP8 computation, it does come with a memory overhead for storing the tensors’ maximum value history. However, despite the additional overhead, the improved throughput observed with 8-bit tensor computations make this approach valuable.
Supported models
SMP supports FP8 mixed precision training using NVIDIA Transformer Engine and keeps compatibility with PyTorch MixedPrecision. This means that you can use FP8 training for supported layers and half-precision using PyTorch Automatic Mixed Precision for others. SMP v2.6 supports the following model architectures for FP8 training: Llama 3.1 (and prior Llama models), Mixtral, and Mistral.
More details about FP8 can be found at FP8 Formats For Deep Learning.
Solution overview
We can use SMP with both Amazon SageMaker Model training jobs  and Amazon SageMaker HyperPod.
For this post, we demonstrate SMP implementation on SageMaker trainings jobs.
Launching a machine learning (ML) training cluster with Amazon SageMaker training jobs is a seamless process that begins with a straightforward API call, AWS Command Line Interface (AWS CLI) command, or AWS SDK interaction. After they’re initiated, SageMaker training jobs spin up the cluster, provisioning the specified number and type of compute instances.
In our example, we use a single ml.p5.48xlarge instance, though we’re illustrating the use of four GPUs for demonstration purposes. The training data, securely stored in Amazon Simple Storage Service (Amazon S3), is copied to the cluster. Each record sequence (Seq0) is strategically split into multiple subsequences and assigned to each GPU in our cluster.
Our implementation uses the FP8 capabilities of SMP to execute model training on Nvidia H100 GPUs and showcases context parallelism capabilities. Because of the flexibility of SageMaker, you can scale your compute resources as needed, accommodating workloads across of a range of sizes. SageMaker creates a resilient training cluster, handles orchestration, closely monitors the infrastructure, and recovers from faults, providing a smooth and uninterrupted training experience. Furthermore, the SageMaker training jobs cost-effective design automatically terminates the cluster upon completion of the training job, with billing calculated down to the second of actual training time used. This combination of power, flexibility, and cost-efficiency makes SageMaker an ideal service for ML practitioners of all levels.
The following diagram shows the solution architecture.

The following walkthrough shows you how you can train a Llama 3.1 8B Instruct model using the PubMed tokenized dataset with a sequence length of approximately 16K tokens. We use SMP context parallelism implementation to enable training for this large sequence length. We compare two approaches: one without context parallelism and another one with it. This comparison highlights the importance of context parallelism when working with LLMs and datasets containing long sequences.
Additionally, we conduct a comparative run on p5.48xlarge instances with context parallelism enabled, both with FP8 enabled and disabled. This demonstration will showcase the incremental throughput benefits we can achieve by enabling FP8-based training alongside context parallelism.
In summary, the implementation follows these four steps:

Set up libraries and process data
Run training without context parallelism
Run training with context parallelism enabled to track memory optimizations
Run training with FP8 enabled to gain further performance

The following flow diagram shows these four steps.

Prerequisites
To perform the solution, you need to have the following prerequisites in place:

Create a Hugging Face User Access Token and get access to the gated repository meta-llama/Llama-3.1-8B on Hugging Face.
Request a Service Quota for 1x p4d.24xlarge and 1x ml.p5.48xlarge on Amazon SageMaker. To request a service quota increase, on the AWS Service Quotas console, choose AWS services, Amazon SageMaker, and then choose one ml.p4d.24xlarge and one ml.p5.48xlarge training job usage.
Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess, AmazonEC2FullAccess to give required access to SageMaker to run the examples.

This walkthrough is for demonstration purposes only. You should adjust this to your specific security requirements for production. Adhere to the principle of least privilege while defining IAM policies in production.

Create an Amazon SageMaker Studio domain (refer to Quick setup to Amazon SageMaker) to access Jupyter notebooks.

Solution walkthrough
To perform the solution, use the instructions in the following steps.
Set up libraries and process data
To set up libraries and process data, follow these instructions. The following flow diagram shows step 1 highlighted.

Enter the following command to install the relevant HuggingFace and SageMaker libraries:

%pip install –upgrade “sagemaker>=2.233”
%pip install “datasets==2.14.5”
%pip install transformers

Load the PubMed dataset and tokenize it

In this example, we use the PubMed Scientific Papers dataset, containing 133,215 biomedical research articles. For our experiment, we select 1,000 papers split 80/20 for training and validation. Using the Meta-LlaMA-3 tokenizer, we process each paper into sequences of 16,384 tokens.
The dataset undergoes two main processing steps: tokenization with Llama’s tokenizer and grouping into fixed-length chunks of 16,384 tokens using utility function group_texts. This uniform sequence length enables even distribution across GPUs while maintaining the natural structure of the scientific papers.

import datasets
from datasets import load_dataset, DatasetDict

# Load the PubMed dataset
pubmed_dataset = load_dataset(
“scientific_papers”,
“pubmed”,
cache_dir=”/home/ec2-user/SageMaker/datasets”,
download_mode=”force_redownload”
)

# Create a smaller subset of the dataset for our experiment
train_test = pubmed_dataset[‘train’].shuffle(seed=42).select(range(1000)).train_test_split(
test_size=0.2,
seed=42
)

lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
desc=f”Grouping texts in chunks of {block_size}”,
)

Prepare data for the training job

In this section, we prepare the PubMed dataset for SageMaker training by managing data transfers to Amazon S3. Both training and validation splits are converted to JSON format and uploaded to designated S3 buckets, with separate paths for input data and output artifacts.

if lm_datasets[“train”] is not None:
train_dataset = lm_datasets[“train”]
train_dataset.to_json(“./training.json”)
training_dataset_location = f”s3://{default_bucket}/dataset/train/”

if lm_datasets[“validation”] is not None:
eval_dataset = lm_datasets[“validation”]
eval_dataset.to_json(“./validation.json”)
validation_dataset_location = f”s3://{default_bucket}/dataset/validation/”

Set up training hyper parameters

In this configuration, we define hyperparameters for training Llama on PubMed, covering memory optimizations, training parameters, model architecture settings, and performance tuning. Starting with conservative settings (batch size=1, BF16 precision), we establish a baseline configuration that will be modified to test different optimization strategies, particularly for context parallelism experiments.

hyperparameters = {
# Memory and optimization settings
“activation_checkpointing”: 1,
“auto_wrap_policy”: “transformer_auto_wrap_policy”,

# Training settings
“train_batch_size”: 1,
“val_batch_size”: 1,

# Model configuration
“vocab_size”: 128256, # Vocab size from Llama 3.1 config file on Hugging Face
“hf_pretrained_model_name_or_dir”: model_id,

}

Run training without context parallelism

To run training without context parallelism, follow these instructions. The following flow diagram shows step 2 highlighted.
In this setup, we configure a baseline training job by disabling context parallelism and FP8 features, while maximizing memory usage through FP32 precision and larger batch sizes. Each GPU processes the full 16,384 token sequence without splitting, and memory-saving features are disabled to demonstrate the limitations and potential memory constraints when running without advanced optimizations such as context parallelism and FP8.

instance_type= “p4d.24xlarge”
instance_count= 1
hybrid_shard_degree= 8

hyperparameters.update({
“use_smp_implementation”: 0, # Disable SMP/CP. Only FSDP is active
“train_batch_size”: 1, # Batch size
“max_context_width”: 16384, # Full sequence length
“clean_cache”: 0,
“bf16″: 1, # Use bf16

})

smp_estimator = PyTorch(
entry_point=”train.py”,
hyperparameters=hyperparameters,

instance_type=instance_type,
volume_size=400,
instance_type=instance_count,
distribution={
“torch_distributed”: {
“enabled”: True,
},
“smdistributed”: {
“modelparallel”: {
“enabled”: True, # Enable model parallelism but with minimal parameters
“parameters”: {
“hybrid_shard_degree”: hybrid_shard_degree,
“delayed_parameter_initialization”: True
}
}
}
},


)

smp_estimator.fit(inputs=data_channels)

The result of not using context parallelism with a large context width (16,384) means that we will get a CUDA out-of-memory error:

AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage “[rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.83 GiB. GPU 3 has a total capacity of 39.38 GiB of which 5.53 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use.

Run training with context parallelism enabled to track memory optimizations

To run training with context parallelism enabled to track memory optimizations, follow these instructions. The following flow diagram shows step 3 highlighted.
In this configuration, we enable context parallelism while keeping FP8 disabled. By setting context parallel degree to 8, we distribute the 16,384 token sequence across all available GPUs for efficient processing. The setup includes essential context parallelism parameters and launches the training job in a background thread, allowing for unblocked notebook execution while maintaining clear job identification for comparison with other configurations.

instance_type= “p4d.24xlarge”
instance_count= 1
hybrid_shard_degree= 8
context_parallel_degree=8

smp_estimator = PyTorch(

entry_point=”train.py”,
instance_type=instance_type,
instance_count=instance_count,
distribution={
“torch_distributed”: {
“enabled”: True,
},
“smdistributed”: {
“modelparallel”: {
“enabled”: True,
“parameters”: {
“context_parallel_degree”: context_parallel_degree,
“hybrid_shard_degree”: hybrid_shard_degree,
“delayed_parameter_initialization”: True,
}
}
}
},

)

smp_estimator.fit(inputs=data_channels)

The result of using context parallelism with such a large context width is that the job successfully completes, as shown in the following screenshot.

We also enabled delayed parameter initialization and hybrid sharding capabilities from SMP for both preceding configurations. Delayed parameter initialization allows initializing large models on a meta device without attaching data. This can resolve limited GPU memory issues when you first load the model. This approach is particularly useful for training LLMs with tens of billions of parameters, where even CPU memory might not be sufficient for initialization. Hybrid sharding is a memory saving technique that shards parameters within the hybrid shard degree (HSD) group and replicates parameters across groups. The HSD controls sharding across GPUs and can be set to an integer from 0 to world_size. This results in reduced communication volume because expensive AllGathers and ReduceScatters are only done within a node, which perform better for medium-sized models.
Run training with FP8 enabled to gain further performance

To run training with FP8 enabled to gain further memory performance, follow these instructions. The following flow diagram shows step 4 highlighted.
In this fully optimized configuration, we enable both context parallelism and FP8 training using a NVIDIA P5 instance (ml.p5.48xlarge). This setup combines sequence splitting across GPUs with FP8 precision training, creating a highly efficient training environment. Using P5 instances provides the necessary hardware support for FP8 computation, with the result that we can maximize the benefits of both memory-saving techniques.

instance_type= “p5.48xlarge”
instance_count= 1
hybrid_shard_degree= 8
context_parallel_degree=8

hyperparameters.update({
“use_smp_implementation”: 1, # Enable SMP/CP
“max_context_width”: 16384, # Full sequence length
“fp8”: 1, # Enable FP8 flag
“distributed_backend”: “nccl” # Add this line to explicitly use NCCL

})

smp_estimator = PyTorch(

entry_point=”train.py”,
instance_type=instance_type,
instance_count=instance_count,
distribution={
“torch_distributed”: {
“enabled”: True,
},
“smdistributed”: {
“modelparallel”: {
“enabled”: True,
“parameters”: {
“context_parallel_degree”: context_parallel_degree,
“hybrid_shard_degree”: hybrid_shard_degree,
“delayed_parameter_initialization”: True,
}
}
}
},

)

smp_estimator.fit(inputs=data_channels)

Start training with context parallelism, without FP8 (on a P5 instance)
To do a fair comparison with and without FP8, we will do another run without FP8 but with context parallelism on a P5.48xlarge instance and compare the throughputs for both runs.

instance_type= “p5.48xlarge”
instance_count= 1
hybrid_shard_degree= 8
context_parallel_degree=8

hyperparameters.update({
“use_smp_implementation”: 1, # Enable SMP/CP
“max_context_width”: 16384, # Full sequence length
“bf16”: 1, # Use BF16
“distributed_backend”: “nccl” # Add this line to explicitly use NCCL

})

# This remains the same as in the previous step
smp_estimator = PyTorch(

)

smp_estimator.fit(inputs=data_channels)

If we compare both runs, we can tell that the speed of the same context parallelism enabled job with FP8 is almost 10 times faster
With FP8, speed is around 14.6 samples/second, as shown in the following screenshot.

Without FP8, speed is around 1.4 samples/second, as shown in the following screenshot.

The following table depicts the throughput increment you get in each of the listed cases. All these cases are run on a P5.48xLarge.
The throughput may vary based on factors such as the context width or batch size. The following numbers are what we have observed in our testing.

Configuration (ml.P5.48xlarge; CP on 8 GPUs, Train Batch Size 4)
Observed samples speed
Observed throughput

No context parallelism & No FP8
torch.OutOfMemoryError: CUDA out of memory
torch.OutOfMemoryError: CUDA out of memory

Only Context Parallelism
2.03 samples/sec
247 TFLOPS/GPU

Context parallelism + FP8
3.05 samples/sec
372 TFLOPS/GPU

Cleanup
To clean up your resources to avoid incurring more charges, follow these steps:

Delete any unused SageMaker Studio resources.
Optionally, delete the SageMaker Studio domain.
Delete any S3 buckets created
Verify that your training job isn’t running anymore! To do so, on your SageMaker console, choose Training and check Training jobs.

To learn more about cleaning up your resources provisioned, check out Clean up.
Conclusion
In this post, we demonstrated the process of setting up and running training jobs for the PubMed dataset using the Llama 3.1 8B Instruct model, both with and without context parallelism. We also showcased how to enable FP8 based training for even faster throughputs.
Key takeaways:

For datasets that have long sequence lengths, we observe that using context parallelism helps avoid OOM errors.
For faster training, we can enable FP8 based training and combine it with context parallelism to get increased throughput times. In this notebook, we observed that the throughput goes up tenfold if we enable FP8 with context parallelism.

As next steps, try out the above example by following the notebook steps at sagemaker-distributed-training-workshop.
Special thanks to Roy Allela, Senior AI/ML Specialist Solutions Architect for his support on the launch of this post.

About the Authors
Kanwaljit Khurmi is a Principal Worldwide Generative AI Solutions Architect at AWS. He collaborates with AWS product teams, engineering departments, and customers to provide guidance and technical assistance, helping them enhance the value of their hybrid machine learning solutions on AWS. Kanwaljit specializes in assisting customers with containerized applications and high-performance computing solutions.
Surya Kari is a Senior Generative AI Data Scientist at AWS. With a background in computer vision and AI devices, his current specializations include LLM training, multi-modal RAG, vision-language models, and edge computing.
Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in LLM training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.
Suhit Kodgule is a Software Development Engineer with the AWS Artificial Intelligence group working on deep learning frameworks. In his spare time, he enjoys hiking, traveling, and cooking.
Anirudh Viswanathan is a Sr Product Manager, Technical – External Services with the SageMaker Training team. He holds a Masters in Robotics from Carnegie Mellon University, an MBA from the Wharton School of Business, and is named inventor on over 40 patents. He enjoys long-distance running, visiting art galleries, and Broadway shows.

10 Overlooked Ecommerce Store Conversion Optimizations You Need to Try …

You’ve got visitors rolling into your ecommerce store but how many are actually pulling out their wallets? 

Not as many as you’d like, right? Ouch. 

It’s not because your products aren’t amazing (they are). It’s because your site isn’t doing enough to turn curious browsers into loyal buyers.

Here’s the thing…most ecommerce store owners focus on the obvious stuff like adding flashy CTAs and running discount ads. 

But what about the sneaky, overlooked optimizations that actually move the needle? The tweaks you don’t hear about because everyone’s too busy recycling the same tired tips?

That’s where we come in. We’re diving into 10 fresh, overlooked ecommerce store conversion optimizations you probably haven’t tried yet. 

These aren’t your basic “add free shipping” tips. They’re strategies that are practical, creative, and proven to squeeze more juice out of your existing traffic.

So buckle up, because by the time you’re done reading, you’ll be armed with real, useful, and actionable ideas to turn those “meh” conversion rates into a sales explosion. 

Webinar: 2024 Playbook

Increasing Traffic & Conversions in the Age of AI

Free Webinar: Watch Now

1. Know Your Visitors: The Power of Visitor Identification

How much do you really know about the people landing on your site? 

Sure, you’ve got traffic numbers and maybe some heatmaps showing where they click but unless they’re actually logging in or filling out forms, most visitors are just faceless stats in your analytics dashboard. 

That’s a problem.

Website visitor identification flips the script, showing you who’s lurking behind all those “anonymous” sessions. And with the right tools, you can figure out who’s browsing your site – even if they don’t take obvious actions. It’s not magic but it is one of the most overlooked ways to optimize your ecommerce store.

Why it Matters

Ecommerce store conversion is all about connection. If you know who is visiting, you can tailor your site, messaging, and offers to match what they’re looking for. 

Visitor identification gives you the insights to stop guessing and start personalizing.

How to Optimize This

Use Visitor Identification Tools: Tools like Customers.ai can help you uncover anonymous visitors by linking their activity to known company domains or profiles.

Segment Smarter: Once you identify visitors, segment them based on behavior (e.g., frequent cart abandoners, category browsers) or traits (e.g., location, company size).

Personalize Everything: Sync this data with your CRM or dynamic content tools to deliver tailored experiences. Think personalized product recommendations or localized promotions that make visitors feel like your store was built just for them.

Quick Win

Add the Customers.ai pixel to your site and start capturing more customers. Here is how to identify your website visitors with Customers.ai.:

1. Sign up for a free account

If you don’t already have a Customers.ai account, sign up here (no credit card is required) and connect your business.

2. Install the x-ray pixel on your site

Installing the website identification x-ray pixel is easy and can be done through Tag Manager, Shopify, WordPress, and more

3. Verify the x-ray pixel is firing

4. Start identifying your website visitors

That’s it! Once the pixel is installed and verified, you can start identifying your website visitors.

And with the highest capture rate in the industry, you should start seeing visitors immediately.

Knowing your visitors is the first step to converting them. The more you understand about who’s visiting and what they want, the easier it is to turn them into paying customers. And hey, isn’t that the whole point?

2. The Microcopy Magic Trick

You know those tiny pieces of text scattered across your site? The ones that explain a button, guide someone through a form, or reassure them during checkout? 

Yeah, that’s microcopy. 

It’s small, but it’s mighty. In fact, good microcopy can be the difference between “meh, maybe later” and “TAKE MY MONEY!”

Most ecommerce stores don’t give microcopy the love it deserves. They slap generic text on buttons (“Submit”), leave form fields unexplained, and miss golden opportunities to build trust and boost conversions. Let’s fix that.

Why Microcopy Matters

Microcopy works its magic in the background. It reassures shoppers they’re making a good decision, answers silent doubts, and gives them the confidence to move forward. 

Think of it as the friendly salesperson for your online store, always there, but never pushy.

Where to Optimize Microcopy

Buttons: Swap vague CTAs like “Submit” with action-driven ones like “Get My Discount” or “Complete My Order.”

Form Fields: Add clarifying text (e.g., “We’ll never spam you” under an email field) to ease friction.

Checkout Pages: Use trust-building phrases like “Secure Checkout” or “You can edit this later” to reduce cart abandonment.

Product Pages: Answer common objections subtly, e.g., “Why this product?” links or “Hurry! Only 3 left in stock.”

Before-and-After Microcopy Makeover

Before

Button: “Submit”

Form Field: “Email” (no explanation)

Checkout: No messaging about security or returns

After

Button: “Get My 10% Off!”

Form Field: “Email (We’ll send you order updates—no spam, promise!)”

Checkout: “Your payment is secure. Free returns within 30 days.”

Quick Win

Run an A/B test on one key piece of microcopy—like your checkout button. Swap “Submit” for something benefit-driven, and track how it impacts conversions. You’ll likely see that a small tweak makes a big difference.

Microcopy may be tiny but it packs a punch when optimizing your ecommerce store conversion strategy. So go ahead and give those little words some love!

3. Kill the Dead Ends: Smart 404 and Thank You Pages

Your 404 error pages and “Thank You” pages are basically the forgotten corners of your ecommerce store. 

They’re like the junk drawer in your kitchen…ignored until someone stumbles upon them and wonders why they even exist. But what if these “dead ends” could actually drive sales and engagement? 

Spoiler alert: they can.

The Problem

Most 404 pages look like this – “Oops! Page not found.” That’s it. No directions, no value, just a dead stop. 

And “Thank You” pages? They usually say something like, “Thanks for your order!” and leave it at that. Both are massive missed opportunities!

The Solution

Turn these neglected pages into high-performing engagement tools. Instead of letting visitors hit a wall, give them something valuable that keeps them in your store or brings them back later.

How to Optimize

For 404 Pages:

Add Helpful Links: Suggest popular categories or products to redirect lost visitors.

Throw in a Discount: “Lost? Let us make it up to you. Here’s 10% off!” works wonders for turning frustration into delight.

Make It Fun: Add some humor or creativity. Think quirky visuals or clever copy that aligns with your brand voice.

For Thank You Pages:

Upsell or Cross-Sell: Suggest complementary products. “Loved this? You might also like…” is a classic but effective approach.

Encourage Social Sharing: Add a share button with a pre-written message like, “Just snagged this amazing deal at [Store Name]!”

Loyalty Program Hook: Invite them to join your rewards program or offer points for their purchase.

Real-World Impact

A clothing brand revamped its 404 page by turning it into a product recommendation hub with a playful message like, “Lost your way? No worries, let’s get you back on track.” 

The result? A 30% decrease in bounce rate and a 15% lift in conversions from people who ended up buying redirected products.

Similarly, a DTC skincare brand added a referral program link to its Thank You page, leading to a 20% increase in new customers from referrals.

Quick Win

Add a “Continue Shopping” button and a product carousel to your 404 page. For your Thank You page, experiment with an exclusive post-purchase offer like, “Unlock 15% off your next order—valid for 24 hours!”

Who knew getting lost or saying thanks could be so profitable? A simple but brilliant ecommerce store conversion optimization tactic.

4. Leverage Exit-Intent Popups Wisely

Exit-intent popups get a bad rep. Why? 

Because most stores use them like a last-ditch “WAIT, DON’T LEAVE ME!” plea, throwing generic offers at visitors without rhyme or reason. 

But when done right, these popups can be awesome revenue drivers, capturing attention at the exact moment someone’s about to bounce and turning hesitation into action.

Why Most Stores Fail

You’re browsing a site for a new pair of running shoes. You hover near the top of the screen, and BOOM—a popup screams, “Sign up for our newsletter!” 

Uh, no thanks.

The issue? Poor timing and irrelevant offers. If your popup doesn’t match the visitor’s intent or interests, it’s just noise.

The Overlooked Approach

Dynamic exit-intent popups change the game. Instead of serving the same message to everyone, tailor your popups based on user behavior. 

For example:

Browsing a Specific Category? Offer a discount on items in that category.

About to Abandon a Full Cart? Highlight free shipping or an extra discount to seal the deal.

First-Time Visitor? Showcase an irresistible new-customer offer like “10% off your first order!”

How to Optimize Your Popups

Get the Timing Right: Use tools to detect exit intent—like when the cursor moves toward the browser bar.

Customize the Message: Match the popup content to the visitor’s behavior or location on your site.

Limit the Noise: Don’t overdo it—show popups strategically, not every time someone even thinks about leaving.

Quick Win

Try a tool like OptinMonster, Privy, or Sleeknote to create behavior-driven popups. Many of these platforms offer templates designed to fit different goals, like cart recovery or list building. 

Bonus: most of them come with A/B testing features, so you can experiment and see what works best.

Real-Life Example

A fitness apparel brand implemented dynamic exit popups that offered a free gift (like socks or a water bottle) with purchases over $50. Result? A 25% reduction in cart abandonment and a noticeable boost in average order value.

Stop begging your visitors to stay and start giving them a reason to stick around.

5. Bring Reviews and Social Proof Into the Spotlight

Shoppers don’t always trust you but they do trust other customers. 

That’s why reviews and social proof are essential for your ecommerce store. Unfortunately, too many brands treat them like an afterthought. 

Burying reviews at the bottom of a product page or making shoppers dig for them? Rookie move. Let’s fix that.

The Problem

Too many ecommerce stores fail to use reviews effectively. Either they’re hidden away in the depths of product pages, overwhelmed by outdated feedback, or worse, there’s no way to know if the reviewer actually bought the product. These missteps create doubt and cost you sales.

The Solution

Social proof should be front and center, especially at key decision-making moments. Shoppers want reassurance that they’re making a good choice, and well-placed reviews or testimonials provide that confidence boost.

How to Optimize Your Reviews

Showcase Reviews in Strategic Spots:

Add a review carousel on your homepage to highlight bestsellers or new arrivals.

Include top reviews on product recommendation widgets or upsell sections.

Show reviews directly in the cart (e.g., “Loved by 1,200+ happy customers!”).

Filter for Trustworthy Feedback:

Use tags like “Verified Buyer” or “Recent Purchase” to build credibility.

Highlight reviews that address common objections or showcase specific use cases.

Incorporate UGC (User-Generated Content):

Display customer photos and videos alongside text reviews to make the experience more authentic.

Use social media shoutouts or tagged photos to create a “community vibe.”

Pro Tip

Make reviews interactive. Allow customers to sort them by most helpful, newest, or even by product features (“Best for summer” or “Great for sensitive skin”). This makes it easier for shoppers to find the info they care about most.

Quick Win

Add a review summary at the top of your product pages. Include the average star rating, total number of reviews, and a snippet of a glowing testimonial. Bonus points if you make it clickable, so visitors can dive into the full review section with one click.

By putting social proof where it matters most, you’ll build trust, ease doubts, and give shoppers that extra nudge to hit “Buy Now.” It’s the ultimate ecommerce store conversion optimization strategy!

6. Invest in Killer Mobile Navigation

If your ecommerce store doesn’t work flawlessly on mobile, you’re practically begging customers to bounce. If shoppers can’t find what they need quickly and easily on their phones, they’re out and they’re not coming back.

Why Mobile-First Matters

Over 50% of ecommerce traffic now comes from mobile devices. 

Yet, many stores still cling to desktop-first designs, leaving mobile users frustrated with endless scrolling, tiny buttons, or menus that feel like a treasure hunt. Don’t let your store be part of the problem.

What to Fix

Simplify Menus:

Keep your menu short, sweet, and easy to navigate. Group items logically and avoid overwhelming shoppers with too many options.

Use expandable categories (think dropdowns or accordions) to save screen space.

Make CTAs Thumb-Friendly:

Ensure call-to-action (CTA) buttons are big enough to tap without fat-finger mishaps.

Position CTAs within thumb-reach zones, so users don’t have to perform digital gymnastics to click “Add to Cart.”

Add Sticky Headers:

Keep your navigation bar and search functionality accessible as users scroll. A sticky header means shoppers can quickly jump to other sections without having to scroll all the way back up.

Bonus Features to Wow Mobile Shoppers

Voice Search: Make life easy for multitaskers by adding voice search capabilities. Let’s face it, sometimes typing on a tiny screen is a pain.

Tap-to-Call Buttons: If your store involves services or high-ticket items, adding a “Tap to Call” button can be a game-changer for high-intent buyers.

Pro Tip

Test your mobile navigation on actual devices, not just simulators. Try using your site one-handed while walking – if it’s not intuitive, it’s time to rethink the design.

Quick Win

Run a heatmap test for mobile users to see where they’re tapping (or struggling). Use the insights to tweak your menu layout and button placement for a smoother experience.

Remember: if your navigation isn’t killing it, it’s killing your sales. Fix it, and watch your sales numbers soar.

7. Use Urgency Without the Desperation

You’ve seen it before, the “Only 1 left in stock!” warning plastered across every product page, only to see the same thing a week later. 

Shoppers aren’t stupid and overusing fake urgency tactics is just lazy. 

Real urgency, when used authentically, however, is a powerful tool that can nudge hesitant buyers into action without making them roll their eyes.

The Common Mistake

Many stores lean too hard on fake scarcity, like countdown timers for sales that magically reset or claims of “limited stock” that no one believes. 

These tactics can actually hurt your brand’s credibility.

What Works: Authentic Urgency

The secret to effective urgency is honesty. Instead of slapping urgency on everything, focus on real, time-sensitive opportunities:

Limited-Time Bundles: Create product bundles that are only available for a short period, like holiday-themed sets or exclusive collaborations.

Real-Time Stock Counters: Highlight genuine low-stock items with live updates (and make sure it’s actually accurate).

Shipping Cutoffs: Use clear messaging like “Order within 2 hours to get it by Friday!” to push fast decisions.

How to Do It Right

Be Transparent: If something is limited, explain why. For example: “This product is handcrafted, and only 100 are made each month.”

Focus on Value: Urgency should enhance the customer’s experience, not pressure them. Pair it with benefits, like exclusive discounts or freebies.

Use Visual Cues: Bright colors, timers, or banners can grab attention, but don’t go overboard. Subtlety matters.

Real-Life Examples

Patagonia: They use urgency for seasonal stock changes—“Last chance to grab winter gear before it’s gone!” feels honest and actionable.

Amazon: Real-time updates like “Only 3 left—order soon!” show shoppers exactly how fast inventory is moving.

Quick Win

Add a shipping deadline banner to your product pages during key shopping periods. Something as simple as “Order by 3 PM today for next-day delivery” can create a sense of urgency without feeling forced.

Urgency done right builds excitement, not skepticism. Keep it genuine.

8. Personalize the Journey (Without Being Creepy)

Nobody wants to feel like your store is watching them, but let’s be honest, shoppers do want a personalized experience. 

The trick is finding that sweet spot between “Wow, this is exactly what I need!” and “Why does this site know I was looking for socks at 2 AM?”

Personalization is all about serving relevant, helpful suggestions without being invasive.

What to Optimize

Dynamic Product Recommendations

Show “You May Also Like” or “Recently Viewed” sections tailored to individual browsing habits.

Highlight complementary products (“Bought running shoes? Check out these no-show socks!”).

Personalized Email Campaigns

Segment your audience by purchase history, behavior, or demographics to send hyper-relevant emails.

Include dynamic content like “We thought you’d love these!” with products tied to their preferences.

Location-Specific Promotions

Use geotargeting to display relevant offers. Example: Free shipping for local customers or promotions tied to regional holidays.

Optimize timing for campaigns based on time zones to catch customers when they’re most likely to shop.

Overlooked Tools to Make This Seamless

Klaviyo: Perfect for personalized email flows and behavioral targeting.

Nosto: Specialized in dynamic on-site personalization and product recommendations.

Justuno: Great for delivering targeted popups based on visitor behavior and demographics.

The Data Doesn’t Lie

According to studies, 80% of shoppers are more likely to buy from a brand offering personalized experiences, and personalized product recommendations can boost average order value by up to 30%.

Pro Tip

Use personalization to solve pain points. For example, if a customer abandoned their cart, send a follow-up email with their specific items plus a discount or free shipping offer to bring them back.

Remember, the goal isn’t to be a mind reader. It’s to create an experience that feels effortless and intuitive. 

9. Simplify Your Checkout Flow

The longer and more complicated your checkout process, the fewer customers actually complete it. 

If you want shoppers to follow through, you need to make paying as easy as possible.

Hidden Blockers That Tank Conversions

Too Many Form Fields: Do you really need their middle name, date of birth, or pet’s favorite snack? No. Trim the fat.

Forced Account Creation: Nothing screams “abandon cart” louder than “Sign up before you can buy.”

Poor Payment Options: Limited or outdated payment methods alienate customers who expect flexibility.

Fixes to Streamline Your Checkout

Autofill Capabilities

Enable address autofill using tools like Google Places API.

Save returning customer information so they don’t have to re-enter details.

Express Checkout Buttons

Offer PayPal, Apple Pay, Google Pay, or Shop Pay for one-click purchases.

Highlight these options early in the process to cater to impatient shoppers.

Clear Error Messaging

Use real-time error detection with specific instructions (e.g., “Your ZIP code must be 5 digits”).

Prevent frustration by showing errors as they type, not after they hit submit.

Progress Indicators

Show shoppers exactly where they are in the process with a simple step-by-step progress bar.

Real-Life Example: The Checkout Glow-Up

An online pet supply store reduced its checkout fields from 15 to 7 and added express checkout buttons. They also removed mandatory account creation, offering a “Guest Checkout” option instead. 

The result? A 25% increase in completed purchases within the first month.

Quick Win

Audit your current checkout flow. Remove any unnecessary steps and add at least one express payment option. Even small tweaks, like reordering fields for logical flow, can make a big difference in optimizing your ecommerce store conversion flow.

10. Gamify the Shopping Experience

When shopping is fun, people spend more. That’s the magic of gamification. 

By turning your store into a mini game with rewards, progress bars, and milestones, you can keep shoppers engaged, excited, and coming back for more.

Why Gamification Works

It’s all about dopamine! Gamification taps into our brain’s reward system, giving shoppers a little hit of satisfaction every time they achieve something – even something as minor as unlocking free shipping or leveling up in a loyalty program. 

This sense of accomplishment keeps them hooked and encourages more spending.

Fresh Gamification Ideas

Progress Bars for Freebies

Show a dynamic progress bar on the cart page: “You’re $12 away from free shipping!”

Create goals for discounts, like “Spend $50 to unlock 10% off your next order.”

Loyalty Program Milestones

Use tiers or levels (e.g., “Bronze, Silver, Gold”) to encourage repeat purchases.

Offer exclusive perks for higher tiers, like early access to new products or bigger discounts.

Spin-to-Win Popups

Add a fun element to email sign-ups or promotions with a wheel-of-fortune style popup offering discounts, free shipping, or gifts.

Limited-Time Challenges

Run gamified promotions, like “Buy two items today and get a mystery gift!”

Incorporate countdown timers to add urgency.

Tools to Implement Gamification

Smile.io: Build loyalty programs with tiers, points, and rewards.

Gameball: Create personalized customer journeys and achievements.

WooCommerce Points and Rewards: Perfect for setting up gamified point systems for repeat purchases.

Wheelio: Easy spin-to-win popups to engage shoppers.

Real-Life Example

A cosmetics brand introduced a progress bar on its cart page with the message: “Add $20 more to unlock a free gift!” This simple addition increased their average order value by 18%. 

Quick Win

Add a “Spend $X more for free shipping” progress bar to your cart or checkout page. It’s an easy way to increase order value without feeling pushy.

Start Optimizing Your Ecommerce Store Conversions Today

Your ecommerce store doesn’t need a total overhaul to start seeing better conversions. Sometimes, it’s the little tweaks, the ones everyone else forgets about, that make the biggest difference.

We’ve covered 10 ecommerce store conversion optimization tactics that go beyond the basics. From making your 404 page actually useful to turning shopping into a dopamine-packed game, these strategies are here to boost your sales without the stress.

But don’t overthink it. 

Pick one or two ideas that stood out to you. Maybe it’s sprucing up your microcopy or adding a progress bar for free shipping – test them out. Small changes add up fast, and the sooner you start, the sooner you’ll see results.

Oh, and let’s keep this fun. If you try something and it works (or even if it flops spectacularly), I want to hear about it. Share your wins, your fails, or your own genius ideas with us on social. Let’s swap notes and help each other crush it.

Now go out there and make your store the conversion powerhouse it’s meant to be! And don’t forget to start your free Customers.ai trial today and get 500 contacts free!

See Who Is On Your Site Right Now!

Get names, emails, phone numbers & more.

Try it Free, No Credit Card Required

Start Your Free Trial

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post 10 Overlooked Ecommerce Store Conversion Optimizations You Need to Try Now appeared first on Customers.ai.

Hugging Face Releases SmolVLM: A 2B Parameter Vision-Language Model fo …

In recent years, there has been a growing demand for machine learning models capable of handling visual and language tasks effectively, without relying on large, cumbersome infrastructure. The challenge lies in balancing performance with resource requirements, particularly for devices like laptops, consumer GPUs, or mobile devices. Many vision-language models (VLMs) require significant computational power and memory, making them impractical for on-device applications. Models such as Qwen2-VL, although performant, require expensive hardware and substantial GPU RAM, limiting their accessibility and practicality for real-time, on-device tasks. This has created a need for lightweight models that can provide strong performance with minimal resources.

Hugging Face recently released SmolVLM, a 2B parameter vision-language model specifically designed for on-device inference. SmolVLM outperforms other models with comparable GPU RAM usage and token throughput. The key feature of SmolVLM is its ability to run effectively on smaller devices, including laptops or consumer-grade GPUs, without compromising performance. It achieves a balance between performance and efficiency that has been challenging to achieve with models of similar size and capability. Unlike Qwen2-VL 2B, SmolVLM generates tokens 7.5 to 16 times faster, due to its optimized architecture that favors lightweight inference. This efficiency translates into practical advantages for end-users.

Technical Overview

From a technical standpoint, SmolVLM has an optimized architecture that enables efficient on-device inference. It can be fine-tuned easily using Google Colab, making it accessible for experimentation and development even to those with limited resources. It is lightweight enough to run smoothly on a laptop or process millions of documents using a consumer GPU. One of its main advantages is its small memory footprint, which makes it feasible to deploy on devices that could not handle similarly sized models before. The efficiency is evident in its token generation throughput: SmolVLM produces tokens at a speed ranging from 7.5 to 16 times faster compared to Qwen2-VL. This performance gain is primarily due to SmolVLM’s streamlined architecture that optimizes image encoding and inference speed. Even though it has the same number of parameters as Qwen2-VL, SmolVLM’s efficient image encoding prevents it from overloading devices—an issue that frequently causes Qwen2-VL to crash systems like the MacBook Pro M3.

The significance of SmolVLM lies in its ability to provide high-quality visual-language inference without the need for powerful hardware. This is an important step for researchers, developers, and hobbyists who wish to experiment with vision-language tasks without investing in expensive GPUs. In tests conducted by the team, SmolVLM demonstrated its efficiency when evaluated with 50 frames from a YouTube video, producing results that justified further testing on CinePile, a benchmark that assesses a model’s ability to understand cinematic visuals. The results showed SmolVLM scoring 27.14%, placing it between two more resource-intensive models: InternVL2 (2B) and Video LlaVa (7B). Notably, SmolVLM wasn’t trained on video data, yet it performed comparably to models designed for such tasks, demonstrating its robustness and versatility. Moreover, SmolVLM achieves these efficiency gains while maintaining accuracy and output quality, highlighting that it is possible to create smaller models without sacrificing performance.

Conclusion

In conclusion, SmolVLM represents a significant advancement in the field of vision-language models. By enabling complex VLM tasks to be run on everyday devices, Hugging Face has addressed an important gap in the current landscape of AI tools. SmolVLM competes well with other models in its class and often surpasses them in terms of speed, efficiency, and practicality for on-device use. With its compact design and efficient token throughput, SmolVLM will be a valuable tool for those needing robust vision-language processing without access to high-end hardware. This development has the potential to broaden the use of VLMs, making sophisticated AI systems more accessible. As AI becomes more personalized and ubiquitous, models like SmolVLM pave the way for making powerful machine learning accessible to a wider audience.

Check out the Models on Hugging Face, Details, and Demo. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

‘Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)
The post Hugging Face Releases SmolVLM: A 2B Parameter Vision-Language Model for On-Device Inference appeared first on MarkTechPost.

Anthropic Open Sourced Model Context Protocol (MCP): Transforming AI I …

Anthropic has open-sourced the Model Context Protocol (MCP), a major step toward improving how AI systems connect with real-world data. By providing a universal standard, MCP simplifies the integration of AI with data sources, enabling smarter, more context-aware responses and making AI systems more effective and accessible.

Despite remarkable advances in AI’s reasoning capabilities and response quality, even the most sophisticated models struggle to operate effectively when isolated from real-world data. Each new integration between AI systems and data repositories often necessitates bespoke, labor-intensive implementations, limiting scalability and efficiency. Recognizing this bottleneck, Anthropic developed MCP as a universal, open standard to connect AI systems to data sources, replacing fragmented integrations with a streamlined protocol. This innovation promises a more reliable and efficient mechanism for AI systems to access the necessary data.

The MCP is designed to provide developers with tools for building secure, two-way connections between data repositories and AI-powered applications. Its architecture is flexible yet straightforward: data can be exposed through MCP servers, while AI applications, known as MCP clients, connect to these servers to access and utilize the data.

Anthropic has introduced three core components to facilitate the adoption of MCP:

The MCP Specification and SDKs: These resources provide detailed guidelines and software development kits for implementing MCP.

Local MCP Server Support: This feature, integrated into Claude Desktop apps, enables developers to experiment with local MCP server configurations.

Open-Source Repository: Anthropic has released pre-built MCP servers compatible with popular systems such as Google Drive, Slack, GitHub, and Postgres, simplifying the process for organizations to connect their data with AI tools.

Several organizations have already embraced MCP. Companies like Block and Apollo have integrated the protocol into their systems, and development tool providers such as Zed, Replit, Codeium, and Sourcegraph are leveraging MCP to enhance their platforms. These collaborations underscore MCP’s potential to make AI tools more context-aware, especially in complex environments like coding. By enabling AI agents to retrieve relevant data and comprehend contextual nuances, MCP is helping developers produce more functional and efficient code with fewer iterations.

The enthusiasm for MCP among early adopters reflects its transformative potential. Dhanji R. Prasanna, Chief Technology Officer at Block, emphasized the importance of open technologies like MCP in fostering innovation and collaboration. He remarked, “Open technologies like the Model Context Protocol are the bridges that connect AI to real-world applications, ensuring innovation is accessible, transparent, and rooted in collaboration.”

MCP’s open standard prevents developers from maintaining separate connectors for each data source. Instead, they can build against a universal protocol, significantly reducing complexity and fostering sustainability. As MCP’s ecosystem grows, AI systems will maintain context across diverse datasets and tools, eliminating the fragmentation that plagues current integrations.

Developers are encouraged to explore MCP through various avenues:

Installing pre-built MCP servers via the Claude Desktop app.

Following the quickstart guide to build their first MCP server.

Contributing to the open-source repositories of connectors and implementations.

Anthropic’s decision to open-source MCP reflects its commitment to fostering an inclusive and collaborative ecosystem. The company invites AI developers, enterprises, and innovators to join in shaping the future of context-aware AI. By building on a shared foundation, MCP aims to create a robust network of tools and protocols that will empower AI applications to interact seamlessly with the systems and data they need.

In conclusion, Anthropic’s open-sourcing of the Model Context Protocol represents a paradigm shift in how AI systems interact with data. MCP can transform AI applications across industries by addressing critical integration challenges and providing a universal standard. Its success will depend on continued collaboration, innovation, and community engagement, but the groundwork laid by Anthropic positions MCP as a cornerstone for the next generation of AI technologies.

Check out the Details and Documentation. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

‘Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)
The post Anthropic Open Sourced Model Context Protocol (MCP): Transforming AI Integration with Universal Data Connectivity for Smarter, Context-Aware, and Scalable Applications Across Industries appeared first on MarkTechPost.

This AI Paper Introduces HARec: A Hyperbolic Framework for Balancing E …

Recommender systems are essential in modern digital platforms, enabling personalized user experiences by predicting preferences based on interaction data. These systems help users navigate the vast online content by suggesting relevant items critical to addressing information overload. By analyzing user-item interactions, they generate recommendations that aim to be accurate and diverse. However, as the digital ecosystem evolves, so do user preferences, underscoring the need for methods that adapt to these changes while promoting personalization and diversity.

One major challenge in recommendation systems is the tendency to create information cocoons, where users are repeatedly exposed to similar content, limiting their exploration of new or diverse options. Balancing the exploration of fresh, unexpected items with the exploitation of known user preferences is complex but necessary. This balance requires sophisticated models capable of simultaneously managing hierarchical structures inherent in user-item relationships and aligning semantic relationships from textual data. Existing approaches, though effective to some extent, need more adaptability to address these intricacies.

Current methodologies include collaborative filtering, which focuses on user interaction data to predict preferences, and hyperbolic geometric models, which excel at capturing hierarchical relationships. Its inability to integrate semantic insights from textual descriptions limits collaborative filtering. While addressing some hierarchical challenges, hyperbolic models need help with semantic alignment due to their reliance on Euclidean encoders for text data. These limitations reduce the models’ robustness, adaptability, and ability to enhance diversity in recommendations.

The researchers, associated with Snap Inc., Yale University, and the University of Hong Kong, introduced HARec, a hyperbolic representation learning framework designed to tackle these challenges. HARec innovatively combines hyperbolic geometry with graph neural networks (GNNs) and large language models (LLMs). Using a hierarchical tree structure, HARec allows users to customize the balance between exploration and exploitation in recommendations. This user-adjustable mechanism ensures a dynamic and tailored approach, setting HARec apart from traditional systems.

HARec’s methodology is a comprehensive blend of hyperbolic graph collaborative filtering and semantic embedding integration. The framework begins by generating hyperbolic embeddings for user-item interactions using a Lorentz representation model, which excels at modeling tree-like, hierarchical structures. These embeddings are aligned with semantic embeddings derived from textual descriptions through pre-trained LLMs such as BERT. The semantic data undergoes dimensional adjustment and is projected into hyperbolic space to align with collaborative embeddings. This alignment is crucial to integrating both semantic and hierarchical insights seamlessly.

Further, the hierarchical tree structure organizes user-item preferences into layers, with higher layers representing broader interests and lower layers focusing on specific preferences. This setup facilitates dynamic navigation through user preferences. Exploration and exploitation are managed via parameters controlling the degree of recommendation diversity. For instance, temperature and hierarchy level parameters allow users to determine how many recommendations should include novel or familiar content. This flexibility empowers users to influence the trade-off between diversity and specificity in recommendations.

Extensive experiments validated HARec’s effectiveness. Using datasets like Amazon books, Yelp, and Google reviews, the researchers measured utility and diversity metrics, demonstrating HARec’s superiority over existing models. In utility metrics, HARec achieved a Recall@20 score of 16.82% for Amazon books, outperforming the best baseline (11.13%) by a significant margin. Similarly, the NDCG@20 score reached 10.69%, reflecting its ability to prioritize relevant recommendations effectively. Regarding diversity, HARec marked an 11.39% improvement in metrics such as Shannon Entropy and Expected Popularity Complement, highlighting its capability to enhance recommendation variety.

Further analysis showed HARec’s strength in addressing the cold-start problem, which affects items with limited interaction data. HARec demonstrated a performance boost of over 14% for tail items in Recall@20 compared to baseline hyperbolic models, underscoring its ability to incorporate semantic alignment effectively. The researchers also conducted ablation studies to evaluate individual components of the framework. Results indicated that removing either the hyperbolic margin ranking loss or the semantic alignment loss significantly reduced the model’s utility metrics, proving the necessity of these innovations.

HARec represents a substantial advancement in recommender systems by addressing the dual challenges of exploration and exploitation. Its integration of hyperbolic space and semantic alignment offers a novel solution to hierarchical modeling and semantic understanding. The user-adjustable framework ensures adaptability and relevance, making HARec a versatile tool in personalized recommendation systems. By achieving state-of-the-art results in both accuracy and diversity, HARec sets a new benchmark for balancing user preferences and exploration in digital content platforms.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

‘Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)
The post This AI Paper Introduces HARec: A Hyperbolic Framework for Balancing Exploration and Exploitation in Recommender Systems appeared first on MarkTechPost.

Unleash your Salesforce data using the Amazon Q Salesforce Online conn …

Thousands of companies worldwide use Salesforce to manage their sales, marketing, customer service, and other business operations. The Salesforce cloud-based platform centralizes customer information and interactions across the organization, providing sales reps, marketers, and support agents with a unified 360-degree view of each customer. With Salesforce at the heart of their business, companies accumulate vast amounts of customer data within the platform over time. This data is incredibly valuable for gaining insights into customers, improving operations, and guiding strategic decisions. However, accessing and analyzing the blend of structured data and unstructured data can be challenging. With the Amazon Q Salesforce Online connector, companies can unleash the value of their Salesforce data.
Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely take actions based on data and information in your enterprise systems. It empowers employees to be more data-driven, efficient, prepared, and productive.
Amazon Q Business offers pre-built connectors for over 40 data sources, including Amazon Simple Storage Service (Amazon S3), Microsoft SharePoint, Salesforce, Google Drive, Atlassian Confluence, Atlassian Jira, and many more. For a full list of data source connectors, see Amazon Q Business connectors.
In this post, we walk you through configuring and setting up the Amazon Q Salesforce Online connector.
Overview of the Amazon Q Salesforce Online connector
Amazon Q Business supports its own index where you can add and sync documents. Amazon Q connectors make it straightforward to synchronize data from multiple content repositories with your Amazon Q index. You can set up connectors to automatically sync your index with your data source based on a schedule, so you’re always securely searching through up-to-date content.
The Amazon Q Salesforce Online connector provides a simple, seamless integration between Salesforce and Amazon Q. With a few clicks, you can securely connect your Salesforce instance to Amazon Q and unlock a robust self-service conversational AI assistant for your Salesforce data.
The following diagram illustrates this architecture.

Types of documents
When you connect Amazon Q Business to a data source like Salesforce, what Amazon Q considers and crawls as a document varies by connector type.
The Amazon Q Salesforce Online connector crawls and indexes the following content types:

Account
Campaign
Case
Chatter
Contact
Contract
Custom object
Document
Group
Idea
Knowledge articles
Lead
Opportunity
Partner
Pricebook
Product
Profile
Solution
Task
User

The Amazon Q Salesforce Online connector also supports field mappings to enrich index data with additional fields data. Field mappings allow you to map Salesforce field names to Amazon Q index field names. This includes both default field mappings created automatically by Amazon Q, and custom field mappings that you can create and edit.
Authentication
The Amazon Q Salesforce Online connector supports OAuth 2.0 with the Resource Owner Password Flow.
ACL crawling
To securely index documents, the Amazon Q Salesforce Online connector supports crawling access control lists (ACLs) with role hierarchy by default. With ACL crawling, the information can be used to filter chat responses to your end-user’s document access level. You can apply ACL-based chat filtering using Salesforce standard objects and chatter feeds. ACL-based chat filtering isn’t available for Salesforce knowledge articles.
If you index documents without ACLs, all documents are considered public. If you want to index documents without ACLs, make sure the documents are marked as public in your data source.
Solution overview
In this post, we guide you through connecting an existing Amazon Q application to Salesforce Online. You configure authentication, map fields, sync data between Salesforce and Amazon Q, and then deploy your AI assistant using the Amazon Q web experience.
We also demonstrate how to use Amazon Q to have a conversation about Salesforce accounts, opportunities, tasks, and other supported data types.
Prerequisites
You need the following prerequisites:

Authentication – If you are creating a Salesforce connected application for the first time, refer to Connected Apps. The following information is required for Salesforce OAuth 2.0 authentication:

Salesforce authentication URL
User name
Password
Security token
Consumer key
Consumer secret

AWS IAM Identity Center – AWS IAM Identity Center should be set up by your administrator for groups or users who will get access to the Amazon Q AI assistant. To add users to IAM Identity Center, refer to Add users to your Identity Center directory.
Amazon Q application – This post assumes that an Amazon Q Business application has already been created. If you haven’t created an application, refer to Build private and secure enterprise generative AI apps with Amazon Q Business and AWS IAM Identity Center to create and configure an Amazon Q Business application.

Set up Salesforce authentication
To set up authentication and allow external programs to Salesforce, complete the following steps to configure your connected application settings:

In Salesforce, in the Quick Find box, search and choose App Manager.
Choose New Connected App.
For Connected App Name, enter a name.
For API name, enter an API name used when referring to the connected application.
Enter your contact email address and phone.
If you are using OAuth, select the right scope for OAuth.

Choose Save and wait for connected application to be created.
On the Connected Apps page, select the application, and on the drop-down menu, choose View.
On the details page, next to Consumer Key and Secret, choose Manage Consumer Details.

Copy the client ID and client secret for future use in Salesforce.

Set up the Amazon Q Salesforce Online connector
Complete the following steps to set up the Amazon Q Salesforce Online connector:

On the Amazon Q Business console, choose Applications in the navigation pane.
Select your application and on the Actions menu, choose Edit.

On the Update application page, leave settings as default and choose Update.
On the Update retriever page, leave settings as default and choose Update.
On the Connect data sources page, on the All tab, search for Salesforce.
Choose the plus sign for the Salesforce Online connector.

In the Name and description section, enter a name and description.
In the Source section, for Salesforce URL, enter your Salesforce server URL in https://yourcompany.my.salesforce.com/

In the Authentication section, choose Create and add new secret.
Enter the Salesforce connected application authentication information and choose Save.

In the IAM role section, choose Create a new service role (recommended).

In the Sync scope section, select All standard objects.

If you choose to sync only specific objects, then select each object type accordingly.

In the Sync mode section, select New, modified, or deleted content sync.

Under Sync run schedule, choose the desired frequency. For testing purposes, we choose Run on demand.

Choose Add data source and wait for the connector to be created.
After the Salesforce connector is created, you’re redirected back to the Connect data sources page, where you can add additional data sources if needed.
Choose Next.
On the Update groups and users page, assign users or groups from IAM Identity Center set up by your administrator. Optionally, if you have permissions to add new users, you can select Add new users.
Choose Next.

Choose a user or group from the list to give them access to the Amazon Q web experience.
Choose Done.

Choose Update application to complete setting up the Salesforce data connector for Amazon Q Business.

Additional Salesforce field mappings
When you connect Amazon Q to a data source, Amazon Q automatically maps specific data source document attributes to fields within an Amazon Q index. If a document attribute in your data source doesn’t have an attribute mapping already available, or if you want to map additional document attributes to index fields, use the custom field mappings to specify how a data source attribute maps to an Amazon Q index field. You create field mappings by editing your data source after your application and retriever are created.
To update the field mapping, complete the following steps:

On the Amazon Q console, navigate to your Amazon Q application.
Under Data sources, select your data source and on the Actions menu, choose Edit.

In the Field mappings section, find the item that you want to add fields to and choose Add field. (For this post, we add the postalCode field to Lead.)
Add any other fields that you want to be included in the Amazon Q index and then choose Update.

The setup process is complete.

In the application details, choose Sync now to start the Amazon Q crawling and indexing process.

The initial sync may take a few minutes to get started.
When the sync process is complete, you can see a summary of ingested data on the connector’s Sync history tab. Check Total items scanned and Added to confirm that the right number of documents are included in the index.

Mapping custom fields
Salesforce allows you to store your unique business data by creating and using custom fields. When you need to fetch a custom field to generate answers, additional steps are needed for mapping and crawling the field. For example, knowledge articles in Salesforce use custom fields to store content of articles.
Make sure the initial sync process for the connector is complete. On the initial sync, the connector gets a list of all fields and objects in Salesforce, which is needed for custom fields mapping.
Complete the following steps to index contents of knowledge articles:

Navigate to Salesforce Setup and search and open Object Manager.
In Object Manager, choose the Knowledge

In the Fields & Relationships section, find the field name (for this example, we’re looking for Article Body and the field name is Article_Body__c) and record this field name.

On the Amazon Q Business console, navigate back to your application and choose Data sources in the navigation pane.
Select the Salesforce data source and on the Actions menu, choose Edit.

In the Field mappings section, under Knowledge Articles, choose Add field.
For Salesforce field name, enter Article_Body__c and map it to _document_body for Index field name.
Select your object type.
Choose Update to save the changes.

Return to the Data sources page of the application and choose Sync now.

When the sync process is complete, you can chat with Salesforce data source about default fields and also the Salesforce custom field that you added.

Talk with your Salesforce data using the Amazon Q web experience
When the synchronization process is complete, you can start using the Amazon Q web experience. To access the Amazon Q application UI, select your application and choose Customize web experience, which opens a preview of the UI and options to customize it.

You can customize the values for Title, Subtitle, and Welcome message in the UI. After you make changes, choose Save and then choose View web experience.

After signing in, you can start chatting with your generative AI assistant. To verify answers, check the citation links included in the answers. If you need to improve answers, add more details and context to the questions.

The results aren’t limited to cases and activities. You can also include other objects like knowledge bases. If a field isn’t included in the default mapped fields, you still can add them in the retriever settings and update the content index.
Let’s look at opportunities in Salesforce for a specific company and ask Amazon Q about these opportunities.

After opportunities, check a sample knowledge article from Salesforce.

When you chat with Amazon Q, you can see the exact article is referenced as the primary source.

As you can see, each answer has a thumbs up/thumbs down button to provide feedback. Amazon Q uses this feedback to improve responses for all your organization users.
Metadata fields
In Salesforce, document metadata refers to the information that describes the properties and characteristics of documents stored in Salesforce. The Amazon Q data source connector crawls relevant metadata or attributes associated with a document. To use metadata search, go to the Amazon Q application page and choose Metadata controls in the navigation pane. Select the metadata fields that are needed, for instance sf_subject and sf_status. This allows you to ask metadata lookup queries such as “Summarize case titled as supply chain vendors cost optimization” or “Give me status of case with subject as cloud modernization project.” Here, the sf_status and sf_subject metadata fields will be used to query and generate the relevant answer.

Frequently asked questions
In this section, we discuss some frequently asked questions.
Amazon Q Business is unable to answer your questions
If you get the response “Sorry, I could not find relevant information to complete your request,” this may be due to a few reasons:

No permissions – ACLs applied to your account don’t allow you to query certain data sources. If this is the case, reach out to your application administrator to make sure your ACLs are configured to access the data sources.
Data connector sync failed – Your data connector may have failed to sync information from the source to the Amazon Q Business application. Verify the data connector’s sync run schedule and sync history to confirm the sync is successful.
No subscriptions – Make sure that logged-in users have a subscription for Amazon Q.

If none of these reasons apply to your use case, open a support case and work with your technical account manager to get this resolved.
Custom fields aren’t showing up in fields mappings
A custom fields list is retrieved after the initial full synchronization. After a successful synchronization, you can add field mappings for custom fields.
Clean up
To prevent incurring additional costs, it’s essential to clean up and remove any resources created during the implementation of this solution. Specifically, you should delete the Amazon Q application, which will consequently remove the associated index and data connectors. However, any AWS Identity and Access Management (IAM) roles and secrets created during the Amazon Q application setup process will need to be removed separately. Failing to clean up these resources may result in ongoing charges, so it’s crucial to take the necessary steps to remove all components related to this solution.
Complete the following steps to delete the Amazon Q application, secret, and IAM role:

On the Amazon Q Business console, select the application that you created.
On the Actions menu, choose Delete and confirm the deletion.
On the Secrets Manager console, select the secret that was created for the connector.
On the Actions menu, choose Delete.
Set the waiting period as 7 days and choose Schedule deletion.

On the IAM console, select the role that was created during the Amazon Q application creation.
Choose Delete and confirm the deletion.

Conclusion
In this post, we provided an overview of the Amazon Q Salesforce Online connector and how you can use it for a safe and seamless integration of generative AI assistance with Salesforce. By using a single interface for the variety of data sources in the organization, you can enable employees to be more data-driven, efficient, prepared, and productive.
To learn more about the Amazon Q Salesforce Online connector, refer to Connecting Salesforce Online to Amazon Q Business.

About the Author
Mehdy Haghy is a Senior Solutions Architect at the AWS WWCS team, specializing in AI and ML on AWS. He works with enterprise customers, helping them migrate, modernize, and optimize their workloads for the AWS Cloud. In his spare time, he enjoys cooking Persian food and tinkering with circuit boards.

Reducing hallucinations in large language models with custom intervent …

Hallucinations in large language models (LLMs) refer to the phenomenon where the LLM generates an output that is plausible but factually incorrect or made-up. This can occur when the model’s training data lacks the necessary information or when the model attempts to generate coherent responses by making logical inferences beyond its actual knowledge. Hallucinations arise because of the inherent limitations of the language modeling approach, which aims to produce fluent and contextually appropriate text without necessarily ensuring factual accuracy.
Remediating hallucinations is crucial for production applications that use LLMs, particularly in domains where incorrect information can have serious consequences, such as healthcare, finance, or legal applications. Unchecked hallucinations can undermine the reliability and trustworthiness of the system, leading to potential harm or legal liabilities. Strategies to mitigate hallucinations can include rigorous fact-checking mechanisms, integrating external knowledge sources using Retrieval Augmented Generation (RAG), applying confidence thresholds, and implementing human oversight or verification processes for critical outputs.
RAG is an approach that aims to reduce hallucinations in language models by incorporating the capability to retrieve external knowledge and making it part of the prompt that’s used as input to the model. The retriever module is responsible for retrieving relevant passages or documents from a large corpus of textual data based on the input query or context. The retrieved information is then provided to the LLM, which uses this external knowledge in conjunction with prompts to generate the final output. By grounding the generation process in factual information from reliable sources, RAG can reduce the likelihood of hallucinating incorrect or made-up content, thereby enhancing the factual accuracy and reliability of the generated responses.
Amazon Bedrock Guardrails offer hallucination detection with contextual grounding checks, which can be seamlessly applied using Amazon Bedrock APIs (such as Converse or InvokeModel) or embedded into workflows. After an LLM generates a response, these workflows perform a check to see if hallucinations occurred. This setup can be achieved through Amazon Bedrock Prompt Flows or with custom logic using AWS Lambda functions. Customers can also do batch evaluation with human reviewers using Amazon Bedrock model evaluation’s human-based evaluation feature. However, these are static workflows, updating the hallucination detection logic requires modifying the entire workflow, limiting adaptability.
To address this need for flexibility, Amazon Bedrock Agents enables dynamic workflow orchestration. With Amazon Bedrock Agents, organizations can implement scalable, customizable hallucination detection that adjusts based on specific needs, reducing the effort needed to incorporate new detection techniques and additional API calls in the workflow without restructuring the entire workflow and letting the LLM decide the plan of action to orchestrate the workflow.
In this post, we will set up our own custom agentic AI workflow using Amazon Bedrock Agents to intervene when LLM hallucinations are detected and route the user query to customer service agents through a human-in-the-loop process. Imagine this to be a simpler implementation of calling a customer service agent when the chatbot is unable to answer the customer query. The chatbot is based on a RAG approach, which reduces hallucinations to a large extent, and the agentic workflow provides a customizable mechanism in how to measure, detect, and mitigate hallucinations that might occur.
Agentic workflows are a fresh new perspective in building dynamic and complex business use case-based workflows with the help of LLMs as the reasoning engine or brain. These agentic workflows decompose the natural language query-based tasks into multiple actionable steps with iterative feedback loops and self-reflection to produce the final result using tools and APIs.
Amazon Bedrock Agents helps accelerate generative AI application development by orchestrating multistep tasks. Amazon Bedrock Agents uses the reasoning capability of LLMs to break down user-requested tasks into multiple steps. They use the given instruction to create an orchestration plan and then carry out the plan by invoking company APIs or accessing knowledge bases using RAG to provide a final response to the user. This offers tremendous use case flexibility, enables dynamic workflows, and reduces development cost. Amazon Bedrock Agents is instrumental in customizing applications to help meet specific project requirements while protecting private data and helping to secure applications. These agents work with AWS managed infrastructure capabilities such as Lambda and Amazon Bedrock, reducing infrastructure management overhead. Additionally, agents streamline workflows and automate repetitive tasks. With the power of AI automation, you can boost productivity and reduce costs.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
Use case overview
In this post, we add our own custom intervention to a RAG-powered chatbot in an event of hallucinations being detected. We will be using Retrieval Augmented Generation Automatic Score (metrics such as answer correctness and answer relevancy to develop a custom hallucination score for measuring hallucinations. If the hallucination score for a particular LLM response is less than a custom threshold, it indicates that the generated model response is not well-aligned with the ground truth. In this situation, we notify a pool of human agents through Amazon Simple Notification Service (Amazon SNS) notification to assist with the query instead of providing the customer with the hallucinated LLM response.
The RAG-based chatbot we use ingests the Amazon Bedrock User Guide to assist customers on queries related to Amazon Bedrock.
Dataset
The dataset used in the notebook is the latest Amazon Bedrock User guide PDF file, which is publicly available to download. Alternatively, you can use other PDFs of your choice to create the knowledge base from scratch and use it in this notebook.
If you use a custom PDF, you will need to curate a supervised dataset of ground truth answers to multiple questions to test this approach. The custom hallucination detector uses RAGAS metrics, which are generated using a CSV file containing question-answer pairs. For custom PDFs, it is necessary to replace this CSV file and re-run the notebook for a different dataset.
In addition to the dataset in the notebook, we ask the agent multiple questions, a few of them from the PDF and a few not part of the PDF. The ground truth answers are manually curated based on the PDF contents if relevant.

Prerequisites
To run this solution in your AWS account, complete the following prerequisites:

Clone the GitHub repository and follow the steps explained in the README.
Set up an Amazon SageMaker notebook on an ml.t3.medium Amazon Elastic Compute Cloud (Amazon EC2)
Acquire access to models hosted on Amazon Bedrock. Choose Manage model access in the navigation pane of the Amazon Bedrock console and choose from the list of available options. We use Anthropic’s Claude v3 (Sonnet) on Amazon Bedrock and Amazon Titan Embeddings Text v2 on Amazon Bedrock for this post.

Implement the solution
The following illustrates the solution architecture:

Architecture Diagram for Custom Hallucination Detection and Mitigation

The overall workflow involves the following steps:

Data ingestion involving raw PDFs stored in an Amazon Simple Storage Service (Amazon S3) bucket synced as a data source with  .
User asks questions relevant to the Amazon Bedrock User Guide, which are handled by an Amazon Bedrock agent that is set up to handle user queries.

User query: What models are supported by bedrock agents?

The agent creates a plan and identifies the need to use a knowledge base. It then sends a request to the knowledge base, which retrieves relevant data from the underlying vector database. The agent retrieves an answer through RAG using the following steps:

The search query is directed to the vector database (Amazon OpenSearch Serverless).
Relevant answer chunks are retrieved.
The knowledge base response is generated from the retrieved answer chunks and sent back to the agent.

Generated Answer: Amazon Bedrock supports foundation models from various providers including Anthropic (Claude models), AI21 Labs (Jamba models), Cohere (Command models), Meta (Llama models), Mistral AI

The user query and knowledge base response are used together to invoke the correct action group.
The user question and knowledge base response are passed as inputs to a Lambda function that calculates a hallucination score.

The generated answer has some correct and some incorrect information as it picks up general Amazon Bedrock model support and not Amazon Bedrock Agents-specific model support. Therefore we have hallucination detected with a score of 0.4.

An SNS notification is sent if the answer score is lower than the custom threshold.

Because answer score is 0.4 < 0.9 (hallucination threshold), the SNS notification is triggered.

If the answer score is higher than the custom threshold, the hallucination detector set up in Lambda responds with a final knowledge base response. Otherwise, it returns a pre-defined response asking the user to wait until a customer service agent joins the conversation shortly.

Customer service human agent queue is notified and the next available agent joins or emails back if it is an offline response mechanism.

The final agent response is shown in the chatbot UI(User Interface).

In the GitHub repository notebook, we cover the following learning objectives:

Measure and detect hallucinations with an Agentic AI workflow which has the ability to notify humans-in-the-loop to remediate hallucinations, if detected.
Custom hallucination detector with pre-defined thresholds based on select evaluation metrics in RAGAS.
To remediate, we will send an SNS notification to the customer service queue and wait for a human to help us with the question.

Step 1: Setting up Amazon Bedrock Knowledge Bases with Amazon Bedrock Agents
In this section, we will integrate Amazon Bedrock Knowledge Bases with Amazon Bedrock Agents to create a RAG workflow. RAG systems use external knowledge sources to augment the LLM’s output, improving factual accuracy and reducing hallucinations. We create the agent with the following high-level instruction encouraging it to take a question-answering role.

agent_instruction = “””

You are a question answering agent that helps customers answer questions from the Amazon Bedrock User Guide inside the associated knowledge base.

Next you will always use the knowledge base search result to detect and measure any hallucination using the functions provided”

“””

Step 2: Invoke Amazon Bedrock Agents with user questions about Amazon Bedrock documentation
We are using a supervised dataset with predefined questions and ground truth answers to invoke Amazon Bedrock Agents which triggers the custom hallucination detector based on the agent response from the knowledge base. In the notebook, we demonstrate how the answer score based on RAGAS metrics can notify a human customer service representative if it does not meet a pre-defined custom threshold score.
We use RAGAS metrics such as answer correctness and answer relevancy to determine the custom threshold score. Depending on the use case and dataset, the list of applicable RAGAS metrics can be customized accordingly.
To change the threshold score, you can modify the measure_hallucination() method inside the Lambda function lambda_hallucination_detection().
The agent is prompted with the following template. The user_question in the template is iterated from the supervised dataset CSV file that contains the question and ground truth answers.

USER_PROMPT_TEMPLATE = “””Question: {user_question}

Given an input question, you will search the Knowledge Base on Amazon Bedrock User Guide to answer the user question.
If the knowledge base search results do not return any answer, you can try answering it to the best of your ability, but do not answer anything you do not know. Do not hallucinate.
Using this knowledge base search result you will ALWAYS execute the appropriate action group API to measure and detect the hallucination on that knowledge base search result.

Remove any XML tags from the knowledge base search results and final user response.

Some samples for `user_question` parameter:

What models are supported by bedrock agents?
Which models can I use with Amazon Bedrock Agents?
Which are the dates for reinvent 2024?
What is Amazon Bedrock?

“””

Step 3: Trigger human-in-the-loop in case of hallucination
If the custom hallucination score threshold is not met by the agent response, a human in the loop is notified using SNS notifications. These notifications can be sent to the customer service representative queue or Amazon Simple Queue Service (Amazon SQS) queues for email and text notifications. These representatives can respond to the email (offline) or ongoing chat (online) based on their training and knowledge of the system and additional resources. This would be based out of the specific product workflow design.
To view the actual SNS messages sent out, we can view the latest Lambda AWS CloudWatch logs following the instructions as given in viewing CloudWatch logs for Lambda functions. You can search for the string Received SNS message :: inside the CloudWatch logs for the Lambda function LambdaAgentsHallucinationDetection().
Cost considerations
The following are important cost considerations:

This current implementation has no separate charges for building resources using Amazon Bedrock Knowledge Bases or Amazon Bedrock Agents.
You will incur charges for the embedding model and text model invocation on Amazon Bedrock. For more details, see Amazon Bedrock pricing.
You will incur charges for Amazon S3 and vector database usage. For more details, see Amazon S3 pricing and Amazon OpenSearch Service pricing, respectively.

Clean up
To avoid incurring unnecessary costs, the implementation has the option to clean up resources after an entire run of the notebook. You can check the instructions in the cleanup_infrastructure() method for how to avoid the automatic cleanup and experiment with different prompts and datasets.
The order of resource cleanup is as follows:

Disable the action group.
Delete the action group.
Delete the alias.
Delete the agent.
Delete the Lambda function.
Empty the S3 bucket.
Delete the S3 bucket.
Delete AWS Identity and Access Management (IAM) roles and policies.
Delete the vector DB collection policies.
Delete the knowledge bases.

Key considerations
Amazon Bedrock Agents can increase overall latency compared to using just Amazon Bedrock Guardrails and Amazon Bedrock Prompt Flows. It is a trade-off decision between having LLM generated workflows compared to static or deterministic workflows. With agents, the LLM generates the workflow orchestration in real time using the available knowledge bases, tools, and APIs. Whereas with prompt flows and guardrails, the workflow has to be orchestrated and designed offline.
For evaluation, while we have chosen an LLM-based evaluation framework RAGAS, it is possible to swap out the elements in the hallucination detection Lambda function for another framework.
Conclusion
This post demonstrated how to use Amazon Bedrock Agents, Amazon Knowledge Bases, and the RAGAS evaluation metrics to build a custom hallucination detector and remediate it by using human-in-the-loop. The agentic workflow can be extended to custom use cases through different hallucination remediation techniques and offers the flexibility to detect and mitigate hallucinations using custom actions.
For more information on creating agents to orchestrate workflows, see Amazon Bedrock Agents. To learn about multiple RAGAS metrics for LLM evaluations see RAGAS: Getting Started.

About the Authors
Shayan Ray is an Applied Scientist at Amazon Web Services. His area of research is all things natural language (like NLP, NLU, and NLG). His work has been focused on conversational AI, task-oriented dialogue systems, and LLM-based agents. His research publications are on natural language processing, personalization, and reinforcement learning.
Bharathi Srinivasan is a Generative AI Data Scientist at AWS WWSO where she works building solutions for Responsible AI challenges. She is passionate about driving business value from machine learning applications by addressing broad concerns of Responsible AI. Outside of building new AI experiences for customers, Bharathi loves to write science fiction and challenge herself with endurance sports.

Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM

With the rise of large language models (LLMs) like Meta Llama 3.1, there is an increasing need for scalable, reliable, and cost-effective solutions to deploy and serve these models. AWS Trainium and AWS Inferentia based instances, combined with Amazon Elastic Kubernetes Service (Amazon EKS), provide a performant and low cost framework to run LLMs efficiently in a containerized environment.
In this post, we walk through the steps to deploy the Meta Llama 3.1-8B model on Inferentia 2 instances using Amazon EKS.
Solution overview
The steps to implement the solution are as follows:

Create the EKS cluster.
Set up the Inferentia 2 node group.
Install the Neuron device plugin and scheduling extension.
Prepare the Docker image.
Deploy the Meta Llama 3.18B model.

We also demonstrate how to test the solution and monitor performance, and discuss options for scaling and multi-tenancy.
Prerequisites
Before you begin, make sure you have the following utilities installed on your local machine or development environment. If you don’t have them installed, follow the instructions provided for each tool.

The AWS Command Line Interface (AWS CLI) installed
eksctl
kubectl
docker

In this post, the examples use an inf2.48xlarge instance; make sure you have a sufficient service quota to use this instance. For more information on how to view and increase your quotas, refer to Amazon EC2 service quotas.
Create the EKS cluster
If you don’t have an existing EKS cluster, you can create one using eksctl. Adjust the following configuration to suit your needs, such as the Amazon EKS version, cluster name, and AWS Region. Before running the following commands, make sure you authenticate towards AWS:

export AWS_REGION=us-east-1
export CLUSTER_NAME=my-cluster
export EKS_VERSION=1.30
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity –query Account –output text)

Then complete the following steps:

Create a new file named eks_cluster.yaml with the following command:

cat > eks_cluster.yaml <<EOF

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
name: $CLUSTER_NAME
region: $AWS_REGION
version: “$EKS_VERSION”

addons:
– name: vpc-cni
version: latest

cloudWatch:
clusterLogging:
enableTypes: [“*”]

iam:
withOIDC: true
EOF

This configuration file contains the following parameters:

metadata.name – Specifies the name of your EKS cluster, which is set to my-cluster in this example. You can change it to a name of your choice.
metadata.region – Specifies the Region where you want to create the cluster. In this example, it’s set to us-east-2. Change this to your desired Region. Because we’re using Inf2 instances, you should choose a Region where those instances are presented.
metadata.version – Specifies the Kubernetes version to use for the cluster. In this example, it’s set to 1.30. You can change this to a different version if needed, but make sure to use a version that is supported by Amazon EKS. For a list of supported versions, see Review release notes for Kubernetes versions on standard support.
addons.vpc-cni – Specifies the version of the Amazon VPC CNI (Container Network Interface) add-on to use. Setting it to latest will install the latest available version.
cloudWatch.clusterLogging – Enables cluster logging, which sends logs from the control plane to Amazon CloudWatch Logs.
iam.withOIDC – Enables the OpenID Connect (OIDC) provider for the cluster, which is required for certain AWS services to interact with the cluster.

After you create the eks_cluster.yaml file, you can create the EKS cluster by running the following command:

eksctl create cluster –config-file eks_cluster.yaml

This command will create the EKS cluster based on the configuration specified in the eks_cluster.yaml file. The process will take approximately 15–20 minutes to complete.
During the cluster creation process, eksctl will also create a default node group with a recommended instance type and configuration. However, in the next section, we create a separate node group with Inf2 instances, specifically for running the Meta Llama 3.1-8B model.

To complete the setup of kubectl, run the following code:

aws eks update-kubeconfig —region $AWS_REGION —name $CLUSTER_NAME

Set up the Inferentia 2 node group
To run the Meta Llama 3.1-8B model, you’ll need to create an Inferentia 2 node group. Complete the following steps:

First, retrieve the latest Amazon EKS optimized accelerated AMI ID:

export ACCELERATED_AMI=$(aws ssm get-parameter
–name /aws/service/eks/optimized-ami/$EKS_VERSION/amazon-linux-2-gpu/recommended/image_id
–region $AWS_REGION
–query “Parameter.Value”
–output text)

Create the Inferentia 2 node group using eksctl:

cat > eks_nodegroup.yaml <<EOF

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
name: $CLUSTER_NAME
region: $AWS_REGION
version: “$EKS_VERSION”

managedNodeGroups:
– name: neuron-group
instanceType: inf2.48xlarge
desiredCapacity: 1
volumeSize: 512
ami: “$ACCELERATED_AMI”
amiFamily: AmazonLinux2
iam:
attachPolicyARNs:
– arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
– arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
– arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
– arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

overrideBootstrapCommand: |
#!/bin/bash

/etc/eks/bootstrap.sh $CLUSTER_NAME
EOF

Run eksctl create nodegroup –config-file eks_nodegroup.yaml to create the node group.

This will take approximately 5 minutes.
Install the Neuron device plugin and scheduling extension
To set up your EKS cluster for running workloads on Inferentia chips, you need to install two key components: the Neuron device plugin and the Neuron scheduling extension.
The Neuron device plugin is essential for exposing Neuron cores and devices as resources in Kubernetes. The Neuron scheduling extension facilitates the optimal scheduling of pods requiring multiple Neuron cores or devices.
For detailed instructions on installing and verifying these components, refer to Kubernetes environment setup for Neuron. Following these instructions will help you make sure your EKS cluster is properly configured to schedule and run workloads that require worker nodes, such as the Meta Llama 3.1-8B model.
Prepare the Docker image
To run the model, you’ll need to prepare a Docker image with the required dependencies. We use the following code to create an Amazon Elastic Container Registry (Amazon ECR) repository and then build a custom Docker image based on the AWS Deep Learning Container (DLC).

Set up environment variables:

export ECR_REPO_NAME=vllm-neuron

Create an ECR repository:

aws ecr create-repository –repository-name $ECR_REPO_NAME –region $AWS_REGION

Although the base Docker image already includes TorchServe, to keep things simple, this implementation uses the server provided by the vLLM repository, which is based on FastAPI. In your production scenario, you can connect TorchServe to vLLM with your own custom handler.

Create the Dockerfile:

cat > Dockerfile <<EOF
FROM public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04
# Clone the vllm repository
RUN git clone https://github.com/vllm-project/vllm.git
# Set the working directory
WORKDIR /vllm
RUN git checkout v0.6.0
# Set the environment variable
ENV VLLM_TARGET_DEVICE=neuron
# Install the dependencies
RUN python3 -m pip install -U -r requirements-neuron.txt
RUN python3 -m pip install .
# Modify the arg_utils.py file to support larger block_size option
RUN sed -i “/parser.add_argument(‘–block-size’,/ {N;N;N;N;N;s/[8, 16, 32]/[8, 16, 32, 128, 256, 512, 1024, 2048, 4096, 8192]/}” vllm/engine/arg_utils.py
# Install ray
RUN python3 -m pip install ray
RUN pip install -U triton>=3.0.0
# Set the entry point
ENTRYPOINT [“python3”, “-m”, “vllm.entrypoints.openai.api_server”]
EOF

Use the following commands to create an ECR repository, build your Docker image, and push it to the newly created repository. The account ID and Region are dynamically set using AWS CLI commands, making the process more flexible and avoiding hard-coded values.

# Authenticate Docker to your ECR registry
aws ecr get-login-password –region $AWS_REGION | docker login –username AWS –password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
# Build the Docker image
docker build -t ${ECR_REPO_NAME}:latest .

# Tag the image
docker tag ${ECR_REPO_NAME}:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest
# Push the image to ECR
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest

Deploy the Meta Llama 3.1-8B model
With the setup complete, you can now deploy the model using a Kubernetes deployment. The following is an example deployment specification that requests specific resources and sets up multiple replicas:

cat > neuronx-vllm-deployment.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: neuronx-vllm-deployment
labels:
app: neuronx-vllm
spec:
replicas: 3
selector:
matchLabels:
app: neuronx-vllm
template:
metadata:
labels:
app: neuronx-vllm
spec:
schedulerName: my-scheduler
containers:
– name: neuronx-vllm
image: <replace with the url to the docker image you pushed to the ECR>
resources:
limits:
cpu: 32
memory: “64G”
aws.amazon.com/neuroncore: “8”
requests:
cpu: 32
memory: “64G”
aws.amazon.com/neuroncore: “8”
ports:
– containerPort: 8000
env:
– name: HF_TOKEN
value: <your huggingface token>
– name: FI_EFA_FORK_SAFE
value: “1”
args:
– “–model”
– “meta-llama/Meta-Llama-3.1-8B”
– “–tensor-parallel-size”
– “8”
– “–max-num-seqs”
– “64”
– “–max-model-len”
– “8192”
– “–block-size”
– “8192”
EOF

Apply the deployment specification with kubectl apply -f neuronx-vllm-deployment.yaml.
This deployment configuration sets up multiple replicas of the Meta Llama 3.1-8B model using tensor parallelism (TP) of 8. In the current setup, we’re hosting three copies of the model across the available Neuron cores. This configuration allows for the efficient utilization of the hardware resources while enabling multiple concurrent inference requests.
The use of TP=8 helps in distributing the model across multiple Neuron cores, which improves inference performance and throughput. The specific number of replicas and cores used may vary depending on your particular hardware setup and performance requirements.
To modify the setup, update the neuronx-vllm-deployment.yaml file, adjusting the replicas field in the deployment specification and the NUM_NEURON_CORES environment variable in the container specification. Always verify that the total number of cores used (replicas * cores per replica) doesn’t exceed your available hardware resources and that the number of attention heads is evenly divisible by the TP degree for optimal performance.
The deployment also includes environment variables for the Hugging Face token and EFA fork safety. The args section (see the preceding code) configures the model and its parameters, including an increased max model length and block size of 8192.
Test the deployment
After you deploy the model, it’s important to monitor its progress and verify its readiness. Complete the following steps:

Check the deployment status:

kubectl get deployments

This will show you the desired, current, and up-to-date number of replicas.

Monitor the pods:

kubectl get pods -l app=neuronx-vllm -w

The -w flag will watch for changes. You’ll see the pods transitioning from “Pending” to “ContainerCreating” to “Running”.

Check the logs of a specific pod:

kubectl logs <pod-name>

The initial startup process takes around 15 minutes. During this time, the model is being compiled for the Neuron cores. You’ll see the compilation progress in the logs.
To support proper management of your vLLM pods, you should configure Kubernetes probes in your deployment. These probes help Kubernetes determine when a pod is ready to serve traffic, when it’s alive, and when it has successfully started.

Add the following probe configurations to your container spec in the deployment YAML:

spec:
containers:
– name: neuronx-vllm
# … other container configurations …
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 15
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
failureThreshold: 30
periodSeconds: 10

The configuration is comprised of three probes:

Readiness probe – Checks if the pod is ready to serve traffic. It starts checking after 60 seconds and repeats every 10 seconds.
Liveness probe – Verifies if the pod is still running correctly. It begins after 120 seconds and checks every 15 seconds.
Startup probe – Gives the application time to start up. It allows up to 25 minutes for the application to start before considering it failed.

These probes assume that your vLLM application exposes a /health endpoint. If it doesn’t, you’ll need to implement one or adjust the probe configurations accordingly.
With these probes in place, Kubernetes will do the following:

Only send traffic to pods that are ready
Restart pods that are no longer alive
Allow sufficient time for initial startup and compilation

This configuration helps facilitate high availability and proper functioning of your vLLM deployment.
Now you’re ready to access the pods.

Identify the pod that is running your inference server. You can use the following command to list the pods with the neuronx-vllm label:

kubectl get pods -l app=neuronx-vllm

This command will output a list of pods, and you’ll need the name of the pod you want to forward.

Use kubectl port-forward to forward the port from the Kubernetes pod to your local machine. Use the name of your pod from the previous step:

kubectl port-forward <pod-name> 8000:8000

This command forwards port 8000 on the pod to port 8000 on your local machine. You can now access the inference server at http://localhost:8000.
Because we’re forwarding a port directly from a single pod, requests will only be sent to that specific pod. As a result, traffic won’t be balanced across all replicas of your deployment. This is suitable for testing and development purposes, but it doesn’t utilize the deployment efficiently in a production scenario where load balancing across multiple replicas is crucial to handle higher traffic and provide fault tolerance.
In a production environment, a proper solution like a Kubernetes service with a LoadBalancer or Ingress should be used to distribute traffic across available pods. This facilitates the efficient utilization of resources, a balanced load, and improved reliability of the inference service.

You can test the inference server by making a request from your local machine. The following code is an example of how to make an inference call using curl:

curl -X POST http://localhost:8000/v1/completions
-H “Content-Type: application/json”
-d ‘{
“model”: ” meta-llama/Meta-Llama-3.1-8B”,
“prompt”: “Explain the theory of relativity.”,
“max_tokens”: 100
}’

This setup allows you to test and interact with your inference server locally without needing to expose your service publicly or set up complex networking configurations. For production use, make sure that load balancing and scalability considerations are addressed appropriately.
For more information about routing, see Route application and HTTP traffic with Application Load Balancers.
Monitor performance
AWS offers powerful tools to monitor and optimize your vLLM deployment on Inferentia chips. The AWS Neuron Monitor container, used with Prometheus and Grafana, provides advanced visualization of your ML application performance. Additionally, CloudWatch Container Insights for Neuron offers deep, Neuron-specific analytics.
These tools allow you to track Inferentia chip utilization, model performance, and overall cluster health. By analyzing this data, you can make informed decisions about resource allocation and scaling to meet your workload requirements.
Remember that the initial 15-minute startup time for model compilation is a one-time process per deployment, with subsequent restarts being faster due to caching.
To learn more about setting up and using these monitoring capabilities, see Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container.
Scaling and multi-tenancy
As your application’s demand grows, you may need to scale your deployment to handle more requests. Scaling your Meta Llama 3.1-8B deployment on Amazon EKS with Neuron cores involves two coordinated steps:

Increasing the number of nodes in your EKS node group to provide additional Neuron cores
Increasing the number of replicas in your deployment to utilize these new resources

You can scale your deployment manually. Use the AWS Management Console or AWS CLI to increase the size of your EKS node group. When new nodes are available, scale your deployment with the following code:

kubectl scale deployment neuronx-vllm-deployment –replicas=<new-number>

Alternatively, you can set up auto scaling:

Configure auto scaling for your EKS node group to automatically add nodes based on resource demands
Use Horizontal Pod Autoscaling (HPA) to automatically adjust the number of replicas in your deployment

You can configure the node group’s auto scaling to respond to increased CPU, memory, or custom metric demands, automatically provisioning new nodes with Neuron cores as needed. This makes sure that as the number of incoming requests grows, both your infrastructure and your deployment can scale accordingly.
Example scaling solutions include:

Cluster Autoscaler with Karpenter – Though not currently installed in this setup, Karpenter offers more flexible and efficient auto scaling for future consideration. It can dynamically provision the right number of nodes needed for your Neuron workloads based on pending pods and custom scheduling constraints. For more details, see Scale cluster compute with Karpenter and Cluster Autoscaler.
Multi-cluster federation – For even larger scale, you could set up multiple EKS clusters, each with its own Neuron-equipped nodes, and use a multi-cluster federation tool to distribute traffic among them.

You should consider the following when scaling:

Alignment of resources – Make sure that your scaling strategy for both nodes and pods aligns with the Neuron core requirements (multiples of 8 for optimal performance). This is model dependent and unique for the Meta Llama 3.1 model.
Compilation time – Remember the 15-minute compilation time for new pods when planning your scaling strategy. Consider pre-warming pods during off-peak hours.
Cost management – Monitor costs closely as you scale, because Neuron-equipped instances can be expensive.
Performance testing – Conduct thorough performance testing as you scale to verify that increased capacity translates to improved throughput and reduced latency.

By coordinating the scaling of both your node group and your deployment, you can effectively handle increased request volumes while maintaining optimal performance. The auto scaling capabilities of both your node group and deployment can work together to automatically adjust your cluster’s capacity based on incoming request volumes, providing a more responsive and efficient scaling solution.
Clean up
Use the following code to delete the cluster created in this solution:

eksctl delete cluster –name $CLUSTER_NAME –region $AWS_REGION

Conclusion
Deploying LLMs like Meta Llama 3.1-8B at scale poses significant computational challenges. Using Inferentia 2 instances and Amazon EKS can help overcome these challenges by enabling efficient model deployment in a containerized, scalable, and multi-tenant environment.
This solution combines the exceptional performance and cost-effectiveness of Inferentia 2 chips with the robust and flexible landscape of Amazon EKS. Inferentia 2 chips deliver high throughput and low latency inference, ideal for LLMs. Amazon EKS provides dynamic scaling, efficient resource utilization, and multi-tenancy capabilities.
The process involves setting up an EKS cluster, configuring an Inferentia 2 node group, installing Neuron components, and deploying the model as a Kubernetes pod. This approach facilitates high availability, resilience, and efficient resource sharing for language model services, while allowing for automatic scaling, load balancing, and self-healing capabilities.
For the complete code and detailed implementation steps, visit the GitHub repository.

About the Authors
Dmitri Laptev is a Senior GenAI Solutions Architect at AWS, based in Munich. With 17 years of experience in the IT industry, his interest in AI and ML dates back to his university years, fostering a long-standing passion for these technologies. Dmitri is enthusiastic about cloud computing and the ever-evolving landscape of technology.
Maurits de Groot is a Solutions Architect at Amazon Web Services, based out of Amsterdam. He specializes in machine learning-related topics and has a predilection for startups. In his spare time, he enjoys skiing and bouldering.
Ziwen Ning is a Senior Software Development Engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with kickboxing, badminton, and other various sports, and immersing himself in music.
Jianying Lang is a Principal Solutions Architect at the AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in the HPC and AI fields. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD in Computational Physics from the University of Colorado at Boulder.