Fine-tune Llama 3 for text generation on Amazon SageMaker JumpStart

Generative artificial intelligence (AI) models have become increasingly popular and powerful, enabling a wide range of applications such as text generation, summarization, question answering, and code generation. However, despite their impressive capabilities, these models often struggle with domain-specific tasks or use cases due to their general training data. To address this challenge, fine-tuning these models on specific data is crucial for achieving optimal performance in specialized domains.
In this post, we demonstrate how to fine-tune the recently released Llama 3 models from Meta, specifically the llama-3-8b and llama-3-70b variants, using Amazon SageMaker JumpStart. The fine-tuning process is based on the scripts provided in the llama-recipes repo from Meta, utilizing techniques like PyTorch FSDP, PEFT/LoRA, and Int8 quantization for efficient fine-tuning of these large models on domain-specific datasets.
By fine-tuning the Meta Llama 3 models with SageMaker JumpStart, you can harness their improved reasoning, code generation, and instruction following capabilities tailored to your specific use cases.
Meta Llama 3 overview
Meta Llama 3 comes in two parameter sizes—8B and 70B with 8,000 context length—that can support a broad range of use cases with improvements in reasoning, code generation, and instruction following. Meta Llama 3 uses a decoder-only transformer architecture and new tokenizer that provides improved model performance with 128,000 context size. In addition, Meta improved post-training procedures that substantially reduced false refusal rates, improved alignment, and increased diversity in model responses. You can now derive the combined advantages of Meta Llama 3 performance and MLOps controls with Amazon SageMaker features such as Amazon SageMaker Pipelines and Amazon SageMaker Debugger. In addition, the model will be deployed in an AWS secure environment under your virtual private cloud (VPC) controls, helping provide data security.
SageMaker JumpStart
SageMaker JumpStart is a powerful feature within the SageMaker machine learning (ML) environment that provides ML practitioners a comprehensive hub of publicly available and proprietary foundation models (FMs). With this managed service, ML practitioners get access to a growing list of cutting-edge models from leading model hubs and providers that they can deploy to dedicated SageMaker instances within a network isolated environment, and customize models using SageMaker for model training and deployment.
Prerequisites
To try out this solution using SageMaker JumpStart, you’ll need the following prerequisites:

An AWS account that will contain all of your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, refer to Identity and Access Management for Amazon SageMaker.
Access to Amazon SageMaker Studio or a SageMaker notebook instance, or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.

Fine-tune Meta Llama 3 models
In this section, we discuss the steps to fine-tune Meta Llama 3 models. We’ll cover two approaches: using the SageMaker Studio UI for a no-code solution, and utilizing the SageMaker Python SDK.
No-code fine-tuning through the SageMaker Studio UI
SageMaker JumpStart provides access to publicly available and proprietary foundation models from third-party and proprietary providers. Data scientists and developers can quickly prototype and experiment with various ML use cases, accelerating the development and deployment of ML applications. It helps reduce the time and effort required to build ML models from scratch, allowing teams to focus on fine-tuning and customizing the models for their specific use cases. These models are released under different licenses designated by their respective sources. It’s essential to review and adhere to the applicable license terms before downloading or using these models to make sure they’re suitable for your intended use case.
You can access the Meta Llama 3 FMs through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we cover how to discover these models in SageMaker Studio.
SageMaker Studio is an IDE that offers a web-based visual interface for performing the ML development steps, from data preparation to model building, training, and deployment. For instructions on getting started and setting up SageMaker Studio, refer to Amazon SageMaker Studio.
When you’re in SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.

In the JumpStart view, you’re presented with the list of public models offered by SageMaker. You can explore other models from other providers in this view. To start using the Meta Llama 3 models, under Providers, choose Meta.

You’re presented with a list of the models available. Choose the Meta-Llama-3-8B-Instruct model.

Here you can view the model details, as well as train, deploy, optimize, and evaluate the model. For this demonstration, we choose Train.

On this page, you can point to the Amazon Simple Storage Service (Amazon S3) bucket containing the training and validation datasets for fine-tuning. In addition, you can configure deployment configuration, hyperparameters, and security settings for fine-tuning. Choose Submit to start the training job on a SageMaker ML instance.

Deploy the model
After the model is fine-tuned, you can deploy it using the model page on SageMaker JumpStart. The option to deploy the fine-tuned model will appear when fine-tuning is finished, as shown in the following screenshot.

You can also deploy the model from this view. You can configure endpoint settings such as the instance type, number of instances, and endpoint name. You will need to accept the End User License Agreement (EULA) before you can deploy the model.

Fine-tune using the SageMaker Python SDK
You can also fine-tune Meta Llama 3 models using the SageMaker Python SDK. A sample notebook with the full instructions can be found on GitHub. The following code example demonstrates how to fine-tune the Meta Llama 3 8B model:

import os
import boto3
from sagemaker.session import Session
from sagemaker.jumpstart.estimator import JumpStartEstimator

# To fine-tune the Llama 3 70B model available on JumpStart, please change model_id to `meta-textgeneration-llama-3-70b`.
model_id = “meta-textgeneration-llama-3-8b”
accept_eula = “true”
estimator = JumpStartEstimator(
    model_id=model_id, environment={“accept_eula”: accept_eula}
)

# By default, instruction tuning is set to false. Thus, to use instruction tuning dataset you use instruction_tuned=”True”
estimator.set_hyperparameters(instruction_tuned=”True”, epoch=”5″)
estimator.fit({“training”: train_data_location})

The code sets up a SageMaker JumpStart estimator for fine-tuning the Meta Llama 3 large language model (LLM) on a custom training dataset. It configures the estimator with the desired model ID, accepts the EULA, enables instruction tuning by setting instruction_tuned=”True”, sets the number of training epochs, and initiates the fine-tuning process.
When the fine-tuning job is complete, you can deploy the fine-tuned model directly from the estimator, as shown in the following code. As part of the deploy settings, you can define the instance type you want to deploy the model on. For the full list of deployment parameters, refer to the deploy parameters in the SageMaker SDK documentation.

# for Llama 3 70B models, you can deploy to ml.g5.12xlarge instance type or it will default to ml.p4d.24xlarge
finetuned_predictor = estimator.deploy(instance_type=’ml.g5.12xlarge’)

After the endpoint is up and running, you can perform an inference request against it using the predictor object as follows:

prompt = “Your prompt goes here”
payload = {
        “inputs”: prompt,
        “parameters”: {“max_new_tokens”: 256},
    }
response = finetuned_predictor.predict(payload)
response.get(‘generated_text’)

For the full list of predictor parameters, refer to the predictor object in the SageMaker SDK documentation.
Fine-tuning technique
Language models such as Meta Llama are more than 10 GB or even 100 GB in size. Fine-tuning such large models requires instances with significantly higher CUDA memory. Furthermore, training these models can be very slow due to their size. Therefore, for efficient fine-tuning, we use the following optimizations:

Low-Rank Adaptation (LoRA) – This is a type of parameter efficient fine-tuning (PEFT) for efficient fine-tuning of large models. In this, we freeze the whole model and only add a small set of adjustable parameters or layers into the model. For instance, instead of training all 8 billion parameters for Llama 3 8B, we can fine-tune less than 1% of the parameters. This helps significantly reduce the memory requirement because we only need to store gradients, optimizer states, and other training-related information for only 1% of the parameters. Furthermore, this helps reduce both training time and cost. For more details on this method, refer to LoRA: Low-Rank Adaptation of Large Language Models.
Int8 quantization – Even with optimizations such as LoRA, models like Meta Llama 70B require significant computational resources for training. To reduce the memory footprint during training, we can employ Int8 quantization. Quantization typically reduces the precision of the floating-point data types. Although this decreases the memory required to store model weights, it can potentially degrade the performance due to loss of information. However, Int8 quantization utilizes only a quarter of the precision compared to full-precision training, but it doesn’t incur significant degradation in performance. Instead of simply dropping bits, Int8 quantization rounds the data from one type to another, preserving the essential information while optimizing memory usage. To learn about Int8 quantization, refer to int8(): 8-bit Matrix Multiplication for Transformers at Scale.
Fully Sharded Data Parallel (FSDP) – This is a type of data parallel training algorithm that shards the model’s parameters across data parallel workers and can optionally offload part of the training computation to the CPUs. Although the parameters are sharded across different GPUs, computation of each microbatch is local to the GPU worker. It shards parameters more uniformly and achieves optimized performance through communication and computation overlapping during training.

The following table compares different methods with the two Meta Llama 3 models.

Default Instance Type
Supported Instance Types with Default configuration
Default Setting
LORA + FSDP
LORA + No FSDP
Int8 Quantization + LORA + No FSDP

Llama 3 8B
ml.g5.12xlarge
ml.g5.12xlarge, ml.g5.24xlarge, ml.g5.48xlarge
LORA + FSDP
Yes
Yes
Yes

Llama 3 70B
ml.g5.48xlarge
ml.g5.48xlarge
INT8 + LORA + NO FSDP
No
No
Yes

Fine-tuning of Meta Llama models is based on scripts provided by the GitHub repo.
Training dataset format
SageMaker JumpStart currently support datasets in both domain adaptation format and instruction tuning format. In this section, we specify an example dataset in both formats. For more details, refer to the Dataset formatting section in the appendix.
Domain adaptation format
The Meta Llama 3 text generation model can be fine-tuned on domain-specific datasets, enabling it to generate relevant text and tackle various natural language processing (NLP) tasks within a particular domain using few-shot prompting. This fine-tuning process involves providing the model with a dataset specific to the target domain. The dataset can be in various formats, such as CSV, JSON, or TXT files. For example, if you want to fine-tune the model for the domain of financial reports and filings, you could provide it with a text file containing SEC filings from a company like Amazon. The following is an excerpt from such a filing:

This report includes estimates, projections, statements relating to our
business plans, objectives, and expected operating results that are “forward-
looking statements” within the meaning of the Private Securities Litigation
Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E
of the Securities Exchange Act of 1934. Forward-looking statements may appear
throughout this report, including the following sections: “Business” (Part I,
Item 1 of this Form 10-K), “Risk Factors” (Part I, Item 1A of this Form 10-K),
and “Management’s Discussion and Analysis of Financial Condition and Results
of Operations” (Part II, Item 7 of this Form 10-K). These forward-looking
statements generally are identified by the words “believe,” “project,”
“expect,” “anticipate,” “estimate,” “intend,” “strategy,” “future,”
“opportunity,” “plan,” “may,” “should,” “will,” “would,” “will be,” “will
continue,” “will likely result,” and similar expressions.

Instruction tuning format
In instruction fine-tuning, the model is fine-tuned for a set of NLP tasks described using instructions. This helps improve the model’s performance for unseen tasks with zero-shot prompts. In instruction tuning dataset format, you specify the template.json file describing the input and the output formats and the train.jsonl file with the training data item in each line.
The template.json file always has the following JSON format:

{
  “prompt”: “<<Prompt goes here along with question or context or instruction>>”,
  “completion”: “<<completion goes here depending on the activity, for ex: answer for Q&A or summary for Summarization task>>”
}

For instance, the following table shows the template.json and train.jsonl files for the Dolly and Dialogsum datasets.

Dataset
Use Case
template.json
train.jsonl

Dolly
Question Answering
{ “prompt”: “Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:n{instruction}nn### Input:n{context}nn”, “completion”: ” {response}” }
{ “instruction”: “Who painted the Two Monkeys”, “context”: “Two Monkeys or Two Chained Monkeys is a 1562 painting by Dutch and Flemish Renaissance artist Pieter Bruegel the Elder. The work is now in the Gemäldegalerie (Painting Gallery) of the Berlin State Museums.”, “response”: “The two Monkeys or Two Chained Monkeys is a 1562 painting by Dutch and Flemish Renaissance artist Pieter Bruegel the Elder. The work is now in the Gemaeldegalerie (Painting Gallery) of the Berlin State Museums.” }

Dialogsum
Text Summarization
{ “prompt”: “Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n{dialogue}nn”, “completion”: ” {summary}” }
{ “dialogue”: “#Person1#: Where do these flower vases come from? n#Person2#: They are made a town nearby. The flower vases are made of porcelain and covered with tiny bamboo sticks. n#Person1#: Are they breakable? n#Person2#: No. They are not only ornmamental, but also useful. n#Person1#: No wonder it’s so expensive. “, “summary”: “#Person2# explains the flower vases’ materials and advantages and #Person1# understands why they’re expensive.” }

Supported hyperparameters for training
The fine-tuning process for Meta Llama 3 models allows you to customize various hyperparameters, each of which can influence factors such as memory consumption, training speed, and the performance of the fine-tuned model. At the time of writing this post, the following are the default hyperparameter values. For the most up-to-date information, refer to the SageMaker Studio console, because these values may be subject to change.

epoch – The number of passes that the fine-tuning algorithm takes through the training dataset. Must be an integer greater than 1. Default is 5.
learning_rate – The rate at which the model weights are updated after working through each batch of training examples. Must be a positive float greater than 0. Default is 0.0001.
lora_r – Lora R dimension. Must be a positive integer. Default is 8.
lora_alpha – Lora Alpha. Must be a positive integer. Default is 32.
target_modules – Target modules for LoRA fine-tuning. You can specify a subset of [‘q_proj’,’v_proj’,’k_proj’,’o_proj’,’gate_proj’,’up_proj’,’down_proj’] modules as a string separated by a comma without any spaces. Default is q_proj,v_proj.
lora_dropout – Lora Dropout. Must be a positive float between 0 and 1. Default is 0.05.
instruction_tuned – Whether to instruction-train the model or not. At most one of instruction_tuned and chat_dataset can be True. Must be True or False. Default is False.
chat_dataset – If True, dataset is assumed to be in chat format. At most one of instruction_tuned and chat_dataset can be True. Default is False.
add_input_output_demarcation_key – For an instruction tuned dataset, if this is True, a demarcation key (“### Response:n”) is added between the prompt and completion before training. Default is True.
per_device_train_batch_size – The batch size per GPU core/CPU for training. Default is 1.
per_device_eval_batch_size – The batch size per GPU core/CPU for evaluation. Default is 1.
max_train_samples – For debugging purposes or quicker training, truncate the number of training examples to this value. Value -1 means using all of the training samples. Must be a positive integer or -1. Default is -1.
max_val_samples – For debugging purposes or quicker training, truncate the number of validation examples to this value. Value -1 means using all of the validation samples. Must be a positive integer or -1. Default is -1.
seed – Random seed that will be set at the beginning of training. Default is 10.
max_input_length – Maximum total input sequence length after tokenization. Sequences longer than this will be truncated. If -1, max_input_length is set to the minimum of 1024 and the maximum model length defined by the tokenizer. If set to a positive value, max_input_length is set to the minimum of the provided value and the model_max_length defined by the tokenizer. Must be a positive integer or -1. Default is -1.
validation_split_ratio – If validation channel is None, ratio of train-validation split from the train data must be between 0–1. Default is 0.2.
train_data_split_seed – If validation data is not present, this fixes the random splitting of the input training data to training and validation data used by the algorithm. Must be an integer. Default is 0.
preprocessing_num_workers – The number of processes to use for preprocessing. If None, the main process is used for preprocessing. Default is None.
int8_quantization – If True, the model is loaded with 8-bit precision for training. Default for 8B is False. Default for 70B is True.
enable_fsdp – If True, training uses FSDP. Default for 8B is True. Default for 70B is False.

Instance types and compatible hyperparameters
The memory requirement during fine-tuning may vary based on several factors:

Model type – The 8B model has the smallest GPU memory requirement and the 70B model has a largest memory requirement
Max input length – A higher value of input length leads to processing more tokens at a time and as such requires more CUDA memory
Batch size – A larger batch size requires larger CUDA memory and therefore requires larger instance types
Int8 quantization – If using Int8 quantization, the model is loaded into low precision mode and therefore requires less CUDA memory

To help you get started, we provide a set of combinations of different instance types, hyperparameters, and model types that can be successfully fine-tuned. You can select a configuration as per your requirements and availability of instance types. We fine-tune all three models on a variety of settings with three epochs on a subset of the Dolly dataset with summarization examples.
8B model

Instance Type
Max Input Length
Per Device Batch Size
Int8 Quantization
Enable FSDP
Time Taken (Minutes)

ml.g4dn.12xlarge
1024
2
TRUE
FALSE
202

ml.g4dn.12xlarge
2048
2
TRUE
FALSE
192

ml.g4dn.12xlarge
1024
2
FALSE
TRUE
98

ml.g4dn.12xlarge
1024
4
TRUE
FALSE
200

ml.g5.12xlarge
2048
2
TRUE
FALSE
73

ml.g5.12xlarge
1024
2
TRUE
FALSE
88

ml.g5.12xlarge
2048
2
FALSE
TRUE
24

ml.g5.12xlarge
1024
2
FALSE
TRUE
35

ml.g5.12xlarge
2048
4
TRUE
FALSE
72

ml.g5.12xlarge
1024
4
TRUE
FALSE
83

ml.g5.12xlarge
1024
4
FALSE
TRUE
25

ml.g5.12xlarge
1024
8
TRUE
FALSE
83

ml.g5.24xlarge
2048
2
TRUE
FALSE
73

ml.g5.24xlarge
1024
2
TRUE
FALSE
86

ml.g5.24xlarge
2048
2
FALSE
TRUE
24

ml.g5.24xlarge
1024
2
FALSE
TRUE
35

ml.g5.24xlarge
2048
4
TRUE
FALSE
72

ml.g5.24xlarge
1024
4
TRUE
FALSE
83

ml.g5.24xlarge
1024
4
FALSE
TRUE
25

ml.g5.24xlarge
1024
8
TRUE
FALSE
82

ml.g5.48xlarge
2048
2
TRUE
FALSE
73

ml.g5.48xlarge
1024
2
TRUE
FALSE
87

ml.g5.48xlarge
2048
2
FALSE
TRUE
27

ml.g5.48xlarge
1024
2
FALSE
TRUE
48

ml.g5.48xlarge
2048
4
TRUE
FALSE
71

ml.g5.48xlarge
1024
4
TRUE
FALSE
82

ml.g5.48xlarge
1024
4
FALSE
TRUE
32

ml.g5.48xlarge
1024
8
TRUE
FALSE
81

ml.p3dn.24xlarge
2048
2
TRUE
FALSE
104

ml.p3dn.24xlarge
1024
2
TRUE
FALSE
114

70B model

Instance Type
Max Input Length
Per Device Batch Size
Int8 Quantization
Enable FSDP
Time Taken (Minutes)

ml.g5.48xlarge
1024
1
TRUE
FALSE
461

ml.g5.48xlarge
2048
1
TRUE
FALSE
418

ml.g5.48xlarge
1024
2
TRUE
FALSE
423

Recommendations on instance types and hyperparameters
When fine-tuning the model’s accuracy, keep in mind the following:

Larger models such as 70B provide better performance than 8B
Performance without Int8 quantization is better than performance with Int8 quantization

Note the following training time and CUDA memory requirements:

Setting int8_quantization=True decreases the memory requirement and leads to faster training.
Decreasing per_device_train_batch_size and max_input_length reduces the memory requirement and therefore can be run on smaller instances. However, setting very low values may increase the training time.
If you’re not using Int8 quantization (int8_quantization=False), use FSDP (enable_fsdp=True) for faster and efficient training.

When choosing the instance type, consider the following:

At the time of writing this post, the G5 instances provided the most efficient training among the supported instance types. However, because AWS regularly updates and introduces new instance types, we recommend that you validate the recommended instance type for Meta Llama 3 fine-tuning in the SageMaker documentation or SageMaker console before proceeding.
Training time largely depends on the amount of GPUs and the CUDA memory available. Therefore, training on instances with the same number of GPUs (for example, ml.g5.2xlarge and ml.g5.4xlarge) is roughly the same. Therefore, you can use the more cost effective instance for training (ml.g5.2xlarge).

To learn about the cost of training per instance, refer to Amazon EC2 G5 Instances.
If your dataset is in instruction tuning format, where each sample consists of an instruction (input) and the desired model response (completion), and these input+completion sequences are short (for example, 50–100 words), using a high value for max_input_length can lead to poor performance. This is because the model may struggle to focus on the relevant information when dealing with a large number of padding tokens, and it can also lead to inefficient use of computational resources. The default value of -1 corresponds to a max_input_length of 1024 for Llama models. We recommend setting max_input_length to a smaller value (for example, 200–400) when working with datasets containing shorter input+completion sequences to mitigate these issues and potentially improve the model’s performance and efficiency.
Lastly, due to the high demand of the G5 instances, you may experience unavailability of these instances in your AWS Region with the error “CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type.” If you experience this error, retry the training job or try a different Region.
Issues when fine-tuning large models
In this section, we discuss two issues when fine-tuning very large models.
Disable output compression
By default, the output of a training job is a trained model that is compressed in a .tar.gz format before it’s uploaded to Amazon S3. However, for large models like the 70B model, this compression step can be time-consuming, taking more than 4 hours. To mitigate this delay, it’s recommended to use the disable_output_compression feature supported by the SageMaker training environment. When disable_output_compression is set to True, the model is uploaded without any compression, which can significantly reduce the time taken for large model artifacts to be uploaded to Amazon S3. The uncompressed model can then be used directly for deployment or further processing. The following code shows how to pass this parameter into the SageMaker JumpStart estimator:

estimator = JumpStartEstimator(
model_id=model_id, environment={“accept_eula”: “true”}, disable_output_compression=True
)

SageMaker Studio kernel timeout issue
Due to the size of the Meta Llama 3 70B model, the training job may take several hours to complete. The SageMaker Studio kernel is only used to initiate the training job, and its status doesn’t affect the ongoing training process. After the training job starts, the compute resources allocated for the job will continue running the training process, regardless of whether the SageMaker Studio kernel remains active or times out. If the kernel times out during the lengthy training process, you can still deploy the endpoint after training is complete using the training job name with the following code:

from sagemaker.jumpstart.estimator import JumpStartEstimator
training_job_name = <<<INSERT_TRAINING_JOB_NAME>>>

attached_estimator = JumpStartEstimator.attach(training_job_name, model_id)
attached_estimator.logs()
predictor = attached_estimator.deploy()

To find the training job name, navigate to the SageMaker console and under Training in the navigation pane, choose Training jobs. Identify the training job name and substitute it in the preceding code.
Clean up
To prevent incurring unnecessary charges, it’s recommended to clean up the deployed resources when you’re done using them. You can remove the deployed model with the following code:

predictor.delete_predictor()

Conclusion
In this post, we discussed fine-tuning Meta Llama 3 models using SageMaker JumpStart. We showed that you can use the SageMaker JumpStart console in SageMaker Studio or the SageMaker Python SDK to fine-tune and deploy these models. We also discussed the fine-tuning technique, instance types, and supported hyperparameters. In addition, we outlined recommendations for optimized training based on various tests we carried out.
The results for fine-tuning the three models over two datasets are shown in the appendix at the end of this post. As we can see from these results, fine-tuning improves summarization compared to non-fine-tuned models.
As a next step, you can try fine-tuning these models on your own dataset using the code provided in the GitHub repository to test and benchmark the results for your use cases.

About the Authors
Ben Friebe is a Senior Solutions Architect at Amazon Web Services, based in Brisbane, Australia. He likes computers.
Pavan Kumar Rao Navule is a Solutions Architect at Amazon Web Services, where he works with ISVs in India to help them innovate on the AWS platform. He is specialized in architecting AI/ML and generative AI services at AWS. Pavan is a published author for the book “Getting Started with V Programming.” In his free time, Pavan enjoys listening to the great magical voices of Sia and Rihanna.
Khush Patel Khush Patel is a Solutions Architect at Amazon Web Services based out of Houston, Texas. He’s passionate about working with customers to deliver business value using technology. He has a multitude of experience with customers working with Machine Learning and GenerativeAI workloads. In his free time, Khush enjoys watching sports and reading.
Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Appendix
This appendix provides additional information about performance benchmarking and dataset formatting.
Performance benchmarking
In this section, we provide results for fine-tuning the two Meta Llama 3 models (8B and 70B) on two different datasets: Dolly and Dialogsum. For the Dolly dataset, our task is to summarize a paragraph of text, whereas for Dialogsum, we are fine-tuning the model to summarize a discussion between two people. In the following tables, we show the input to the model (prompt and instructions), ground truth (summary), response from the pre-trained Meta Llama 3 model, and response from the fine-tuned Meta Llama 3 model for each of the models. We show inference results for five data points. You can notice from the following tables that the summaries improve for both the datasets when we fine-tune the models.
Results for fine-tuning the Meta Llama 3 8B text generation model on the Dolly dataset

Inputs
Ground Truth
Response from Non-Fine-Tuned Model
Response from Fine-Tuned Model

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nCan you generate a 2 lines abstract of this text?nn### Input:nIn physics, gravity (from Latin gravitas ‘weight’) is a fundamental interaction which causes mutual attraction between all things with mass or energy. Gravity is, by far, the weakest of the four fundamental interactions, approximately 1038 times weaker than the strong interaction, 1036 times weaker than the electromagnetic force and 1029 times weaker than the weak interaction. As a result, it has no significant influence at the level of subatomic particles. However, gravity is the most significant interaction between objects at the macroscopic scale, and it determines the motion of planets, stars, galaxies, and even light.nnOn Earth, gravity gives weight to physical objects, and the Moon’s gravity is responsible for sublunar tides in the oceans (the corresponding antipodal tide is caused by the inertia of the Earth and Moon orbiting one another). Gravity also has many important biological functions, helping to guide the growth of plants through the process of gravitropism and influencing the circulation of fluids in multicellular organisms. Investigation into the effects of weightlessness has shown that gravity may play a role in immune system function and cell differentiation within the human body.nnThe gravitational attraction between the original gaseous matter in the universe allowed it to coalesce and form stars which eventually condensed into galaxies, so gravity is responsible for many of the large-scale structures in the universe. Gravity has an infinite range, although its effects become weaker as objects get farther away.nnGravity is most accurately described by the general theory of relativity (proposed by Albert Einstein in 1915), which describes gravity not as a force, but as the curvature of spacetime, caused by the uneven distribution of mass, and causing masses to move along geodesic lines. The most extreme example of this curvature of spacetime is a black hole, from which nothing—not even light—can escape once past the black hole’s event horizon However, for most applications, gravity is well approximated by Newton’s law of universal gravitation, which describes gravity as a force causing any two bodies to be attracted toward each other, with magnitude proportional to the product of their masses and inversely proportional to the square of the distance between them.nnnn### Response:n
In general physics theory (Newton), Gravity is a mutual attraction between objects based on their respective masses and the square of the distance which separates them. In the relativity theory (Einstein), it’s a curvature of the spacetime generated by the objects themselves.
Gravity is a fundamental interaction which causes mutual attraction between all things with mass or energy. It is the weakest of the four fundamental interactions, but it is the most significant interaction between objects at the macroscopic scale, and it determines the motion of planets, stars, galaxies, and even light. Gravity gives weight to physical objects, and the Moon’s gravity is responsible for sublunar tides in the oceans. It also has many important biological functions, helping to guide the growth of plants through the process
Gravity is a fundamental interaction which causes mutual attraction between all things with mass or energy. Gravity is, by far, the weakest of the four fundamental interactions, approximately 1038 times weaker than the strong interaction, 1036 times weaker than the electromagnetic force and 1029 times weaker than the weak interaction. Gravity is most accurately described by the general theory of relativity (proposed by Albert Einstein in 1915), which describes gravity not as a force, but as the curvature of spacetime

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhat is ProSavinnn### Input:nProSavin is an experimental drug believed to be of use in the treatment of Parkinson’s disease. It is administered to the striatum in the brain, inducing production of dopamine.nnIt is manufactured by Oxford BioMedica. Results from a Phase I/II clinical trial were published in the Lancet and showed safety, but little efficacy. ProSavin was superseded by AXO-Lenti-PD (OXB-102), an optimized version of the drug.nnnn### Response:n
ProSavin is an experimental drug believed to be of use in the treatment of Parkinson’s disease. It is administered to the striatum in the brain, inducing production of dopamine.
ProSavin is an experimental drug believed to be of use in the treatment of Parkinson’s disease. It is administered to the striatum in the brain, inducing production of dopamine.nnIt is manufactured by Oxford BioMedica. Results from a Phase I/II clinical trial were published in the Lancet and showed safety, but little efficacy. ProSavin was superseded by AXO-Lenti-PD (OXB-102), an optimized version of the drug.
ProSavin is an experimental drug believed to be of use in the treatment of Parkinson’s disease. It is administered to the striatum in the brain, inducing production of dopamine.nnIt is manufactured by Oxford BioMedica. Results from a Phase I/II clinical trial were published in the Lancet and showed safety, but little efficacy. ProSavin was superseded by AXO-Lenti-PD (OXB-102), an optimized version of the drug.

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWho were the Beatles?nn### Input:nThe Beatles were an English rock band, formed in Liverpool in 1960, that comprised John Lennon, Paul McCartney, George Harrison and Ringo Starr. They are regarded as the most influential band of all time and were integral to the development of 1960s counterculture and popular music’s recognition as an art form. Rooted in skiffle, beat and 1950s rock ‘n’ roll, their sound incorporated elements of classical music and traditional pop in innovative ways; the band also explored music styles ranging from folk and Indian music to psychedelia and hard rock. As pioneers in recording, songwriting and artistic presentation, the Beatles revolutionised many aspects of the music industry and were often publicised as leaders of the era’s youth and sociocultural movements.nnnn### Response:n
The Beatles were a rock band comprised of 4 musicians. They were pioneers who had a huge influence on the music industry and wider sociocultural movements.
The Beatles were an English rock band, formed in Liverpool in 1960, that comprised John Lennon, Paul McCartney, George Harrison and Ringo Starr. They are regarded as the most influential band of all time and were integral to the development of 1960s counterculture and popular music’s recognition as an art form. Rooted in skiffle, beat and 1950s rock ‘n’ roll, their sound incorporated elements of classical music and traditional pop in innovative ways; the
The Beatles were an English rock band, formed in Liverpool in 1960, that comprised John Lennon, Paul McCartney, George Harrison and Ringo Starr. They are regarded as the most influential band of all time and were integral to the development of 1960s counterculture and popular music’s recognition as an art form. Rooted in skiffle, beat and 1950s rock ‘n’ roll, their sound incorporated elements of classical music and traditional pop in innovative ways; the

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nSummarize what Alpha Phi is.nn### Input:nAlpha Phi International Women’s Fraternity (ΑΦ, also known as APhi) is an international sorority with 172 active chapters and over 250,000 initiated members.nnFounded at Syracuse University in Syracuse, New York on September 18, 1872, it is the fourth Greek-letter organization founded for women, and the first women’s fraternity founded in the northeast.nnAlpha Phi is a member of the National Panhellenic Conference, the governing council of 26 women’s fraternities. Alpha Phi’s international headquarters are located in Evanston, Illinois.nnnn### Response:n
Alpha Phi is one of the first women’s fraternities founded in the Northeast. They have over 172 active chapters at College campuses and over 250,000 initiated members. Alpha Phi was founded at Syracuse University in Syracuse, New York on September 18, 1872.
Alpha Phi is an international sorority with 172 active chapters and over 250,000 initiated members. Founded at Syracuse University in Syracuse, New York on September 18, 1872, it is the fourth Greek-letter organization founded for women, and the first women’s fraternity founded in the northeast. Alpha Phi is a member of the National Panhellenic Conference, the governing council of 26 women’s fraternities. Alpha Phi’s international headquarters are located in Evanston, Illinois
Alpha Phi is an international sorority with 172 active chapters and over 250,000 initiated members. It was founded at Syracuse University in Syracuse, New York on September 18, 1872, and is the fourth Greek-letter organization founded for women, and the first women’s fraternity founded in the northeast. Alpha Phi is a member of the National Panhellenic Conference, the governing council of 26 women’s fraternities. Alpha Phi’s international headquarters are located in Evanston

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nplease give me a short description about SVB collapsenn### Input:nOn Friday, March 10, 2023, Silicon Valley Bank (SVB) failed after a bank run, marking the second-largest bank failure in United States history and the largest since the 2007–2008 financial crisis. It was one of three March 2023 United States bank failures.nnnn### Response:n
It’s due to risk mismanagement of SVB Bank, where the bank bet on long dated treasury bond. When the bank runs happen, SVB don’t have a choice but to sell those long dated bond at a steep discount and incur huge lossses.
Silicon Valley Bank (SVB) was a bank that failed in 2023. It was one of the largest banks in the United States, and its failure had a significant impact on the financial markets. The bank was founded in 1983 and was headquartered in Santa Clara, California. It was known for its focus on the technology industry and its support of startups and venture capital firms. However, in 2023, the bank faced a number of challenges, including a decline in the value
Silicon Valley Bank (SVB) failed after a bank run, marking the second-largest bank failure in United States history and the largest since the 2007–2008 financial crisis. It was one of three March 2023 United States bank failures.

Results for fine-tuning the Meta Llama 3 8B text generation model on the Dialogsum dataset

Inputs
Ground Truth
Response from Non-Fine-Tuned Model
Response from Fine-Tuned Model

Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Good moring, madam. Can I help you? n#Person2#: Good moring. I am Sunfang. Yesterday Mr. Smith told me to come for a job interview this morning. Is he available now? n#Person1#: Yes, he is waiting for you in Room 123. n#Person2#: Thank you very much. May I come in? n#Person3#: Come in, please. n#Person2#: Good moring, Mr. Smith. I am Sunfang coming for an interview. n#Person3#: Nice to see you. Take a seat, please. n#Person2#: Thank you. n#Person3#: We’ve been looking over your application. I see you’ve got several years experience in secretarial work. What kind of work did you do in your previous job? n#Person2#: I mainly did typing, filing, answering telephone calls and such routine work in the beginning. I started the job as a junior secretary, but within the year I was promoted to senior secretary of Southeast Company. I drafted business correspondence for the general manager. n#Person3#: Then what’s the reason that promoted you to leave Southeast Company? n#Person2#: Last month the company moved to Guangzhou for further development. I was reluctant to relocate because I love Nanjing so much. n#Person3#: Tell me what you know about our company. n#Person2#: Your company is the leading dealer in the import and export of cosmetics. Also, you have a reputation for being a good company to work for and you are in last year’s top 500 Employers list. n#Person3#: Tell me three positive qualities your coworkers would use to describe you and your work style. n#Person2#: First they’d say I am very punctual. They’d also say I am a cheerful person to work with I don’t bring my problems to work. I try to have a positive attitude about work. And thirdly, they’d say I am flexible I accept change easily. n#Person3#: Tell me something about your education. n#Person2#: I graduated from Yangzhou Junior College. Now I am attending evening classes in Nanjing Radio & TV University three times a week. My major is International Trade. I’ve already passed over half of the compulsory courses and five optional courses, and expect to receive a B. A. degree next year. n#Person3#: I see you are an ambitious young lady. I wish you can make greater progress in the future. nnnn### Response:n
#Person1# leads Sunfang to Mr. Smith for an interview. Sunfang tells Mr. Smith she was promoted fast in her previous job. She wants to work for Mr. Smith because her old company moved but she didn’t want to relocate, and because She thinks Mr. Smith’s company plays a leading role. Sunfang thinks her colleagues might describe her as being punctual, positive and flexible. Mr. Smith is impressed.
#Person1#: Good moring, madam. Can I help you? n#Person2#: Good moring. I am Sunfang. Yesterday Mr. Smith told me to come for a job interview this morning. Is he available now? n#Person1#: Yes, he is waiting for you in Room 123. n#Person2#: Thank you very much. May I come in? n#Person3#: Come in, please. n#Person2#: Good moring, Mr. Smith. I am Sunfang coming for an interview. n#Person3#: Nice to see you. Take a seat, please. n#Person2#: Thank you. n#Person3#: We’ve been looking over your application. I see you’ve got several years experience in secretarial work. What kind of work did you do in your previous job? n#Person2#: I mainly did typing, filing, answering telephone calls and such routine work in the beginning. I started the job as a junior secretary, but within the year I was promoted to senior secretary of Southeast Company. I drafted business correspondence for the general manager. n#Person3#: Then what’s the reason that promoted you to leave Southeast Company? n#Person2#: Last month the company moved to Guangzhou for further development. I was reluctant to relocate because I love Nanjing so much. n#Person3#: Tell me what you know about our company. n#Person2#: Your company
Sunfang comes for a job interview with Mr. Smith. She tells Mr. Smith about her previous job and her work experience. She also tells Mr. Smith about her company and her education.

Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Well, Jenny, the school year is almost over. We just have two more weeks before exams. What do you plan to do this summer?n#Person2#: I’m going to teach English to some immigrants in the university’s community service program.n#Person1#: That sounds interesting. Don’t you need to speak a foreign language for a job like that?n#Person2#: No, you don’t. You just have to present the language simply and give the students a chance to practice speaking.n#Person1#: Come to think of it, that’s the way I was taught to speak Chinese. But speaking didn’t help me learn to read and write Chinese.n#Person2#: My students won’t want to read and write English, at lease not now. They are more interested in speaking.n#Person1#: You sound very knowledgeable about all this. How do you know so much?n#Person2#: I took a Teaching English as a Second Language course last year when you were in China. I’ve also talked with the experienced teachers quite a lot. I think I would like to be an ESL teacher when I graduate.nnnn### Response:n
Jenny plans to teach English to some immigrants in the university’s community service program this summer and she tells #Person1# that students are more interested in speaking.
#Person1#: Well, Jenny, the school year is almost over. We just have two more weeks before exams. What do you plan to do this summer?n#Person2#: I’m going to teach English to some immigrants in the university’s community service program.n#Person1#: That sounds interesting. Don’t you need to speak a foreign language for a job like that?n#Person2#: No, you don’t. You just have to present the language simply and give the students a chance to practice speaking.n#Person1#: Come to think of it, that’s the way I was taught to speak Chinese. But speaking didn’t help me learn to read and write Chinese.n#Person2#: My students won’t want to read and write English, at lease not now. They are more interested in speaking.n#Person1#: You sound very knowledgeable about all this. How do you know so much?n#Person2#: I took a Teaching English as a Second Language course last year when you were in China. I’ve also talked with the experienced teachers quite a lot. I think I would like to be an ESL teacher when I graduate.n
Jenny tells #Person1# she’s going to teach English to some immigrants in the university’s community service program. Jenny thinks speaking English is the most important part and she’s knowledgeable about it.

Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Can I help you?n#Person2#: Yes, I am looking for a pair of gloves. I think I wear size six.n#Person1#: The gloves are on this counter. Let’s see… How about the blue pair? The color suits you and they are washable, too.n#Person2#: Oh, I’ll try them on. They seem to fit. How much shall I pay you?n#Person1#: They are on sales this week. They have been reduced to twenty-five dollars.n#Person2#: All right. I will take them.n#Person1#: They are very nice.n#Person2#: Thank you.nnnn### Response:n
#Person2# buys a pair of blue gloves of size six with #Person1#’s assistance.
#Person1#: Can I help you?n#Person2#: Yes, I am looking for a pair of gloves. I think I wear size six.n#Person1#: The gloves are on this counter. Let’s see… How about the blue pair? The color suits you and they are washable, too.n#Person2#: Oh, I’ll try them on. They seem to fit. How much shall I pay you?n#Person1#: They are on sales this week. They have been reduced to twenty-five dollars.n#Person2#: All right. I will take them.n#Person1#: They are very nice.n#Person2#: Thank you.n
#Person2# buys a pair of gloves with #Person1#’s help.

Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: How is your business on the Internet, Steven?n#Person2#: Don’t mention it. I lost a bunch of money on it this year. I think it is on the brink of bankruptcy.n#Person1#: I am sorry to hear that. It shows that you should buy companies, not ideas.n#Person2#: I changed my strategy around too. I am investing in good companies now.n#Person1#: Yes, I agree with you. I am going to find a good company and invest in it soon.n#Person2#: It’s a good idea. But please be careful when you invest this time.n#Person1#: Yes, I will. I should learn more from you. After all I am green hand on it.n#Person2#: If you have any question, you can ask me.n#Person1#: OK. My wife sent me a message and asked me to go home just now. I am leaving.n#Person2#: All right. Let’s chat next time. Bye-bye.n#Person1#: Bye.nnnn### Response:n
Steven lost much money on Internet business so he’s now investing in good companies. #Person1# wants to invest too but #Person1# is green hand on investment. Steven’s willing to help.
#Person1#: How is your business on the Internet, Steven?n#Person2#: Don’t mention it. I lost a bunch of money on it this year. I think it is on the brink of bankruptcy.n#Person1#: I am sorry to hear that. It shows that you should buy companies, not ideas.n#Person2#: I changed my strategy around too. I am investing in good companies now.n#Person1#: Yes, I agree with you. I am going to find a good company and invest in it soon.n#Person2#: It’s a good idea. But please be careful when you invest this time.n#Person1#: Yes, I will. I should learn more from you. After all I am green hand on it.n#Person2#: If you have any question, you can ask me.n#Person1#: OK. My wife sent me a message and asked me to go home just now. I am leaving.n#Person2#: All right. Let’s chat next time. Bye-bye.n#Person1#: Bye.n
Steven lost a bunch of money on the Internet and changed his strategy to invest in good companies. #Person1# agrees and will be careful when investing.

Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Which language do you speak?n#Person2#: Well, I studied French at school, but I’ve forgotten most of it. I learnt some Russian while I was studying there and now I study Spanish in evening classes. Do you speak Japanese at home with your parents.n#Person1#: Yes, I do. I’m learning Chinese now on the internet.n#Person2#: How do you find studying a language on the internet?n#Person1#: It’s not easy, particularly with listening and speaking, because sometimes the connection isn’t too good.n#Person2#: I bet you find the character difficult to write!n#Person1#: Actually, many of the characters are the same or very similar in both Chinese and Japanese, so they’re easy for me.n#Person2#: So, you can read and write much better than you speak and understand other speakers, right?n#Person1#: Yes. How are the Spanish classes going?n#Person2#: In a way, it’s similar to your experience of learning Chinese. Some Spanish words are very similar to English ones. But I need to practise listening to native speakers and saying the words myself.nnnn### Response:n
#Person1# is learning Chinese on the internet and finds listening and speaking difficult, but thinks writing is easy. #Person2# thinks #Person2#’s experience of learning Spanish is similar to #Person1#’s.
#Person1#: Which language do you speak?n#Person2#: Well, I studied French at school, but I’ve forgotten most of it. I learnt some Russian while I was studying there and now I study Spanish in evening classes. Do you speak Japanese at home with your parents.n#Person1#: Yes, I do. I’m learning Chinese now on the internet.n#Person2#: How do you find studying a language on the internet?n#Person1#: It’s not easy, particularly with listening and speaking, because sometimes the connection isn’t too good.n#Person2#: I bet you find the character difficult to write!n#Person1#: Actually, many of the characters are the same or very similar in both Chinese and Japanese, so they’re easy for me.n#Person2#: So, you can read and write much better than you speak and understand other speakers, right?n#Person1#: Yes. How are the Spanish classes going?n#Person2#: In a way, it’s similar to your experience of learning Chinese. Some Spanish words are very similar to English ones. But I need to practise listening to native speakers and saying the words myself.n
#Person1# and #Person2# are talking about the languages they speak. #Person1# is learning Chinese on the internet and #Person2# is learning Spanish. They both find it difficult to learn the language.

Results for fine-tuning the Meta Llama 3 70B text generation model on the Dolly dataset

Inputs
Ground Truth
Response from Non-Fine-Tuned Model
Response from Fine-Tuned Model

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nCan you give me a short summary of the follow paragraph about the history of electronic music?nn### Input:nDuring the 1960s, digital computer music was pioneered, innovation in live electronics took place, and Japanese electronic musical instruments began to influence the music industry. In the early 1970s, Moog synthesizers and drum machines helped popularize synthesized electronic music. The 1970s also saw electronic music begin to have a significant influence on popular music, with the adoption of polyphonic synthesizers, electronic drums, drum machines, and turntables, through the emergence of genres such as disco, krautrock, new wave, synth-pop, hip hop, and EDM. In the early 1980s mass-produced digital synthesizers, such as the Yamaha DX7, became popular, and MIDI (Musical Instrument Digital Interface) was developed. In the same decade, with a greater reliance on synthesizers and the adoption of programmable drum machines, electronic popular music came to the fore. During the 1990s, with the proliferation of increasingly affordable music technology, electronic music production became an established part of popular culture. In Berlin starting in 1989, the Love Parade became the largest street party with over 1 million visitors, inspiring other such popular celebrations of electronic music.nnnn### Response:n
With the advent of new electronic instruments in the 60s and 70s, namely synthesizers and drum machines, electronic music quickly gained mainstream popularity through genres like disco and hip hop. As the technology progressed through the 80s and 90s, more affordable instruments led electronic music to be even more widely embraced, cementing its status in pop culture and leading to the rise of large electronic music events around the globe.
The history of electronic music can be traced back to the 1960s, when digital computer music was pioneered, innovation in live electronics took place, and Japanese electronic musical instruments began to influence the music industry. In the early 1970s, Moog synthesizers and drum machines helped popularize synthesized electronic music. The 1970s also saw electronic music begin to have a significant influence on popular music, with the adoption of polyphonic synthesizers, electronic drums, drum machines, and
Electronic music has been around since the 1960s. In the 1970s, Moog synthesizers and drum machines helped popularize synthesized electronic music. In the 1980s, digital synthesizers became popular and MIDI was developed. In the 1990s, electronic music production became an established part of popular culture.

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWho was Nike in Greek Mythology?nn### Input:nIn Greek mythology, Nike (/ˈnaɪki/ (listen); Ancient Greek: Νίκη, lit. ‘victory’, ancient: [nǐː.kɛː], modern: [ˈni.ci]) was a goddess who personified victory in any field including art, music, war, and athletics. She is often portrayed in Greek art as Winged Victory in the motion of flight; however, she can also appear without wings as “Wingless Victory” when she is being portrayed as an attribute of another deity such as Athena. In Greek literature Nike is described as both an attribute and attendant to the gods Zeus and Athena. Nike gained this honored role beside Zeus during the Titanomachy where she was one of the first gods to offer her allegiance to Zeus. At Athens, Nike became a servant to Athena as well as an attribute of her due to the prominent status Athena held in her patron city. The fusion of the two goddesses at Athens has contributed to the ambiguity surrounding Nike’s origins. It is unclear whether she originated from a character trait of the Greek goddess Athena or has always existed as an independent deity. Her origin story in Greek mythology is also slightly ambiguous, with the Theogony claiming Nike to be the daughter of Styx and Pallas while the Homeric Hymns describe Ares as being Nike’s father. Her Roman equivalent was Victoria.nnnn### Response:n
Nike is the god of ‘victory’ in Greek mythology, and is often portrayed in art Winged Victory in the motion of flight.
Nike was the Greek goddess of victory. She was the daughter of the Titan Pallas and the goddess Styx, and the sister of Kratos (Strength), Bia (Force), and Zelus (Rivalry). She was depicted as a winged goddess, and was often shown holding a wreath of victory or a palm branch. She was also associated with the goddess Athena, and was often depicted as her companion. Nike was worshipped throughout Greece, and her cult was particularly strong
Nike was the Greek goddess of victory. She was often portrayed in Greek art as Winged Victory in the motion of flight. Nike gained this honored role beside Zeus during the Titanomachy where she was one of the first gods to offer her allegiance to Zeus. At Athens, Nike became a servant to Athena as well as an attribute of her due to the prominent status Athena held in her patron city.

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nGive me some details about the MV Cape Race pleasenn### Input:nThe MV Cape Race (T-AKR-9960) was originally built as the MV Seaspeed America for the Transocean RO-RO Corp. She was sold to various other corporations until she was purchased by the United States Navy in April 1993. She underwent conversion to fit a more active role in and serve a military need better and was placed back in service April 1993. She has served on multiple occasions but is currently in the US Ready Reserve Fleet.nnnn### Response:n
The MV Cape Race (T-AKR-9960) was originally built as the MV Seaspeed America for the Transocean RO-RO Corp. She was sold to various other corporations until she was purchased by the United States Navy in April 1993. nnShe underwent conversion to fit a more active role in and serve a military need better and was placed back in service April 1993. She has served on multiple occasions but is currently in the US Ready Reserve Fleet.
The MV Cape Race (T-AKR-9960) was originally built as the MV Seaspeed America for the Transocean RO-RO Corp. She was sold to various other corporations until she was purchased by the United States Navy in April 1993. She underwent conversion to fit a more active role in and serve a military need better and was placed back in service April 1993. She has served on multiple occasions but is currently in the US Ready Reserve Fleet.
The MV Cape Race (T-AKR-9960) was originally built as the MV Seaspeed America for the Transocean RO-RO Corp. She was sold to various other corporations until she was purchased by the United States Navy in April 1993. She underwent conversion to fit a more active role in and serve a military need better and was placed back in service April 1993. She has served on multiple occasions but is currently in the US Ready Reserve Fleet.

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhat is sweep rowing?nn### Input:nRowing, sometimes called crew in the United States, is the sport of racing boats using oars. It differs from paddling sports in that rowing oars are attached to the boat using oarlocks, while paddles are not connected to the boat. Rowing is divided into two disciplines: sculling and sweep rowing. In sculling, each rower holds two oars—one in each hand, while in sweep rowing each rower holds one oar with both hands. There are several boat classes in which athletes may compete, ranging from single sculls, occupied by one person, to shells with eight rowers and a coxswain, called eights. There are a wide variety of course types and formats of racing, but most elite and championship level racing is conducted on calm water courses 2 kilometres (1.2 mi) long with several lanes marked using buoys.nnnn### Response:n
Sweep rowing is a rowing discipline in which rowers hold one oar with both hands.
Sweep rowing is a type of rowing where each rower holds one oar with both hands. There are several boat classes in which athletes may compete, ranging from single sculls, occupied by one person, to shells with eight rowers and a coxswain, called eights. There are a wide variety of course types and formats of racing, but most elite and championship level racing is conducted on calm water courses 2 kilometres (1.2 mi) long with several lanes
Sweep rowing is a type of rowing where each rower holds one oar with both hands.

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhat is the Pandemis corylanann### Input:nPandemis corylana, the chequered fruit-tree tortrix, hazel tortrix moth, filbert tortricid or barred fruit tree moth, is a moth of the family Tortricidae. It is found from northern and central Europe to Siberia, Korea and Japan.nnnn### Response:n
The Pandemis corylana, the chequered fruit-tree tortrix, hazel tortrix moth, filbert tortricid or barred fruit tree moth, is a moth of the family Tortricidae. It is found from northern and central Europe to Siberia, Korea and Japan.
Pandemis corylana, the chequered fruit-tree tortrix, hazel tortrix moth, filbert tortricid or barred fruit tree moth, is a moth of the family Tortricidae. It is found from northern and central Europe to Siberia, Korea and Japan.
The Pandemis corylana, also known as the chequered fruit-tree tortrix, hazel tortrix moth, filbert tortricid or barred fruit tree moth, is a moth of the family Tortricidae. It is found from northern and central Europe to Siberia, Korea and Japan.

Results for fine-tuning the Meta Llama 3 70B text generation model on the Dialogsum dataset

Inputs
Ground Truth
Response from Non-Fine-Tuned Model
Response from Fine-Tuned Model

Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Would you like some more coffee, James?n#Person2#: Oh, yes, please. Another day’s work. I like the job, but I hate traveling around London.n#Person1#: Oh, come on, James. How do you get to work?n#Person2#: I walk to the station and take the underground. And sometimes I take the taxi if I get up late.n#Person1#: How long does the journey take?n#Person2#: It takes about 45 minutes on a good day. It’s just awful.n#Person1#: Well, how much does it cost?n#Person2#: It’s twenty pounds a week. I think that’s quite a lot.n#Person1#: Oh, James. Stop complaining. And hurry up. You’ll be late.nnnn### Response:n
James likes the job but hates the commute because it takes a long time and costs a lot.
#Person1#: Would you like some more coffee, James?n#Person2#: Oh, yes, please. Another day’s work. I like the job, but I hate traveling around London.n#Person1#: Oh, come on, James. How do you get to work?n#Person2#: I walk to the station and take the underground. And sometimes I take the taxi if I get up late.n#Person1#: How long does the journey take?n#Person2#: It takes about 45 minutes on a good day. It’s just awful.n#Person1#: Well, how much does it cost?n#Person2#: It’s twenty pounds a week. I think that’s quite a lot.n#Person1#: Oh, James. Stop complaining. And hurry up. You’ll be late.nn#Person1# asked #Person2# if he would like some more coffee. #Person2# said yes and that he liked his job but hated traveling around London. #Person1# asked how he got to work. #Person2# said he walked to the station and took the underground, and sometimes took a taxi if he got up late. #Person1# asked how long the journey took. #Person2# said it took about 45 minutes on a good day and that it was awful. #Person1# asked how much it cost. #Person2# said it was twenty pounds a week and that he thought that was quite a lot
James complains to #Person1# about the traveling around London. #Person1# asks him to hurry up.

Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: What is so special about tomorrow?n#Person2#: Tomorrow is October 24th, United Nations Day.n#Person1#: Why is October 24th United Nations Day?n#Person2#: That’s the date in 1945 when the United Nations began.n#Person1#: How does your school celebrate United Nations Day?n#Person2#: The UN Day celebration includes the songs and dances of different countries in the United Nations and a traditional dress parade. Our school library displays students’ works of art from different countries and cultures. We also hold a ‘Tastes from Around the World’ food festival.n#Person1#: What’s the purpose of the celebration?n#Person2#: To help us students learn about other cultures and understand and respect other cultures.n#Person1#: It is really worthwhile.nnnn### Response:n
#Person1# and #Person2# talk about the United Nations Day. #Person2# tells #Person1# how the school will celebrate the day and the purpose of it.
#Person1#: What is so special about tomorrow?n#Person2#: Tomorrow is October 24th, United Nations Day.n#Person1#: Why is October 24th United Nations Day?n#Person2#: That’s the date in 1945 when the United Nations began.n#Person1#: How does your school celebrate United Nations Day?n#Person2#: The UN Day celebration includes the songs and dances of different countries in the United Nations and a traditional dress parade. Our school library displays students’ works of art from different countries and cultures. We also hold a ‘Tastes from Around the World’ food festival.n#Person1#: What’s the purpose of the celebration?n#Person2#: To help us students learn about other cultures and understand and respect other cultures.n#Person1#: It is really worthwhile.n
#Person2# tells #Person1# United Nations Day is October 24th. #Person2#’s school celebrates it to help students understand and respect other cultures.

Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Have you heard about Alice and David?n#Person2#: No, have they had another fight?n#Person1#: No, they’ve got engaged.n#Person2#: You must be joking. Those two?n#Person1#: Well, my dear, I didn’t believe it either. But I got it straight from the horse’s mouth. David called me this morning.n#Person2#: So when did this all happen?n#Person1#: Last weekend, while they were on their ski trip.n#Person2#: Well, I believe it now, and when are they are getting married?n#Person1#: Next June.n#Person2#: I can hardly believe it.nnnn### Response:n
#Person1# and #Person2# can’t believe that Alice and David, who fought with each other, got engaged.
#Person1# told #Person2# that Alice and David had got engaged. #Person2# was surprised and asked when it had happened. #Person1# said that it had happened last weekend, while they were on their ski trip. #Person2# asked when they were getting married. #Person1# said that they were getting married next June.
#Person1# tells #Person2# that Alice and David have got engaged. #Person2# is surprised.

Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: We did a survey on your job performance.n#Person2#: How did I do?n#Person1#: Not so good. Lots of complains.n#Person2#: Really? I thought I was doing well.n#Person1#: Your performance so far is not quite in line with what we expect for our executives. I hope I’Ve made my point.nnnn### Response:n
#Person1# tells #Person2# #Person2#’s job performance isn’t satisfying, despite #Person2#’s opposite opinion.
#Person1# told #Person2# that they did a survey on their job performance. #Person2# asked how they did, and #Person1# said that they did not do so good and that there were lots of complains. #Person2# was surprised and said that they thought they were doing well. #Person1# said that their performance so far was not quite in line with what they expect for their executives, and they hoped that they had made their point.
#Person1# tells #Person2# that #Person2#’s job performance is not good.

Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Hi, Francis, morning! How are you doing today?n#Person2#: I had a bad start of the day. I was stuck in a bad traffic jam on my way and I missed my morning meeting.n#Person1#: Indeed, I agree that is not a good start. How about the meeting? You missed anything important?n#Person2#: It is ok. My boss briefs me later on.n#Person1#: Was he unhappy with you not attending the meeting?n#Person2#: Err, kind of. You know this traffic jam is really bad with so many new cars hitting the streets these days.nnnn### Response:n
Francis was stuck in a traffic jam and missed the morning meeting. His boss briefs him.
#Person1#: Hi, Francis, morning! How are you doing today?n#Person2#: I had a bad start of the day. I was stuck in a bad traffic jam on my way and I missed my morning meeting.n#Person1#: Indeed, I agree that is not a good start. How about the meeting? You missed anything important?n#Person2#: It is ok. My boss briefs me later on.n#Person1#: Was he unhappy with you not attending the meeting?n#Person2#: Err, kind of. You know this traffic jam is really bad with so many new cars hitting the streets these days.nn#Person1# asked #Person2# how he was doing today. #Person2# said he had a bad start of the day because he was stuck in a bad traffic jam on his way and he missed his morning meeting. #Person1# agreed that it was not a good start and asked if #Person2# missed anything important in the meeting. #Person2# said it was ok because his boss would brief him later on. #Person1# asked if his boss was unhappy with him not attending the meeting. #Person2# said he was kind of unhappy because the traffic jam was really bad with so many new cars hitting the streets these days.
Francis tells #Person1# he missed the meeting because of the traffic jam.

Dataset formatting
We currently offer two types of fine-tuning: instruction fine-tuning and domain adaption fine-tuning. You can switch to one of the training methods by specifying the parameter instruction_tuned as True or False.
Domain adaption format
The text generation model can be fine-tuned on any domain-specific dataset to incorporate domain-specific knowledge and language patterns. After fine-tuning on the domain-specific dataset, the model is expected to generate more relevant and accurate text within that domain. Although few-shot prompting can also guide the model towards domain-specific generation, the fine-tuning process plays a crucial role in adapting the model’s understanding and generation capabilities to the target domain. The combination of fine-tuning on domain data and effective prompting techniques can enable the model to perform various NLP tasks within that specific domain more effectively.
For input to the model, use a training and optional validation directory. Each directory contains a CSV, JSON, or TXT file. For CSV and JSON files, the train or validation data is used from the column called text or the first column if no column called text is found. The number of files under train and validation (if provided) should equal to 1, respectively.
The output is a trained model that can be deployed for inference.
The following is an example of a TXT file for fine-tuning the text generation model. The TXT file is SEC filings of Amazon from 2021–2022:

This report includes estimates, projections, statements relating to our
business plans, objectives, and expected operating results that are “forward-
looking statements” within the meaning of the Private Securities Litigation
Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E
of the Securities Exchange Act of 1934. Forward-looking statements may appear
throughout this report, including the following sections: “Business” (Part I,
Item 1 of this Form 10-K), “Risk Factors” (Part I, Item 1A of this Form 10-K),
and “Management’s Discussion and Analysis of Financial Condition and Results
of Operations” (Part II, Item 7 of this Form 10-K). These forward-looking
statements generally are identified by the words “believe,” “project,”
“expect,” “anticipate,” “estimate,” “intend,” “strategy,” “future,”
“opportunity,” “plan,” “may,” “should,” “will,” “would,” “will be,” “will
continue,” “will likely result,” and similar expressions. Forward-looking
statements are based on current expectations and assumptions that are subject
to risks and uncertainties that may cause actual results to differ materially.
We describe risks and uncertainties that could cause actual results and events
to differ materially in “Risk Factors,” “Management’s Discussion and Analysis
of Financial Condition and Results of Operations,” and “Quantitative and
Qualitative Disclosures about Market Risk” (Part II, Item 7A of this Form
10-K). Readers are cautioned not to place undue reliance on forward-looking
statements, which speak only as of the date they are made. We undertake no
obligation to update or revise publicly any forward-looking statements,
whether because of new information, future events, or otherwise.

GENERAL

Embracing Our Future …

Instruction fine-tuning
The text generation model can be instruction-tuned on any text data provided that the data is in the expected format. The instruction-tuned model can be further deployed for inference.
For input, use a training and optional validation directory. The train and validation directories should contain one or multiple JSON lines (.jsonl) formatted files. In particular, the train directory can also contain an optional *.json file describing the input and output formats.
The best model is selected according to the validation loss, calculated at the end of each epoch. If a validation set is not given, an (adjustable) percentage of the training data is automatically split and used for validation.
The training data must be formatted in a JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample. All training data must be in a single folder; however, it can be saved in multiple .jsonl files. The .jsonl file extension is mandatory. The training folder can also contain a template.json file describing the input and output formats. If no template file is given, the following template will be used:

{
    “prompt”: “Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:n{instruction}nn### Input:n{context}nn”,
    “completion”: “{response}”
}

In this case, the data in the JSON lines entries must include prompt and completion fields. If a custom template is provided, it must also use prompt and completion keys to define the input and output templates. The following is a sample custom template:

{
    “prompt”: “question: {question} context: {context}”,
    “completion”: “{answer}”
}

Here, the data in the JSON lines entries must include the question, context, and answer fields.
The output is a trained model that can be deployed for inference.
We provide a subset of SEC filings data of Amazon. It is downloaded from publicly available EDGAR. For instructions on accessing the data, refer to Accessing EDGAR Data.
License: Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0)

Ground truth curation and metric interpretation best practices for eva …

Generative artificial intelligence (AI) applications powered by large language models (LLMs) are rapidly gaining traction for question answering use cases. From internal knowledge bases for customer support to external conversational AI assistants, these applications use LLMs to provide human-like responses to natural language queries. However, building and deploying such assistants with responsible AI best practices requires a robust ground truth and evaluation framework to make sure they meet quality standards and user experience expectations, as well as clear evaluation interpretation guidelines to make the quality and responsibility of these systems intelligible to business decision-makers.
This post focuses on evaluating and interpreting metrics using FMEval for question answering in a generative AI application. FMEval is a comprehensive evaluation suite from Amazon SageMaker Clarify, providing standardized implementations of metrics to assess quality and responsibility. To learn more about FMEval, refer to Evaluate large language models for quality and responsibility.
In this post, we discuss best practices for working with FMEval in ground truth curation and metric interpretation for evaluating question answering applications for factual knowledge and quality. Ground truth data in AI refers to data that is known to be true, representing the expected outcome for the system being modeled. By providing a true expected outcome to measure against, ground truth data unlocks the ability to deterministically evaluate system quality. Ground truth curation and metric interpretation are tightly coupled, and the implementation of the evaluation metric must inform ground truth curation to achieve best results. By following these guidelines, data scientists can quantify the user experience delivered by their generative AI pipelines and communicate meaning to business stakeholders, facilitating ready comparisons across different architectures, such as Retrieval Augmented Generation (RAG) pipelines, off-the-shelf or fine-tuned LLMs, or agentic solutions.
Solution overview
We use an example ground truth dataset (referred to as the golden dataset, shown in the following table) of 10 question-answer-fact triplets. Each triplet describes a fact, and an encapsulation of the fact as a question-answer pair to emulate an ideal response, derived from a knowledge source document. We used Amazon’s Q2 2023 10Q report as the source document from the SEC’s public EDGAR dataset to create 10 question-answer-fact triplets. The 10Q report contains details on company financials and operations over the Q2 2023 business quarter. The golden dataset applies the ground truth curation best practices discussed in this post for most questions, but not all, to demonstrate the downstream impact of ground truth curation on metric results.

Question
Answer
Fact

Who is Andrew R. Jassy?
Andrew R. Jassy is the President and Chief Executive Officer of Amazon.com, Inc.
Chief Executive Officer of Amazon<OR>CEO of Amazon<OR>President of Amazon

What were Amazon’s total net sales for the second quarter of 2023?
Amazon’s total net sales for the second quarter of 2023 were $134.4 billion.
134.4 billion<OR>134,383 million<OR>134183 million<OR>134.383 billion

Where is Amazon’s principal office located?
Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210.
410 Terry Avenue North

What was Amazon’s operating income for the six months ended June 30, 2023?
Amazon’s operating income for the six months ended June 30, 2023 was $12.5 billion.
12.5 billion<OR>12,455 million<OR>12.455 billion

When did Amazon acquire One Medical?
Amazon acquired One Medical on February 22, 2023 for cash consideration of approximately $3.5 billion, net of cash acquired.
Feb 22 2023<OR>February 22nd 2023<OR>2023-02-22<OR>February 22, 2023

What was a key challenge faced by Amazon’s business in the second quarter of 2023?
Changes in foreign exchange rates reduced Amazon’s International segment net sales by $180 million for Q2 2023.
foreign exchange rates

What was Amazon’s total cash, cash equivalents and restricted cash as of June 30, 2023?
Amazon’s total cash, cash equivalents, and restricted cash as of June 30, 2023 was $50.1 billion.
50.1 billion<OR>50,067 million<OR>50.067 billion

What were Amazon’s AWS sales for the second quarter of 2023?
Amazon’s AWS sales for the second quarter of 2023 were $22.1 billion.
22.1 billion<OR>22,140 million<OR>22.140 billion<OR>22140 million

As of June 30, 2023, how many shares of Rivian’s Class A common stock did Amazon hold?
As of June 30, 2023, Amazon held 158 million shares of Rivian’s Class A common stock.
158 million

How many shares of common stock were outstanding as of July 21, 2023?
There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.
10317750796<OR>10,317,750,796

We generated responses from three generative AI RAG pipelines (anonymized as Pipeline1, Pipeline2, Pipeline3, as shown in the following figure) and calculated factual knowledge and QA accuracy metrics, evaluating them against the golden dataset. The fact key of the triplet is used for the Factual Knowledge metric ground truth, and the answer key is used for the QA Accuracy metric ground truth. With this, factual knowledge is measured against the fact key, and the ideal user experience in terms of style and conciseness is measured against the question-answer pairs.

Evaluation for question answering in a generative AI application
A generative AI pipeline can have many subcomponents, such as a RAG pipeline. RAG is a methodology to improve the accuracy of LLM responses answering a user query by retrieving and inserting relevant domain knowledge into the language model prompt. RAG quality depends on the configurations of the retriever (chunking, indexing) and generator (LLM selection and hyperparameters, prompt), as illustrated in the following figure. Tuning chunking and indexing in the retriever makes sure the correct content is available in the LLM prompt for generation. The chunk size and chunk splitting method, as well as the means of embedding and ranking relevant document chunks as vectors in the knowledge store, impacts whether the actual answer to the query is ultimately inserted in the prompt. In the generator, selecting an appropriate LLM to run the prompt, and tuning its hyperparameters and prompt template, all control how the retrieved information is interpreted for the response. With this, when a final response from a RAG pipeline is evaluated, the preceding components may be adjusted to improve response quality.

Alternatively, question answering can be powered by a fine-tuned LLM, or through an agentic approach. Although we demonstrate the evaluation of final responses from RAG pipelines, the final responses from a generative AI pipeline for question answering can be similarly evaluated because the prerequisites are a golden dataset and the generative answers. With this approach, changes in the generative output due to different generative AI pipeline architectures can be evaluated to inform the best design choices (comparing RAG and knowledge retrieval agents, comparing LLMs used for generation, retrievers, chunking, prompts, and so on).
Although evaluating each sub-component of a generative AI pipeline is important in development and troubleshooting, business decisions rely on having an end-to-end, side-by-side data view, quantifying how a given generative AI pipeline will perform in terms of user experience. With this, business stakeholders can understand expected quality changes in terms of end-user experience by switching LLMs, and adhere to legal and compliance requirements, such as ISO42001 AI Ethics. There are further financial benefits to realize; for example, quantifying expected quality changes on internal datasets when switching a development LLM to a cheaper, lightweight LLM in production. The overall evaluation process for the benefit of decision-makers is outlined in the following figure. In this post, we focus our discussion on ground truth curation, evaluation, and interpreting evaluation scores for entire question answering generative AI pipelines using FMEval to enable data-driven decision-making on quality.

A useful mental model for ground truth curation and improvement of a golden dataset is a flywheel, as shown in the following figure. The ground truth experimentation process involves querying your generative AI pipeline with the initial golden dataset questions and evaluating the responses against initial golden answers using FMEval. Then, the quality of the golden dataset must be reviewed by a judge. The judge review of the golden dataset quality accelerates the flywheel towards an ever-improving golden dataset. The judge role in the workflow can be assumed by another LLM to enable scaling against established, domain-specific criteria for high-quality ground truth. Maintaining a human-in-the-loop component to the judge function remains essential to sample and verify results, as well as to increase the quality bar with increasing task complexity. Improvement to the golden dataset fosters improvement to the quality of the evaluation metrics, until sufficient measurement accuracy in the flywheel is met by the judge, using the established criteria for quality. To learn more about AWS offerings on human review of generations and data labeling, such as Amazon Augmented AI (Amazon A2I) and Amazon SageMaker Ground Truth Plus, refer to Using Amazon Augmented AI for Human Review and High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus. When using LLMs as a judge, make sure to apply prompt safety best practices.

However, to conduct reviews of golden dataset quality as part of the ground truth experiment flywheel, human reviewers must understand the evaluation metric implementation and its coupling to ground truth curation.
FMEval metrics for question answering in a generative AI application
The Factual Knowledge and QA Accuracy metrics from FMEval provide a way to evaluate custom question answering datasets against ground truth. For a full list of metrics implemented with FMEval, refer to Using prompt datasets and available evaluation dimensions in model evaluation jobs.
Factual Knowledge
The Factual Knowledge metric evaluates whether the generated response contains factual information present in the ground truth answer. It is a binary (0 or 1) score based on a string match. Factual knowledge also reports a quasi-exact string match which performs matching after normalization. For simplicity, we focus on the exact match Factual Knowledge score in this post.
For each golden question:

0 indicates the lowercased factual ground truth is not present in the model response
1 indicates the lowercased factual ground truth is present in the response

QA Accuracy
The QA Accuracy metric measures a model’s question answering accuracy by comparing its generated answers against ground truth answers. The metrics are computed by string matching true positive, false positive, and false negative word matches between QA ground truth answers and generated answers.
It includes several sub-metrics:

Recall Over Words – Scores from 0 (worst) to 1 (best), measuring how much of the QA ground truth is contained in the model output
Precision Over Words – Scores from 0 (worst) to 1 (best), measuring how many words in the model output match the QA ground truth
F1 Over Words – The harmonic mean of precision and recall, providing a balanced score from 0 to 1
Exact Match – Binary 0 or 1, indicating if the model output exactly matches the QA ground truth
Quasi Exact Match – Similar to Exact Match, but with normalization (lowercasing and removing articles)

Because QA Accuracy metrics are calculated on an exact match basis, (for more details, see Accuracy) they may be less reliable for questions where the answer can be rephrased without modifying its meaning. To mitigate this, we propose applying Factual Knowledge as the assessment of factual correctness, motivating the use of a dedicated factual ground truth with minimal word expression, together with QA Accuracy as a measure of idealized user experience in terms of response verbosity and style. We elaborate on these concepts later in this post. The BERTScore is also computed as part of QA Accuracy, which provides a measure of semantic match quality against the ground truth.
Proposed ground truth curation best practices for question answering with FMEval
In this section, we share best practices for curating your ground truth for question answering with FMEval.
Understanding the Factual Knowledge metric calculation
A factual knowledge score is a binary measure of whether a real-world fact was correctly retrieved by the generative AI pipeline. 0 indicates the lower-cased expected answer is not part of the model response, whereas 1 indicates it is. Where there is more than one acceptable answer, and either answer is considered correct, apply a logical operator for OR. A configuration for a logical AND can also be applied for cases where the factual material encompasses multiple distinct entities. In the present examples, we demonstrate a logical OR, using the <OR> delimiter. See Use SageMaker Clarify to evaluate large language models for information about logical operators. An example curation of a golden question and golden fact is shown in the following table.

Golden Question
“How many shares of common stock were outstanding as of July 21, 2023?”

Golden Fact
10,317,750,796<OR>10317750796

Fact detection is useful for assessing hallucination in a generative AI pipeline. The two sample responses in the following table illustrate fact detection. The first example correctly states the fact in the example response, and receives a 1.0 score. The second example hallucinates a number instead of stating the fact, and receives a 0 score.

Metric
Example Response
Score
Calculation Approach

Factual Knowledge
“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”
1.0
String match to golden fact

“Based on the documents provided, Amazon had 22,003,237,746 shares of common stock outstanding as of July 21, 2023.”
0.0

In the following example, we highlight the importance of units in ground truth for Factual Knowledge string matching. The golden question and golden fact represent Amazon’s total net sales for the second quarter of 2023.

Golden Question
“What were Amazon’s total net sales for the second quarter of 2023?

Golden Fact
134.4 billion<OR>134,383 million

The first response hallucinates the fact, using units of billions, and correctly receives a score of 0.0. The second response correctly represents the fact, in units of millions. Both units should be represented in the golden fact. The third response was unable to answer the question, flagging a potential issue with the information retrieval step.

Metric
Example Response
Score
Calculation Approach

Factual Knowledge
Amazon’s total net sales for the second quarter of 2023 were $170.0 billion.
0.0
String match to golden fact

The total consolidated net sales for Q2 2023 were $134,383 million according to this report.
1.0

Sorry, the provided context does not include any information about Amazon’s total net sales for the second quarter of 2023. Would you like to ask another question?
0.0

Interpreting Factual Knowledge scores
Factual knowledge scores are a useful flag for challenges in the generative AI pipeline such as hallucination or information retrieval problems. Factual knowledge scores can be curated in the form of a Factual Knowledge Report for human review, as shown in the following table, to visualize pipeline quality in terms of fact detection side by side.

User Question
QA Ground Truth
Factual Ground Truth
Pipeline 1
Pipeline 2
Pipeline 3

As of June 30, 2023, how many shares of Rivian’s Class A common stock did Amazon hold?
As of June 30, 2023, Amazon held 158 million shares of Rivian’s Class A common stock.
158 million
1
1
1

How many shares of common stock were outstanding as of July 21, 2023?
There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.
10317750796<OR>10,317,750,796
1
1
1

What was Amazon’s operating income for the six months ended June 30, 2023?
Amazon’s operating income for the six months ended June 30, 2023 was $12.5 billion.
12.5 billion<OR>12,455 million<OR>12.455 billion
1
1
1

What was Amazon’s total cash, cash equivalents and restricted cash as of June 30, 2023?
Amazon’s total cash, cash equivalents, and restricted cash as of June 30, 2023 was $50.1 billion.
50.1 billion<OR>50,067 million<OR>50.067 billion
1
0
0

What was a key challenge faced by Amazon’s business in the second quarter of 2023?
Changes in foreign exchange rates reduced Amazon’s International segment net sales by $180 million for Q2 2023.
foreign exchange rates
0
0
0

What were Amazon’s AWS sales for the second quarter of 2023?
Amazon’s AWS sales for the second quarter of 2023 were $22.1 billion.
22.1 billion<OR>22,140 million<OR>22.140 billion<OR>22140 million
1
0
0

What were Amazon’s total net sales for the second quarter of 2023?
Amazon’s total net sales for the second quarter of 2023 were $134.4 billion.
134.4 billion<OR>134,383 million<OR>134183 million<OR>134.383 billion
1
0
0

When did Amazon acquire One Medical?
Amazon acquired One Medical on February 22, 2023 for cash consideration of approximately $3.5 billion, net of cash acquired.
Feb 22 2023<OR>February 22nd 2023<OR>2023-02-22<OR>February 22, 2023
1
0
1

Where is Amazon’s principal office located?
Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210.
410 Terry Avenue North
0
0
0

Who is Andrew R. Jassy?
Andrew R. Jassy is the President and Chief Executive Officer of Amazon.com, Inc.
Chief Executive Officer of Amazon<OR>CEO of Amazon<OR>President of Amazon
1
1
1

Curating Factual Knowledge ground truth
Consider the impact of string matching between your ground truth and LLM responses when curating ground truth for Factual Knowledge. Best practices for curation in consideration of string matching are the following:

Use a minimal version of the QA Accuracy ground truth for a factual ground truth containing the most important facts – Because the Factual Knowledge metric uses exact string matching, curating minimal ground truth facts distinct from the QA Accuracy ground truth is imperative. Using QA Accuracy ground truth will not yield a string match unless the response is identical to the ground truth. Apply logical operators as is best suited to represent your facts.
Zero factual knowledge scores across the benchmark can indicate a poorly formed golden question-answer-fact triplet – If a golden question doesn’t contain an obvious singular answer, or can be equivalently interpreted multiple ways, reframe the golden question or answer to be specific. In the Factual Knowledge table, a question such as “What was a key challenge faced by Amazon’s business in the second quarter of 2023?” can be subjective, and interpreted with multiple possible acceptable answers. Factual Knowledge scores were 0.0 for all entries because each LLM interpreted a unique answer. A better question would be: “How much did foreign exchange rates reduce Amazon’s International segment net sales?” Similarly, “Where is Amazon’s principal office located?” renders multiple acceptable answers, such as “Seattle,” “Seattle, Washington,” or the street address. The question could be reframed as “What is the street address of Amazon’s principal office?” if this is the desired response.
Generate many variations of fact representation in terms of units and punctuation – Different LLMs will use different language to present facts (date formats, engineering units, financial units, and so on). The factual ground truth should accommodate such expected units for the LLMs being evaluated as part of the pipeline. Experimenting with LLMs to automate fact generation from QA ground truth using LLMs can help.
Avoid false positive matches – Avoid curating ground truth facts that are overly simple. Short, unpunctuated number sequences, for example, can be matched with years, dates, or phone numbers and can generate false positives.

Understanding QA Accuracy metric calculation
We use the following question answer pair to demonstrate how FMEval metrics are calculated, and how this informs best practices in QA ground truth curation.

Golden Question
“How many shares of common stock were outstanding as of July 21, 2023?”

Golden Answer
“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

In calculating QA Accuracy metrics, first the responses and ground truth are first normalized (lowercase, remove punctuation, remove articles, remove excess whitespace). Then, true positive, false positives, and false negative matches are computed between the LLM response and the ground truth. QA Accuracy metrics returned by FMEval include recall, precision, F1. By assessing exact matching, the Exact Match and Quasi-Exact Match metrics are returned. A detailed walkthrough of the calculation and scores are shown in the following tables.
The first table illustrates the accuracy metric calculation mechanism.

Metric
Definition
Example
Score

True Positive (TP)
The number of words in the model output that are also contained in the ground truth.
Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”
11

False Positive (FP)
The number of words in the model output that are not contained in the ground truth.
Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”
7

False Negative (FN)
The number of words that are missing from the model output, but are included in the ground truth.
Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”
3

The following table lists the accuracy scores.

Metric
Score
Calculation Approach

Recall Over Words
0.786

Precision Over Words
0.611

F1
0.688

Exact Match
0.0
(Non-normalized) Binary score that indicates whether the model output is an exact match for the ground truth answer.

Quasi-Exact Match
0.0
(Normalized) Binary score that indicates whether the model output is an exact match for the ground truth answer.

Interpreting QA Accuracy scores
The following are best practices for interpreting QA accuracy scores:

Interpret recall as closeness to ground truth – The recall metric in FMEval measures the fraction of ground truth words that are in the model response. With this, we can interpret recall as closeness to ground truth.

The higher the recall score, the more ground truth is included in the model response. If the entire ground truth is included in the model response, recall will be perfect (1.0), and if no ground truth is included in the model, response recall will be zero (0.0).
Low recall in response to a golden question can indicate a problem with information retrieval, as shown in the example in the following table. A high recall score, however, doesn’t unilaterally indicate a correct response. Hallucinations of facts can present as a single deviated word between model response and ground truth, while still yielding a high true positive rate in word matching. For such cases, you can complement QA Accuracy scores with Factual Knowledge assessments of golden questions in FMEval (we provide examples later in this post).

 Interpretation
Question
Curated Ground Truth
High Closeness to Ground Truth
Low Closeness to Ground Truth

Interpreting Closeness to Ground Truth Scores
“How many shares of common stock were outstanding as of July 21, 2023?”
“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”
“As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.”
0.923
“Sorry, I do not have access to documents containing common stock information about Amazon.”
0.111

Interpret precision as conciseness to ground truth – The higher the score, the closer the LLM response is to the ground truth in terms of conveying ground truth information in the fewest number of words. By this definition, we recommend interpreting precision scores as a measure of conciseness to the ground truth. The following table demonstrates LLM responses that show high conciseness to the ground truth and low conciseness. Both answers are factually correct, but the reduction in precision is derived from the higher verbosity of the LLM response relative to the ground truth.

 Interpretation
Question
Curated Ground Truth
High Conciseness to Ground Truth
Low Conciseness to Ground Truth

Interpreting Conciseness to Ground Truth
“How many shares of common stock were outstanding as of July 21, 2023?”
“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”
As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.
1.0
“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023. Specifically, in the first excerpt from the quarterly report for the quarter ending June 30, 2023, it states: ‘10,317,750,796 shares of common stock, par value $0.01 per share, outstanding as of July 21, 2023’ Therefore, the number of shares of Amazon common stock outstanding as of July 21, 2023 was 10,317,750,796 according to this statement.”
0.238

Interpret F1 score as combined closeness and conciseness to ground truth – F1 score is the harmonic mean of precision and recall, and so represents a joint measure that equally weights closeness and conciseness for a holistic score. The highest-scoring responses will contain all the words and remain similarly concise as the curated ground truth. The lowest-scoring responses will differ in verbosity relative to the ground truth and contain a large number of words that are not present in the ground truth. Due to the intermixing of these four qualities, F1 score interpretation is subjective. Reviewing recall and precision independently will clearly indicate the qualities of the generative responses in terms of closeness and conciseness. Some examples of high and low F1 scores are provided in the following table.

 Interpretation
Question
Curated Ground Truth
High Combined Closeness x Conciseness
Low Combined Closeness x Conciseness

Interpreting Closeness and Conciseness to Ground Truth
“How many shares of common stock were outstanding as of July 21, 2023?”
“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”
“As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.”
0.96
“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023. Specifically, in the first excerpt from the quarterly report for the quarter ending June 30, 2023, it states: ‘10,317,750,796 shares of common stock, par value $0.01 per share, outstanding as of July 21, 2023’ Therefore, the number of shares of Amazon common stock outstanding as of July 21, 2023 was 10,317,750,796 according to this statement.”
0.364

Combine factual knowledge with recall for detection of hallucinated facts and false fact matches – Factual Knowledge scores can be interpreted in combination with recall metrics to distinguish likely hallucinations and false positive facts. For example, the following cases can be caught, with examples in the following table:

High recall with zero factual knowledge suggests a hallucinated fact.
Zero recall with positive factual knowledge suggests an accidental match between the factual ground truth and an unrelated entity such as a document ID, phone number, or date.
Low recall and zero factual knowledge may also suggest a correct answer that has been expressed with alternative language to the QA ground truth. Improved ground truth curation (increased question specificity, more ground truth fact variants) can remediate this problem. The BERTScore can also provide semantic context on match quality.

Interpretation
QA Ground Truth
Factual Ground Truth
Factual Knowledge
Recall Score
LLM response

Hallucination detection
Amazon’s total net sales for the second quarter of 2023 were $134.4 billion.
134.4 billion<OR>134,383 million
0
0.92
Amazon’s total net sales for the second quarter of 2023 were $170.0 billion.

Detect false positive facts
There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.
10317750796<OR> 10,317,750,796
1.0
0.0
Document ID: 10317750796

Correct answer, expressed in different words to ground truth question-answer-fact
Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210.
410 Terry Avenue North
0
0.54
Amazon’s principal office is located in Seattle, Washington.

Curating QA Accuracy ground truth
Consider the impact of true positive, false positive, and false negative matches between your golden answer and LLM responses when curating your ground truth for QA Accuracy. Best practices for curation in consideration of string matching are as follows:

Use LLMs to generate initial golden questions and answers – This is beneficial in terms of speed and level of effort; however, outputs must be reviewed and further curated if necessary before acceptance (see Step 3 of the ground truth experimentation flywheel earlier in this post). Furthermore, applying an LLM to generate your ground truth may bias correct answers towards that LLM, for example, due to string matching of filler words that the LLM commonly uses in its language expression that other LLMs may not. Keeping ground truth expressed in an LLM-agnostic manner is a gold standard.
Human review golden answers for proximity to desired output – Your golden answers should reflect your standard for the user-facing assistant in terms of factual content and verbiage. Consider the desired level of verbosity and choice of words you expect as outputs based on your production RAG prompt template. Overly verbose ground truths, and ground truths that adopt language unlikely to be in the model output, will increase false negative scores unnecessarily. Human curation of generated golden answers should reflect the desired verbosity and word choice in addition to accuracy of information, before accepting LLM generated golden answers, to make sure evaluation metrics are computed relative to a true golden standard. Apply guardrails on the verbosity of ground truth, such as controlling word count, as part of the generation process.
Compare LLM accuracy using recall – Closeness to ground truth is the best indicator of word agreement between the model response and the ground truth. When golden answers are curated properly, a low recall suggests strong deviation between the ground truth and the model response, whereas a high recall suggests strong agreement.
Compare verbosity using precision – When golden answers are curated properly, verbose LLM responses decrease precision scores due to false positives present, and concise LLM responses are rewarded by high precision scores. If the golden answer is highly verbose, however, concise model responses will incur false negatives.
Experiment to determine recall acceptability thresholds for generative AI pipelines – A recall threshold for the golden dataset can be set to determine cutoffs for pipeline quality acceptability.
Interpret QA accuracy metrics in conjunction with other metrics to pass judgement on accuracy – Metrics such as Factual Knowledge can be combined with QA Accuracy scores to judge factual knowledge in addition to ground truth word matching.

Key takeaways
Curating appropriate ground truth and interpreting evaluation metrics in a feedback loop is crucial for effective business decision-making when deploying generative AI pipelines for question answering.
There were several key takeaways from this experiment:

Ground truth curation and metric interpretation are a cyclical process – Understanding how the metrics are calculated should inform the ground truth curation approach to achieve the desired comparison.
Low-scoring evaluations can indicate problems with ground truth curation in addition to generative AI pipeline quality – Using golden datasets that don’t reflect true answer quality (misleading questions, incorrect answers, ground truth answers don’t reflect expected response style) can be the root cause of poor evaluation results for a successful pipeline. When golden dataset curation is in place, low-scoring evaluations will correctly flag pipeline problems.
Balance recall, precision, and F1 scores – Find the balance between acceptable recall (closeness to ground truth), precision (conciseness to ground truth), and F1 scores (combined) through iterative experimentation and data curation. Pay close attention to what scores quantify your ideal closeness to ground truth and conciseness to the ground truth based on your data and business objectives.
Design ground truth verbosity to the level desired in your user experience – For QA Accuracy evaluation, curate ground truth answers that reflect the desired level of conciseness and word choice expected from the production assistant. Overly verbose or unnaturally worded ground truths can unnecessarily decrease precision scores.
Use recall and factual knowledge for setting accuracy thresholds – Interpret recall in conjunction with factual knowledge to assess overall accuracy, and establish thresholds by experimentation on your own datasets. Factual knowledge scores can complement recall to detect hallucinations (high recall, false factual knowledge) and accidental fact matches (zero recall, true factual knowledge).
Curate distinct QA and factual ground truths – For a Factual Knowledge evaluation, curate minimal ground truth facts distinct from the QA Accuracy ground truth. Generate comprehensive variations of fact representations in terms of units, punctuation, and formats.
Golden questions should be unambiguous – Zero factual knowledge scores across the benchmark can indicate poorly formed golden question-answer-fact triplets. Reframe subjective or ambiguous questions to have a specific, singular acceptable answer.
Automate, but verify, with LLMs – Use LLMs to generate initial ground truth answers and facts, with a human review and curation to align with the desired assistant output standards. Recognize that applying an LLM to generate your ground truth may bias correct answers towards that LLM during evaluation due to matching filler words, and strive to keep ground truth language LLM-agnostic.

Conclusion
In this post, we outlined best practices for ground truth curation and metric interpretation when evaluating generative AI question answering using FMEval. We demonstrated how to curate ground truth question-answer-fact triplets in consideration of the Factual Knowledge and QA Accuracy metrics calculated by FMEval. To validate our approach, we curated a golden dataset of 10 question-answer-fact triplets from Amazon’s Q2 2023 10Q report. We generated responses from three anonymized generative AI pipelines and calculated QA Accuracy and Factual Knowledge metrics.
Our primary findings emphasize that ground truth curation and metric interpretation are tightly coupled. Ground truth should be curated with the measurement approach in mind, and metrics can update the ground truth during golden dataset development. We further recommend curating separate ground truths for QA accuracy and factual knowledge, particularly emphasizing setting a desired level of verbosity according to user experience goals, and setting golden questions with unambiguous interpretations. Closeness and conciseness to ground truth are valid interpretations of FMEval recall and precision metrics, and factual knowledge scores can be used to detect hallucinations. Ultimately, the quantification of the expected user experience in the form of a golden dataset for pipeline evaluation with FMEval supports business decision-making, such as choosing between pipeline options, projecting quality changes from development to production, and adhering to legal and compliance requirements.
Whether you are building an internal application, a customer-facing virtual assistant, or exploring the potential of generative AI for your business, this post can help you use FMEval to make sure your projects meet the highest standards of quality and responsibility. We encourage you to adopt these best practices and start evaluating your generative AI question answering pipelines with the FMEval toolkit today.

About the Authors
Samantha Stuart is a Data Scientist with AWS Professional Services, and has delivered for customers across generative AI, MLOps, and ETL engagements. Samantha has a research master’s degree in engineering from the University of Toronto, where she authored several publications on data-centric AI for drug delivery system design. Outside of work, she is most likely spotted playing music, spending time with friends and family, at the yoga studio, or exploring Toronto.
Rahul Jani is a Data Architect with AWS Professional Services. He collaborates closely with enterprise customers building modern data platforms, generative AI applications, and MLOps. He is specialized in the design and implementation of big data and analytical applications on the AWS platform. Beyond work, he values quality time with family and embraces opportunities for travel.
Ivan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.
Andrei Ivanovic is a Data Scientist with AWS Professional Services, with experience delivering internal and external solutions in generative AI, AI/ML, time series forecasting, and geospatial data science. Andrei has a Master’s in CS from the University of Toronto, where he was a researcher at the intersection of deep learning, robotics, and autonomous driving. Outside of work, he enjoys literature, film, strength training, and spending time with loved ones.

DetoxBench: Comprehensive Evaluation of Large Language Models for Effe …

Several significant benchmarks have been developed to evaluate language understanding and specific applications of large language models (LLMs). Notable benchmarks include GLUE, SuperGLUE, ANLI, LAMA, TruthfulQA, and Persuasion for Good, which assess LLMs on tasks such as sentiment analysis, commonsense reasoning, and factual accuracy. However, limited work has specifically targeted fraud and abuse detection using LLMs, with challenges stemming from restricted data availability and the prevalence of numeric datasets unsuitable for LLM training.

The scarcity of public datasets and the difficulty in textual representation of fraud patterns have underscored the need for a specialized evaluation framework. These limitations have driven the development of more targeted research and resources to enhance the detection and mitigation of malicious language using LLMs. A new AI research from Amazon introduces a novel approach to address these gaps and advance LLM capabilities in fraud and abuse detection.

Researchers present “DetoxBench,” a comprehensive evaluation of LLMs for fraud and abuse detection, addressing their potential and challenges. The paper emphasises LLMs’ capabilities in natural language processing but highlights the need for further exploration in high-stakes applications like fraud detection. The paper underscores the societal harm caused by fraud, the current reliance on traditional models, and the lack of holistic benchmarks for LLMs in this domain. The benchmark suite aims to evaluate LLMs’ effectiveness, promote ethical AI development, and mitigate real-world harm.

DetoxBench’s methodology involves developing a benchmark suite tailored to assess LLMs in detecting and mitigating fraudulent and abusive language. The suite includes tasks like spam detection, hate speech, and misogynistic language identification, reflecting real-world challenges. Several state-of-the-art LLMs, including those from Anthropic, Mistral AI, and AI21, were selected for evaluation, ensuring a comprehensive assessment of different models’ capabilities in fraud and abuse detection.

The experimentation emphasizes task diversity to evaluate LLMs’ generalization across various fraud and abuse detection scenarios. Performance metrics are analyzed to identify model strengths and weaknesses, particularly in tasks requiring nuanced understanding. Comparative analysis reveals variability in LLM performance, indicating the need for further refinement for high-stakes applications. The findings highlight the importance of ongoing development and responsible deployment of LLMs in critical areas like fraud detection.

The DetoxBench evaluation of eight large language models (LLMs) across various fraud and abuse detection tasks revealed significant differences in performance. The Mistral Large model achieved the highest F1 scores in five out of eight tasks, demonstrating its effectiveness. Anthropic Claude models exhibited high precision, exceeding 90% in some tasks, but had notably low recall, dropping below 10% for toxic chat and hate speech detection. Cohere models displayed high recall, with 98% for fraud email detection, but lower precision, at 64%, leading to a higher false positive rate. Inference times varied, with AI21 models being the fastest at 1.5 seconds per instance, while Mistral Large and Anthropic Claude models took approximately 10 seconds per instance.

Few-shot prompting offered a limited improvement over zero-shot prompting, with specific gains in tasks like fake job detection and misogyny detection. The imbalanced datasets, which had fewer abusive cases, were addressed by random undersampling, creating balanced test sets for better evaluation. Format compliance issues excluded models like Cohere’s Command R from final results. These findings highlight the importance of task-specific model selection and suggest that fine-tuning LLMs could further enhance their performance in fraud and abuse detection.

In conclusion, DetoxBench establishes the first systematic benchmark for evaluating LLMs in fraud and abuse detection, revealing key insights into model performance. Larger models like the 200 Billion Anthropic and 176 Billion Mistral AI families excelled, particularly in contextual understanding. The study found that few-shot prompting often did not outperform zero-shot prompting, suggesting variability in prompting effectiveness. Future research aims to fine-tune LLMs and explore advanced techniques, emphasizing the importance of careful model selection and strategy to enhance detection capabilities in this critical area.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post DetoxBench: Comprehensive Evaluation of Large Language Models for Effective Detection of Fraud and Abuse Across Diverse Real-World Scenarios appeared first on MarkTechPost.

Anthropic Released Claude for Enterprise: A Powerful and Ethical AI So …

Anthropic, a company known for its commitment to creating AI systems that prioritize safety, transparency, and alignment with human values, has introduced Claude for Enterprise to meet the growing demands of businesses seeking reliable, ethical AI solutions. As organizations increasingly adopt AI technologies to enhance productivity and streamline operations, Claude for Enterprise emerges as a powerful tool designed to address key challenges faced by enterprises in leveraging AI effectively and safely.

Image Source

Background on Anthropic and Claude

Founded by former OpenAI researchers, Anthropic has been at the forefront of AI safety and ethics. The company was established to create powerful AI systems aligned with human goals, reducing the risk of unintended consequences. Their approach has consistently emphasized building an interpretable AI capable of being effectively controlled by its users. Claude, named after Claude Shannon, the father of information theory, is the result of Anthropic’s efforts to develop an AI that prioritizes safety while maintaining high-performance levels.

The enterprise version of Claude builds on its previous iterations, enhancing its capabilities to cater to business environments. While earlier versions of Claude were available for public use, Claude for Enterprise is specifically tailored to fulfill the needs of businesses and companies that require scalable, secure, and compliant AI solutions. The introduction of Claude for Enterprise is seen as a strategic move by Anthropic to position itself as a pioneer in the enterprise AI market, competing with other major players like OpenAI, Microsoft, and Google.

Features and Capabilities of Claude for Enterprise

Claude for Enterprise has several key features that appeal to businesses. One of the standout features is its focus on safety and ethical use. Anthropic has incorporated robust safety mechanisms into Claude, ensuring that the AI is less likely to generate harmful or biased outputs. This is particularly important in enterprise settings, where the consequences of biased or inappropriate AI behavior can be significant, potentially leading to reputational damage or legal challenges.

Claude for Enterprise is also built to be scalable & flexible, making it suitable for various applications across different industries. Whether a company uses AI for customer support, data analysis, or process automation, Claude can be customized to meet specific business needs. Its ability to quickly and accurately process large volumes of data makes it a valuable asset for companies looking to improve operational efficiency and decision-making.

Image Source

Another important feature of Claude for Enterprise is its transparency. One of the main concerns organizations have when adopting AI is the “black box” nature of many AI systems, where the internal workings of the model are opaque and difficult to understand. Anthropic has addressed this issue by making Claude more interpretable, allowing users to understand better how the AI arrives at its conclusions. This helps businesses ensure that the AI functions correctly and enables them to comply with regulatory requirements regarding AI transparency and accountability.

Addressing Security and Compliance

Safety, transparency, security, and compliance are major considerations for enterprises adopting AI technologies. Claude for Enterprise is designed with these concerns in mind, offering enterprise-grade security features to protect sensitive data. As organizations increasingly handle large amounts of personal and proprietary information, securing AI systems has become paramount. Claude’s architecture ensures that data is dealt with securely, minimizing the risk of breaches or unauthorized access.

Compliance with industry regulations is another area in which Claude for Enterprise excels. Many organizations are subject to strict rules regarding data privacy and the use of technology. Claude for Enterprise is designed to meet these regulatory requirements, making it smoother for businesses in regulated industries to adopt AI without running afoul of the law. By offering a compliant AI solution, Anthropic addresses one of the major barriers to AI adoption in the enterprise space.

The Role of AI in Business Transformation

The introduction of Claude for Enterprise reflects the broader trend of AI transforming how businesses operate. AI technologies can change industries by automating routine tasks, enhancing decision-making, and enabling personalized customer interactions. However, the full potential can only be realized if businesses have access to safe, reliable, and scalable AI systems.

Claude for Enterprise is positioned as a tool to help businesses and organizations direct the complexities of AI adoption. For example, companies can use Claude to enhance customer service by deploying AI-powered chatbots that can handle various inquiries, freeing human agents to focus on more complex issues. Claude can assist with data analysis in the financial sector, identifying patterns and trends that human analysts might miss. Meanwhile, Claude can be used in healthcare to analyze medical records & provide insights that help doctors make more informed decisions.

Challenges and Future Outlook

While the release of Claude for Enterprise is a significant step forward, challenges still need to be addressed for widespread AI adoption in the business world. One of the primary concerns is the potential for job displacement. As AI systems become more capable, there is a growing fear and anxiety that they will replace human workers in certain roles, particularly in industries like customer service and data entry. Like other AI companies, Anthropic will need to work with businesses to ensure that AI is implemented to complement human workers rather than replace them.

Another challenge is the ethical use of AI. Despite the safety mechanisms built into Claude, there is always the risk that AI could be used in harmful or unethical ways. Anthropic’s commitment to ethical AI development is commendable, but it will require ongoing vigilance to ensure that Claude is used responsibly in enterprise settings.

Looking ahead, the future of AI in business looks promising. As more companies and businesses realize the benefits of AI, demand for enterprise-grade AI solutions like Claude is likely to grow. Anthropic’s focus on safety, transparency, and compliance positions it well to meet this demand, and Claude for Enterprise could become a key player in the AI market.

Conclusion

The release of Claude for Enterprise by Anthropic represents a major step forward in the development of safe, reliable, and scalable AI for businesses. With its emphasis on transparency, safety, and compliance, Claude is well-suited to meet the needs of organizations across various industries. Tools like Claude for Enterprise will play an important role in helping companies harness the power of AI while mitigating risks. Anthropic’s commitment to ethical AI development ensures that Claude for Enterprise is not only a powerful tool but also a responsible one.

Check out the Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post Anthropic Released Claude for Enterprise: A Powerful and Ethical AI Solution Prioritizing Safety, Transparency, and Compliance for Modern Business Transformation appeared first on MarkTechPost.

Yi-Coder Released by 01.AI: A Powerful Small-Scale Code LLM Series, De …

The landscape of large language models (LLMs) for coding has been enriched with the release of Yi-Coder by 01.AI, a series of open-source models designed for efficient and powerful coding performance. Despite its relatively small size, Yi-Coder delivers state-of-the-art results, positioning itself as a formidable code generation and completion player. Available in two configurations, 1.5 billion and 9 billion parameters, Yi-Coder proves that bigger isn’t always better, offering an impressive range of capabilities tailored for developers seeking high-performance models with minimal resource overhead. The four variants open-sourced on Hugging Face till now are:

Yi-Coder-9B-Chat: This model is designed for text generation, focusing on code-related tasks offering interactive and conversational capabilities. It delivers state-of-the-art performance in competitive programming and long-context code generation and was recently updated to enhance its efficiency.

Yi-Coder-9B: The larger base model in the series, Yi-Coder-9B, offers powerful code generation and comprehension across 52 programming languages. Updated to optimize its long-context processing further, it excels at precisely handling complex, project-level tasks.

Yi-Coder-1.5B-Chat: A smaller, lightweight model designed for chat-based coding tasks, Yi-Coder-1.5B-Chat delivers impressive results in code editing and interactive code completion. The recent update focuses on improving its real-time performance and accuracy in conversational coding applications.

Yi-Coder-1.5B: This base model offers an efficient solution for developers needing fast code generation with fewer computational resources. The recent update enhances its ability to tackle basic programming tasks, making it a highly versatile tool for developers with limited hardware.

Image Source

Yi-Coder-9B, the larger of the two models, stands out due to its advanced training. It builds upon Yi-9B with an additional 2.4 trillion high-quality tokens sourced from a comprehensive repository-level code corpus on GitHub and code-related data filtered from CommonCrawl. These tokens cover 52 major programming languages, enabling Yi-Coder to offer unmatched proficiency across various coding environments. The ability to handle long-context modeling with a maximum context window of 128K tokens makes Yi-Coder ideal for handling complex, project-level code generation and comprehension tasks.

One of Yi-Coder’s most impressive aspects is its competitive performance, particularly with the Yi-Coder-9B-Chat model. In rigorous evaluations, Yi-Coder-9B-Chat achieved a 23.4% pass rate on LiveCodeBench, a platform designed to benchmark LLMs using real-time competitive programming problems sourced from LeetCode, AtCoder, and CodeForces. Notably, Yi-Coder’s performance surpassed much larger models, including DeepSeek-Coder-33B-Instruct and CodeGeex4-All-9B, making it the only model under 10 billion parameters to break the 20% threshold.

Image Source

In addition to its competitive programming strengths, Yi-Coder excelled in standard code generation benchmarks such as HumanEval, MBPP, and CRUXEval-O. With an 85.4% pass rate on HumanEval and a 73.8% pass rate on MBPP, Yi-Coder-9B-Chat outperformed many of its peers, showcasing its ability to handle basic and complex coding tasks. It also became the first open-source LLM to surpass 50% accuracy on CRUXEval-O, further cementing its status as a high-performing model in the coding community.

Yi-Coder is not limited to code generation; it also excels in code editing tasks. Using CodeEditorBench, a benchmark designed to evaluate a model’s ability to perform debugging, translation, language switching, and code polishing, Yi-Coder consistently outperformed its competitors. The model demonstrated impressive win rates against other open-source models, particularly debugging and code translation. This makes Yi-Coder attractive for developers looking to streamline their code refinement processes.

Image Source

Another critical area where Yi-Coder shines is cross-file code completion, a key requirement for modern Integrated Development Environments (IDEs). On the CrossCodeEval benchmark, which tests models’ ability to understand and complete code with cross-file dependencies, Yi-Coder outperformed similarly sized models in both retrieval and non-retrieval contexts. This result can be attributed to its extensive training on repository-level code corpora, allowing it to capture long-term dependencies and efficiently complete code tasks that span multiple files.

Long-context comprehension is one of Yi-Coder’s most unique strengths. In a synthetic task called “Needle in the code,” Yi-Coder demonstrated its ability to handle sequences as long as 128K tokens, twice the length used in comparable evaluations like those of CodeQwen1.5. The model flawlessly completed this task, demonstrating its proficiency in extracting key information from extensive codebases, a crucial skill for developers working on large-scale projects.

Image Source

In addition to its coding capabilities, Yi-Coder has shown promise in mathematical reasoning. By leveraging program-aided language models (PAL), Yi-Coder-9B achieved an average accuracy of 70.3% across seven mathematical reasoning benchmarks, surpassing the performance of the larger DeepSeek-Coder-33B. This demonstrates that strong coding abilities can translate into other domains, such as solving complex mathematical problems.

In conclusion, Yi-Coder’s release marks an important step forward in the evolution of code-focused LLMs. Despite its relatively small parameter count, the model offers a competitive edge over larger alternatives, excelling in long-context comprehension, mathematical reasoning, and code editing. Its availability in base and chat versions provides flexibility for users seeking efficient inference and training options. By open-sourcing Yi-Coder, 01.AI has significantly contributed to the development community. The model’s remarkable performance across various coding tasks and its efficient architecture positions Yi-Coder as a powerful tool for developers looking to push the boundaries of what small LLMs can achieve in software development.

Check out the Details and Model Series. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post Yi-Coder Released by 01.AI: A Powerful Small-Scale Code LLM Series, Delivering Exceptional Performance in Code Generation, Editing, and Long-Context Comprehension appeared first on MarkTechPost.

Build powerful RAG pipelines with LlamaIndex and Amazon Bedrock

This post was co-written with Jerry Liu from LlamaIndex.
Retrieval Augmented Generation (RAG) has emerged as a powerful technique for enhancing the capabilities of large language models (LLMs). By combining the vast knowledge stored in external data sources with the generative power of LLMs, RAG enables you to tackle complex tasks that require both knowledge and creativity. Today, RAG techniques are used in every enterprise, small and large, where generative artificial intelligence (AI) is used as an enabler for solving document-based question answering and other types of analysis.
Although building a simple RAG system is straightforward, building production RAG systems using advanced patterns is challenging. A production RAG pipeline typically operates over a larger data volume and larger data complexity, and must meet a higher quality bar compared to building a proof of concept. A general broad challenge that developers face is low response quality; the RAG pipeline is not able to sufficiently answer a large number of questions. This can be due to a variety of reasons; the following are some of the most common:

Bad retrievals – The relevant context needed to answer the question is missing.
Incomplete responses – The relevant context is partially there but not completely. The generated output doesn’t fully answer the input question.
Hallucinations – The relevant context is there but the model is not able to extract the relevant information in order to answer the question.

This necessitates more advanced RAG techniques on the query understanding, retrieval, and generation components in order to handle these failure modes.
This is where LlamaIndex comes in. LlamaIndex is an open source library with both simple and advanced techniques that enables developers to build production RAG pipelines. It provides a flexible and modular framework for building and querying document indexes, integrating with various LLMs, and implementing advanced RAG patterns.
Amazon Bedrock is a managed service providing access to high-performing foundation models (FMs) from leading AI providers through a unified API. It offers a wide range of large models to choose from, along with capabilities to securely build and customize generative AI applications. Key advanced features include model customization with fine-tuning and continued pre-training using your own data, as well as RAG to augment model outputs by retrieving context from configured knowledge bases containing your private data sources. You can also create intelligent agents that orchestrate FMs with enterprise systems and data. Other enterprise capabilities include provisioned throughput for guaranteed low-latency inference at scale, model evaluation to compare performance, and AI guardrails to implement safeguards. Amazon Bedrock abstracts away infrastructure management through a fully managed, serverless experience.
In this post, we explore how to use LlamaIndex to build advanced RAG pipelines with Amazon Bedrock. We discuss how to set up the following:

Simple RAG pipeline – Set up a RAG pipeline in LlamaIndex with Amazon Bedrock models and top-k vector search
Router query – Add an automated router that can dynamically do semantic search (top-k) or summarization over data
Sub-question query – Add a query decomposition layer that can decompose complex queries into multiple simpler ones, and run them with the relevant tools
Agentic RAG – Build a stateful agent that can do the preceding components (tool use, query decomposition), but also maintain state-like conversation history and reasoning over time

Simple RAG pipeline
At its core, RAG involves retrieving relevant information from external data sources and using it to augment the prompts fed to an LLM. This allows the LLM to generate responses that are grounded in factual knowledge and tailored to the specific query.
For RAG workflows in Amazon Bedrock, documents from configured knowledge bases go through preprocessing, where they are split into chunks, embedded into vectors, and indexed in a vector database. This allows efficient retrieval of relevant information at runtime. When a user query comes in, the same embedding model is used to convert the query text into a vector representation. This query vector is compared against the indexed document vectors to identify the most semantically similar chunks from the knowledge base. The retrieved chunks provide additional context related to the user’s query. This contextual information is appended to the original user prompt before being passed to the FM to generate a response. By augmenting the prompt with relevant data pulled from the knowledge base, the model’s output is able to use and be informed by an organization’s proprietary information sources. This RAG process can also be orchestrated by agents, which use the FM to determine when to query the knowledge base and how to incorporate the retrieved context into the workflow.
The following diagram illustrates this workflow.

The following is a simplified example of a RAG pipeline using LlamaIndex:

from llama_index import SimpleDirectoryReader, VectorStoreIndex

# Load documents
documents = SimpleDirectoryReader(“data/”).load_data()

# Create a vector store index
index = VectorStoreIndex.from_documents(documents)

# Query the index
response = index.query(“What is the capital of France?”)

# Print the response
print(response)

The pipeline includes the following steps:

Use the SimpleDirectoryReader to load documents from the “data/”
Create a VectorStoreIndex from the loaded documents. This type of index converts documents into numerical representations (vectors) that capture their semantic meaning.
Query the index with the question “What is the capital of France?” The index uses similarity measures to identify the documents most relevant to the query.
The retrieved documents are then used to augment the prompt for the LLM, which generates a response based on the combined information.

LlamaIndex goes beyond simple RAG and enables the implementation of more sophisticated patterns, which we discuss in the following sections.
Router query
RouterQueryEngine allows you to route queries to different indexes or query engines based on the nature of the query. For example, you could route summarization questions to a summary index and factual questions to a vector store index.
The following is a code snippet from the example notebooks demonstrating RouterQueryEngine:

from llama_index import SummaryIndex, VectorStoreIndex
from llama_index.core.query_engine import RouterQueryEngine

# Create summary and vector indices
summary_index = SummaryIndex.from_documents(documents)
vector_index = VectorStoreIndex.from_documents(documents)

# Define query engines
summary_query_engine = summary_index.as_query_engine()
vector_query_engine = vector_index.as_query_engine()

# Create router query engine
query_engine = RouterQueryEngine(
# Define logic for routing queries
# …
query_engine_tools=[
summary_query_engine,
vector_query_engine,
],
)

# Query the engine
response = query_engine.query(“What is the main idea of the document?”)

Sub-question query
SubQuestionQueryEngine breaks down complex queries into simpler sub-queries and then combines the answers from each sub-query to generate a comprehensive response. This is particularly useful for queries that span across multiple documents. It first breaks down the complex query into sub-questions for each relevant data source, then gathers the intermediate responses and synthesizes a final response that integrates the relevant information from each sub-query. For example, if the original query was “What is the population of the capital city of the country with the highest GDP in Europe,” the engine would first break it down into sub-queries like “What is the highest GDP country in Europe,” “What is the capital city of that country,” and “What is the population of that capital city,” and then combine the answers to those sub-queries into a final comprehensive response.
The following is an example of using SubQuestionQueryEngine:

from llama_index.core.query_engine import SubQuestionQueryEngine

# Create sub-question query engine
sub_question_query_engine = SubQuestionQueryEngine.from_defaults(
# Define tools for generating sub-questions and answering them
# …
)

# Query the engine
response = sub_question_query_engine.query(
“Compare the revenue growth of Uber and Lyft from 2020 to 2021”
)

Agentic RAG
An agentic approach to RAG uses an LLM to reason about the query and determine which tools (such as indexes or query engines) to use and in what sequence. This allows for a more dynamic and adaptive RAG pipeline. The following architecture diagram shows how agentic RAG works on Amazon Bedrock.

Agentic RAG in Amazon Bedrock combines the capabilities of agents and knowledge bases to enable RAG workflows. Agents act as intelligent orchestrators that can query knowledge bases during their workflow to retrieve relevant information and context to augment the responses generated by the FM.
After the initial preprocessing of the user input, the agent enters an orchestration loop. In this loop, the agent invokes the FM, which generates a rationale outlining the next step the agent should take. One potential step is to query an attached knowledge base to retrieve supplemental context from the indexed documents and data sources.
If a knowledge base query is deemed beneficial, the agent invokes an InvokeModel call specifically for knowledge base response generation. This fetches relevant document chunks from the knowledge base based on semantic similarity to the current context. These retrieved chunks provide additional information that is included in the prompt sent back to the FM. The model then generates an observation response that is parsed and can invoke further orchestration steps, like invoking external APIs (through action group AWS Lambda functions) or provide a final response to the user. This agentic orchestration augmented by knowledge base retrieval continues until the request is fully handled.
One example of an agent orchestration loop is the ReAct agent, which was initially introduced by Yao et al. ReAct interleaves chain-of-thought and tool use. At every stage, the agent takes in the input task along with the previous conversation history and decides whether to invoke a tool (such as querying a knowledge base) with the appropriate input or not.
The following is an example of using the ReAct agent with the LlamaIndex SDK:

from llama_index.core.agent import ReActAgent

# Create ReAct agent with defined tools
agent = ReActAgent.from_tools(
query_engine_tools,
llm=llm,
)

# Chat with the agent
response = agent.chat(“What was Lyft’s revenue growth in 2021?”)

The ReAct agent will analyze the query and decide whether to use the Lyft 10K tool or another tool to answer the question. To try out agentic RAG, refer to the GitHub repo.
LlamaCloud and LlamaParse
LlamaCloud represents a significant advancement in the LlamaIndex landscape, offering a comprehensive suite of managed services tailored for enterprise-grade context augmentation within LLM and RAG applications. This service empowers AI engineers to concentrate on developing core business logic by streamlining the intricate process of data wrangling.
One key component is LlamaParse, a proprietary parsing engine adept at handling complex, semi-structured documents replete with embedded objects like tables and figures, seamlessly integrating with LlamaIndex’s ingestion and retrieval pipelines. Another key component is the Managed Ingestion and Retrieval API, which facilitates effortless loading, processing, and storage of data from diverse sources, including LlamaParse outputs and LlamaHub’s centralized data repository, while accommodating various data storage integrations.
Collectively, these features enable the processing of vast production data volumes, culminating in enhanced response quality and unlocking unprecedented capabilities in context-aware question answering for RAG applications. To learn more about these features, refer to Introducing LlamaCloud and LlamaParse.
For this post, we use LlamaParse to showcase the integration with Amazon Bedrock. LlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks. What is unique about LlamaParse is that it is the world’s first generative AI native document parsing service, which allows users to submit documents along with parsing instructions. The key insight behind parsing instructions is that you know what kind of documents you have, so you already know what kind of output you want. The following figure shows a comparison of parsing a complex PDF with LlamaParse vs. two popular open source PDF parsers.

A green highlight in a cell means that the RAG pipeline correctly returned the cell value as the answer to a question over that cell. A red highlight means that the question was answered incorrectly.
Integrate Amazon Bedrock and LlamaIndex to build an Advanced RAG Pipeline
In this section, we show you how to build an advanced RAG stack combining LlamaParse and LlamaIndex with Amazon Bedrock services – LLMs, embedding models, and Bedrock Knowledge Base.
To use LlamaParse with Amazon Bedrock, you can follow these high-level steps:

Download your source documents.
Send the documents to LlamaParse using the Python SDK:

from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

parser = LlamaParse(
api_key=os.environ.get(‘LLAMA_CLOUD_API_KEY’), # set via api_key param or in your env as LLAMA_CLOUD_API_KEY
result_type=”markdown”, # “markdown” and “text” are available
num_workers=4, # if multiple files passed, split in `num_workers` API calls
verbose=True,
language=”en”, # Optionally you can define a language, default=en
)

file_extractor = {“.pdf”: parser}
reader = SimpleDirectoryReader(
input_dir=’data/10k/’,
file_extractor=file_extractor
)

Wait for the parsing job to finish and upload the resulting Markdown documents to Amazon Simple Storage Service (Amazon S3).
Create an Amazon Bedrock knowledge base using the source documents.
Choose your preferred embedding and generation model from Amazon Bedrock using the LlamaIndex SDK:

llm = Bedrock(model = “anthropic.claude-v2”)
embed_model = BedrockEmbedding(model = “amazon.titan-embed-text-v1”)

Implement an advanced RAG pattern using LlamaIndex. In the following example, we use SubQuestionQueryEngine and a retriever specially created for Amazon Bedrock knowledge bases:

from llama_index.retrievers.bedrock import AmazonKnowledgeBasesRetriever

Finally, query the index with your question:

response = await query_engine.aquery(‘Compare revenue growth of Uber and Lyft from 2020 to 2021’)

We tested Llamaparse on a real-world, challenging example of asking questions about a document containing Bank of America Q3 2023 financial results. An example slide from the full slide deck (48 complex slides!) is shown below.

Using the procedure outlined above, we asked “What is the trend in digital households/relationships from 3Q20 to 3Q23?”; take a look at the answer generated using Llamaindex tools vs. the reference answer from human annotation.

LlamaIndex + LlamaParse answer
Reference answer

The trend in digital households/relationships shows a steady increase from 3Q20 to 3Q23. In 3Q20, the number of digital households/relationships was 550K, which increased to 645K in 3Q21, then to 672K in 3Q22, and further to 716K in 3Q23. This indicates consistent growth in the adoption of digital services among households and relationships over the reported quarters.
The trend shows a steady increase in digital households/relationships from 645,000 in 3Q20 to 716,000 in 3Q23. The digital adoption percentage also increased from 76% to 83% over the same period.

The following are example notebooks to try out these steps on your own examples. Note the prerequisite steps and cleanup resources after testing them.

Ingest with LlamaParse into S3 for KB
Agentic RAG with Bedrock KB and LlamaIndex SubQuestionQueryEngine

Conclusion
In this post, we explored various advanced RAG patterns with LlamaIndex and Amazon Bedrock. To delve deeper into the capabilities of LlamaIndex and its integration with Amazon Bedrock, check out the following resources:

LlamaIndex documentation
Amazon Bedrock User Guide
LlamaIndex examples GitHub repo

By combining the power of LlamaIndex and Amazon Bedrock, you can build robust and sophisticated RAG pipelines that unlock the full potential of LLMs for knowledge-intensive tasks.

About the Author
Shreyas Subramanian is a Principal data scientist and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.
Jerry Liu is the co-founder/CEO of LlamaIndex, a data framework for building LLM applications. Before this, he has spent his career at the intersection of ML, research, and startups. He led the ML monitoring team at Robust Intelligence, did self-driving AI research at Uber ATG, and worked on recommendation systems at Quora.

Evaluating prompts at scale with Prompt Management and Prompt Flows fo …

As generative artificial intelligence (AI) continues to revolutionize every industry, the importance of effective prompt optimization through prompt engineering techniques has become key to efficiently balancing the quality of outputs, response time, and costs. Prompt engineering refers to the practice of crafting and optimizing inputs to the models by selecting appropriate words, phrases, sentences, punctuation, and separator characters to effectively use foundation models (FMs) or large language models (LLMs) for a wide variety of applications. A high-quality prompt maximizes the chances of having a good response from the generative AI models.
A fundamental part of the optimization process is the evaluation, and there are multiple elements involved in the evaluation of a generative AI application. Beyond the most common evaluation of FMs, the prompt evaluation is a critical, yet often challenging, aspect of developing high-quality AI-powered solutions. Many organizations struggle to consistently create and effectively evaluate their prompts across their various applications, leading to inconsistent performance and user experiences and undesired responses from the models.
In this post, we demonstrate how to implement an automated prompt evaluation system using Amazon Bedrock so you can streamline your prompt development process and improve the overall quality of your AI-generated content. For this, we use Amazon Bedrock Prompt Management and Amazon Bedrock Prompt Flows to systematically evaluate prompts for your generative AI applications at scale.
The importance of prompt evaluation
Before we explain the technical implementation, let’s briefly discuss why prompt evaluation is crucial. The key aspects to consider when building and optimizing a prompt are typically:

Quality assurance – Evaluating prompts helps make sure that your AI applications consistently produce high-quality, relevant outputs for the selected model.
Performance optimization – By identifying and refining effective prompts, you can improve the overall performance of your generative AI models in terms of lower latency and ultimately higher throughput.
Cost efficiency – Better prompts can lead to more efficient use of AI resources, potentially reducing costs associated with model inference. A good prompt allows for the use of smaller and lower-cost models, which wouldn’t give good results with a bad quality prompt.
User experience – Improved prompts result in more accurate, personalized, and helpful AI-generated content, enhancing the end user experience in your applications.

Optimizing prompts for these aspects is an iterative process that requires an evaluation for driving the adjustments in the prompts. It is, in other words, a way to understand how good a given prompt and model combination are for achieving the desired answers.
In our example, we implement a method known as LLM-as-a-judge, where an LLM is used for evaluating the prompts based on the answers it produced with a certain model, according to predefined criteria. The evaluation of prompts and their answers for a given LLM is a subjective task by nature, but a systematic prompt evaluation using LLM-as-a-judge allows you to quantify it with an evaluation metric in a numerical score. This helps to standardize and automate the prompting lifecycle in your organization and is one of the reasons why this method is one of the most common approaches for prompt evaluation in the industry.

Let’s explore a sample solution for evaluating prompts with LLM-as-a-judge with Amazon Bedrock. You can also find the complete code example in amazon-bedrock-samples.
Prerequisites
For this example, you need the following:

An AWS account and a user with an AWS Identity and Access Management (IAM) role authorized to use Amazon Bedrock. For guidance, refer to the instructions in Getting started with Amazon Bedrock. Make sure the role includes the permissions for using Prompt Flows for Amazon Bedrock, as explained in Prerequisites for Prompt flows for Amazon Bedrock.
Access provided to the models you use for invocation and evaluation. For guidance, follow the instructions in Manage access to Amazon Bedrock foundation models.

Set up the evaluation prompt
To create an evaluation prompt using Amazon Bedrock Prompt Management, follow these steps:

On the Amazon Bedrock console, in the navigation pane, choose Prompt management and then choose Create prompt.
Enter a Name for your prompt such as prompt-evaluator and a Description such as “Prompt template for evaluating prompt responses with LLM-as-a-judge.” Choose Create.

In the Prompt field, write your prompt evaluation template. In the example, you can use a template like the following or adjust it according to your specific evaluation requirements.

You’re an evaluator for the prompts and answers provided by a generative AI model.
Consider the input prompt in the <input> tags, the output answer in the <output> tags, the prompt evaluation criteria in the <prompt_criteria> tags, and the answer evaluation criteria in the <answer_criteria> tags.

<input>
{{input}}
</input>

<output>
{{output}}
</output>

<prompt_criteria>
– The prompt should be clear, direct, and detailed.
– The question, task, or goal should be well explained and be grammatically correct.
– The prompt is better if containing examples.
– The prompt is better if specifies a role or sets a context.
– The prompt is better if provides details about the format and tone of the expected answer.
</prompt_criteria>

<answer_criteria>
– The answers should be correct, well structured, and technically complete.
– The answers should not have any hallucinations, made up content, or toxic content.
– The answer should be grammatically correct.
– The answer should be fully aligned with the question or instruction in the prompt.
</answer_criteria>

Evaluate the answer the generative AI model provided in the <output> with a score from 0 to 100 according to the <answer_criteria> provided; any hallucinations, even if small, should dramatically impact the evaluation score.
Also evaluate the prompt passed to that generative AI model provided in the <input> with a score from 0 to 100 according to the <prompt_criteria> provided.
Respond only with a JSON having:
– An ‘answer-score’ key with the score number you evaluated the answer with.
– A ‘prompt-score’ key with the score number you evaluated the prompt with.
– A ‘justification’ key with a justification for the two evaluations you provided to the answer and the prompt; make sure to explicitely include any errors or hallucinations in this part.
– An ‘input’ key with the content of the <input> tags.
– An ‘output’ key with the content of the <output> tags.
– A ‘prompt-recommendations’ key with recommendations for improving the prompt based on the evaluations performed.
Skip any preamble or any other text apart from the JSON in your answer.

Under Configurations, select a model to use for running evaluations with the prompt. In our example we selected Anthropic Claude Sonnet. The quality of the evaluation will depend on the model you select in this step. Make sure you balance the quality, response time, and cost accordingly in your decision.
Set the Inference parameters for the model. We recommend that you keep Temperature as 0 for making a factual evaluation and to avoid hallucinations.

You can test your evaluation prompt with sample inputs and outputs using the Test variables and Test window panels.

Now that you have a draft of your prompt, you can also create versions of it. Versions allow you to quickly switch between different configurations for your prompt and update your application with the most appropriate version for your use case. To create a version, choose Create version at the top.

The following screenshot shows the Prompt builder page.

Set up the evaluation flow
Next, you need to build an evaluation flow using Amazon Bedrock Prompt Flows. In our example, we use prompt nodes. For more information on the types of nodes supported, check the Node types in prompt flow documentation. To build an evaluation flow, follow these steps:

On the Amazon Bedrock console, under Prompt flows, choose Create prompt flow.
Enter a Name such as prompt-eval-flow. Enter a Description such as “Prompt Flow for evaluating prompts with LLM-as-a-judge.” Choose Use an existing service role to select a role from the dropdown. Choose Create.
This will open the Prompt flow builder. Drag two Prompts nodes to the canvas and configure the nodes as per the following parameters:

Flow input

Output:

Name: document, Type: String

Invoke (Prompts)

Node name: Invoke
Define in node
Select model: A preferred model to be evaluated with your prompts
Message: {{input}}
Inference configurations: As per your preferences
Input:

Name: input, Type: String, Expression: $.data

Output:

Name: modelCompletion, Type: String

Evaluate (Prompts)

Node name: Evaluate
Use a prompt from your Prompt Management
Prompt: prompt-evaluator
Version: Version 1 (or your preferred version)
Select model: Your preferred model to evaluate your prompts with
Inference configurations: As set in your prompt
Input:

Name: input, Type: String, Expression: $.data
Name: output, Type: String, Expression: $.data

Output

Name: modelCompletion, Type: String

Flow output

Node name: End
Input:

Name: document, Type: String, Expression: $.data

To connect the nodes, drag the connecting dots, as shown in the following diagram.

Choose Save.

You can test your prompt evaluation flow by using the Test prompt flow panel. Pass an input, such as the question, “What is cloud computing in a single paragraph?” It should return a JSON with the result of the evaluation similar to the following example. In the code example notebook, amazon-bedrock-samples, we also included the information about the models used for invocation and evaluation to our result JSON.

{
“answer-score”: 95,
“prompt-score”: 90,
“justification”: “The answer provides a clear and technically accurate explanation of cloud computing in a single paragraph. It covers key aspects such as scalability, shared resources, pay-per-use model, and accessibility. The answer is well-structured, grammatically correct, and aligns with the prompt. No hallucinations or toxic content were detected. The prompt is clear, direct, and explains the task well. However, it could be improved by providing more details on the expected format, tone, or length of the answer.”,
“input”: “What is cloud computing in a single paragraph?”,
“output”: “Cloud computing is a model for delivering information technology services where resources are retrieved from the internet through web-based tools. It is a highly scalable model in which a consumer can access a shared pool of configurable computing resources, such as applications, servers, storage, and services, with minimal management effort and often with minimal interaction with the provider of the service. Cloud computing services are typically provided on a pay-per-use basis, and can be accessed by users from any location with an internet connection. Cloud computing has become increasingly popular in recent years due to its flexibility, cost-effectiveness, and ability to enable rapid innovation and deployment of new applications and services.”,
“prompt-recommendations”: “To improve the prompt, consider adding details such as the expected length of the answer (e.g., ‘in a single paragraph of approximately 100-150 words’), the desired tone (e.g., ‘in a professional and informative tone’), and any specific aspects that should be covered (e.g., ‘including examples of cloud computing services or providers’).”,
“modelInvoke”: “amazon.titan-text-premier-v1:0”,
“modelEval”: “anthropic.claude-3-sonnet-20240229-v1:0”
}

As the example shows, we asked the FM to evaluate with separate scores the prompt and the answer the FM generated from that prompt. We asked it to provide a justification for the score and some recommendations to further improve the prompts. All this information is valuable for a prompt engineer because it helps guide the optimization experiments and helps them make more informed decisions during the prompt life cycle.
Implementing prompt evaluation at scale
To this point, we’ve explored how to evaluate a single prompt. Often, medium to large organizations work with tens, hundreds, and even thousands of prompt variations for their multiple applications, making it a perfect opportunity for automation at scale. For this, you can run the flow in full datasets of prompts stored in files, as shown in the example notebook.
Alternatively, you can also rely on other node types in Amazon Bedrock Prompt Flows for reading and storing in Amazon Simple Storage Service (Amazon S3) files and implementing iterator and collector based flows. The following diagram shows this type of flow. Once you have established a file-based mechanism for running the prompt evaluation flow on datasets at scale, you can also automate the whole process by connecting it your preferred continuous integration and continuous development (CI/CD) tools. The details for these are out of the scope of this post.

Best practices and recommendations
Based on our evaluation process, here are some best practices for prompt refinement:

Iterative improvement – Use the evaluation feedback to continuously refine your prompts. The prompt optimization is ultimately an iterative process.
Context is key – Make sure your prompts provide sufficient context for the AI model to generate accurate responses. Depending on the complexity of the tasks or questions that your prompt will answer, you might need to use different prompt engineering techniques. You can check the Prompt engineering guidelines in the Amazon Bedrock documentation and other resources on the topic provided by the model providers.
Specificity matters – Be as specific as possible in your prompts and evaluation criteria. Specificity guides the models towards desired outputs.
Test edge cases – Evaluate your prompts with a variety of inputs to verify robustness. You might also want to run multiple evaluations on the same prompt for comparing and testing output consistency, which might be important depending on your use case.

Conclusion and next steps
By using the LLM-as-a-judge method with Amazon Bedrock Prompt Management and Amazon Bedrock Prompt Flows, you can implement a systematic approach to prompt evaluation and optimization. This not only improves the quality and consistency of your AI-generated content but also streamlines your development process, potentially reducing costs and improving user experiences.
We encourage you to explore these features further and adapt the evaluation process to your specific use cases. As you continue to refine your prompts, you’ll be able to unlock the full potential of generative AI in your applications. To get started, check out the full with the code samples used in this post. We’re excited to see how you’ll use these tools to enhance your AI-powered solutions!
For more information on Amazon Bedrock and its features, visit the Amazon Bedrock documentation.

About the Author

Antonio Rodriguez is a Sr. Generative AI Specialist Solutions Architect at Amazon Web Services. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

CrisperWhisper: A Breakthrough in Speech Recognition Technology with E …

Accurately transcribing spoken language into written text is becoming increasingly essential in speech recognition. This technology is crucial for accessibility services, language processing, and clinical assessments. However, the challenge lies in capturing the words and the intricate details of human speech, including pauses, filler words, and other disfluencies. These nuances provide valuable insights into cognitive processes and are particularly important in clinical settings where accurate speech analysis can aid in diagnosing and monitoring speech-related disorders. As the demand for more precise transcription grows, so does the need for innovative methods to address these challenges effectively.

One of the most significant challenges in this domain is the precision of word-level timestamps. This is especially important in scenarios with multiple speakers or background noise, where traditional methods often need to improve. Accurate transcription of disfluencies, such as filled pauses, word repetitions, and corrections, is difficult yet crucial. These elements are not mere speech artifacts; they reflect underlying cognitive processes and are key indicators in assessing conditions like aphasia. Existing transcription models often need help with these nuances, leading to errors in both transcription and timing. These inaccuracies limit their effectiveness, particularly in clinical and other high-stakes environments where precision is paramount.

Current methods, like the Whisper and WhisperX models, attempt to tackle these challenges using advanced techniques such as forced alignment and dynamic time warping (DTW). WhisperX, for instance, employs a VAD-based cut-and-merge approach that enhances both speed and accuracy by segmenting audio before transcription. While this method offers some improvements, it still faces significant challenges in noisy environments and with complex speech patterns. The reliance on multiple models, like WhisperX’s use of Wav2Vec2.0 for phoneme alignment, adds complexity and can lead to further degradation of timestamp precision in less-than-ideal conditions. Despite these advancements, there remains a clear need for more robust solutions.

Researchers at Nyra Health introduced a new model, CrisperWhisper. This model refined the Whisper architecture, improving noise robustness and single-speaker focus. The researchers significantly enhanced word-level timestamps’ accuracy by carefully adjusting the tokenizer and fine-tuning the model. CrisperWhisper employs a dynamic time-warping algorithm that aligns speech segments with greater precision, even in background noise. This adjustment improves the model’s performance in noisy environments and reduces errors in transcribing disfluencies, making it particularly useful for clinical applications.

CrisperWhisper’s improvements are largely due to several key innovations. The model strips unnecessary tokens and optimizes the vocabulary to detect better pauses and filler words, such as ‘uh’ and ‘um.’ It introduces heuristics that cap pause durations at 160 ms, distinguishing between meaningful speech pauses and insignificant artifacts. CrisperWhisper employs a cost matrix constructed from normalized cross-attention vectors to ensure that each word’s timestamp is as accurate as possible. This method allows the model to produce transcriptions that are not only more precise but also more reliable in noisy conditions. The result is a model that can accurately capture the timing of speech, which is crucial for applications that require detailed speech analysis.

The performance of CrisperWhisper is impressive when compared to previous models. It achieves an F1 score of 0.975 on the synthetic dataset and significantly outperforms WhisperX and WhisperT in noise robustness and word segmentation accuracy. For instance, CrisperWhisper achieves an F1 score of 0.90 on the AMI disfluency subset, compared to WhisperX’s 0.85. The model also demonstrates superior noise resilience, maintaining high mIoU and F1 scores even under conditions with a signal-to-noise ratio of 1:5. In tests involving verbatim transcription datasets, CrisperWhisper reduced the word error rate (WER) on the AMI Meeting Corpus from 16.82% to 9.72%, and on the TED-LIUM dataset from 11.77% to 4.01%. These results underscore the model’s capability to deliver precise and reliable transcriptions, even in challenging environments.

In conclusion, Nyra Health introduced CrisperWhisper, which addresses timestamp accuracy and noise robustness. CrisperWhisper provides a robust solution that enhances the precision of speech transcriptions. Its ability to accurately capture disfluencies and maintain high performance in noisy conditions makes it a valuable tool for various applications, particularly in clinical settings. The improvements in word error rate and overall transcription accuracy highlight CrisperWhisper’s potential to set a new standard in speech recognition technology.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post CrisperWhisper: A Breakthrough in Speech Recognition Technology with Enhanced Timestamp Precision, Noise Robustness, and Accurate Disfluency Detection for Clinical Applications appeared first on MarkTechPost.

A Novel Hybrid Approach Combining Hyperdimensional Vector Computing an …

Artificial intelligence is rapidly advancing, with a significant focus on improving models that process and interpret complex datasets, particularly time series data. This type of data involves sequences of data points collected over time and is critical in various fields, including finance, healthcare, and environmental science. The ability to accurately predict and classify time series data can lead to more informed decisions and better outcomes in these domains. Researchers are dedicated to developing methods that enhance the learning and generation of sequences, thereby making AI systems more effective in handling real-world data.

A major challenge in sequence learning is dealing with high-dimensional, noisy data, which is often difficult to interpret and process. Traditional machine-learning models need help to extract meaningful patterns from such data, leading to suboptimal predictions and classifications. This issue is particularly pronounced in time series analysis, where understanding the order and relationship between data points is crucial. Researchers have been striving to create models that can overcome these limitations and more accurately capture the complexities of sequential data.

Existing methods for time series analysis, such as Dynamic Time Warping (DTW) and traditional Tsetlin Machines (TMs), have their strengths and weaknesses. DTW is a widely used technique for measuring the similarity between sequences, but it is computationally intensive and can be challenging to implement on large datasets. Tsetlin Machines, known for their simplicity and interpretability, offer a different approach but require extensive parameter tuning to achieve optimal performance. These limitations highlight the need for more advanced and efficient methods to handle a broader range of sequence learning tasks.

Researchers from the University of Agder introduced a novel approach that combines Hyperdimensional Vector Computing (HVC) with Tsetlin Machines. This hybrid model leverages the robustness of HVC in high-dimensional spaces with the interpretability and learning capabilities of Tsetlin Machines. The research team designed a system that encodes sequences into hyperdimensional vectors, effectively capturing the temporal and spatial relationships within the data. This innovative approach aims to provide a more powerful and efficient sequence learning and generation tool.

The method proposed by the researchers involves encoding sequences into hyperdimensional vectors, which Tsetlin Machines then process. The model utilizes operations such as binding, bundling, and perturbation within the hyperdimensional vector space to represent and analyze sequences. This approach allows the system to generate new sequences that maintain the characteristics of the original data while being computationally efficient. For instance, the model can encode a sequence of time series data into a hyperdimensional vector of 10,000 bits, requiring just 1.22 MB of memory, even when scaled up to handle larger datasets.

The hybrid model was rigorously tested on the UCR Time Series Classification Archive, a comprehensive benchmark comprising 128 different time series datasets. The results were impressive, with the model outperforming or matching state-of-the-art benchmarks in approximately 78% of the datasets. The researchers reported that the HVTM method achieved accuracy improvements or maintained competitiveness within a 2% cutoff compared to the optimal benchmarks provided by DTW-based methods. Specifically, the model excelled in datasets involving motion, images, and ECGs, outperforming DTW benchmarks by at least 60% in these categories. However, it faced challenges with very short series (24-80 data points) and exhibited comparable performance to DTW for mid-length series (277-500 data points).

The hybrid model demonstrated strong performance in forecasting tasks. The researchers experimented with deterministic and stochastic time series models, including harmonic series, AR(1), ARMA(1,1), and seasonal AR models. The forecasting experiments involved generating 24-step ahead predictions, where the HVTM demonstrated a mean error rate of approximately 4% with a 5 N-Gram encoding on harmonic series data. The error rates for AR(1) models with coefficients of 0.4 and 0.7 were around 15% and 14%, respectively. Seasonal AR models, which presented more significant challenges, had error rates of approximately 31%, reflecting the complexity of capturing seasonal patterns.

In summary, the research by the University of Agder in sequence learning introduces a hybrid model that combines Hyperdimensional Vector Computing with Tsetlin Machines. This approach enhances the accuracy & efficiency of time series analysis, making it a promising tool for many applications. The model’s ability to handle complex datasets with minimal memory requirements makes it suitable for deployment in resource-constrained environments. As the researchers continue to refine and expand their approach, this hybrid model could serve as a valuable alternative to more resource-intensive methods, offering a new direction for the future of AI in sequence learning.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post A Novel Hybrid Approach Combining Hyperdimensional Vector Computing and Tsetlin Machines for Efficient Sequence Learning, Classification, and Forecasting in High-Dimensional Time Series Data appeared first on MarkTechPost.

Enhancing Segmentation Efficiency: A Unified Approach for Label-Limite …

Label-efficient segmentation has emerged as a crucial area of research, particularly in point cloud semantic segmentation. While deep learning techniques have advanced this field, the reliance on large-scale datasets with point-wise annotations remains a significant challenge. Recent methods have explored weak supervision, human annotations, and techniques such as perturbed self-distillation, consistency regularization, and self-supervised learning to address this issue. Pseudo-labeling has also gained prominence as an effective strategy for utilizing unlabeled data.

Despite these advancements, existing methods often involve complex training processes and focus primarily on 2D image segmentation. The 3D domain, which frequently deals with highly sparse labels, remains underexplored. Semi-supervised segmentation approaches, including entropy minimization and consistency regularization, have shown promise. However, the unique challenges posed by 3D point clouds necessitate the development of more generic, modality-agnostic segmentation methods that can effectively handle both 2D and 3D data while improving noise reduction and label efficiency.

Label-efficient segmentation addresses the challenge of performing effective segmentation using limited ground-truth labels, a critical issue in both 3D point cloud and 2D image data. Pseudo-labels have been widely utilized to facilitate training with sparse annotations, but often struggle with noise and variations in unlabeled data. Recent research proposes novel learning strategies to regularise pseudo-labels, aiming to narrow gaps between generated labels and model predictions. Entropy-Regularized Distribution Alignment (ERDA) incorporates entropy regularization and distribution alignment techniques to optimize both pseudo-label generation and segmentation model training simultaneously. Such methods demonstrate superior performance across various label-efficient settings, often outperforming fully supervised baselines with minimal true annotations, representing significant advancements towards modality-agnostic label-efficient segmentation solutions.

Researchers have developed a novel approach called ERDA to enhance label-efficient segmentation across 2D images and 3D point clouds. ERDA addresses challenges of noise and discrepancies in pseudo-labels generated from unlabeled data by incorporating Entropy Regularization (ER) and Distribution Alignment (DA) components. ER reduces the entropy of pseudo-labels, encouraging more confident and reliable predictions, while DA aligns the distribution of pseudo-labels with model predictions using Kullback-Leibler divergence. This combination refines pseudo-labels, improving the model’s learning process and overall segmentation performance.

The methodology introduces a query-based pseudo-labeling approach, generating high-quality, modality-agnostic pseudo-labels suitable for both 2D and 3D data. ERDA’s flexibility allows application to various label-efficient segmentation tasks, including semi-supervised, sparse labels, and unsupervised settings. Implementation is straightforward, reducing to a cross-entropy-based loss for simplified training. Experimental results demonstrate ERDA’s superior performance compared to previous methods across various settings and datasets, showcasing its effectiveness in both 2D and 3D modalities and marking a significant contribution to the field of label-efficient segmentation.

Experimental results demonstrate ERDA’s effectiveness in label-efficient segmentation across 2D and 3D modalities. In 2D segmentation, ERDA significantly improves performance in unsupervised settings. For 3D tasks, notable improvements are achieved, with models like RandLA-Net and CloserLook showing increases of +3.7 and +3.4 in mean Intersection over Union (mIoU), respectively. ERDA outperforms many fully supervised methods with only 1% of labels, highlighting its robustness in limited-data scenarios. Ablation studies validate the contributions of different components, while statistical properties evaluation supports the reliability of generated pseudo-labels. Overall, ERDA advances label-efficient learning, achieving state-of-the-art performance across various datasets and modalities.

In conclusion, this paper introduces ERDA, a novel approach for modality-agnostic label-efficient segmentation. ERDA addresses challenges of insufficient supervision and varying data processing techniques across 2D and 3D modalities. By reducing noise in pseudo-labels and aligning them with model predictions, ERDA enables better utilization of unlabeled data. The method’s query-based pseudo-labels contribute to its modality-agnostic nature. Experimental results demonstrate ERDA’s superior performance across various datasets and modalities, even surpassing fully-supervised baselines. While limitations exist, such as assuming complete coverage of semantic classes, ERDA shows promise for generalization to medical images and unsupervised settings, suggesting potential for future research combining label-efficient methods with large foundation models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post Enhancing Segmentation Efficiency: A Unified Approach for Label-Limited Learning Across 2D and 3D Data Modalities appeared first on MarkTechPost.

Deploy Amazon SageMaker pipelines using AWS Controllers for Kubernetes

Kubernetes is a popular orchestration platform for managing containers. Its scalability and load-balancing capabilities make it ideal for handling the variable workloads typical of machine learning (ML) applications. DevOps engineers often use Kubernetes to manage and scale ML applications, but before an ML model is available, it must be trained and evaluated and, if the quality of the obtained model is satisfactory, uploaded to a model registry.
Amazon SageMaker provides capabilities to remove the undifferentiated heavy lifting of building and deploying ML models. SageMaker simplifies the process of managing dependencies, container images, auto scaling, and monitoring. Specifically for the model building stage, Amazon SageMaker Pipelines automates the process by managing the infrastructure and resources needed to process data, train models, and run evaluation tests.
A challenge for DevOps engineers is the additional complexity that comes from using Kubernetes to manage the deployment stage while resorting to other tools (such as the AWS SDK or AWS CloudFormation) to manage the model building pipeline. One alternative to simplify this process is to use AWS Controllers for Kubernetes (ACK) to manage and deploy a SageMaker training pipeline. ACK allows you to take advantage of managed model building pipelines without needing to define resources outside of the Kubernetes cluster.
In this post, we introduce an example to help DevOps engineers manage the entire ML lifecycle—including training and inference—using the same toolkit.
Solution overview
We consider a use case in which an ML engineer configures a SageMaker model building pipeline using a Jupyter notebook. This configuration takes the form of a Directed Acyclic Graph (DAG) represented as a JSON pipeline definition. The JSON document can be stored and versioned in an Amazon Simple Storage Service (Amazon S3) bucket. If encryption is required, it can be implemented using an AWS Key Management Service (AWS KMS) managed key for Amazon S3. A DevOps engineer with access to fetch this definition file from Amazon S3 can load the pipeline definition into an ACK service controller for SageMaker, which is running as part of an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. The DevOps engineer can then use the Kubernetes APIs provided by ACK to submit the pipeline definition and initiate one or more pipeline runs in SageMaker. This entire workflow is shown in the following solution diagram.

Prerequisites
To follow along, you should have the following prerequisites:

An EKS cluster where the ML pipeline will be created.
A user with access to an AWS Identity and Access Management (IAM) role that has IAM permissions (iam:CreateRole, iam:AttachRolePolicy, and iam:PutRolePolicy) to allow creating roles and attaching policies to roles.
The following command line tools on the local machine or cloud-based development environment used to access the Kubernetes cluster:

The AWS Command Line Interface (AWS CLI) installed for interacting with AWS services.
kubectl for working with Kubernetes clusters.
eksctl for working with EKS clusters.
yq for YAML processing. (For Linux environments, use the wget plain binary installation.)
Helm 3.7+ for installing and managing Kubernetes applications.
curl for transmitting data with URLs.

Install the SageMaker ACK service controller
The SageMaker ACK service controller makes it straightforward for DevOps engineers to use Kubernetes as their control plane to create and manage ML pipelines. To install the controller in your EKS cluster, complete the following steps:

Configure IAM permissions to make sure the controller has access to the appropriate AWS resources.
Install the controller using a SageMaker Helm Chart to make it available on the client machine.

The following tutorial provides step-by-step instructions with the required commands to install the ACK service controller for SageMaker.
Generate a pipeline JSON definition
In most companies, ML engineers are responsible for creating the ML pipeline in their organization. They often work with DevOps engineers to operate those pipelines. In SageMaker, ML engineers can use the SageMaker Python SDK to generate a pipeline definition in JSON format. A SageMaker pipeline definition must follow the provided schema, which includes base images, dependencies, steps, and instance types and sizes that are needed to fully define the pipeline. This definition then gets retrieved by the DevOps engineer for deploying and maintaining the infrastructure needed for the pipeline.
The following is a sample pipeline definition with one training step:

{
“Version”: “2020-12-01”,
“Steps”: [
{
“Name”: “AbaloneTrain”,
“Type”: “Training”,
“Arguments”: {
“RoleArn”: “<<YOUR_SAGEMAKER_ROLE_ARN>>”,
“HyperParameters”: {
“max_depth”: “5”,
“gamma”: “4”,
“eta”: “0.2”,
“min_child_weight”: “6”,
“objective”: “multi:softmax”,
“num_class”: “10”,
“num_round”: “10”
},
“AlgorithmSpecification”: {
“TrainingImage”: “683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1”,
“TrainingInputMode”: “File”
},
“OutputDataConfig”: {
“S3OutputPath”: “s3://<<YOUR_BUCKET_NAME>>/sagemaker/”
},
“ResourceConfig”: {
“InstanceCount”: 1,
“InstanceType”: “ml.m4.xlarge”,
“VolumeSizeInGB”: 5
},
“StoppingCondition”: {
“MaxRuntimeInSeconds”: 86400
},
“InputDataConfig”: [
{
“ChannelName”: “train”,
“DataSource”: {
“S3DataSource”: {
“S3DataType”: “S3Prefix”,
“S3Uri”: “s3://<<YOUR_BUCKET_NAME>>/sagemaker/xgboost/train/”,
“S3DataDistributionType”: ”
}
},
“ContentType”: “text/libsvm”
},
{
“ChannelName”: “validation”,
“DataSource”: {
“S3DataSource”: {
“S3DataType”: “S3Prefix”,
“S3Uri”: “s3://<<YOUR_BUCKET_NAME>>/sagemaker/xgboost/validation/”,
“S3DataDistributionType”: “FullyReplicated”
}
},
“ContentType”: “text/libsvm”
}]
}
}]
}

With SageMaker, ML model artifacts and other system artifacts are encrypted in transit and at rest. SageMaker encrypts these by default using the AWS managed key for Amazon S3. You can optionally specify a custom key using the KmsKeyId property of the OutputDataConfig argument. For more information on how SageMaker protects data, see Data Protection in Amazon SageMaker.
Furthermore, we recommend securing access to the pipeline artifacts, such as model outputs and training data, to a specific set of IAM roles created for data scientists and ML engineers. This can be achieved by attaching an appropriate bucket policy. For more information on best practices for securing data in Amazon S3, see Top 10 security best practices for securing data in Amazon S3.
Create and submit a pipeline YAML specification
In the Kubernetes world, objects are the persistent entities in the Kubernetes cluster used to represent the state of your cluster. When you create an object in Kubernetes, you must provide the object specification that describes its desired state, as well as some basic information about the object (such as a name). Then, using tools such as kubectl, you provide the information in a manifest file in YAML (or JSON) format to communicate with the Kubernetes API.
Refer to the following Kubernetes YAML specification for a SageMaker pipeline. DevOps engineers need to modify the .spec.pipelineDefinition key in the file and add the ML engineer-provided pipeline JSON definition. They then prepare and submit a separate pipeline execution YAML specification to run the pipeline in SageMaker. There are two ways to submit a pipeline YAML specification:

Pass the pipeline definition inline as a JSON object to the pipeline YAML specification.
Convert the JSON pipeline definition into String format using the command line utility jq. For example, you can use the following command to convert the pipeline definition to a JSON-encoded string:

jq -r tojson <pipeline-definition.json>

In this post, we use the first option and prepare the YAML specification (my-pipeline.yaml) as follows:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Pipeline
metadata:
name: my-kubernetes-pipeline
spec:
parallelismConfiguration:
maxParallelExecutionSteps: 2
pipelineName: my-kubernetes-pipeline
pipelineDefinition: |
{
“Version”: “2020-12-01”,
“Steps”: [
{
“Name”: “AbaloneTrain”,
“Type”: “Training”,
“Arguments”: {
“RoleArn”: “<<YOUR_SAGEMAKER_ROLE_ARN>>”,
“HyperParameters”: {
“max_depth”: “5”,
“gamma”: “4”,
“eta”: “0.2”,
“min_child_weight”: “6”,
“objective”: “multi:softmax”,
“num_class”: “10”,
“num_round”: “30”
},
“AlgorithmSpecification”: {
“TrainingImage”: “683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1”,
“TrainingInputMode”: “File”
},
“OutputDataConfig”: {
“S3OutputPath”: “s3://<<YOUR_S3_BUCKET>>/sagemaker/”
},
“ResourceConfig”: {
“InstanceCount”: 1,
“InstanceType”: “ml.m4.xlarge”,
“VolumeSizeInGB”: 5
},
“StoppingCondition”: {
“MaxRuntimeInSeconds”: 86400
},
“InputDataConfig”: [
{
“ChannelName”: “train”,
“DataSource”: {
“S3DataSource”: {
“S3DataType”: “S3Prefix”,
“S3Uri”: “s3://<<YOUR_S3_BUCKET>>/sagemaker/xgboost/train/”,
“S3DataDistributionType”: “FullyReplicated”
}
},
“ContentType”: “text/libsvm”
},
{
“ChannelName”: “validation”,
“DataSource”: {
“S3DataSource”: {
“S3DataType”: “S3Prefix”,
“S3Uri”: “s3://<<YOUR_S3_BUCKET>>/sagemaker/xgboost/validation/”,
“S3DataDistributionType”: “FullyReplicated”
}
},
“ContentType”: “text/libsvm”
}
]
}
}
]}
pipelineDisplayName: my-kubernetes-pipeline
roleARN: <<YOUR_SAGEMAKER_ROLE_ARN>>

Submit the pipeline to SageMaker
To submit your prepared pipeline specification, apply the specification to your Kubernetes cluster as follows:

kubectl apply -f my-pipeline.yaml

Create and submit a pipeline execution YAML specification
Refer to the following Kubernetes YAML specification for a SageMaker pipeline. Prepare the pipeline execution YAML specification (pipeline-execution.yaml) as follows:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: PipelineExecution
metadata:
name: my-kubernetes-pipeline-execution
spec:
parallelismConfiguration:
maxParallelExecutionSteps: 2
pipelineExecutionDescription: “My first pipeline execution via Amazon EKS cluster.”
pipelineName: my-kubernetes-pipeline

To start a run of the pipeline, use the following code:

kubectl apply -f pipeline-execution.yaml

Review and troubleshoot the pipeline run
To list all pipelines created using the ACK controller, use the following command:

kubectl get pipeline

To list all pipeline runs, use the following command:

kubectl get pipelineexecution

To get more details about the pipeline after it’s submitted, like checking the status, errors, or parameters of the pipeline, use the following command:

kubectl describe pipeline my-kubernetes-pipeline

To troubleshoot a pipeline run by reviewing more details about the run, use the following command:

kubectl describe pipelineexecution my-kubernetes-pipeline-execution

Clean up
Use the following command to delete any pipelines you created:

kubectl delete pipeline

Use the following command to cancel any pipeline runs you started:

kubectl delete pipelineexecution

Conclusion
In this post, we presented an example of how ML engineers familiar with Jupyter notebooks and SageMaker environments can efficiently work with DevOps engineers familiar with Kubernetes and related tools to design and maintain an ML pipeline with the right infrastructure for their organization. This enables DevOps engineers to manage all the steps of the ML lifecycle with the same set of tools and environment they are used to, which enables organizations to innovate faster and more efficiently.
Explore the GitHub repository for ACK and the SageMaker controller to start managing your ML operations with Kubernetes.

About the Authors
Pratik Yeole is a Senior Solutions Architect working with global customers, helping customers build value-driven solutions on AWS. He has expertise in MLOps and containers domains. Outside of work, he enjoys time with friends, family, music, and cricket.
Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Effectively manage foundation models for generative AI applications wi …

Generative artificial intelligence (AI) foundation models (FMs) are gaining popularity with businesses due to their versatility and potential to address a variety of use cases. The true value of FMs is realized when they are adapted for domain specific data. Managing these models across the business and model lifecycle can introduce complexity. As FMs are adapted to different domains and data, operationalizing these pipelines becomes critical.
Amazon SageMaker, a fully managed service to build, train, and deploy machine learning (ML) models, has seen increased adoption to customize and deploy FMs that power generative AI applications. SageMaker provides rich features to build automated workflows for deploying models at scale. One of the key features that enables operational excellence around model management is the Model Registry. Model Registry helps catalog and manage model versions and facilitates collaboration and governance. When a model is trained and evaluated for performance, it can be stored in the Model Registry for model management.
Amazon SageMaker has released new features in Model Registry that make it easy to version and catalog FMs. Customers can use SageMaker to train or tune FMs, including Amazon SageMaker JumpStart and Amazon Bedrock models, and also manage these models within Model Registry. As customers begin to scale generative AI applications across various use cases such as fine-tuning for domain-specific tasks, the number of models can quickly grow. To keep track of models, versions, and associated metadata, SageMaker Model Registry can be used as an inventory of models.
In this post, we explore the new features of Model Registry that streamline FM management: you can now register unzipped model artifacts and pass an End User License Agreement (EULA) acceptance flag without needing users to intervene.
Overview
Model Registry has worked well for traditional models, which are smaller in size. For FMs, there were challenges because of their size and requirements for user intervention for EULA acceptance. With the new features in Model Registry, it’s become easier to register a fine-tuned FM within Model Registry, which then can be deployed for actual use.
A typical model development lifecycle is an iterative process. We conduct many experimentation cycles to achieve expected performance from the model. Once trained, these models can be registered in the Model Registry where they are cataloged as versions. The models can be organized in groups, the versions can be compared for their quality metrics, and models can have an associated approval status indicating if its deployable.
Once the model is manually approved, a continuous integration and continuous deployment (CI/CD) pipeline can be triggered to deploy these models to production. Optionally, Model Registry can be used as a repository of models that are approved for use by an enterprise. Various teams can then deploy these approved models from Model Registry and build applications around it.
An example workflow could follow these steps and is shown in the following diagram:

Select a SageMaker JumpStart model and register it in Model Registry
Alternatively, you can fine-tune a SageMaker JumpStart model
Evaluate the model with SageMaker model evaluation. SageMaker allows for human evaluation if desired.
Create a model group in the Model Registry. For each run, create a model version. Add your model group into one or more Model Registry Collections, which can be used to group registered models that are related to each other. For example, you could have a collection of large language models (LLMs) and another collection of diffusion models.
Deploy the models as SageMaker Inference endpoints that can be consumed by generative AI applications.

Figure 1: Model Registry workflow for foundation models
To better support generative AI applications, Model Registry released two new features: ModelDataSource, and source model URI. The following sections will explore these features and how to use them.
ModelDataSource speeds up deployment and provides access to EULA dependent models
Until now, model artifacts had to be stored along with the inference code when a model gets registered in Model Registry in a compressed format. This posed challenges for generative AI applications where FMs are of very large size with billions of parameters. The large size of FMs when stored as zipped models was causing increased latency with SageMaker endpoint startup time because decompressing these models at run time took very long. The model_data_source parameter can now accept the location of the unzipped model artifacts in Amazon Simple Storage Service (Amazon S3) making the registration process simple. This also eliminates the need for endpoints to unzip the model weights, leading to reduced latency during endpoint startup times.
Additionally, public JumpStart models and certain FMs from independent service providers, such as LLAMA2, require that their EULA must be accepted prior to using the models. Thus, when public models from SageMaker JumpStart were tuned, they could not be stored in the Model Registry because a user needed to accept the license agreement. Model Registry added a new feature: EULA acceptance flag support within the model_data_source parameter, allowing the registration of such models. Now customers can catalog, version, associate metadata such as training metrics, and more in Model Registry for a wider variety of FMs.
Register unzipped models stored in Amazon S3 using the AWS SDK.

model_data_source = {
“S3DataSource”: {
“S3Uri”: “s3://bucket/model/prefix/”,
“S3DataType”: “S3Prefix”,         
“CompressionType”: “None”,           
“ModelAccessConfig”: {                
“AcceptEula”: true
},
}
}
model = Model(      
sagemaker_session=sagemaker_session,       
image_uri=IMAGE_URI,     
model_data=model_data_source
)
model.register()

Register models requiring a EULA.

from sagemaker.jumpstart.model importJumpStartModel
model_id = “meta-textgeneration-llama-2-7b”
my_model = JumpStartModel(model_id=model_id)
registered_model =my_model.register(accept_eula=True)
predictor = registered_model.deploy()

Source model URI provides simplified registration and proprietary model support
Model Registry now supports automatic population of inference specification files for some recognized model IDs, including select AWS Marketplace models, hosted models, or versioned model packages in Model Registry. Because of SourceModelURI’s support for automatic population, you can register proprietary JumpStart models from providers such as AI21 labs, Cohere, and LightOn without needing the inference specification file, allowing your organization to use a broader set of FMs in Model Registry.
Previously, to register a trained model in the SageMaker Model Registry, you had to provide the complete inference specification required for deployment, including an Amazon Elastic Container Registry (Amazon ECR) image and the trained model file. With the launch of source_uri support, SageMaker has made it easy for users to register any model by providing a source model URI, which is a free form field that stores model ID or location to a proprietary JumpStart and Bedrock model ID, S3 location, and MLflow model ID. Rather than having to supply the details required for deploying to SageMaker hosting at the time of registrations, you can add the artifacts later on. After registration, to deploy a model, you can package the model an inference specification and update Model Registry accordingly.
For example, you can register a model in Model Registry with a model Amazon Resource Name (ARN) SourceURI.

model_arn = “<arn of the model to be registered>”
registered_model_package = model.register(       
model_package_group_name=”model_group_name”,
source_uri=model_arn
)

Later, you can update the registered model with the inference specification, making it deployable on SageMaker.

model_package = sagemaker_session.sagemaker_client.create_model_package(
ModelPackageGroupName=”model_group_name”,
SourceUri=”source_uri”
)
mp = ModelPackage(       
role=get_execution_role(sagemaker_session),
model_package_arn=model_package[“ModelPackageArn”],
sagemaker_session=sagemaker_session
)
mp.update_inference_specification(image_uris=[“ecr_image_uri”])

Register an Amazon JumpStart proprietary FM.

from sagemaker.jumpstart.model import JumpStartModel
model_id = “ai21-contextual-answers”
my_model = JumpStartModel(
model_id=model_id
)
model_package = my_model.register()

Conclusion
As organizations continue to adopt generative AI in different parts of their business, having robust model management and versioning becomes paramount. With Model Registry, you can achieve version control, tracking, collaboration, lifecycle management, and governance of FMs.
In this post, we explored how Model Registry can now more effectively support managing generative AI models across the model lifecycle, empowering you to better govern and adopt generative AI to achieve transformational outcomes.
To learn more about Model Registry, see Register and Deploy Models with Model Registry. To get started, visit the SageMaker console.

About the Authors
Chaitra Mathur serves as a Principal Solutions Architect at AWS, where her role involves advising clients on building robust, scalable, and secure solutions on AWS. With a keen interest in data and ML, she assists clients in leveraging AWS AI/ML and generative AI services to address their ML requirements effectively. Throughout her career, she has shared her expertise at numerous conferences and has authored several blog posts in the ML area.
Kait Healy is a Solutions Architect II at AWS. She specializes in working with startups and enterprise automotive customers, where she has experience building AI/ML solutions at scale to drive key business outcomes.
Saumitra Vikaram is a Senior Software Engineer at AWS. He is focused on AI/ML technology, ML model management, ML governance, and MLOps to improve overall organizational efficiency and productivity.
Siamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, ML model management, and ML governance to improve overall organizational efficiency and productivity. He has extensive experience automating processes and deploying various technologies

Build an ecommerce product recommendation chatbot with Amazon Bedrock …

Many ecommerce applications want to provide their users with a human-like chatbot that guides them to choose the best product as a gift for their loved ones or friends. To enhance the customer experience, the chatbot need to engage in a natural, conversational manner to understand the user’s preferences and requirements, such as the recipient’s gender, the occasion for the gift, and the desired product category. Based on the discussion with the user, the chatbot should be able to query the ecommerce product catalog, filter the results, and recommend the most suitable products.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
Amazon Bedrock Agents is a feature that enables generative AI applications to run multistep tasks across company systems and data sources. In this post, we show you how to build an ecommerce product recommendation chatbot using Amazon Bedrock Agents and FMs available in Amazon Bedrock.
Solution overview
Traditional rule-based chatbots often struggle to handle the nuances and complexities of open-ended conversations, leading to frustrating experiences for users. Furthermore, manually coding all the possible conversation flows and product filtering logic is time-consuming and error-prone, especially as the product catalog grows.
To address this challenge, you need a solution that uses the latest advancements in generative AI to create a natural conversational experience. The solution should seamlessly integrate with your existing product catalog API and dynamically adapt the conversation flow based on the user’s responses, reducing the need for extensive coding.
With Amazon Bedrock Agents, you can build intelligent chatbots that can converse naturally with users, understand their preferences, and efficiently retrieve and recommend the most relevant products from the catalog. Amazon Bedrock Agents simplifies the process of building and deploying generative AI models, enabling businesses to create engaging and personalized conversational experiences without the need for extensive machine learning (ML) expertise.
For our use case, we create a recommender chatbot using Amazon Bedrock Agents that prompts users to describe who they want to buy the gift for and the relevant occasion. The agent queries the product information stored in an Amazon DynamoDB table, using an API implemented as an AWS Lambda function. The agent adapts the API inputs to filter products based on its discussion with the user, for example gender, occasion, and category. After obtaining the user’s gift preferences by asking clarifying questions, the agent responds with the most relevant products that are available in the DynamoDB table based on user preferences.
The following diagram illustrates the solution architecture.

As shown in the preceding diagram, the ecommerce application first uses the agent to drive the conversation with users and generate product recommendations. The agent uses an API backed by Lambda to get product information. Lastly, the Lambda function looks up product data from DynamoDB.
Prerequisites
You need to have an AWS account with a user or role that has at minimum the following AWS Identity and Access Management (IAM) policies and permissions:

AWS managed policies:

AmazonBedrockFullAccess
AWSMarketplaceManageSubscriptions
AWSLambda_ReadOnlyAccess
AmazonDynamoDBReadOnlyAccess

IAM actions:

iam:CreateRole
iam:CreatePolicy
iam:AttachRolePolicy

Deploy the solution resources with AWS CloudFormation
Before you create your agent, you need to set up the product database and API. We use an AWS CloudFormation template to create a DynamoDB table to store product information and a Lambda function to serve as the API for retrieving product details.
At the time of writing this post, you can use any of the following AWS Regions to deploy the solution: US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai, Sydney), Europe (Frankfurt, Paris), Canada (Central), or South America (São Paulo). Visit Supported regions and models for Amazon Bedrock Agents for updates.
To deploy the template, choose Launch Stack:

This template creates a DynamoDB table named Products with the following attributes: product_name (partition key), category, gender, and occasion. It also defines a global secondary index (GSI) for each of these attributes to enable efficient querying.
Additionally, the template sets up a Lambda function named GetProductDetailsFunction that acts as an API for retrieving product details, This Lambda function accepts query parameters such as category, gender, and occasion. It constructs a filter expression based on the provided parameters and scans the DynamoDB table to retrieve matching products. If no parameters are provided, it retrieves all the products in the table and returns the first 100 products.
The template also creates another Lambda function called PopulateProductsTableFunction that generates sample data to store in the Products table. The CloudFormation template includes a custom resource that will run the PopulateProductsTableFunction function one time as part of the template deployment, to add 100 sample product entries in the products DynamoDB table, with various combinations of product names, descriptions, categories, genders, and occasions.
You can optionally update the sample product entries or replace it with your own product data. To do so, open the DynamoDB console, choose Explore items, and select the Products table. Choose Scan and choose Run to view and edit the current items or choose Create item to add a new item. If your data has different attributes than the sample product entries, you need to adjust the code of the Lambda function GetProductDetailsFunction, the OpenAPI schema, and the instructions for the agent that are used in the following section.
Create the agent
Now that you have the infrastructure in place, you can create the agent. The first step is to request model access.

On the Amazon Bedrock console, choose Model access in the navigation pane.
Choose Enable specific models.

Select the model you need access to (for this post, we select Claude 3 Sonnet).

Wait for the model access status to change to Access granted.

Now you can create your agent. We use a CloudFormation template to create the agent and the action group that will invoke the Lambda function.

To deploy the template, choose Launch Stack:

Now you can check the details of the agent that was created by the stack.

On the Amazon Bedrock console, choose Agents under Builder tools in the navigation pane.
Choose the agent product-recommendation-agent, then choose Edit in Agent Builder.
The Instructions for the Agent section includes a set of instructions that guides the agent in how to communicate with the user and use the API. You can adjust the instructions based on different use cases and business scenarios as well as the available APIs.

The agent’s primary goal is to engage in a conversation with the user to gather information about the recipient’s gender, the occasion for the gift, and the desired category. Based on this information, the agent will query the Lambda function to retrieve and recommend suitable products.
Your next step is to check the action group that enables the agent to invoke the Lambda function.

In the Action groups section, choose the Get-Product-Recommendations action group.

You can see the GetProductDetailsFunction Lambda function is selected in the Action group invocation section.

In the Action group schema section, you can see the OpenAPI schema, which enables the agent to understand the description, inputs, outputs, and the actions of the API that it can use during the conversation with the user.

Now you can use the Test Agent pane to have conversations with the chatbot.
Test the chatbot
The following screenshots show example conversations, with the chatbot recommending products after calling the API.

In the sample conversation, the chatbot asks relevant questions to determine the gift recipient’s gender, the occasion, and the desired category. After it has gathered enough information, it queries the API and presents a list of recommended products matching the user’s preferences.
You can see the rationale for each response by choosing Show trace. The following screenshots show how the agent decided to use different API filters based on the discussion.

You can see in the rationale field how the agent made its decision for each interaction. This trace data can help you understand the reasons behind a recommendation. Logging this information can be beneficial for future refinements of your agent’s recommendations.
Clean up
Complete the following steps to clean up your resources:

On the AWS CloudFormation console, delete the stack AgentStack.
Then delete the stack Productstableandapi.

Conclusion
This post showed you how to use Amazon Bedrock Agents to create a conversational chatbot that can assist users in finding the perfect gift. The chatbot intelligently gathers user preferences, queries a backend API to retrieve relevant product details, and presents its recommendations to the user. This approach demonstrates the power of Agents for Amazon Bedrock in building engaging and context-aware conversational experiences.
We recommend you follow best practices while using Amazon Bedrock Agents. For instance, using AWS CloudFormation to create and configure the agent allows you to minimize human error and recreate the agent across different environments and Regions. Also, automating your agent testing using a set of golden questions and their expected answers enables you to test the quality of the instructions for the agent and compare the outputs of the different models on Amazon Bedrock in relation to your use case.
Visit Amazon Bedrock Agents to learn more about features and details.

About the Author
Mahmoud Salaheldin is a Senior Solutions Architect in AWS, working with customers in the Middle East, North Africa, and Turkey, where he helps enterprises, digital-centered businesses, and independent software vendors innovate new products that can enhance their customer experience and increase their business efficiency. He is a generative AI ambassador as well as a containers community member. He lives in Dubai, United Arab Emirates, and enjoys riding motorcycles and traveling.

Sleigh BFCM: Top Meta Ads Strategies from Our 2024 Holiday Guide

The holidays are coming, and if you’re looking to make this Black Friday and Cyber Monday your most successful ever, it’s time to sharpen your Meta Ads strategy. 

Yes, inflation is biting, and yes, consumers are being more cautious—but that doesn’t mean they’re not shopping! 

In fact, 80% of consumers plan to do more online shopping this year. 

So, how do you stand out? 

That’s where our 2024 Holiday Guide: Advanced Meta Ads Strategies to Sleigh Black Friday comes in.

Packed with actionable tips, expert insights, BFCM statistics, and quick wins, this guide is your playbook for winning the holiday season. 

But we’re not here to just repeat what’s in the guide—we’re giving you a sneak peek of some of the most valuable strategies to get you pumped for what’s inside. 

Ready to unwrap these holiday gems?

Start Early and Warm Up Your Audience

First things first: timing. 

According to Jen Root, VP of Partnerships at Hawke Media, this is not the year to procrastinate. 

With competition heating up due to both the holiday season and election year ad spend, you need to start warming up your audience now. 

By starting as early as September, you’ll build retargeting audiences that’ll give your ads a head start. 

This early bird approach helps increase Return on Ad Spend (ROAS) and lowers Customer Acquisition Cost (CAC), setting you up for a smooth Black Friday.

Fun Fact: 1 in 4 holiday shoppers will start as early as September!

Get Your Meta Tracking Q4-Ready

You wouldn’t bake a holiday pie without making sure your oven works, right? 

The same goes for your Meta tracking. 

Brett Fish, Founder of TagHero, says you need to ensure your tracking setup is top-notch to get the best results from Meta’s machine learning. 

Dive into Meta’s Event Manager, and test your customer journey to ensure everything’s running smoothly. The better your data, the better your results.

Pro Tip: Meta’s AI needs clean data to perform its magic—don’t leave your tracking to chance!

Don’t Let Conversion Blockers Steal Your Holiday Spirit

The holidays can get chaotic, but your website doesn’t have to be! 

Sean McCarthy from Lucky Orange recommends using heatmaps and session recordings to spot and remove conversion blockers before the holiday rush. 

Whether it’s a poorly placed button or a clunky form, smoothing out the experience now means more sales later.

Did You Know? 54% of ecommerce sales will come from mobile devices this holiday season, so make sure your mobile game is strong!

Boost Your Holiday Sales with Data

The holidays are all about giving—and what better gift to give yourself than some first-party data? 

Larry Kim, our Founder and CEO, recommends using tools like Customers.ai to capture and leverage your website visitor data. 

Not everyone buys on their first visit, but with the right retargeting ads, you can reel them back in with personalized offers that match their interests.

Surprising Stat: There are only 27 shopping days between Thanksgiving and Christmas this year—don’t waste a single one!

Deck the Halls with Creativity (and Some UGC)

Want your ads to sparkle? Turn your best-performing 2024 creative into holiday gold! 

Courtney Fritts, Owner of Axiom Marketing Solutions, suggests giving your top creatives a festive twist. Add a Santa hat to your video star, update your copy with gifting language, or simply swap your background for a cozy holiday setting. 

Small tweaks can make a big impact and keep your winning ads fresh.

Bonus Tip: UGC (User-Generated Content) ads get 4x the click-through rate and cost 50% less per click compared to traditional ads. Talk about a holiday miracle!

These tips are just the beginning. Our 2024 Holiday Guide is filled with 10 more actionable strategies, from advanced audience targeting to countdown campaigns that create urgency. 

Whether you’re a seasoned pro or just getting started, this guide will help you sleigh the holidays and crush your goals.

Ready to unwrap the full guide?

Download it here and make sure your Meta Ads are ready to light up the holiday season!

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post Sleigh BFCM: Top Meta Ads Strategies from Our 2024 Holiday Guide appeared first on Customers.ai.

miniG Released by CausalLM: A Groundbreaking Scalable AI-Language Mode …

CausalLM has released miniG, a groundbreaking language model designed to bridge the gap between performance & efficiency. This innovative model stands out for its powerful capabilities and compact design, making advanced AI technology more accessible to a wider audience. As industries increasingly seek cost-effective and scalable AI solutions, miniG emerges as a transformative tool, setting a new standard in developing and deploying AI models.

Background and Development of miniG

miniG, the latest creation by CausalLM, represents a substantial leap in the field of AI language models. CausalLM, known for its expertise in developing advanced AI models, has once again demonstrated its prowess with the release of miniG. The development of miniG was driven by the need for a more efficient, scalable, and versatile language model that could perform at a level comparable to its larger counterparts while maintaining a smaller footprint.

Image Source

The creation of miniG involved a meticulous research and development process, during which the team at CausalLM focused on optimizing the model’s architecture. The objective was to build a model that could deliver high performance with fewer computational resources. This goal was achieved by leveraging state-of-the-art techniques in model compression, fine-tuning, and knowledge distillation. The result is a language model that is powerful and accessible to a broader range of users, from large enterprises to individual developers.

Key Features and Capabilities of miniG

One of the most remarkable aspects of miniG is its ability to perform complex language tasks with impressive accuracy. Despite its smaller size, miniG does not compromise performance. It excels in natural language processing (NLP) tasks such as text generation, sentiment analysis, translation, and summarization. The model’s architecture is designed to handle large datasets efficiently, making it suitable for various real-world applications.

Another feature that sets miniG apart is its scalability. CausalLM has ensured that miniG can be easily integrated into different platforms, whether deployed on cloud services or edge devices. This flexibility is crucial for industries or businesses that require real-time processing & analysis, such as finance, healthcare, and customer service. miniG’s ability to function seamlessly across different environments makes it an important tool for developers who need to build AI-powered applications with limited resources.

In addition to its technical capabilities, miniG is designed with user-friendliness in mind. CausalLM has provided comprehensive documentation and support to help users get started with the model quickly. The company has also made the model available through various interfaces, including APIs and open-source libraries, ensuring that developers can integrate miniG into their projects with minimal effort.

Impact on the AI Community and Industry

The release of miniG is expected to impact the AI community and various industries profoundly. MiniG provides a new benchmark for model efficiency and performance for the AI research community. It challenges the notion that bigger models are always better by demonstrating that smaller, well-optimized models can achieve comparable results. This shift in perspective will likely influence future research directions, encouraging the development of more efficient models accessible to a wider audience.

In the industry, miniG’s release comes with a growing demand for powerful, cost-effective AI solutions. Businesses are increasingly looking for AI models that can be deployed at scale without incurring prohibitive costs. miniG addresses this need by offering a model that delivers high performance at a fraction of the cost of larger models. This affordability and versatility make miniG an attractive option for companies and businesses looking to integrate AI into their operations.

miniG’s release is likely to spur innovation in AI applications. With a powerful yet accessible model, developers and businesses can explore new use cases for AI previously considered too resource-intensive. This could lead to the development novel AI-powered products and services, driving growth in the tech industry and beyond.

Ethical Considerations and Future Prospects

As with any AI model, the release of miniG also raises important ethical considerations. CausalLM has emphasized the importance of responsible AI development and has taken steps to ensure that miniG is used in a manner that aligns with ethical standards. The company has implemented safeguards to prevent model misuse, such as limiting access to certain features and providing guidelines on responsible AI usage. CausalLM has already hinted at future updates and iterations of the model, which could include enhancements in performance, security, and user experience. The company’s commitment to innovation suggests that miniG is just the beginning of a new era in AI development, where efficiency and accessibility are prioritized alongside power and performance.

Conclusion

CausalLM’s release of miniG combines high performance with efficiency and accessibility. miniG can potentially revolutionize how AI is used across various industries. Its impact will likely be felt in the tech sector and fields such as healthcare, finance, and customer service, where AI is becoming an integral part of operations.

Check out the Model Card. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

Here is a highly recommended webinar from our sponsor: ‘Building Performant AI Applications with NVIDIA NIMs and Haystack’
The post miniG Released by CausalLM: A Groundbreaking Scalable AI-Language Model Trained on a Synthesis Dataset of 120 Million Entries appeared first on MarkTechPost.