Gradient makes LLM benchmarking cost-effective and effortless with AWS …

This is a guest post co-written with Michael Feil at Gradient.
Evaluating the performance of large language models (LLMs) is an important step of the pre-training and fine-tuning process before deployment. The faster and more frequent you’re able to validate performance, the higher the chances you’ll be able to improve the performance of the model.
At Gradient, we work on custom LLM development, and just recently launched our AI Development Lab, offering enterprise organizations a personalized, end-to-end development service to build private, custom LLMs and artificial intelligence (AI) co-pilots. As part of this process, we regularly evaluate the performance of our models (tuned, trained, and open) against open and proprietary benchmarks. While working with the AWS team to train our models on AWS Trainium, we realized we were restricted to both VRAM and the availability of GPU instances when it came to the mainstream tool for LLM evaluation, lm-evaluation-harness. This open source framework lets you score different generative language models across various evaluation tasks and benchmarks. It is used by leaderboards such as Hugging Face for public benchmarking.
To overcome these challenges, we decided to build and open source our solution—integrating AWS Neuron, the library behind AWS Inferentia and Trainium, into lm-evaluation-harness. This integration made it possible to benchmark v-alpha-tross, an early version of our Albatross model, against other public models during the training process and after.
For context, this integration runs as a new model class within lm-evaluation-harness, abstracting the inference of tokens and log-likelihood estimation of sequences without affecting the actual evaluation task. The decision to move our internal testing pipeline to Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances (powered by AWS Inferentia2) enabled us to access up to 384 GB of shared accelerator memory, effortlessly fitting all of our current public architectures. By using AWS Spot Instances, we were able to take advantage of unused EC2 capacity in the AWS Cloud—enabling cost savings up to 90% discounted from on-demand prices. This minimized the time it took for testing and allowed us to test more frequently because we were able to test across multiple instances that were readily available and release the instances when we were finished.
In this post, we give a detailed breakdown of our tests, the challenges that we encountered, and an example of using the testing harness on AWS Inferentia.
Benchmarking on AWS Inferentia2
The goal of this project was to generate identical scores as shown in the Open LLM Leaderboard (for many CausalLM models available on Hugging Face), while retaining the flexibility to run it against private benchmarks. To see more examples of available models, see AWS Inferentia and Trainium on Hugging Face.
The code changes required to port over a model from Hugging Face transformers to the Hugging Face Optimum Neuron Python library were quite low. Because lm-evaluation-harness uses AutoModelForCausalLM, there is a drop in replacement using NeuronModelForCausalLM. Without a precompiled model, the model is automatically compiled in the moment, which could add 15–60 minutes onto a job. This gave us the flexibility to deploy testing for any AWS Inferentia2 instance and supported CausalLM model.
Results
Because of the way the benchmarks and models work, we didn’t expect the scores to match exactly across different runs. However, they should be very close based on the standard deviation, and we have consistently seen that, as shown in the following table. The initial benchmarks we ran on AWS Inferentia2 were all confirmed by the Hugging Face leaderboard.
In lm-evaluation-harness, there are two main streams used by different tests: generate_until and loglikelihood. The gsm8k test primarily uses generate_until to generate responses just like during inference. Loglikelihood is mainly used in benchmarking and testing, and examines the probability of different outputs being produced. Both work in Neuron, but the loglikelihood method in SDK 2.16 uses additional steps to determine the probabilities and can take extra time.

Lm-evaluation-harness Results

Hardware Configuration
Original System
AWS Inferentia inf2.48xlarge

Time with batch_size=1 to evaluate mistralai/Mistral-7B-Instruct-v0.1 on gsm8k
103 minutes
32 minutes

Score on gsm8k (get-answer – exact_match with std)
0.3813 – 0.3874 (± 0.0134)
0.3806 – 0.3844 (± 0.0134)

Get started with Neuron and lm-evaluation-harness
The code in this section can help you use lm-evaluation-harness and run it against supported models on Hugging Face. To see some available models, visit AWS Inferentia and Trainium on Hugging Face.
If you’re familiar with running models on AWS Inferentia2, you might notice that there is no num_cores setting passed in. Our code detects how many cores are available and automatically passes that number in as a parameter. This lets you run the test using the same code regardless of what instance size you are using. You might also notice that we are referencing the original model, not a Neuron compiled version. The harness automatically compiles the model for you as needed.
The following steps show you how to deploy the Gradient gradientai/v-alpha-tross model we tested. If you want to test with a smaller example on a smaller instance, you can use the mistralai/Mistral-7B-v0.1 model.

The default quota for running On-Demand Inf instances is 0, so you should request an increase via Service Quotas. Add another request for all Inf Spot Instance requests so you can test with Spot Instances. You will need a quota of 192 vCPUs for this example using an inf2.48xlarge instance, or a quota of 4 vCPUs for a basic inf2.xlarge (if you are deploying the Mistral model). Quotas are AWS Region specific, so make sure you request in us-east-1 or us-west-2.
Decide on your instance based on your model. Because v-alpha-tross is a 70B architecture, we decided use an inf2.48xlarge instance. Deploy an inf2.xlarge (for the 7B Mistral model). If you are testing a different model, you may need to adjust your instance depending on the size of your model.
Deploy the instance using the Hugging Face DLAMI version 20240123, so that all the necessary drivers are installed. (The price shown includes the instance cost and there is no additional software charge.)
Adjust the drive size to 600 GB (100 GB for Mistral 7B).
Clone and install lm-evaluation-harness on the instance. We specify a build so that we know any variance is due to model changes, not test or code changes.

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
# optional: pick specific revision from the main branch version to reproduce the exact results
git checkout 756eeb6f0aee59fc624c81dcb0e334c1263d80e3
# install the repository without overwriting the existing torch and torch-neuronx installation
pip install –no-deps -e .
pip install peft evaluate jsonlines numexpr pybind11 pytablewriter rouge-score sacrebleu sqlitedict tqdm-multiprocess zstandard hf_transfer

Run lm_eval with the hf-neuron model type and make sure you have a link to the path back to the model on Hugging Face:

# e.g use mistralai/Mistral-7B-v0.1 if you are on inf2.xlarge
MODEL_ID=gradientai/v-alpha-tross

python -m lm_eval –model “neuronx” –model_args “pretrained=$MODEL_ID,dtype=bfloat16” –batch_size 1 –tasks gsm8k

If you run the preceding example with Mistral, you should receive the following output (on the smaller inf2.xlarge, it could take 250 minutes to run):

███████████████████████| 1319/1319 [32:52<00:00, 1.50s/it]
neuronx (pretrained=mistralai/Mistral-7B-v0.1,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version| Filter |n-shot| Metric |Value | |Stderr|
|—–|——:|———-|—–:|———–|—–:|—|—–:|
|gsm8k| 2|get-answer| 5|exact_match|0.3806|± |0.0134|

Clean up
When you are done, be sure to stop the EC2 instances via the Amazon EC2 console.
Conclusion
The Gradient and Neuron teams are excited to see a broader adoption of LLM evaluation with this release. Try it out yourself and run the most popular evaluation framework on AWS Inferentia2 instances. You can now benefit from the on-demand availability of AWS Inferentia2 when you’re using custom LLM development from Gradient. Get started hosting models on AWS Inferentia with these tutorials.

About the Authors
Michael Feil is an AI engineer at Gradient and previously worked as a ML engineer at Rodhe & Schwarz and a researcher at Max-Plank Institute for Intelligent Systems and Bosch Rexroth. Michael is a leading contributor to various open source inference libraries for LLMs and open source projects such as StarCoder. Michael holds a bachelor’s degree in mechatronics and IT from KIT and a master’s degree in robotics from Technical University of Munich.
Jim Burtoft is a Senior Startup Solutions Architect at AWS and works directly with startups like Gradient. Jim is a CISSP, part of the AWS AI/ML Technical Field Community, a Neuron Ambassador, and works with the open source community to enable the use of Inferentia and Trainium. Jim holds a bachelor’s degree in mathematics from Carnegie Mellon University and a master’s degree in economics from the University of Virginia.

<