Cohere Rerank 3.5 is now available in Amazon Bedrock through Rerank AP …

We are excited to announce the availability of Cohere’s advanced reranking model Rerank 3.5 through our new Rerank API in Amazon Bedrock. This powerful reranking model enables AWS customers to significantly improve their search relevance and content ranking capabilities. This model is also available for Amazon Bedrock Knowledge Base users. By incorporating Cohere’s Rerank 3.5 in Amazon Bedrock, we’re making enterprise-grade search technology more accessible and empowering organizations to enhance their information retrieval systems with minimal infrastructure management.
In this post, we discuss the need for Reranking, the capabilities of Cohere’s Rerank 3.5, and how to get started using it on Amazon Bedrock.
Reranking for advanced retrieval
Reranking is a vital enhancement to Retrieval Augmented Generation (RAG) systems that adds a sophisticated second layer of analysis to improve search result relevance beyond what traditional vector search can achieve. Unlike embedding models that rely on pre-computed static vectors, rerankers perform dynamic query-time analysis of document relevance, enabling more nuanced and contextual matching. This capability allows RAG systems to effectively balance between broad document retrieval and precise context selection, ultimately leading to more accurate and reliable outputs from language models while reducing the likelihood of hallucinations.
Existing search systems significantly benefit from reranking technology by providing more contextually relevant results that directly impact user satisfaction and business outcomes. Unlike traditional keyword matching or basic vector search, reranking performs an intelligent second-pass analysis that considers multiple factors, including semantic meaning, user intent, and business rules to optimize search result ordering. In ecommerce specifically, reranking helps surface the most relevant products by understanding nuanced relationships between search queries and product attributes, while also incorporating crucial business metrics like conversion rates and inventory levels. This advanced relevance optimization leads to improved product discovery, higher conversion rates, and enhanced customer satisfaction across digital commerce platforms, making reranking an essential component for any modern enterprise search infrastructure.
Introducing Cohere Rerank 3.5
Cohere’s Rerank 3.5 is designed to enhance search and RAG systems. This intelligent cross-encoding model takes a query and a list of potentially relevant documents as input, then returns the documents sorted by semantic similarity to the query. Cohere Rerank 3.5 excels in understanding complex information requiring reasoning and is able to understand the meaning behind enterprise data and user questions. Its ability to comprehend and analyze enterprise data and user questions across over 100 languages including Arabic, Chinese, English, French, German, Hindi, Japanese, Korean, Portuguese, Russian, and Spanish, makes it particularly valuable for global organizations in sectors such as finance, healthcare, hospitality, energy, government, and manufacturing.
One of the key advantages of Cohere Rerank 3.5 is its ease of implementation. Through a single Rerank API call in Amazon Bedrock, you can integrate Rerank into existing systems at scale, whether keyword-based or semantic. Reranking strictly improves first-stage retrievals on standard text retrieval benchmarks.
Cohere Rerank 3.5 is state of the art in the financial domain, as illustrated in the following figure.

Cohere Rerank 3.5 is also state of the art in the ecommerce domain, as illustrated in the following figure. Cohere’s ecommerce benchmarks revolve around retrieval on various products, including fashion, electronics, food, and more.

Products were structured as strings in a key-value pair format such as the following:

“Title”: “Title”
“Description”: “Long-form description” “Type”: <Some categorical data> etc…..

Cohere Rerank 3.5 also excels in hospitality, as shown in the following figure. Hospitality benchmarks revolve around retrieval on hospitality experiences and lodging options.

Documents were structured as strings in a key-value pairs format such as the following:

“Listing Title”: “Rental unit in Toronto” “Location”: “171 John Street, Toronto, Ontario, Canada”

“Description”: “Escape to our serene villa with stunning downtown views….”

We see noticeable gains in project management performance across all types of issue tracking tasks, as illustrated in the following figure.

Cohere’s project management benchmarks span a variety of retrieval tasks, such as:

Search through engineering tickets from various project management and issue tracking software tools
Search through GitHub issues on popular open source repos

Get started with Cohere Rerank 3.5
To start using Cohere Rerank 3.5 with Rerank API and Amazon Bedrock Knowledge Bases, navigate to the Amazon Bedrock console, and click on Model Access on the left hand pane. Click on Modify Access, select Cohere Rerank 3.5, click Next and hit submit.
Get Started with Amazon Bedrock Rerank API
The Cohere Rerank 3.5 model, powered by the Amazon Bedrock Rerank API, allows you to rerank input documents directly based on their semantic relevance to a user query – without requiring a pre-configured knowledge base. The flexibility makes it a powerful tool for various use cases.
To begin, set up your environment by importing the necessary libraries and initializing Boto3 clients:

import boto3
import json
region = boto3.Session().region_name

bedrock_agent_runtime = boto3.client(‘bedrock-agent-runtime’,region_name=region)

modelId = “cohere.rerank-v3-5:0”
model_package_arn = f”arn:aws:bedrock:{region}::foundation-model/{modelId}”

Next, define a main function that reorders a list of text documents by computing relevance scores based on the user query:

def rerank_text(text_query, text_sources, num_results, model_package_arn):
response = bedrock_agent_runtime.rerank(
queries=[
{
“type”: “TEXT”,
“textQuery”: {
“text”: text_query
}
}
],
sources=text_sources,
rerankingConfiguration={
“type”: “BEDROCK_RERANKING_MODEL”,
“bedrockRerankingConfiguration”: {
“numberOfResults”: num_results,
“modelConfiguration”: {
“modelArn”: model_package_arn,
}
}
}
)
return response[‘results’]

For instance, imagine a scenario where you need to identify emails related to returning items from a multilingual dataset. The example below demonstrates this process:

example_query = “What emails have been about returning items?”

documents = [
“Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?”,
“Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?”,
“مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب”,
“Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?”,
“Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken.”,
“Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?”,
“Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective.”,
“早上好,关于我最近的订单,我有一个问题。我收到了错误的商品”,
“Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective.”
]

Now, prepare the list of text sources that will be passed into the rerank_text() function:

text_sources = []
for text in documents:
text_sources.append({
“type”: “INLINE”,
“inlineDocumentSource”: {
“type”: “TEXT”,
“textDocument”: {
“text”: text,
}
}
})

You can then invoke rerank_text() by specifying the user query, the text resources, the desired number of top-ranked results, and the model ARN:

response = rerank_text(example_query, text_sources, 3, model_package_arn)
print(response)

The output generated by the Amazon Bedrock Rerank API with Cohere Rerank 3.5 for this query is:

[{‘index’: 4, ‘relevanceScore’: 0.1122397780418396},
{‘index’: 8, ‘relevanceScore’: 0.07777658104896545},
{‘index’: 2, ‘relevanceScore’: 0.0770234540104866}]

The relevance scores provided by the API are normalized to a range of [0, 1], with higher scores indicating higher relevance to the query. Here the 5th item in the list of documents is the most relevant. (Translated from German to English: Hello, I have a question about my last order. I received the wrong item and need to return it.)
You can also get started using Cohere Rerank 3.5 with Amazon Bedrock Knowledge Bases by completing the following steps:

In the Amazon Bedrock console, choose Knowledge bases under Builder tools in the navigation pane.
Choose Create knowledge base.
Provide your knowledge base details, such as name, permissions, and data source.

To configure your data source, specify the location of your data.
Select an embedding model to convert the data into vector embeddings, and have Amazon Bedrock create a vector store in your account to store the vector data.

When you select this option (available only in the Amazon Bedrock console), Amazon Bedrock creates a vector index in Amazon OpenSearch Serverless (by default) in your account, removing the need to manage anything yourself.

Review your settings and create your knowledge base.
In the Amazon Bedrock console, choose your knowledge base and choose Test knowledge base.
Choose the icon for additional configuration options for testing your knowledge base.
Choose your model (for this post, Cohere Rerank 3.5) and choose Apply.

The configuration pane shows the new Reranking section menu with additional configuration options. The number of reranked source chunks returns the specified number of highest relevant chunks.

Conclusion
In this post, we explored how to use Cohere’s Rerank 3.5 model in Amazon Bedrock, demonstrating its powerful capabilities for enhancing search relevance and robust reranking capabilities for enterprise applications, enhancing user experience and optimizing information retrieval workflows. Start improving your search relevance today with Cohere’s Rerank model on Amazon Bedrock.
Cohere Rerank 3.5 in Amazon Bedrock is available in the following AWS Regions: in us-west-2 (US West – Oregon), ca-central-1 (Canada – Central), eu-central-1 (Europe – Frankfurt), and ap-northeast-1 (Asia Pacific – Tokyo).
Share your feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.
To learn more about Cohere Rerank 3.5’s features and capabilities, view the Cohere in Amazon Bedrock product page.

About the Authors
Karan Singh is a Generative AI Specialist for third-party models at AWS, where he works with top-tier third-party foundation model (FM) providers to develop and execute joint Go-To-Market strategies, enabling customers to effectively train, deploy, and scale FMs to solve industry specific challenges. Karan holds a Bachelor of Science in Electrical and Instrumentation Engineering from Manipal University, a master’s in science in Electrical Engineering from Northwestern University and is currently an MBA Candidate at the Haas School of Business at University of California, Berkeley.
James Yi is a Senior AI/ML Partner Solutions Architect at Amazon Web Services. He spearheads AWS’s strategic partnerships in Emerging Technologies, guiding engineering teams to design and develop cutting-edge joint solutions in generative AI. He enables field and technical teams to seamlessly deploy, operate, secure, and integrate partner solutions on AWS. James collaborates closely with business leaders to define and execute joint Go-To-Market strategies, driving cloud-based business growth. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.

AWS DeepRacer: How to master physical racing?

As developers gear up for re:Invent 2024, they again face the unique challenges of physical racing. What are the obstacles? Let’s have a look.
In this blog post, I will look at what makes physical AWS DeepRacer racing—a real car on a real track—different to racing in the virtual world—a model in a simulated 3D environment. I will cover the basics, the differences in virtual compared to physical, and what steps I have taken to get a deeper understanding of the challenge.
The AWS DeepRacer League is wrapping up. In two days, 32 racers will face off in Las Vegas for one last time. This year, the qualification has been all-virtual, so the transition from virtual to physical racing will be a challenge.
The basics
AWS DeepRacer relies on the racer training a model within the simulator, a 3D environment built around ROS and Gazebo, originally built on AWS RoboMaker.
The trained model is subsequently used for either virtual or physical races. The model comprises a convolutional neural network (CNN) and an action space translating class labels into speed and throttle movement. In the basic scenario involving a single camera, a 160 x 120 pixels, 8-bit grayscale image (similar to the following figure) is captured 15 times per second, passed through the neural network, and the action with the highest weight (probability) is executed.
The small piece of AI magic is that during model evaluation (racing) there’s no context; each image is processed independently of the image before it, and without knowledge of the state of the car itself. If you process the images in reverse order the results remain the same!
Virtual compared to physical
The virtual worlds are 3D worlds created in Gazebo, and the software is written in Python and C++ using ROS as the framework. As shown in the following image, the 3D simulation is fairly flat, with basic textures and surfaces. There is little or no reflections or shine, and the environment is as visually clean as you make it. Input images are captured 15 times per second.

Within this world a small car is simulated. Compared to a real car, the model is very basic and lacks quite a few of the things that make a real car work: There is no suspension, the tires are rigid cylinders, there is no Ackermann steering, and there are no differentials. It’s almost surprising that this car can drive at all. On the positive side the camera is perfect; irrespective of lighting conditions you get crisp clear pictures with no motion blur.
A typical virtual car drives at speeds between 0.5 and 4.0 meters per second, depending on the shape of the track. If you go too fast, it will often oversteer and spin out of the turn because of the relatively low grip.
In contrast, the real world is less perfect—simulation-to-real gap #1 is around visual noise created by light, reflections (if track is printed on reflective material), and background noise (such as if the barriers around the track are too low, and the car sees people and objects in the back). Input images are captured 30 times per second.
The car itself—based on the readily available WLToys A979—has all the things the model car doesn’t: proper tires, suspension, and differential. One problem is that the car is heavy—around 1.5 kg—and the placement of some components causes the center of gravity to be very high. This causes simulation-to-real gap #2: Roll and pitch during corners at high speeds cause the camera to rotate, confusing the neural network as the horizon moves.
Gap #3 comes from motion blur when the light is too dim; the blur can cause the dashed centerline to look like a solid line, making it hard to distinguish the centerline from the solid inner and outer lines, as shown in the following figure.

The steering geometry, the differentials, the lack of engineering precision of the A979, and the corresponding difficulty in calibrating it, causes gap #4. Even if the model wants to go straight, the car still pulls left or right, needing constant correction to stay on track. This is most noticeable when the car is unable to drive down the straights in a straight line.
The original AWS DeepRacer, without modifications, has a smaller speed range of about 2 meters per second. It has a better grip but suffers from the previously mentioned roll movements. If you go too fast, it will understeer and potentially roll over. Since 2023, the AWS pit-crews operate their fleets of AWS DeepRacers with shock spacers to stiffen the suspension, reduce the roll, and increase the max effective speed.
Four questions
Looking at the sim-to-real gaps there are four questions that we want to explore:

How can we train the model to better handle the real world? This includes altering the simulator to close some of the gaps, combined with adapting reward function, action space, and training methodology to make better use of this simulator.
How can we better evaluate what the car does, and why? In the virtual world, we can perform log analysis to investigate; in the real world this has not yet been possible.
How can we evaluate our newly trained models? A standard AWS DeepRacer track, with its size of 8 meters x 6 meters, is prohibitively large. Is it possible to downscale the track to fit in a home?
Will a modified car perform better? Upgrade my AWS DeepRacer with better shocks? Add ball bearings and shims to improve steering precision? Or build a new lighter car based on a Raspberry Pi?

Solutions
To answer these questions, some solutions are required to support the experiments. The following assumes that you’re using Deepracer-for-Cloud to run the training locally or in an Amazon Elastic Compute Cloud (Amazon EC2) instance. We won’t go into the details but provide references that will enable you to try things out on your own.
Customized simulator
The first thing to look at is how you can alter the simulator. The simulator code is available, and modifying it doesn’t require too many skills. You can alter the car and the physics of the world or adjust the visual environment.
Change the environment
Changing the environments means altering the 3D world. This can be done by altering the features in a pre-existing track by adding or removing track parts (such as lines), changing lighting, adding background features (such as walls or buildings), swapping out textures, and so on. Making changes to the world will require building a new Docker image, which can take quite some time, but there are ways to speed that up. Going a step further, it’s also possible to make the world programmatically (command line or code) alterable during run-time.
The starting point are the track COLLADA (.dae) files found in the meshes folder. You can import it into Blender (shown in the following figure), make your changes, and export the file again. Note that lights and camera positions from Blender aren’t considered by Gazebo. To alter the lighting conditions, you will have to alter the .world file in worlds—the files are XML files in sdformat.

See Custom Tracks for some examples of tuned tracks.
Car and physics
The competition cars owned by AWS can’t be altered, so the objective of tuning the car in the simulator is to make it behave in ways more similar to the real one. Trained neural networks have an embedded expectation of what will happen next; which means that the simulated car learned that by taking a specific action, it would get a turn of a given radius. If the simulator car steers more or less than the physical one in a given situation, the outcome becomes unpredictable.
Lack of Ackermann steering, no differentials, but wheels that can deflect up to 30 degrees—real wheels only go to a bit more than 20 degrees outwards and less than that inwards. My experience is that the real car, surprisingly enough, still has a shorter turning radius than the virtual one.
The car models are found in the urdf folder. There are three different cars, relating to the different versions of physics, which you configure in your actions space (model_metadata.json). Today, only the deepracer (v3 and v4 physics) and deepracer_kinematics (v5 physics) models are relevant. There are variant models for single camera and for stereo camera, both with and without the LIDAR.
Each physics version is different; the big question is what impact, if any, each version has on the behavior of the physical car.

Version 3: Steering and throttle is managed through a PID controller, making speed and steering changes smooth (and slow). The simulation environment runs at all times—including during image processing and inference—leading to a higher latency between image capture and action taking effect.
Version 4: Steering and throttle is managed through a PID controller, but the world is put on hold during inference, reducing the latency.
Version 5: Steering and throttle is managed through a position and velocity controller, and the world is put on hold during inference, almost eliminating latency. (This is very unnatural; the car can take alternating 30 degree left and right turns and will go almost straight ahead.)

The PID controller for v3 and v4 can be changed in the racecar control file. By changing the P, I, and D values, you can tune how fast or how slow the car accelerates and steers.
You can also tune the friction. In our simulator, friction is defined for the wheels, not the surfaces that the car drives on. The values (called mu and mu2) are found in racecar.gazebo; increasing them (once per tire!) will allow the car to drive faster without spinning.
Finally, I implemented an experimental version of the Ackermann steering geometry including differentials. Why? When turning, a car’s wheels follow two circles with the same center point, the inner one is having a smaller radius than the outer one. In short, the inner wheels will have to steer more (larger curvature), but rotate slower (smaller circumference) than the outer wheels.
Customized car software
The initial work to create an altered software stack for the original AWS DeepRacer started in 2022. The first experiments included operating the AWS DeepRacer with an R/C controller and capturing the camera images and IMU data to create an in-car video. There was a lot to learn about ROS2, including creating a custom node for publishing IMU sensor data and capturing and creating videos on the fly. During the Berlin Summit in 2022, I also got to give my modified car a spin on the track!
In the context of physical racing, the motivation for customizing the car software is to obtain more information—what does the car do, and why. Watching the following video, you can clearly see the rolling movement in the turns, and the blurring of certain parts of the image discussed earlier.

The work triggered a need to alter several of the open source AWS DeepRacer packages, and included work such as optimizing the performance from camera to inference through compressing images and enabling GPU and compute stick acceleration of the inference. This turned into several scripts comprising all the changes to the different nodes and creating an upgraded software package that could be installed on an original AWS DeepRacer car.
The work evolved, and a logging mechanism using ROS Bag allowed us to analyze not only pictures, but also the actions that the car took. Using the deepracer-viz library of Jochem Lugtenburg, a fellow AWS DeepRacer community leader, I added a GradCam overlay on the video feed (shown in the following video), which gives a better understanding of what’s going on.

The outcome of this has evolved into the community AWS DeepRacer Custom Car repository, which allows anyone to upgrade their AWS DeepRacer with improved software with two commands and without having to compile the modules themselves!
Benefits are:

Performance improvement by using compressed image transport for the main processing pipeline.
Inference using OpenVINO with Intel GPU (original AWS DeepRacer), OpenVino with Myriad Neural Compute Stick (NCS2), or TensorFlow Lite.
Model Optimizer caching, speeding up switching of models.
Capture in-car camera and inference results to a ROS Bag for logfile analysis.
UI tweaks and fixes.
Support for Raspberry Pi4, enabling us to create the DeepRacer Pi!

Testing on a custom track
Capturing data is great, but you need a way to test it all—bringing models trained in a customized environment onto a track to see what works and what doesn’t.
The question turned out to be: How hard is it to make a track that has the same design as the official tracks, but that takes up less space than the 8m x 6m of the re:Invent 2018 track? After re:Invent 2023, I started to investigate. The goal was to create a custom track that would fit in my garage with a theoretical maximum size of 5.5m x 4.5m. The track should be printable on vinyl in addition to being available in the Simulator for virtual testing.
After some trial and error, it proved to be quite straightforward, even if it requires multiple steps, starting in a Jupyter Notebook, moving into a vector drawing program (Inkscape), and finalizing in Blender (to create the simulator meshes).
The trapezoid track shown in the following two figures (center line and final sketch) is a good example of how to create a brand new track. The notebook starts with eight points in an array and builds out the track step by step, adding the outer line, center line, and color.

In the end I chose to print a narrower version of Trapezoid—Trapezoid Narrow, shown in the following figure—to fit behind my garage, with dimensions of 5.20m x 2.85m including the green borders around the track. I printed it on PVC with a thickness 500 grams per square meter. The comparatively heavy material was a good choice. It prevents folds and wrinkles and generally ensures that the track stays in place even when you walk on it.

Around the track, I added a boundary of mesh PVC mounted on some 20 x 20 centimeter aluminum poles. Not entirely a success, because the light shone through and I needed to add a lining of black fleece. The following image shows the completed track before the addition of black fleece.

Experiments and conclusions
re:Invent is just days away. Experiments are still running, and because I need to fight my way through the Wildcard race, this is not the time to include all the details. Let’s just say that things aren’t always as straightforward as expected.
As a preview of what’s going on, I’ll end this post with the latest iteration of the in-car video, showing a AWS DeepRacer Pi doing laps in the garage. Check back after re:Invent for the big reveal!

About the author
Lars Lorentz Ludvigsen is a technology enthusiast who was introduced to AWS DeepRacer in late 2019 and was instantly hooked. Lars works as a Managing Director at Accenture where he helps clients to build the next generation of smart connected products. In addition to his role at Accenture, he’s an AWS Community Builder who focuses on developing and maintaining the AWS DeepRacer community’s software solutions.

Meta AI Releases Llama Guard 3-1B-INT4: A Compact and High-Performance …

Generative AI systems transform how humans interact with technology, offering groundbreaking natural language processing and content generation capabilities. However, these systems pose significant risks, particularly in generating unsafe or policy-violating content. Addressing this challenge requires advanced moderation tools that ensure outputs are safe and adhere to ethical guidelines. Such tools must be effective and efficient, particularly for deployment on resource-constrained hardware such as mobile devices.

One persistent challenge in deploying safety moderation models is their size and computational requirements. While powerful and accurate, large language models (LLMs) demand substantial memory and processing power, making them unsuitable for devices with limited hardware capabilities. Deploying these models can lead to runtime bottlenecks or failures for mobile devices with restricted DRAM, severely limiting their usability. To address this, researchers have focused on compressing LLMs without sacrificing performance.

Existing methods for model compression, including pruning and quantization, have been instrumental in reducing model size and improving efficiency. Pruning involves selectively removing less important model parameters, while quantization reduces the precision of the model weights to lower-bit formats. Despite these advancements, many solutions need help to effectively balance size, computational demands, and safety performance, particularly when deployed on edge devices.

Researchers at Meta introduced Llama Guard 3-1B-INT4, a safety moderation model designed to address these challenges. The model, unveiled during Meta Connect 2024, is just 440MB, making it seven times smaller than its predecessor, Llama Guard 3-1B. This was accomplished through advanced compression techniques such as decoder block pruning, neuron-level pruning, and quantization-aware training. The researchers also employed distillation from a larger Llama Guard 3-8B model to recover lost quality during compression. Notably, the model achieves a throughput of at least 30 tokens per second with a time-to-first-token of less than 2.5 seconds on a standard Android mobile CPU.

Several key methodologies underpin the technical advancements in Llama Guard 3-1B-INT4. Pruning techniques reduced the model’s decoder blocks from 16 to 12 and the MLP hidden dimensions from 8192 to 6400, achieving a parameter count of 1.1 billion, down from 1.5 billion. Quantization further compressed the model by reducing the precision of weights to INT4 and activations to INT8, cutting its size by a factor of four compared to a 16-bit baseline. Also, unembedding layer pruning reduced the output layer size by focusing only on 20 necessary tokens while maintaining compatibility with existing interfaces. These optimizations ensured the model’s usability on mobile devices without compromising its safety standards.

The performance of Llama Guard 3-1B-INT4 underscores its effectiveness. It achieves an F1 score of 0.904 for English content, outperforming its larger counterpart, Llama Guard 3-1B, which scores 0.899. For multilingual capabilities, the model performs on par with or better than larger models in five out of eight tested non-English languages, including French, Spanish, and German. Compared to GPT-4, tested in a zero-shot setting, Llama Guard 3-1B-INT4 demonstrated superior safety moderation scores in seven languages. Its reduced size and optimized performance make it a practical solution for mobile deployment, and it has been shown successfully on a Moto-Razor phone.

The research highlights several important takeaways, summarized as follows:

Compression Techniques: Advanced pruning and quantization methods can reduce LLM size by over 7× without significant loss in accuracy.

Performance Metrics: Llama Guard 3-1B-INT4 achieves an F1 score of 0.904 for English and comparable scores for multiple languages, surpassing GPT-4 in specific safety moderation tasks.

Deployment Feasibility: The model operates 30 tokens per second on commodity Android CPUs with a time-to-first-token of less than 2.5 seconds, showcasing its potential for on-device applications.

Safety Standards: The model maintains robust safety moderation capabilities, balancing efficiency with effectiveness across multilingual datasets.

Scalability: The model enables scalable deployment on edge devices by reducing computational demands, broadening its applicability.

In conclusion, Llama Guard 3-1B-INT4 represents a significant advancement in safety moderation for generative AI. It addresses the critical challenges of size, efficiency, and performance, offering a compact model for mobile deployment yet robust enough to ensure high safety standards. Through innovative compression techniques and meticulous fine-tuning, researchers have created a tool that is both scalable and reliable, paving the way for safer AI systems in diverse applications.

Check out the Paper and Codes. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

‘Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)
The post Meta AI Releases Llama Guard 3-1B-INT4: A Compact and High-Performance AI Moderation Model for Human-AI Conversations appeared first on MarkTechPost.

Chameleon: An AI System for Efficient Large Language Model Inference U …

Large language models (LLMs) have transformed the landscape of natural language processing, becoming indispensable tools across industries such as healthcare, education, and technology. These models perform complex tasks, including language translation, sentiment analysis, and code generation. However, their exponential growth in scale and adoption has introduced significant computational challenges. Each task often requires fine-tuned versions of these models, leading to high memory and energy demands. Efficiently managing the inference process in environments with concurrent queries for diverse tasks is crucial for sustaining their usability in production systems.

Inference clusters serving LLMs face fundamental issues of workload heterogeneity and memory inefficiencies. Current systems encounter high latency due to frequent adapter loading and scheduling inefficiencies. Adapter-based fine-tuning techniques, such as Low-Rank Adaptation (LoRA), enable models to specialize in tasks by modifying smaller portions of the base model parameters. While LoRA substantially reduces memory requirements, it introduces new challenges. These include increased contention on memory bandwidth during adapter loads and delays from head-of-line blocking when requests of varying complexities are processed sequentially. These inefficiencies limit the scalability and responsiveness of inference clusters under heavy workloads.

Existing solutions attempt to address these challenges but need to catch up in critical areas. For instance, methods like S-LoRA store base model parameters in GPU memory and load adapters on-demand from host memory. This approach leads to performance penalties due to adapter fetch times, particularly in high-load scenarios where PCIe link bandwidth becomes a bottleneck. Scheduling policies such as FIFO (First-In, First-Out) and SJF (Shortest-Job-First) have been explored to manage the diversity in request sizes, but both approaches fail under extreme load. FIFO often causes head-of-line blocking for smaller requests, while SJF leads to starvation of longer requests, resulting in missed service level objectives (SLOs).

Researchers from the University of Illinois Urbana-Champaign and IBM Research introduced Chameleon, an innovative LLM inference system designed to optimize environments with numerous task-specific adapters. Chameleon combines adaptive caching and a sophisticated scheduling mechanism to mitigate inefficiencies. It employs GPU memory more effectively by caching frequently used adapters, thus reducing the time required for adapter loading. Also, the system uses a multi-level queue scheduling policy that dynamically prioritizes tasks based on resource needs and execution time.

Chameleon leverages idle GPU memory to cache popular adapters, dynamically adjusting cache size based on system load. This adaptive cache eliminates the need for frequent data transfers between CPU and GPU, significantly reducing contention on the PCIe link. The scheduling mechanism categorizes requests into size-based queues and allocates resources proportionally, ensuring no task is starved. This approach accommodates heterogeneity in task sizes and prevents smaller requests from being blocked by larger ones. The scheduler dynamically recalibrates queue priorities and quotas, optimizing performance under varying workloads.

The system was evaluated using real-world production workloads and open-source LLMs, including the Llama-7B model. Results show that Chameleon reduces the P99 time-to-first-token (TTFT) latency by 80.7% and P50 TTFT latency by 48.1%, outperforming baseline systems like S-LoRA. Throughput improved by 1.5 times, allowing the system to handle higher request rates without violating SLOs. Notably, Chameleon demonstrated scalability, efficiently handling adapter ranks ranging from 8 to 128 while minimizing the latency impact of larger adapters.

Key Takeaways from the Research:

Performance Gains: Chameleon reduced tail latency (P99 TTFT) by 80.7% and median latency (P50 TTFT) by 48.1%, significantly improving response times under heavy workloads.

Enhanced Throughput: The system achieved 1.5x higher throughput than baseline methods, allowing for more concurrent requests.

Dynamic Resource Management: Adaptive caching effectively utilized idle GPU memory, dynamically resizing the cache based on system demand to minimize adapter reloads.

Innovative Scheduling: The multi-level queue scheduler eliminated head-of-line blocking and ensured fair resource allocation, preventing starvation of larger requests.

Scalability: Chameleon efficiently supported adapter ranks from 8 to 128, demonstrating its suitability for diverse task complexities in multi-adapter settings.

Broader Implications: This research sets a precedent for designing inference systems that balance efficiency and scalability, addressing real-world production challenges in deploying large-scale LLMs.

In conclusion, Chameleon introduces significant advancements for LLM inference in multi-adapter environments. Leveraging adaptive caching and a non-preemptive multi-level queue scheduler optimizes memory utilization and task scheduling. The system efficiently addresses adapter loading and heterogeneous request handling issues, delivering substantial performance improvements.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

‘Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)
The post Chameleon: An AI System for Efficient Large Language Model Inference Using Adaptive Caching and Multi-Level Scheduling Techniques appeared first on MarkTechPost.

How Perplexity AI is Transforming Search: Recent Innovations, Strategi …

Founded in 2022, Perplexity AI has quickly emerged as a significant player in artificial intelligence, particularly in AI-driven search technologies. With a strong focus on innovation and user-centric features, the company has introduced groundbreaking advancements while securing notable investments to expand its operations. Recent developments in Perplexity AI’s portfolio highlight its commitment to redefining how users interact with search engines and AI technologies.

One of Perplexity AI’s most notable releases in 2024 is its suite of AI-powered shopping features aimed at streamlining the online shopping experience. Among these features, “Product Cards” stand out for their ability to display detailed product information, including images, pricing, and AI-generated summaries of reviews and features. The “Buy with Pro” feature, available to Pro subscribers in the United States, allows users to purchase products directly through Perplexity’s platform, leveraging stored billing and shipping information for a seamless checkout process. Also, “Snap to Shop” enables users to upload images of products and receive AI-powered details, including purchasing options, making it a strong competitor to Google Lens in terms of visual search capabilities. These innovations mark Perplexity’s entry into the e-commerce sector, positioning it as a competitor to established platforms.

On the financial front, Perplexity AI has demonstrated impressive growth through strategic funding rounds. In January 2024, it raised $73.6 million in Series B funding, led by IVP and supported by NEA, Databricks Ventures, and other notable investors. This round brought the company’s valuation to $520 million, underscoring investor confidence in its technology and vision. By October 2024, Perplexity AI reportedly discussed raising an additional $500 million in funding, potentially pushing its valuation to $8 billion. These financial milestones fuel Perplexity’s research and development and bolster its ability to compete in the crowded AI and search engine markets.

In May 2024, Perplexity AI expanded its functionality by introducing “Perplexity Pages,” a feature that allows users to create and share web pages based on AI-generated content. These customizable pages enable users to compile detailed reports, articles, or guides in a visually engaging format. This tool reflects the company’s focus on providing value-added services beyond traditional search capabilities, further diversifying its offerings.

Perplexity AI introduced “Enterprise Search” for enterprise clients in April 2024. This secure AI search tool allows organizations to integrate internal databases with web-based content, delivering a comprehensive and customized search experience. The tool is particularly useful for companies seeking to enhance productivity by leveraging AI to unify diverse information sources. The company also secured a strategic partnership with Telefónica’s investment arm, Wayra, which invested in Perplexity and facilitated its introduction to Telefónica’s customers in key markets like Brazil, the UK, and Spain. This partnership aims to integrate Perplexity’s advanced AI capabilities into Telefónica’s services, further expanding its market reach.

However, the company’s rapid growth has been challenging. In October 2024, Perplexity faced a lawsuit filed by News Corp entities, including The Wall Street Journal and the New York Post, alleging unauthorized use of copyrighted content for AI-generated responses. The lawsuit demands that Perplexity cease using the articles and destroy any associated data. To address these concerns, the company has initiated revenue-sharing agreements with publishers to resolve content usage disputes while fostering collaborative relationships with content creators. These steps indicate Perplexity’s willingness to adapt and align its practices with industry standards.

Perplexity AI’s user base has grown significantly over the past year, a testament to its innovative approach to search. In July 2024, the platform answered approximately 250 million queries, a significant jump from the 500 million questions answered throughout 2023. This rapid increase reflects its rising popularity and solidifies it as a formidable competitor to established search engines like Google. The company’s ability to combine advanced AI technology with user-friendly features has resonated well with its audience, driving engagement and growth.

As Perplexity AI continues its upward trajectory, its recent developments highlight a clear strategy: innovation, user engagement, and strategic partnerships. By integrating cutting-edge shopping tools, expanding enterprise solutions, and proactively addressing legal and ethical challenges, Perplexity sets a strong foundation for its future. The company’s focus on enhancing user experience through AI-driven technologies ensures that it remains at the forefront of the evolving search and AI landscape.

In conclusion, Perplexity AI has made significant strides in redefining search technologies through its recent innovations. From introducing AI-powered shopping features and customizable pages to addressing enterprise needs and navigating legal challenges, the company is charting a path of growth and relevance. With its strong financial backing and commitment to user-centric development, Perplexity AI is well-positioned to continue shaping the future of AI-powered search and beyond.

Sources

https://www.theverge.com/2024/11/18/24299574/perplexity-ai-search-engine-buy-products

AI-powered search engine Perplexity AI, now valued at $520M, raises $73.6M

https://www.wsj.com/tech/ai/ai-startup-perplexity-to-triple-valuation-to-9-billion-in-new-funding-round-f2fb8c2c

https://nypost.com/2024/10/21/business/ny-post-wall-street-journal-sue-jeff-bezos-backed-perplexity-ai/ 

Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 59k+ ML SubReddit.

‘Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)
The post How Perplexity AI is Transforming Search: Recent Innovations, Strategic Partnerships, and Market Advancements in 2024 appeared first on MarkTechPost.