Accelerate your generative AI distributed training workloads with the …

In today’s rapidly evolving landscape of artificial intelligence (AI), training large language models (LLMs) poses significant challenges. These models often require enormous computational resources and sophisticated infrastructure to handle the vast amounts of data and complex algorithms involved. Without a structured framework, the process can become prohibitively time-consuming, costly, and complex. Enterprises struggle with managing distributed training workloads, efficient resource utilization, and model accuracy and performance. This is where the NVIDIA NeMo Framework comes into play. In this post, we present a step-by-step guide to run distributed training workloads on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.
NVIDIA NeMo Framework
NVIDIA NeMo is an end-to-end cloud-centered framework for training and deploying generative AI models with billions and trillions of parameters at scale. The NVIDIA NeMo Framework provides a comprehensive set of tools, scripts, and recipes to support each stage of the LLM journey, from data preparation to training and deployment. It offers a variety of customization techniques and is optimized for at-scale inference of models for both language and image applications, using multi-GPU and multi-node configurations. NVIDIA NeMo simplifies generative AI model development, making it more cost-effective and efficient for enterprises. By providing end-to-end pipelines, advanced parallelism techniques, memory-saving strategies, and distributed checkpointing, NVIDIA NeMo makes sure AI model training is streamlined, scalable, and high-performing.
The following are benefits of using NVIDIA NeMo for distributed training:

End-to-end pipelines for different stages such as data preparation, training, and more, which allows for a plug-and-play approach for your custom data
Parallelism techniques, including the following:

Data parallelism
Tensor parallelism
Pipeline parallelism
Sequence parallelism
Expert parallelism
Context parallelism

Memory saving techniques, including the following:

Selective activation recompute
CPU offloading (activation, weights)
Attention, including Flash Attention (FA 1/2, FA-cuDNN), Grouped Query Attention, Multi-Query Attention, and Sliding Window Attention
Distributed optimizers, including Torch FSDP, Distributed Optimizer (zero-1)

Data loaders for different architectures
Distributed checkpointing

Solution overview
You can deploy and manage NVIDIA NeMo using either Slurm or Kubernetes orchestration platforms. Amazon EKS is a managed Kubernetes service that makes it straightforward to run Kubernetes clusters on AWS. It manages the availability and scalability of the Kubernetes control plane, and it provides compute node auto scaling and lifecycle management support to help you run highly available container applications.
Amazon EKS is an ideal platform for running distributed training workloads due to its robust integrations with AWS services and performance features. It seamlessly integrates with Amazon FSx for Lustre, a high-throughput file system, enabling fast data access and management using persistent volume claims with the FSx CSI driver. Amazon EKS also integrates with Amazon CloudWatch for comprehensive logging and monitoring, providing insights into cluster performance and resource utilization. It supports Amazon Simple Storage Service (Amazon S3) for scalable and durable data storage and management, providing accessibility for large datasets. Enhanced network performance is achieved with Elastic Fabric Adapter (EFA), which offers low-latency, high-throughput connectivity between nodes. These features collectively make Amazon EKS a powerful and efficient choice for optimizing AI and machine learning (ML) training workflows.
The following diagram shows the solution architecture.

In this post, we present the steps to run distributed training workloads on an EKS cluster. The high-level steps are as follows:

Set up an EFA enabled 2-node 24xlarge cluster.
Set up an FSx for Lustre file system so you can have a shared data repository for storing training dataset and model checkpoints.
Set up an environment for NVIDIA NeMo.
Modify the NVIDIA NeMo Kubernetes manifests to prepare a dataset and train a model.

Prerequisites
You need to be able to launch a CPU-based Amazon Elastic Compute Cloud (Amazon EC2) instance that you’ll use to create the EKS cluster. When your instance is up and running, SSH into your EC2 instance and install the following CLIs:

The latest version of the AWS Command Line Interface (AWS CLI)
kubectl
eksctl
helm

These steps may change if you are on a non-Linux platform. Consult the preceding documentation for installing the CLIs on other platforms accordingly. We also require that you have a capacity reservation with p4de.24xlarge instances and have the capacityReservationID.
Launch an EKS cluster
ECR p4de.24xlarge instances have the NVIDIA A100 80GB instances, which are highly popular for distributed training generative AI workloads. For more information, refer to Amazon EC2 Instance Types. In this section, we show how to create an EKS cluster with an On-Demand Capacity Reservation for p4de.24xlarge instances.

We provide the cluster creation config in p4de-cluster-config.yaml. See the following code:

git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/3.test_cases/2.nemo-launcher/EKS

eksctl create cluster -f p4de-cluster-config.yaml

The following are key points to note when creating this cluster:

Make sure the kubectl version and the specified Region are correct.
Update the capacityReservationID field and make sure to specify the availabilityZones within the managedNodeGroups section, which should be the same Availability Zone ID in which your capacity lives.
This configuration will create two managed node groups: one for the system nodes using c5.2xlarge instances and another for running distributed training on p4de.24xlarge instances. Managed node groups will use Amazon EKS optimized AMIs. If you want to provide a custom AMI, you can create a self-managed node group and specify a custom AMI. To find the AMI ID, refer to Retrieving Amazon EKS optimized Amazon Linux AMI IDs. For more details about the Amazon EKS optimized AMI, see the GitHub repo.
Make sure efaEnabled is set to true. You can use the same config for creating a cluster with other node groups. For a list of EFA supported instance types, see Supported instance types.
Another popular instance for generative AI distributed training workloads is the p5.48xlarge instance with the NVIDIA H100 80 GB GPU. To add a P5 node group to an existing EKS cluster, refer to AWS CLI scripts for EKS management.

After the cluster is created, you can enable kubectl to communicate with your cluster by adding a new context to the kubectl config file:

aws eks update-kubeconfig –region region-code –name my-cluster

You can confirm communication with your cluster by running the following command:

kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   28h

Next, you can install the AWS EFA Kubernetes Device Plugin. EFA is a network interface for EC2 instances that enhances the performance of inter-node communications, which is critical for distributed training workloads that involve GPUs. This plugin allows Kubernetes to recognize and utilize the EFA device, facilitating high-throughput, low-latency networking necessary for efficient distributed training and deep learning applications.

Install the plugin with the following code:

helm repo add eks https://aws.github.io/eks-charts

helm install efa eks/aws-efa-k8s-device-plugin -n kube-system

The NVIDIA device plugin for Kubernetes enables GPU support within your EKS cluster by exposing the GPUs to the Kubernetes API server through the kubelet. It advertises the available GPU resources, allowing Kubernetes to schedule and manage GPU-accelerated workloads.

Install the plugin with the following code:

wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml

kubectl apply -f nvidia-device-plugin.yml

Run the following command to verify all the pods:

kubectl get pods –all-namespaces

You can run kubectl get nodes to verify the nodes.

Alternatively, you can use the EKS node viewer tool to view nodes, their costs, and their status in your cluster. After it’s installed, enter eks-node-viewer to get the following view.

The node viewer displays the IP addresses of our two p4de.24xlarge compute nodes.

We can choose one of these private IP DNS names to further examine and describe the node as follows:

kubectl describe node ip-192-168-165-37.us-west-2.compute.internal

The preceding command describes a lot of detail of the node. To make sure EFA is installed correctly, make sure you see details as shown in the following screenshot.

For p4 nodes, you will see vpc.amazonaws.com/efa:4 and for p5.48xlarge nodes, you should see vpc.amazonaws.com/efa:32.
If EFA is enabled in the node group, make sure that a security group is attached to the nodes that allows a rule to allow all outgoing traffic originating from the same security group. This is required for EFA to work. For instructions, see Get started with EFA and MPI. This security group is intended for testing purposes only. For your production environments, we recommend that you create an inbound SSH rule that allows traffic only from the IP address from which you are connecting, such as the IP address of your computer, or a range of IP addresses in your local network.
Create an FSx for Lustre file system
For distributed training applications, typically hundreds of GPU instances are used, with each node containing multiple GPUs. It is crucial that all nodes can access a shared file system to train on the same dataset efficiently. For this purpose, a high-performance file system with high throughput and low latency is essential. We recommend using the FSx for Lustre file system for large-scale distributed training, because it meets these requirements and provides seamless data access for all nodes involved in the training process.
To have a FSx for Lustre file system mounted on your EKS cluster, complete the following steps:

Use the following scripts to create an AWS Identity and Access Management (IAM) role and attach the FSx policy:

export FSX_POLICY_NAME=fsx-csi

wget https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/csi/fsx/fsx-policy.json
export FSX_POLICY_DOC=file://fsx-policy.json

# From EC2 Auto Scaling Group
export EKS_INSTANCE_PROFILE_NAME=(eks-1ec6fc6b-1a19-d65d-66ac-293ff0a20eb9 )

POLICY_ARN=$(aws iam create-policy –policy-name ${FSX_POLICY_NAME} –policy-document $FSX_POLICY_DOC –query “Policy.Arn” –output text)

INSTANCE_PROFILE=$(aws iam list-instance-profiles –query InstanceProfiles[?InstanceProfileName==”‘${EKS_INSTANCE_PROFILE_NAME}'”].{InstanceProfileName:InstanceProfileName} –output text)

ROLE_NAME=$(aws iam get-instance-profile –instance-profile-name ${INSTANCE_PROFILE} –query InstanceProfile.Roles[0].RoleName –output text)

# Attach FSx Policy to role ${ROLE_NAME} …”
aws iam attach-role-policy –policy-arn ${POLICY_ARN} –role-name ${ROLE_NAME}

Use the following script to create a security group that allows EKS nodes to access the file system:

# From EC2 console
export MY_REGION=us-west-2
# FSX_SUBNET_ID should be same ID the compute nodes are present in. You can get this from the EKS console
export FSX_SUBNET_ID=subnet-0edecd850cff2cfad
# From EC2 Auto Scaling Group
export FSX_SECURITY_GROUP_NAME=eks-fsx-sg

# Get VPC_ID from EKS console
export VPC_ID=vpc-04411d49af198a6ea

# Create security group
export SECURITY_GROUP_ID=$(aws ec2 create-security-group –vpc-id ${VPC_ID} –region ${MY_REGION} –group-name ${FSX_SECURITY_GROUP_NAME} –description “FSx for Lustre Security Group” –query “GroupId” –output text)

export SUBNET_CIDR=$(aws ec2 describe-subnets –region ${MY_REGION} –query Subnets[?SubnetId==”‘${FSX_SUBNET_ID}'”].{CIDR:CidrBlock} –output text)

# Ingress rule
aws ec2 authorize-security-group-ingress –region ${MY_REGION} –group-id ${SECURITY_GROUP_ID} –protocol tcp –port 988 –cidr ${SUBNET_CIDR}

Create a 1.2 TB Persistent_2 FSx for Lustre file system from the FSx for Lustre console in the same Availability Zone as your compute instances (FSX_SUBNET_ID), VPC of Amazon EKS (VPC_ID), and the security group you created (SECURITY_GROUP_ID).
After the file system is created, note the file system ID, DNS name, and mount name from the file system details page.

Before mounting the file system, you need to install the FSx CSI driver that allows EKS clusters to manage the lifecycle of FSx for Lustre file systems.

Install the FSx CSI driver as follows:

echo “Installing FSx CSI driver …”
kubectl apply -k “github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master”

echo “FSx pods in kube-system namespace …”
kubectl -n kube-system get pods | grep fsx

Next, to mount the file system, provide scripts in the fsx-storage-class.yaml, fsx-pv.yaml and fsx-pvc.yaml files:

# Storage Class
kubectl apply -f fsx-storage-class.yaml
kubectl get sc

# Persistent Volume
kubectl apply -f fsx-pv.yaml

# Persistent Volume Claim
kubectl apply -f fsx-pvc.yaml

You can check to make sure that the volumes are in Bound state.

Set up the environment for NVIDIA NeMo
For this post, we use the NVIDIA device plugin for Kubernetes, but if you need to install the GPU Operator, you can do so as follows:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install –wait –generate-name -n gpu-operator –create-namespace nvidia/gpu-operator

To enable distributed training, we use the KubeFlow Training Operator, which is essential for managing and scheduling ML training jobs in a Kubernetes environment. This operator simplifies the process of running distributed training jobs by automating the deployment and scaling of the necessary components. See the following code:

# Deploy Kubeflow training operator

kubectl apply -k “github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0″

# From https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/kubeflow/training-operator/deploy.sh

# Configure RBAC resources

kubectl apply -f ./clusterrole-hpa-access.yaml

kubectl apply -f ./clusterrolebinding-training-operator-hpa-access.yaml

Additionally, we use the KubeFlow MPI Operator for preprocessing training data in parallel. The MPI Operator facilitates running Message Passing Interface (MPI) jobs, which are crucial for parallelizing the preprocessing tasks across multiple nodes, thereby speeding up the training process. See the following code:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml

# From https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/kubeflow/mpi-operator/clusterrole-mpi-operator.yaml
# Add lease permissions fot mpi-operator cluster role
kubectl apply -f ./clusterrole-mpi-operator.yaml

The NVIDIA NeMo Framework is available publicly in the image nvcr.io/nvidia/nemo:24.01.framework. We provide an AWS optimized Dockerfile for use with P4 and P5 instances. We recommend the following library versions for optimal performance:

ENV EFA_INSTALLER_VERSION=1.30.0
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
ENV NCCL_VERSION=2.19.4-1

You can build and push the image to Amazon Elastic Container Registry (Amazon ECR) as follows:

## AWS
export AWS_REGION=us-west-2
export ACCOUNT=$(aws sts get-caller-identity –query Account –output text)

## Docker Image
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE=nemo-aws
export TAG=”:24.01.framework”

docker build -t ${REGISTRY}${IMAGE}${TAG} -f 0.Dockerfile .

echo “Logging in to $REGISTRY …”
aws ecr get-login-password | docker login –username AWS –password-stdin $REGISTRY

# Create registry if it does not exist
REGISTRY_COUNT=$(aws ecr describe-repositories | grep ${IMAGE} | wc -l)
if [ “$REGISTRY_COUNT” == “0” ]; then
echo “”
echo “Creating repository ${IMAGE} …”
aws ecr create-repository –repository-name ${IMAGE}
fi

# Push image
docker image push ${REGISTRY}${IMAGE}${TAG}

The NVIDIA NeMo Framework requires users to provide config files with job and model information. You can copy the launcher scripts from the container as follows:

# Run container
docker run -it ${REPOSITORY}${IMAGE}${TAG} bash

# Copy files
docker cp -a <container-id>: /opt/NeMo-Megatron-Launcher/ <Path-to-save-launcher-scripts>

In a Slurm cluster implementation, the launcher scripts, data, and results folder could reside in the file system that both the head node (node from where jobs are submitted) and compute nodes access. But in this Amazon EKS implementation, the node that you used to create the EKS cluster doesn’t have access to EKS file system. To get around this, you can put the launcher scripts in the head node and the results and data folder in the file system that the compute nodes have access to.
Run NVIDIA NeMo on an EKS cluster
We’re now ready to set up NVIDIA NeMo Kubernetes manifests for data preparation and model training. For more information about running it on premises, see Running NeMo Framework on Kubernetes. There are some modifications to be done for it to run on Amazon EKS, as shown in the following steps. We provide the launcher scripts in the GitHub repo.

Modify the launcher_scripts/conf/cluster/k8s.yaml file as follows. The subPath field is the path where FSx for Lustre is mounted, which is /fsx-shared in this case.

shm_size: 512Gi # Amount of system memory to allocate in Pods. Should end in “Gi” for gigabytes.
volumes:
persistentVolumeClaim:
# This claim should be created before running
claimName: fsx-pvc
subPath: fsx-shared # path is mirrored into pod (no leading slash b/c relative to root)

# NOTE: These args will soon be deprecated
nfs_server: null # Hostname or IP address for the NFS server where data is stored.
nfs_path: null # Path to store data in the NFS server.
ib_resource_name: null # Specify the resource name for IB devices according to kubernetes, such as “nvidia.com/hostdev” for Mellanox IB adapters. Can also be a list, but must be same length as ib_count
ib_count: null # Specify the number of IB devices to include per node in each pod. Can also be a list, but must be same length as ib_resource_name
ib_network_annotation: “” # Specify the networks as comma separated values
dns_policy: null # Specify a dnsPolicy to use in all pods, if necessary

Install the required Python packages; this is required so that NeMo Launcher can submit jobs to the Kubernetes cluster:

sudo apt install python3-pip

pip install -r <Path-to- NeMo-Megatron-Launcher>/requirements.txt

Next, we copy the following folders from the container to the /fsx-shared/data folder:

NeMo-Megatron-Launcher/launcher_scripts/data/bpe
NeMo-Megatron-Launcher/launcher_scripts/data/nsfw

To copy files from EKS pods, you can start a pod just for this purpose. Create a file fsx-share-test.yaml as follows:

apiVersion: v1
kind: Pod
metadata:
name: fsx-share-test
spec:
containers:
– name: fsx-share-test
image: ubuntu
command: [“/bin/bash”]
args: [“-c”, “while true; do echo “hello from FSx” – $(date -u) >> /fsx-shared/test.txt; sleep 120; done”]
volumeMounts:
– name: fsx-pv
mountPath: /fsx-shared
volumes:
– name: fsx-pv
persistentVolumeClaim:
claimName: fsx-pvc

Run this pod and copy the files:

kubectl apply -f fsx-share-test.yaml

kubectl cp <Path-to- NeMo-Megatron-Launcher>/launcher_scripts/data/bpe fsx-share-test: /fsx-shared/data/

kubectl cp <Path-to- NeMo-Megatron-Launcher>/launcher_scripts/data/nsfw fsx-share-test: /fsx-shared/data/

A few files need to be updated for data preparation for it to work with the EKS cluster.

Modify the launcher_scripts/conf/config.yaml file:

For cluster, use k8s.
For training, use gpt3/126m.
For stages, this should be just data_preparation and no other stages.
For launcher_scripts_path, use the path to the NeMo Megatron launch scripts, which should end with /launcher_scripts.
For data_dir, use /fsx-shared/data (the location to store and read the data).
For base_results_dir, use /fsx-shared/results (the location to store the results, checkpoints, and logs).
For container, use ${REPOSITORY}${IMAGE}${TAG}

Modify the conf/data_preparation/gpt3/download_gpt3_pile.yaml file:

Set node_array_size to 2.
Set file_numbers to “0-5”. With five files, it should be around 350 GB of data

Modify the nemo_launcher/core/k8s_templates/data_preparation/data-prep.yaml file:

If you get the error that mpirun is not found, add the full path to the executable /opt/amazon/openmpi/bin/mpirun.
Add /fsx-shared in the container volume mount path.
Add the volume:

volumes:
– name: fsx-pv
persistentVolumeClaim:
claimName: fsx-pvc

Launch the data preparation job: python3 main.py

This script creates a Helm chart for the selected stage (in this case, data_preparation) and runs the Helm chart automatically. Refer to Run NeMo Framework on Kubernetes for an explanation of the data preparation process. Make sure python3 is installed.

You can monitor your job status and logs using three commands: helm list, kubectl get pods, and kubectl logs –follow).
When the job is finished, you can remove the Helm chart: helm uninstall download-gpt3-pile

You can see the downloaded the data in the /fsx-shared folder by running in one of the pods as kubectl exec -it nlp-worker-0 bash.

Training
Now that our data preparation is complete, we’re ready to train our model with the created dataset. Complete the following steps:

Modify a parameter in the conf/config.yaml file:

Set stages to training and no other stages.

Modify parameters in conf/training/gpt3/126m.yaml:

Set num_nodes to 2.
Set devices to 1.
On line 18, change use_distributed_sampler: False to replace_sampler_ddp: False.

Optionally, if you want to use a mock dataset instead of real dataset for testing purposes, you can modify the data section as follows. You are essentially changing data_impl: mmap to data_impl: mock and assigning an empty list to data_prefix.

data:
data_impl: mock
splits_string: “99990,8,2”
seq_length: 2048
skip_warmup: True
num_workers: 2
dataloader_type: single # cyclic
reset_position_ids: False # Reset position ids after end-of-document token
reset_attention_mask: False # Reset attention mask after end-of-document token
eod_mask_loss: False # Mask loss for the end of document tokens
index_mapping_dir: null
data_prefix: [] # Should be weight path weight path… for a blended dataset

# You can just comment the default “data_prefix” values like below.
# – ${data_dir}/my-gpt3_00_text_document
# – .0333

Modify the parameters in the nemo_launcher/core/k8s_templates/training/training.yaml file:
Run python3 main.py to start training and you should see the training pods by running kubectl get pods as follows:

NAME READY STATUS RESTARTS AGE
nlp-training-worker-0 1/1 Running 0 168m
nlp-training-worker-1 1/1 Running 0 168m

In addition to monitoring your job using helm list, kubectl get pods, and kubectl logs –follow, you can also SSH into your pod with kubectl exec and use nvidia-smi to check GPU status.

When the job is finished, you can delete the helm chart: helm uninstall gpt3-126m

Model checkpoints are saved at /fsx-shared/results/checkpoints along with other training logs and TensorBoard events. By default, checkpoints are saved at every 2,000 steps. You can modify the conf/training/gpt3/126m.yaml file to make changes in the training setup.
Troubleshooting deployment failures
If deployment fails due to incorrect setup or configuration, complete the following debug steps:

Find the error message by running kubectl logs –follow PODNAME and kubectl describe pod PODNAME.
Stop any running jobs by removing the Helm chart. This can be done by running helm uninstall CHARTNAME.

Pods should be spun down after removing the Helm chart.

You can double-check by running kubectl get pods.
If pods are not spun down, you can manually stop them by running kubectl delete PODNAME.

Based on the error message, you may find errors from:

Unready nodes.
Missing Operators or CRDs. In this case, make sure your kubectl get pods -A output looks like that shown earlier. If errors exist, try reinstalling Operators and CRDs.
NeMo Framework scripts or Kubernetes manifests. This is more likely a bug or wrong setup on the NeMo side. Errors can vary.

Clean up
It’s important to spin down resources after model training in order to avoid costs associated with running idle instances. To clean up our setup, we must delete the FSx for Lustre file system before deleting the cluster because it’s associated with a subnet in the cluster’s VPC.

To delete the file system integration with the EKS cluster, run the following command:

kubectl delete -f ./fsx-storage-class.yaml

Not only will this delete the persistent volume, it will also delete the EFS file system and all the data on the file system will be lost.

When Step 1 is complete, delete the cluster by using the following script:

eksctl delete cluster -f p4de-cluster-config.yaml

This will delete all the existing pods, remove the cluster, and delete the VPC you created in the beginning.
Conclusion
In this post, we demonstrated how to train generative AI models at scale using the NeMo Framework within an EKS cluster. We covered the challenges of training LLMs and how NeMo’s comprehensive tools and optimizations address these challenges, making the process more efficient and cost-effective. With NeMo, you can manage and scale distributed training workloads effectively. This post works with P4de instances. Another popular instance for generative AI distributed training workloads is the p5.48xlarge instance with the NVIDIA H100 80 GB GPU. To add a P5 node group to an existing EKS cluster, refer to AWS CLI scripts for EKS management.
To help you get started, we have published a GitHub repository that provides step-by-step instructions for creating an EKS cluster with P4de instances, mounting an FSx for Lustre file system, and running distributed training workloads with NeMo. This guide empowers you to harness the full potential of NeMo and Amazon EKS for your AI model training needs.

About the authors
Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.
Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring platform. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.
Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.
Wenhan Tan is a Solutions Architect at Nvidia assisting customers to adopt Nvidia AI solutions at large-scale. His work focuses on accelerating deep learning applications and addressing inference and training challenges.

Governing the ML lifecycle at scale, Part 2: Multi-account foundations

Your multi-account strategy is the core of your foundational environment on AWS. Design decisions around your multi-account environment are critical for operating securely at scale. Grouping your workloads strategically into multiple AWS accounts enables you to apply different controls across workloads, track cost and usage, reduce the impact of account limits, and mitigate the complexity of managing multiple virtual private clouds (VPCs) and identities by allowing different teams to access different accounts that are tailored to their purpose.
In Part 1 of this series, Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker, you learned about best practices for operating and governing machine learning (ML) and analytics workloads at scale on AWS. In this post, we provide guidance for implementing a multi-account foundation architecture that can help you organize, build, and govern the following modules: data lake foundations, ML platform services, ML use case development, ML operations, centralized feature stores, logging and observability, and cost and reporting.
We cover the following key areas of the multi-account strategy for governing the ML lifecycle at scale:

 Implementing the recommended account and organizational unit structure to provide isolation of AWS resources (compute, network, data) and cost visibility for ML and analytics teams
Using AWS Control Tower to implement a baseline landing zone to support scaling and governing data and ML workloads
Securing your data and ML workloads across your multi-account environment at scale using the AWS Security Reference Architecture
Using the AWS Service Catalog to scale, share, and reuse ML across your multi-account environment and for implementing baseline configurations for networking
Creating a network architecture to support your multi-account environment and facilitate network isolation and communication across your multi-tenant environment

Your multi-account foundation is the first step towards creating an environment that enables innovation and governance for data and ML workloads on AWS. By integrating automated controls and configurations into your account deployments, your teams will be able to move quickly and access the resources they need, knowing that they are secure and comply with your organization’s best practices and governance policies. In addition, this foundational environment will enable your cloud operations team to centrally manage and distribute shared resources such as networking components, AWS Identity and Access Management (IAM) roles, Amazon SageMaker project templates, and more.
In the following sections, we present the multi-account foundation reference architectures, discuss the motivation behind the architectural decisions made, and provide guidance for implementing these architectures in your own environment.
Organizational units and account design
You can use AWS Organizations to centrally manage accounts across your AWS environment. When you create an organization, you can create hierarchical groupings of accounts within organizational units (OUs). Each OU is typically designed to hold a set of accounts that have common operational needs or require a similar set of controls.
The recommended OU structure and account structure you should consider for your data and ML foundational environment is based on the AWS whitepaper Organizing Your AWS Environment Using Multiple Accounts. The following diagram illustrates the solution architecture.
Only those OUs that are relevant to the ML and data platform have been shown. You can also add other OUs along with the recommended ones. The next sections discuss how these recommended OUs serve your ML and data workloads and the specific accounts you should consider creating within these OUs.
The following image illustrates, respectively, the architecture of the account structure for setting up a multi-account foundation and how it would look like in AWS Organizations once implemented .

Recommended OUs
The recommended OUs include Security, Infrastructure, Workloads, Deployments, and Sandbox. If you deploy AWS Control Tower, which is strongly recommended, it creates two default OUs: Security and Sandbox. You should use these default OUs and create the other three. For instructions, refer to Create a new OU.
Security OU
The Security OU stores the various accounts related to securing your AWS environment. This OU and the accounts therein are typically owned by your security team.
You should consider the following initial accounts for this OU:

 Security Tooling account – This account houses general security tools as well as those security tools related to your data and ML workloads. For instance, you can use Amazon Macie within this account to help protect your data across all of your organization’s member accounts.
Log Archive account – If you deploy AWS Control Tower, this account is created by default and placed within your Security OU. This account is designed to centrally ingest and archive logs across your organization.

Infrastructure OU
Similar to other types of workloads that you can run on AWS, your data and ML workloads require infrastructure to operate correctly. The Infrastructure OU houses the accounts that maintain and distribute shared infrastructure services across your AWS environment. The accounts within this OU will be owned by the infrastructure, networking, or Cloud Center of Excellence (CCOE) teams.
The following are the initial accounts to consider for this OU:

Network account – To facilitate a scalable network architecture for data and ML workloads, it’s recommended to create a transit gateway within this account and share this transit gateway across your organization. This will allow for a hub and spoke network architecture that privately connects your VPCs in your multi-account environment and facilitates communication with on-premises resources if needed.
Shared Services account – This account hosts enterprise-level shared services such as AWS Managed Microsoft AD and AWS Service Catalog that you can use to facilitate the distribution of these shared services.

Workloads OU
The Workloads OU is intended to house the accounts that different teams within your platform use to create ML and data applications. In the case of an ML and data platform, you’ll use the following accounts:

ML team dev/test/prod accounts – Each ML team may have their own set of three accounts for the development, testing, and production stages of the MLOps lifecycle.
(Optional) ML central deployments – It’s also possible to have ML model deployments fully managed by an MLOps central team or ML CCOE. This team can handle the deployments for the entire organization or just for certain teams; either way, they get their own account for deployments.
Data lake account – This account is managed by data engineering or platform teams. There can be several data lake accounts organized by business domains. This is hosted in the Workloads OU.
Data governance account – This account is managed by data engineering or platform teams. This acts as the central governance layer for data access. This is hosted in the Workloads OU.

Deployments OU
The Deployments OU contains resources and workloads that support how you build, validate, promote, and release changes to your workloads. In the case of ML and data applications, this will be the OU where the accounts that host the pipelines and deployment mechanisms for your products will reside. These will include accounts like the following:

DevOps account – This hosts the pipelines to deploy extract, transform, and load (ETL) jobs and other applications for your enterprise cloud platform
ML shared services account – This is the main account for your platform ML engineers and the place where the portfolio of products related to model development and deployment are housed and maintained

If the same team managing the ML engineering resources is the one taking care of pipelines and deployments, then these two accounts may be combined into one. However, one team should be responsible for the resources in one account; the moment you have different independent teams taking care of these processes, the accounts should be different. This makes sure that a single team is accountable for the resources in its account, making it possible to have the right levels of billing, security, and compliance for each team.
Sandbox OU
The Sandbox OU typically contains accounts that map to an individual or teams within your organization and are used for proofs of concept. In the case of our ML platform, this can be cases of the platform and data scientist teams wanting to create proofs of concept with ML or data services. We recommend using synthetic data for proofs of concept and avoid using production data in Sandbox environments.
AWS Control Tower
AWS Control Tower enables you to quickly get started with the best practices for your ML platform. When you deploy AWS Control Tower, your multi-account AWS environment is initialized according to prescriptive best practices. AWS Control Tower configures and orchestrates additional AWS services, including Organizations, AWS Service Catalog, and AWS IAM Identity Center. AWS Control Tower helps you create a baseline landing zone, which is a well-architected multi-account environment based on security and compliance best practices. As a first step towards initializing your multi-account foundation, you should set up AWS Control Tower.
In the case of our ML platform, AWS Control Tower helps us with four basic tasks and configurations:

Organization structure – From the accounts and OUs that we discussed in the previous section, AWS Control Tower provides you with the Security and Sandbox OUs and the Security Tooling and Logging accounts.
Account vending – This enables you to effortlessly create new accounts that comply with your organization’s best practices at scale. It allows you to provide your own bootstrapping templates with AWS Service Catalog (as we discuss in the next sections).
Access management – AWS Control Tower integrates with IAM Identity Center, providing initial permissions sets and groups for the basic actions in your landing zone.
Controls – AWS Control Tower implements preventive, detective, and proactive controls that help you govern your resources and monitor compliance across groups of AWS accounts.

Access and identity with IAM Identity Center
After you establish your landing zone with AWS Control Tower and create the necessary additional accounts and OUs, the next step is to grant access to various users of your ML and data platform. Proactively determining which users will require access to specific accounts and outlining the reasons behind these decisions is recommended. Within IAM Identity Center, the concepts of groups, roles, and permission sets allows you to create fine-grained access for different personas within the platform.
Users can be organized into two primary groups: platform-wide and team-specific user groups. Platform-wide user groups encompass central teams such as ML engineering and landing zone security, and they are allocated access to the platform’s foundational accounts. Team-specific groups operate at the team level, denoted by roles such as team admins and data scientists. These groups are dynamic, and are established for new teams and subsequently assigned to their respective accounts upon provisioning.
The following table presents some example platform-wide groups.

User Group
Description
Permission Set
Accounts

AWSControlTowerAdmins
Responsible for managing AWS Control Tower in the landing zone
AWSControlTowerAdmins and AWSSecurityAuditors
Management account

AWSNetworkAdmins
Manages the networking resources of the landing zone
NetworkAdministrator
Network account

AWSMLEngineers
Responsible for managing the ML central resources
PowerUserAccess
ML shared services account

AWSDataEngineers
Responsible for managing the data lake, ETLs and data processes of the platform
PowerUserAccess
Data lake account

The following table presents examples of team-specific groups.

User Group
Description
Permission Set
Accounts

TeamLead
Group for the administrators of the team.
AdministratorAccess
Team account

DataScientists
Group for data scientists. This group is added as an access for the team’s SageMaker domain.
DataScientist
Team account

MLEngineers
The team may have other roles dedicated to certain specific tasks that have a relationship with the matching platform-wide teams.
MLEngineering
Team account

DataEngineers
DataEngineering
Team account

AWS Control Tower automatically generates IAM Identity Center groups with permission set relationships for the various landing zone accounts it creates. You can use these preconfigured groups for your platform’s central teams or create new custom ones. For further insights into these groups, refer to IAM Identity Center Groups for AWS Control Tower. The following screenshot shows an example of the AWS Control Tower console, where you can view the accounts and determine which groups have permission on each account.

IAM Identity Center also provides a login page where landing zone users can get access to the different resources, such as accounts or SageMaker domains, with the different levels of permissions that you have granted them.

AWS Security Reference Architecture
The AWS SRA is a holistic set of guidelines for deploying the full complement of AWS security services in a multi-account environment. It can help you design, implement, and manage AWS security services so they align with AWS recommended practices.
To help scale security operations and apply security tools holistically across the organization, it’s recommended to use the AWS SRA to configure your desired security services and tools. You can use the AWS SRA to set up key security tooling services, such as Amazon GuardDuty, Macie, and AWS Security Hub. The AWS SRA allows you to apply these services across your entire multi-account environment and centralize the visibility these tools provide. In addition, when accounts get created in the future, you can use the AWS SRA to configure the automation required to scope your security tools to these new accounts.
The following diagram depicts the centralized deployment of the AWS SRA.

Scale your ML workloads with AWS Service Catalog
Within your organization, there will likely be different teams corresponding to different business units. These teams will have similar infrastructure and service needs, which may change over time. With AWS Service Catalog, you can scale your ML workloads by allowing IT administrators to create, manage, and distribute portfolios of approved products to end-users, who then have access to the products they need in a personalized portal. AWS Service Catalog has direct integrations with AWS Control Tower and SageMaker.
It’s recommended that you use AWS Service Catalog portfolios and products to enhance and scale the following capabilities within your AWS environment:

Account vending – The cloud infrastructure team should maintain a portfolio of account bootstrapping products within the shared infrastructure account. These products are templates that contain the basic infrastructure that should be deployed when an account is created, such as VPC configuration, standard IAM roles, and controls. This portfolio can be natively shared with AWS Control Tower and the management account, so that the products are directly used when creating a new account. For more details, refer to Provision accounts through AWS Service Catalog.
Analytics infrastructure self-service – This portfolio should be created and maintained by a central analytics team or the ML shared services team. This portfolio is intended to host templates to deploy different sets of analytics products to be used by the platform ML and analytics teams. It is shared with the entire Workloads OU (for more information, see Sharing a Portfolio). Examples of the products include a SageMaker domain configured according to the organization’s best practices or an Amazon Redshift cluster for the team to perform advanced analytics.
ML model building and deploying – This capability maps to two different portfolios, which are maintained by the platform ML shared services team:

Model building portfolio – This contains the products to build, train, evaluate, and register your ML models across all ML teams. This portfolio is shared with the Workloads OU and is integrated with SageMaker project templates.
Model deployment portfolio – This contains the products to deploy your ML models at scale in a reliable and consistent way. It will have products for different deployment types such as real-time inference, batch inference, and multi-model endpoints. This portfolio can be isolated within the ML shared services account by the central ML engineering team for a more centralized ML strategy, or shared with the Workloads OU accounts and integrated with SageMaker project templates to federate responsibility to the individual ML teams.

Let’s explore how we deal with AWS Service Catalog products and portfolios in our platform. Both of the following architectures show an implementation to govern the AWS Service Catalog products using the AWS Cloud Development Kit (AWS CDK) and AWS CodePipeline. Each of the aforementioned portfolios will have its own independent pipeline and code repository. The pipeline synthesizes the AWS CDK service catalog product constructs into actual AWS Service Catalog products and deploys them to the portfolios, which are later made available for its consumption and use. For more details about the implementation, refer to Govern CI/CD best practices via AWS Service Catalog.
The following diagram illustrates the architecture for the account vending portfolio.

The workflow includes the following steps:

The shared infrastructure account is set up with the pipeline to create the AWS Service Catalog portfolio.
The CCOE or central infrastructure team can work on these products and customize them so that company networking and security requirements are met.
You can use the AWS Control Tower Account Factory Customization (AFC) to integrate the portfolio within the account vending process. For more details, see Customize accounts with Account Factory Customization (AFC).
To create a new account from the AFC, we use a blueprint. A blueprint is an AWS CloudFormation template that will be deployed in the newly created AWS account. For more information, see Create a customized account from a blueprint.

The following screenshot shows an example of what account creation with a blueprint looks like.

For the analytics and ML portfolios, the architecture changes the way these portfolios are used downstream, as shown in the following diagram.

The following are the key steps involved in building this architecture:

The ML shared services account is set up and bootstrapped with the pipelines to create the two AWS Service Catalog portfolios.
The ML CCOE or ML engineering team can work on these products and customize them so they’re up to date and cover the main use cases from the different business units.
These portfolios are shared with the OU where the ML dev accounts will be located. For more information about the different options to share AWS Service Catalog portfolios, see Sharing a Portfolio.
Sharing these portfolios with the entire Workloads OU will result in these two portfolios being available for use by the account team as soon as the account is provisioned.

After the architecture has been set up, account admins will see the AWS Service Catalog portfolios and ML workload account after they log in. The portfolios are ready to use and can get the team up to speed quickly.

Network architecture
In our ML platform, we are considering two different major logical environments for our workloads: production and pre-production environments with corporate connectivity, and sandbox or development iteration accounts without corporate connectivity. These two environments will have different permissions and requirements when it comes to connectivity.
As your environment in AWS scales up, inter-VPC connectivity and on-premises VPC connectivity will need to scale in parallel. By using services such as Amazon Virtual Private Cloud (Amazon VPC) and AWS Transit Gateway, you can create a scalable network architecture that is highly available, secure, and compliant with your company’s best practices. You can attach each account to its corresponding network segment.
For simplicity, we create a transit gateway within the central network account for our production workloads; this will resemble a production network segment. This will create a hub and spoke VPC architecture that will allow our production accounts to do the following:

Enable inter-VPC communication between the different accounts.
Inspect traffic with centralized egress or ingress to the network segment.
Provide the environments with connectivity to on-premises data stores.
Create a centralized VPC endpoints architecture to reduce networking costs while maintaining private network compliance. For more details, see Centralized access to VPC private endpoints.

For more information about these type of architectures, refer to Building a Scalable and Secure Multi-VPC AWS Network Infrastructure.
The following diagram illustrates the recommended architecture for deploying your transit gateways and creating attachments to the VPCs within your accounts. Anything considered a production environment, whether it’s a workload or shared services account, is connected to the corporate network, while dev accounts have direct internet connectivity to speed up development and exploring of new features.

At a high level, this architecture allows you to create different transit gateways within your network account for your desired AWS Regions or environments. Scalability is provided through the account vending functionality of AWS Control Tower, which deploys a CloudFormation stack to the accounts containing a VPC and the required infrastructure to connect to the environment’s corresponding network segment. For more information about this approach, see the AWS Control Tower Guide for Extending Your Landing Zone.
With this approach, whenever a team needs a new account, the platform team just needs to know whether this will be an account with corporate network connectivity or not. Then the corresponding blueprint is selected to bootstrap the account with, and the account is created. If it’s a corporate network account, the VPC will come with an attachment to the production transit gateway.
Conclusion
In this post, we discussed best practices for creating a multi-account foundation to support your analytics and ML workloads and configuring controls to help you implement governance early in your ML lifecycle. We provided a baseline recommendation for OUs and accounts you should consider creating using AWS Control Tower and blueprints. In addition, we showed how you can deploy security tools at scale using the AWS SRA, how to configure IAM Identity Center for centralized and federated access management, how to use AWS Service Catalog to package and scale your analytics and ML resources, and a best practice approach for creating a hub and spoke network architecture.
Use this guidance to get started in the creation of your own multi-account environment for governing your analytics and ML workloads at scale, and make sure you subscribe to the AWS Machine Learning Blog to receive updates regarding additional blog posts within this series.

About the authors
Alberto Menendez is a DevOps Consultant in Professional Services at AWS. He helps accelerate customers’ journeys to the cloud and achieve their digital transformation goals. In his free time, he enjoys playing sports, especially basketball and padel, spending time with family and friends, and learning about technology.
Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his three-year old sheep-a-doodle!
Liam Izar is Solutions Architect at AWS, where he helps customers work backward from business outcomes to develop innovative solutions on AWS. Liam has led multiple projects with customers migrating, transforming, and integrating data to solve business challenges. His core area of expertise includes technology strategy, data migrations, and machine learning. In his spare time, he enjoys boxing, hiking, and vacations with the family.

Exploring Robustness: Large Kernel ConvNets in Comparison to Convoluti …

Robustness is crucial for deploying deep learning models in real-world applications. Vision Transformers (ViTs) have shown strong robustness and state-of-the-art performance in various vision tasks since their introduction in the 2020s, outperforming traditional CNNs. Recent advancements in large kernel convolutions have revived interest in CNNs, showing they can match or exceed ViT performance. However, the robustness of large kernel networks remains to be determined. This study investigates whether large kernel networks are inherently robust, how their robustness compares to CNNs and ViTs, and what factors contribute to their robustness.

Researchers from the Shanghai Jiao Tong University, Meituan, and several Chinese universities comprehensively evaluated the robustness of large kernel convolutional networks (convents) compared to traditional CNNs and ViTs across six benchmark datasets. Their experiments demonstrated that large kernel convents exhibit remarkable robustness, sometimes even outperforming ViTs. Through a series of nine experiments, they identified unique properties such as occlusion invariance, kernel attention patterns, and frequency characteristics that contribute to this robustness. This study challenges the prevailing belief that self-attention is necessary for achieving strong robustness, suggesting that traditional CNNs can achieve comparable levels of robustness and advocating for further advancements in large kernel network development.

Large kernel convolutional networks date back to early deep learning models but were overshadowed by small kernel networks like VGG-Net and ResNet. Recently, models like ConvNeXt and RepLKNet have revived interest in large kernels, improving performance, especially downstream tasks. However, their robustness still needs to be explored. The study addresses this gap by evaluating the robustness of large kernel networks through various experiments. ViTs are known for their strong robustness across vision tasks. Previous studies have shown that ViTs outperform CNNs in robustness, inspiring further research. This study compares large kernel networks’ robustness to ViTs and CNNs, providing new insights.

The study investigates whether large kernel convents are robust and how their robustness compares to traditional CNNs and ViTs. Using RepLKNet as the primary model, experiments were conducted across six robustness benchmarks. Models like ResNet-50, BiT, and ViT were used for comparison. Results show that RepLKNet outperforms traditional CNNs and ViTs in various robustness tests, including natural adversarial challenges, common corruptions, and domain adaptation. RepLKNet demonstrates superior robustness, particularly in occlusion scenarios and background dependency, highlighting the potential of large kernel convents in robust learning tasks.

Large Kernel ConvNets exhibit robust performance due to their occlusion invariance and kernel attention patterns. Experiments show that these networks better handle high occlusion, adversarial attacks, model perturbations, and noise frequency than traditional models like ResNet and ViT. They maintain performance even when layers are removed or subjected to frequency-based noise. The robustness is largely attributed to the large kernel size, as replacing large kernels with smaller ones significantly degrades performance. This robustness improves with increasing kernel size, showing consistent enhancements in various benchmarks and confirming the importance of large kernels in ConvNet design.

While our empirical analysis strongly supports the robustness of large kernel ConvNets, the study acknowledges the need for more direct theoretical proofs given the intricate nature of deep learning. Moreover, computational constraints limited our ability to conduct kernel size ablations on ImageNet-21K, focusing instead on ImageNet-1K. Nevertheless, our research confirms the significant robustness of large kernel ConvNets across six standard benchmark datasets, accompanied by a thorough quantitative and qualitative examination. These insights shed light on the factors underlying their resilience, suggesting promising avenues for advancing the application and development of large kernel ConvNets in future research and practical use.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Exploring Robustness: Large Kernel ConvNets in Comparison to Convolutional Neural Network CNNs and Vision Transformers ViTs appeared first on MarkTechPost.

Planetarium: A New Benchmark to Evaluate LLMs on Translating Natural L …

Large language models (LLMs) have gained significant attention in solving planning problems, but current methodologies must be revised. Direct plan generation using LLMs has shown limited success, with GPT-4 achieving only 35% accuracy on simple planning tasks. This low accuracy highlights the need for more effective approaches. Another significant challenge lies in the lack of rigorous techniques and benchmarks for evaluating the translation of natural language planning descriptions into structured planning languages, such as the Planning Domain Definition Language (PDDL).

Researchers have explored various approaches to overcome the challenges of using LLMs for planning tasks. One method involves using LLMs to generate plans directly, but this has shown limited success due to poor performance even on simple planning tasks. Another approach, “Planner-Augmented LLMs,” combines LLMs with classical planning techniques. This method frames the problem as a machine translation task, converting natural language descriptions of planning problems into structured formats like PDDL, finite state automata, or logic programming.

The hybrid approach of translating natural language to PDDL utilizes the strengths of both LLMs and traditional symbolic planners. LLMs interpret natural language, while efficient traditional planners ensure solution correctness. However, evaluating code generation tasks, including PDDL translation, remains challenging. Existing evaluation methods, such as match-based metrics and plan validators, need to be revised in assessing the accuracy and relevance of generated PDDL to the original instructions.

Researchers from the Department of Computer Science at Brown University present Planetarium, a rigorous benchmark for evaluating LLMs’ ability to translate natural language descriptions of planning problems into PDDL, addressing the challenges in assessing PDDL generation accuracy. This benchmark offers a rigorous approach to evaluating PDDL equivalence, formally defining planning problem equivalence and providing an algorithm to check whether two PDDL problems satisfy this definition. Planetarium includes a comprehensive dataset featuring 132,037 ground truth PDDL problems with corresponding text descriptions, varying in abstraction and size. The benchmark also provides a broad evaluation of current LLMs in both zero-shot and fine-tuned settings, revealing the task’s difficulty. With GPT-4 achieving only 35.1% accuracy in a zero-shot setting, Planetarium serves as a valuable tool for measuring progress in LLM-based PDDL generation and is publicly available for future development and evaluation.

The Planetarium benchmark introduces a rigorous algorithm for evaluating PDDL equivalence, addressing the challenge of comparing different representations of the same planning problem. This algorithm transforms PDDL code into scene graphs, representing both initial and goal states. It then fully specifies the goal scenes by adding all trivially true edges and creates problem graphs by joining initial and goal scene graphs.

The equivalence check involves several steps: First, it performs quick checks for obvious non-equivalence or equivalence cases. If these fail, it proceeds to fully specify the goal scenes, identifying all propositions true in all reachable goal states. The algorithm then operates in two modes: one for problems where object identity matters, and another where objects in goal states are treated as placeholders. For problems with object identity, it checks isomorphism between combined problem graphs. For placeholder problems, it checks isomorphism between initial and goal scenes separately. This approach ensures a comprehensive and accurate evaluation of PDDL equivalence, capable of handling various representation nuances in planning problems.

The Planetarium benchmark evaluates the performance of various large language models (LLMs) in translating natural language descriptions into PDDL. Results show that GPT-4o, Mistral v0.3 7B Instruct, and Gemma 1.1 IT 2B & 7B all performed poorly in zero-shot settings, with GPT-4o achieving the highest accuracy at 35.12%. GPT-4o’s performance breakdown reveals that abstract task descriptions are more challenging to translate than explicit ones, while fully explicit task descriptions facilitate the easier generation of parseable PDDL codeThey is also so, Fine-tuning significantly improved performance across all open-weight models. Mistral v0.3 7B Instruct achieved the highest accuracy after fine-tuning.

This study introduces the Planetarium benchmark which marks a significant advance in evaluating LLMs’ ability to translate natural language into PDDL for planning tasks. It addresses crucial technical and societal challenges, emphasizing the importance of accurate translations to prevent potential harm from misaligned results. Current performance levels, even for advanced models like GPT-4, highlight the complexity of this task and the need for further innovation. As LLM-based planning systems evolve, Planetarium provides a vital framework for measuring progress and ensuring reliability. This research pushes the boundaries of AI capabilities and underscores the importance of responsible development in creating trustworthy AI planning systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Planetarium: A New Benchmark to Evaluate LLMs on Translating Natural Language Descriptions of Planning Problems into Planning Domain Definition Language PDDL appeared first on MarkTechPost.

RTMW: A Series of High-Performance AI Models for 2D/3D Whole-Body Pose …

Whole-body pose estimation is a key component for improving the capabilities of human-centric AI systems. It is useful in human-computer interaction, virtual avatar animation, and the film industry. Early research in this field was challenging due to the task’s complexity and limited computational power and data, so, researchers focused on estimating the pose of separate body parts. Systems like OpenPose combined these separate estimations to achieve whole-body pose estimation. However, this method was computationally expensive and had performance limitations. Although lightweight tools like MediaPipe provide good real-time performance and are easy to use, their accuracy still needs improvement.

Current research on these problems includes Top-down Approaches, Coordinate Classification, and 3D Pose Estimation. Top-down algorithms use standard detectors to create bounding boxes and scale the human figure uniformly for pose estimation. These algorithms have performed well in public benchmarks. The two-stage inference method allows the human detector and the pose estimator to use smaller input resolutions. In Coordinate Classification, SimCC introduces an approach that treats keypoint prediction as a classification task for horizontal and vertical coordinates. Lastly, 3D pose estimation is a growing field with many industry applications. It mainly involves two approaches: lifting methods that use 2D key points and regression methods based on image analysis.

Researchers from Shanghai AI Laboratory have proposed RTMW (Real-Time Multi-person Whole-body pose estimation models), a series of high-performance models for estimating 2D/3D whole-body pose. For capturing pose information in a better way from various body parts with different scales, RTMPose model architecture is utilized with FPN and HEM (Hierarchical Encoding Module). The model is trained with a large collection of open-source human datasets with annotations that have manual alignment and are improved using a two-stage distillation technique. RTMW performs strongly on various whole-body pose estimation tests while keeping high inference efficiency and consistent deployment friendliness. 

RTMPose uses various training techniques and adopts the two-stage distillation technology from DWPose during training. Since there are limited open-source whole-body pose estimation datasets, 14 datasets were utilized, aligning the keypoint definitions manually, and uniformly mapping them to the 133-point definition of COCO-Wholebody. Due to the lack of open-source 3D datasets during the pose estimation task of whole-body in the monocular 3D, 14 existing 2D datasets are combined with three open-source 3D datasets for joint training using 17 datasets. These datasets include 3 whole-body datasets, 6 human body datasets, 4 face datasets, 1 hand dataset, and 3 3D whole-body point datasets.

The proposed RTMW model is tested on the whole-body pose estimation task using the COCOWholeBody dataset. The results show that RTMW performs very well, balancing accuracy and complexity. Also, RTMW3D demonstrates good performance on COCOWholeBody. Moreover, the performance of RTMW3D was tested on a set of H3WB, where it achieved a better performance on this dataset. The evaluation of RTMW models’ inference speed is performed. We evaluated the inference speed of RTMW models. Even though RTMW includes an extra module compared to RTMPose, which makes it slightly slower, it significantly improves accuracy.

Researchers from the Shanghai AI Laboratory have introduced RTMW, a series of high-performance models for 2D/3D whole-body pose estimation. In this paper, they have expanded on previous work by examining the complexities and challenges in whole-body pose estimation. The new method, RTMW/RTMW3D, builds on the established RTMPose model for real-time whole-body pose estimation. This method has shown outstanding performance among all open-source alternatives and features unique monocular 3D pose estimation capabilities. In the future, the proposed algorithm and its open-source availability will meet several practical needs in the industry for robust pose estimation solutions.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post RTMW: A Series of High-Performance AI Models for 2D/3D Whole-Body Pose Estimation appeared first on MarkTechPost.

Video auto-dubbing using Amazon Translate, Amazon Bedrock, and Amazon …

This post is co-written with MagellanTV and Mission Cloud. 
Video dubbing, or content localization, is the process of replacing the original spoken language in a video with another language while synchronizing audio and video. Video dubbing has emerged as a key tool in breaking down linguistic barriers, enhancing viewer engagement, and expanding market reach. However, traditional dubbing methods are costly (about $20 per minute with human review effort) and time consuming, making them a common challenge for companies in the Media & Entertainment (M&E) industry. Video auto-dubbing that uses the power of generative artificial intelligence (generative AI) offers creators an affordable and efficient solution.
This post shows you a cost-saving solution for video auto-dubbing. We use Amazon Translate for initial translation of video captions and use Amazon Bedrock for post-editing to further improve the translation quality. Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to help you build generative AI applications with security, privacy, and responsible AI.
MagellanTV, a leading streaming platform for documentaries, wants to broaden its global presence through content internationalization. Faced with manual dubbing challenges and prohibitive costs, MagellanTV sought out AWS Premier Tier Partner Mission Cloud for an innovative solution.
Mission Cloud’s solution distinguishes itself with idiomatic detection and automatic replacement, seamless automatic time scaling, and flexible batch processing capabilities with increased efficiency and scalability.
Solution overview
The following diagram illustrates the solution architecture. The inputs of the solution are specified by the user, including the folder path containing the original video and caption file, target language, and toggles for idiom detector and formality tone. You can specify these inputs in an Excel template and upload the Excel file to a designated Amazon Simple Storage Service (Amazon S3) bucket. This will launch the whole pipeline. The final outputs are a dubbed video file and a translated caption file.

We use Amazon Translate to translate the video caption, and Amazon Bedrock to enhance the translation quality and enable automatic time scaling to synchronize audio and video. We use Amazon Augmented AI for editors to review the content, which is then sent to Amazon Polly to generate synthetic voices for the video. To assign a gender expression that matches the speaker, we developed a model to predict the gender expression of the speaker.
In the backend, AWS Step Functions orchestrates the preceding steps as a pipeline. Each step is run on AWS Lambda or AWS Batch. By using the infrastructure as code (IaC) tool, AWS CloudFormation, the pipeline becomes reusable for dubbing new foreign languages.
In the following sections, you will learn how to use the unique features of Amazon Translate for setting formality tone and for custom terminology. You will also learn how to use Amazon Bedrock to further improve the quality of video dubbing.
Why choose Amazon Translate?
We chose Amazon Translate to translate video captions based on three factors.

Amazon Translate supports over 75 languages. While the landscape of large language models (LLMs) has continuously evolved in the past year and continues to change, many of the trending LLMs support a smaller set of languages.
Our translation professional rigorously evaluated Amazon Translate in our review process and affirmed its commendable translation accuracy. Welocalize benchmarks the performance of using LLMs and machine translations and recommends using LLMs as a post-editing tool.
Amazon Translate has various unique benefits. For example, you can add custom terminology glossaries, while for LLMs, you might need fine-tuning that can be labor-intensive and costly.

Use Amazon Translate for custom terminology
Amazon Translate allows you to input a custom terminology dictionary, ensuring translations reflect the organization’s vocabulary or specialized terminology. We use the custom terminology dictionary to compile frequently used terms within video transcription scripts.
Here’s an example. In a documentary video, the caption file would typically display “(speaking in foreign language)” on the screen as the caption when the interviewee speaks in a foreign language. The sentence “(speaking in foreign language)” itself doesn’t have proper English grammar: it lacks the proper noun, yet it’s commonly accepted as an English caption display. When translating the caption into German, the translation also lacks the proper noun, which can be confusing to German audiences as shown in the code block that follows.

## Translate – without custom terminology (default)
import boto3
# Initialize a session of Amazon Translate
translate=boto3.client(service_name=’translate’, region_name=’us-east-1′, use_ssl=True)
def translate_text(text, source_lang, target_lang):
result=translate.translate_text(
Text=text,
SourceLanguageCode=source_lang,
TargetLanguageCode=target_lang)
return result.get(‘TranslatedText’)
text=”(speaking in a foreign language)”
output=translate_text(text, “en”, “de”)
print(output)
# Output: (in einer Fremdsprache sprechen)

Because this phrase “(speaking in foreign language)” is commonly seen in video transcripts, we added this term to the custom terminology CSV file translation_custom_terminology_de.csv with the vetted translation and provided it in the Amazon Translate job. The translation output is as intended as shown in the following code.

## Translate – with custom terminology
import boto3
import json
# Initialize a session of Amazon Translate
translate=boto3.client(‘translate’)
with open(‘translation_custom_terminology_de.csv’, ‘rb’) as ct_file:
translate.import_terminology(
Name=’CustomTerminology_boto3′,
MergeStrategy=’OVERWRITE’,
Description=’Terminology for Demo through boto3′,
TerminologyData={
‘File’:ct_file.read(),
‘Format’:’CSV’,
‘Directionality’:’MULTI’
}
)
text=”(speaking in foreign language)”
result=translate.translate_text(
Text=text,
TerminologyNames=[‘CustomTerminology_boto3_2024’],
SourceLanguageCode=”en”,
TargetLanguageCode=”de”
)
print(result[‘TranslatedText’])
# Output: (Person spricht in einer Fremdsprache)

Set formality tone in Amazon Translate
Some documentary genres tend to be more formal than others. Amazon Translate allows you to define the desired level of formality for translations to supported target languages. By using the default setting (Informal) of Amazon Translate, the translation output in German for the phrase, “[Speaker 1] Let me show you something,” is informal, according to a professional translator.

## Translate – with informal tone (default)
import boto3
# Initialize a session of Amazon Translate
translate=boto3.client(service_name=’translate’, region_name=’us-east-1′, use_ssl=True)
def translate_text(text, source_lang,target_lang):
result=translate.translate_text(
Text=text,
SourceLanguageCode=source_lang,
TargetLanguageCode=target_lang)
return result.get(‘TranslatedText’)
text=”[Speaker 1] Let me show you something.”
output=translate_text(text, “en”, “de”)
print(output)
# Output: [Sprecher 1] Lass mich dir etwas zeigen.

By adding the Formal setting, the output translation has a formal tone, which fits the documentary’s genre as intended.

## Translate – with formal tone
import boto3
# Initialize a session of Amazon Translate
translate=boto3.client(service_name=’translate’, region_name=’us-east-1′, use_ssl=True)
def translate_text(text, source_lang, target_lang):
result=translate.translate_text(
Text=text,
SourceLanguageCode=source_lang,
TargetLanguageCode=target_lang,
Settings={‘Formality’:’FORMAL’})
return result.get(‘TranslatedText’)
text=”[Speaker 1] Let me show you something.”
output=translate_text(text, “en”, “de”)
print(output)
# Output: [Sprecher 1] Lassen Sie mich Ihnen etwas zeigen.

Use Amazon Bedrock for post-editing
In this section, we use Amazon Bedrock to improve the quality of video captions after we obtain the initial translation from Amazon Translate.
Idiom detection and replacement
Idiom detection and replacement is vital in dubbing English videos to accurately convey cultural nuances. Adapting idioms prevents misunderstandings, enhances engagement, preserves humor and emotion, and ultimately improves the global viewing experience. Hence, we developed an idiom detection function using Amazon Bedrock to resolve this issue.
You can turn the idiom detector on or off by specifying the inputs to the pipeline. For example, for science genres that have fewer idioms, you can turn the idiom detector off. While, for genres that have more casual conversations, you can turn the idiom detector on. For a 25-minute video, the total processing time is about 1.5 hours, of which about 1 hour is spent on video preprocessing and video composing. Turning the idiom detector on only adds about 5 minutes to the total processing time.
We have developed a function bedrock_api_idiom to detect and replace idioms using Amazon Bedrock. The function first uses Amazon Bedrock LLMs to detect idioms in the text and then replace them. In the example that follows, Amazon Bedrock successfully detects and replaces the input text “well, I hustle” to “I work hard,” which can be translated correctly into Spanish by using Amazon Translate.

## A rare idiom is well-detected and rephrased by Amazon Bedrock
text_rephrased=bedrock_api_idiom(text)
print(text_rephrased)
# Output: I work hard
response=translate_text(text_rephrased, “en”, “es-MX”)
print(response)
# Output: yo trabajo duro
response=translate_text(response, “es-MX”, “en”)
print(response)
# Output: I work hard

Sentence shortening
Third-party video dubbing tools can be used for time-scaling during video dubbing, which can be costly if done manually. In our pipeline, we used Amazon Bedrock to develop a sentence shortening algorithm for automatic time scaling.
For example, a typical caption file consists of a section number, timestamp, and the sentence. The following is an example of an English sentence before shortening.
Original sentence:
A large portion of the solar energy that reaches our planet is reflected back into space or absorbed by dust and clouds.

Here’s the shortened sentence using the sentence shortening algorithm. Using Amazon Bedrock, we can significantly improve the video-dubbing performance and reduce the human review effort, resulting in cost saving.
Shortened sentence:
A large part of solar energy is reflected into space or absorbed by dust and clouds.

Conclusion
This new and constantly developing pipeline has been a revolutionary step for MagellanTV because it efficiently resolved some challenges they were facing that are common within Media & Entertainment companies in general. The unique localization pipeline developed by Mission Cloud creates a new frontier of opportunities to distribute content across the world while saving on costs. Using generative AI in tandem with brilliant solutions for idiom detection and resolution, sentence length shortening, and custom terminology and tone results in a truly special pipeline bespoke to MagellanTV’s growing needs and ambitions.
If you want to learn more about this use case or have a consultative session with the Mission team to review your specific generative AI use case, feel free to request one through AWS Marketplace.

About the Authors
Na Yu is a Lead GenAI Solutions Architect at Mission Cloud, specializing in developing ML, MLOps, and GenAI solutions in AWS Cloud and working closely with customers. She received her Ph.D. in Mechanical Engineering from the University of Notre Dame.
Max Goff is a data scientist/data engineer with over 30 years of software development experience. A published author, blogger, and music producer he sometimes dreams in A.I.
Marco Mercado is a Sr. Cloud Engineer specializing in developing cloud native solutions and automation. He holds multiple AWS Certifications and has extensive experience working with high-tier AWS partners. Marco excels at leveraging cloud technologies to drive innovation and efficiency in various projects.
Yaoqi Zhang is a Senior Big Data Engineer at Mission Cloud. She specializes in leveraging AI and ML to drive innovation and develop solutions on AWS. Before Mission Cloud, she worked as an ML and software engineer at Amazon for six years, specializing in recommender systems for Amazon fashion shopping and NLP for Alexa. She received her Master of Science Degree in Electrical Engineering from Boston University.
Adrian Martin is a Big Data/Machine Learning Lead Engineer at Mission Cloud. He has extensive experience in English/Spanish interpretation and translation.
Ryan Ries holds over 15 years of leadership experience in data and engineering, over 20 years of experience working with AI and 5+ years helping customers build their AWS data infrastructure and AI models. After earning his Ph.D. in Biophysical Chemistry at UCLA and Caltech, Dr. Ries has helped develop cutting-edge data solutions for the U.S. Department of Defense and a myriad of Fortune 500 companies.
Andrew Federowicz is the IT and Product Lead Director for Magellan VoiceWorks at MagellanTV. With a decade of experience working in cloud systems and IT in addition to a degree in mechanical engineering, Andrew designs builds, deploys, and scales inventive solutions to unique problems. Before Magellan VoiceWorks, Andrew architected and built the AWS infrastructure for MagellanTV’s 24/7 globally available streaming app. In his free time, Andrew enjoys sim racing and horology.
Qiong Zhang, PhD, is a Sr. Partner Solutions Architect at AWS, specializing in AI/ML. Her current areas of interest include federated learning, distributed training, and generative AI. She holds 30+ patents and has co-authored 100+ journal/conference papers. She is also the recipient of the Best Paper Award at IEEE NetSoft 2016, IEEE ICC 2011, ONDM 2010, and IEEE GLOBECOM 2005.
Cristian Torres is a Sr. Partner Solutions Architect at AWS. He has 10 years of experience working in technology performing several roles such as: Support Engineer, Presales Engineer, Sales Specialist and Solutions Architect. He works as a generalist with AWS services focusing on Migrations to help strategic AWS Partners develop successfully from a technical and business perspective.

How Mixbook used generative AI to offer personalized photo book experi …

This post is co-written with Vlad Lebedev and DJ Charles from Mixbook.
Mixbook is an award-winning design platform that gives users unrivaled creative freedom to design and share one-of-a-kind stories, transforming the lives of more than six million people. Today, Mixbook is the #1 rated photo book service in the US with 26 thousand five-star reviews.
Mixbook is empowering users to share their stories with creativity and confidence. Their mission is to assist users in celebrating the beautiful moments of their lives. Mixbook aims to foster the profound connections between users and their loved ones through sharing of their stories in both physical and digital mediums.
Years ago, Mixbook undertook a strategic initiative to transition their operational workloads to Amazon Web Services (AWS), a move that has continually yielded significant advantages. This pivotal decision has been instrumental in propelling them towards fulfilling their mission, ensuring their system operations are characterized by reliability, superior performance, and operational efficiency.
In this post we show you how Mixbook used generative artificial intelligence (AI) capabilities in AWS to personalize their photo book experiences—a step towards their mission.
Business Challenge
In today’s digital world, we have a lot of pictures that we take and share with our friends and family. Let’s consider a scenario where we have hundreds of photos from a recent family vacation, and we want to create a coffee-table photo-book to make it memorable. However, choosing the best pictures from the lot and describing them with captions can take a lot of time and effort. As we all know, a picture’s worth a thousand words, which is why trying to sum up a moment with a caption of just six to ten words can be so challenging. Mixbook really gets the problem, and they’re here to fix it.
Solution
Mixbook Smart Captions is the magical solution to the caption conundrum. It doesn’t only interpret user photos; it also adds a sprinkle of creativity, making the stories pop.
Most importantly, Smart Captions doesn’t fully automate the creative process. Instead, it provides a creative partner to enable the user’s own storytelling to imbue a book with personal flourishes. Whether it’s a selfie or a scenic shot, the goal is to make sure users’ photos speak volumes, effortlessly.
Architecture overview
The implementation of the system involves three primary components:

Data intake
Information inference
Creative synthesis

Caption generation is heavily reliant on the inference process, because the quality and meaningfulness of the comprehension process output directly influence the specificity and personalization of the caption generation. The following is the data flow diagram of the caption generation process., which is described in the text that follows.

Data intake
A user uploads photos into Mixbook. The raw photos are stored in Amazon Simple Storage Service (Amazon S3).
The data intake process involves three macro components: Amazon Aurora MySQL-Compatible Edition, Amazon S3, and AWS Fargate for Amazon ECS. Aurora MySQL serves as the primary relational data storage solution for tracking and recording media file upload sessions and their accompanying metadata. It offers flexible capacity options, ranging from serverless on one end to reserved provisioned instances for predictable long-term use on the other. S3, in turn, provides efficient, scalable, and secure storage for the media file objects themselves. Its storage classes enable the maintenance of recent uploads in a warm state for low-latency access, while older objects can be transitioned to Amazon S3 Glacier tiers, thus minimizing storage expenses over time. Amazon Elastic Container Registry (Amazon ECS), when used in conjunction with the low-maintenance compute environment of AWS Fargate, forms a convenient orchestrator for containerized workloads, bringing all components together seamlessly.
Inference
The comprehension phase extracts essential contextual and semantic elements from the input, including image descriptions, temporal and spatial data, facial recognition, emotional sentiment, and labels. Among these, the image descriptions generated by a computer vision model offer the most fundamental understanding of the captured moments. Amazon Rekognition delivers precise detection of faces’ bounding boxes and emotional expressions. Face detection is crucial for optimal automatic photo placement and cropping, while emotion recognition allows for more effective story tone adjustments. The detected face bounding boxes on the photos are primarily used for optimal automatic photo placement and cropping. The emotions are used to help select a better tone to make it funnier or more nostalgic (for example). Furthermore, Amazon Rekognition enhances safety by identifying potentially objectionable content.
The inference pipeline is powered by an AWS Lambda-based multi-step architecture, which maximizes cost-efficiency and elasticity by running independent image analysis steps in parallel. AWS Step Functions enables the synchronization and ordering of interdependent steps.
The image captions are generated by an Amazon SageMaker inference endpoint, which is enhanced by an Amazon ElastiCache for Redis-powered buffer. The buffer was implemented after benchmarking the captioning model’s performance. The benchmarking revealed that the model performed optimally when processing batches of images, but underperformed when analyzing individual images.
Generation
The caption-generating mechanism behind the writing assistant feature is what turns Mixbook Studio into a natural language story-crafting tool. Powered by a Llama language model, the assistant initially used carefully engineered prompts created by AI experts. However, the Mixbook Storyarts team sought more granular control over the style and tone of the captions, leading to a diverse team that included an Emmy-nominated scriptwriter reviewing, adjusting, and adding unique handcrafted examples. This resulted in a process of fine-tuning the model, moderating modified responses, and deploying approved models for experimental and public releases. After inference, three captions are created and stored in Amazon Relational Database Service (Amazon RDS).
The following image shows the Mixbook Smart Captions feature in Mixbook Studio.

Benefits
Mixbook implemented this solution to provide new features to their customers. It provided an improved user experience with operational efficiency.
User experience

Enhanced storytelling: Captures the users’ emotions and experiences, now beautifully expressed through captions that are heartfelt.
User delight: Adds an element of surprise with captions that aren’t just accurate, but also delightful and imaginative. A delighted user Hanie U says “I hope there are more captions experiences released in the future.” Another user, Megan P. says, “It worked great!” Users can also edit the generated captions.
Time efficiency: Nobody has the time to struggle with captions. The feature saves precious time while making user stories shine bright.
Safety and correctness: The captions were generated responsibly, leveraging the guard-rails to ensure content moderation and relevancy.

System

Elasticity and scalability of Lambda
Comprehensible workflow orchestration with Step Functions
Variety of base models from SageMaker and tuning capabilities for maximum control

As a result of their improved user delight, Mixbook has been named as an official honoree of the Webby Awards in 2024 for Apps & Software Best Use of AI & Machine Learning.

“AWS enables us to scale the innovations our customers love most. And now, with the new AWS generative AI capabilities, we are able to blow our customers minds with creative power they never thought possible. Innovations like this are why we’ve been partnered with AWS since the beta in 2006.”
– Andrew Laffoon, CEO, Mixbook

Conclusion
Mixbook started experimenting with AWS generative AI solutions to augment their existing application in early 2023. They started with a quick proof-of-concept to yield results to show the art of the possible. Continuous development, testing, and integration using AWS breadth of services in compute, storage, analytics, and machine learning allowed them to iterate quickly. After they released the Smart Caption features in beta, they were able to quickly adjust according to real-world usage patterns, and protect the product’s value.
Try out Mixbook Studio to experience the storytelling. To learn more about AWS generative AI solutions, start with Transform your business with generative AI. To hear more from Mixbook leaders, listen to the AWS re:Think Podcast available from Art19, Apple Podcasts, and Spotify.

About the authors
Vlad Lebedev is a Senior Technology Leader at Mixbook. He leads a product-engineering team responsible for transforming Mixbook into a place for heartfelt storytelling. He draws on over a decade of hands-on experience in web development, system design, and data engineering to drive elegant solutions for complex problems. Vlad enjoys learning about both contemporary and ancient cultures, their histories, and languages.
DJ Charles is the CTO at Mixbook. He has enjoyed a 30-year career architecting interactive and e-commerce designs for top brands. Innovating broadband tech for the cable industry in the ’90s, revolutionizing supply-chain processes in the 2000s, and advancing environmental tech at Perillon led to global real-time bidding platforms for brands like Sotheby’s & eBay. Beyond tech, DJ loves learning new musical instruments, the art of songwriting, and deeply engages in music production & engineering in his spare time.
Malini Chatterjee is a Senior Solutions Architect at AWS. She provides guidance to AWS customers on their workloads across a variety of AWS technologies. She brings a breadth of expertise in Data Analytics and Machine Learning. Prior to joining AWS, she was architecting data solutions in financial industries. She is very passionate about semi-classical dancing and performs in community events. She loves traveling and spending time with her family.
Jessica Oliveira is an Account Manager at AWS who provides guidance and support to Commercial Sales in Northern California. She is passionate about building strategic collaborations to help ensure her customers’ success. Outside of work, she enjoys traveling, learning about different languages and cultures, and spending time with her family.

ETH Zurich Researchers Introduced EventChat: A CRS Using ChatGPT as It …

Conversational Recommender Systems (CRS) are revolutionizing how users make decisions by offering personalized suggestions through interactive dialogue interfaces. Unlike traditional systems that present predetermined options, CRS allows users to dynamically input and refine their preferences, significantly reducing information overload. By incorporating feedback loops and advanced machine learning techniques, CRS provides an engaging and intuitive user experience. These systems are particularly valuable for small and medium-sized enterprises (SMEs) looking to enhance customer satisfaction and engagement without the extensive resources required for traditional recommendation systems.

Due to limited resources and high operational costs, SMEs need help implementing efficient recommendation systems. Traditional systems often need more flexibility and user control, constraining users from reacting to predefined recommendations. SMEs require affordable and effective solutions that dynamically adapt to user preferences in real-time, providing a more interactive and satisfying experience. The need for more advanced conversational models that can cater to these requirements is critical for SMEs to stay competitive and meet customer expectations.

Existing frameworks for CRS have primarily focused on managing dialogues and extracting user information. Traditional approaches, which rely heavily on script-based interactions, often must provide the depth and flexibility required for a truly personalized user experience. Recent advancements have incorporated large language models (LLMs) like ChatGPT, which can generate and understand natural language to facilitate more adaptive conversations. These LLM-driven systems, such as fine-tuned versions of LaMDA, offer significant improvements in interaction quality but come with high development and operational costs, posing challenges for resource-constrained SMEs.

Researchers from ETH Zurich have introduced EventChat, a CRS tailored for SMEs in the leisure industry. The company aims to balance cost-effectiveness with high-quality user interactions. EventChat utilizes ChatGPT as its core language model, integrating prompt-based learning techniques to minimize the need for extensive training data. This approach makes it accessible for smaller businesses by reducing the implementation complexity and associated costs. EventChat’s key features include handling complex queries, providing tailored event recommendations, and addressing SMEs’ specific needs in delivering enhanced user experiences.

EventChat operates through a turn-based dialogue system where user inputs trigger specific actions such as search, recommendation, or targeted inquiries. The backend architecture combines relational and vector databases to curate relevant event information. Combining button-based interactions with conversational prompts, this hybrid approach ensures efficient resource use while maintaining high recommendation accuracy. Developed using the Flutter framework, EventChat’s frontend allows for customizable time intervals and user preferences, enhancing overall user experience and control. By including user-specific parameters directly in the chat, EventChat optimizes interaction efficiency and satisfaction.

Image Source

The performance evaluation of EventChat demonstrated promising results, with an 85.5% recommendation accuracy rate. The system showed effective user engagement and satisfaction, although it faced challenges with latency and cost. Specifically, a median cost of $0.04 per interaction and a latency of 5.7 seconds highlighted areas needing improvement. The study emphasized the importance of balancing high-quality responses with economic viability for SMEs, suggesting that further optimization could enhance system performance. The research team also noted the significant impact of using advanced LLMs like ChatGPT, which, while improving interaction quality, increased operational costs and response times.

Image Source

The research indicates that LLM-driven CRS, such as EventChat, can significantly benefit SMEs by improving user engagement and recommendation accuracy. Despite challenges related to cost and latency, the strategic implementation of these systems shows promise in democratizing advanced recommendation technologies for smaller businesses. The findings underscore the need for ongoing refinement & strategic planning to maximize the potential of CRS in resource-constrained environments. By reducing costs and improving response times, SMEs can leverage LLM-driven CRS to enhance customer satisfaction and stay competitive in their respective markets.

In conclusion, integrating LLM-driven CRS like EventChat presents a viable solution for SMEs aiming to enhance customer engagement and satisfaction. EventChat’s implementation demonstrates that balancing cost, latency, and interaction quality is crucial for an effective system. With an 85.5% recommendation accuracy and a median price of $0.04 per interaction, EventChat highlights the potential benefits and challenges of adopting advanced conversational models in SME settings. As SMEs seek affordable and efficient recommendation solutions, ongoing research and refinement of LLM-driven CRS will be vital in achieving sustainable and competitive business practices.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit
The post ETH Zurich Researchers Introduced EventChat: A CRS Using ChatGPT as Its Core Language Model Enhancing Small and Medium Enterprises with Advanced Conversational Recommender Systems appeared first on MarkTechPost.

RoboMorph: Evolving Robot Design with Large Language Models and Evolut …

The field of robotics is seeing transformative changes with the integration of generative methods like large language models (LLMs). These advancements enable the developing of sophisticated systems that autonomously navigate and adapt to various environments. The application of LLMs in robot design and control processes represents a significant leap forward, offering the potential to create robots that are more efficient & capable of performing complex tasks with greater autonomy. 

Designing effective robot morphologies presents substantial challenges due to the expansive design space and the traditional reliance on human expertise for prototyping and testing. Creating, testing, and iterating on robot designs takes time and effort. Engineers must navigate a vast array of potential configurations, which requires significant computational resources and time. This bottleneck in the design process highlights the need for innovative approaches to streamline and optimize robot design, reducing the dependency on manual intervention and speeding up the development cycle.

Current methods for robot design typically combine manual prototyping, iterative testing, and evolutionary algorithms to explore different configurations. While proven effective, these approaches are limited by the extensive computational resources and time required to navigate the design space. Evolutionary algorithms, for example, simulate numerous iterations of robot designs to find optimal configurations, but this process can be slow and resource-intensive. This traditional approach underscores the need for more efficient methods to accelerate the design process while maintaining or enhancing the quality of the resulting robots.

Researchers from the Univerity of Warsaw, IDEAS NCBR, Nomagic, and Nomagic introduced RoboMorph, a groundbreaking framework that integrates LLMs, evolutionary algorithms, and reinforcement learning (RL) to automate the design of modular robots. This innovative method leverages the capabilities of LLMs to efficiently navigate the extensive robot design space by representing each robot design as a grammar. RoboMorph’s framework includes automatic prompt design and an RL-based control algorithm, which iteratively improves robot designs through feedback loops. Integrating these advanced techniques allows RoboMorph to generate diverse and optimized robot designs more efficiently than traditional methods.

RoboMorph operates by representing robot designs as grammars, which LLMs use to explore the design space. Each iteration begins with a binary tournament selection algorithm that selects half of the population for mutation. The selected prompts are mutated, and the new prompts are used to generate a new batch of robot designs. These designs are compiled into XML files and evaluated using the MuJoCo physics simulator to obtain fitness scores. This iterative process enables RoboMorph to improve robot designs over successive generations, showcasing significant morphological advancements. Evolutionary algorithms ensure a diverse and balanced selection of designs, preventing premature convergence and promoting the exploration of novel configurations.

Image Source

The performance of RoboMorph was evaluated through experiments involving ten seeds, ten evolutions, and a population size of four. Each iteration involved the mutation of prompts and the application of the RL-based control algorithm to compute fitness scores. The fitness score, the average reward over 15 random rollouts, indicated a positive trend with each iteration. RoboMorph significantly improved robot morphology, generating optimized designs that outperformed traditional methods. The top-ranked robot designs, tailored for flat terrains, showed that longer body lengths and consistent limb dimensions contributed to improved locomotion and stability.

Image Source

RoboMorph presents a promising approach to addressing the complexities of robot design. By integrating generative methods, evolutionary algorithms, and RL-based control, the researchers have developed a framework that streamlines the design process and enhances the adaptability and functionality of robots. The framework’s ability to efficiently generate and optimize robot designs demonstrates its potential for real-world applications. Future research will focus on scaling experiments, refining mutation operators, expanding the design space, and exploring diverse environments. The ultimate goal is to integrate further the generative capabilities of LLMs with low-cost manufacturing techniques to design robots suitable for a wide range of applications.

In conclusion, RoboMorph leverages the power of LLMs and evolutionary algorithms to create a framework that streamlines the design process and produces optimized robot morphologies. This approach addresses the limitations of earlier methods and offers a promising pathway for developing more efficient and capable robots. The results of RoboMorph’s experiments highlight its potential to revolutionize robot designs.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit
The post RoboMorph: Evolving Robot Design with Large Language Models and Evolutionary Machine Learning Algorithms for Enhanced Efficiency and Performance appeared first on MarkTechPost.

Samsung Researchers Introduce LoRA-Guard: A Parameter-Efficient Guardr …

Large Language Models (LLMs) have demonstrated remarkable proficiency in language generation tasks. However, their training process, which involves unsupervised learning from extensive datasets followed by supervised fine-tuning, presents significant challenges. The primary concern stems from the nature of pre-training datasets, such as Common Crawl, which often contain undesirable content. Consequently, LLMs inadvertently acquire the ability to generate offensive language and potentially harmful advice. This unintended capability poses a serious safety risk, as these models can produce coherent responses to user inputs without proper content filtering. The challenge for researchers lies in developing methods to maintain the LLMs’ language generation capabilities while effectively mitigating the production of unsafe or unethical content.

Existing attempts to overcome the safety concerns in LLMs have primarily focused on two approaches: safety tuning and the implementation of guardrails. Safety tuning aims to optimize models to respond in a manner aligned with human values and safety considerations. However, these chat models remain vulnerable to jailbreak attacks, which employ various strategies to circumvent safety measures. These strategies include using low-resource languages, refusal suppression, privilege escalation, and distractions.

To counter these vulnerabilities, researchers have developed guardrails to monitor exchanges between chat models and users. One notable approach involves the use of model-based guardrails, which are separate from the chat models themselves. These guard models are designed to flag harmful content and serve as a critical component of AI safety stacks in deployed systems.

However, the current methods face significant challenges. The use of separate guard models introduces substantial computational overhead, making them impractical in low-resource settings. Also, the learning process is inefficient due to the considerable overlap in language understanding abilities between chat models and guard models, as both need to perform their respective tasks of response generation and content moderation effectively.

Samsung R&D Institute researchers present LoRA-Guard, an innovative system that integrates chat and guard models, addressing efficiency issues in LLM safety. It uses a low-rank adapter on a chat model’s transformer backbone to detect harmful content. The system operates in dual modes: activating LoRA parameters for guardrailing with a classification head, and deactivating them for normal chat functions. This approach significantly reduces parameter overhead by 100-1000x compared to previous methods, making deployment feasible in resource-constrained settings. LoRA-Guard has been evaluated on various datasets, including zero-shot scenarios, and its model weights have been published to support further research.

LoRA-Guard’s architecture is designed to efficiently integrate guarding capabilities into a chat model. It uses the same embedding and tokenizer for both the chat model C and the guard model G. The key innovation lies in the feature map: while C uses the original feature map f, G employs f’ with LoRA adapters attached to f. G also utilizes a separate output head hguard for classification into harmfulness categories.

This dual-path design allows for seamless switching between chat and guard functions. By activating or deactivating LoRA adapters and switching between output heads, the system can perform either task without performance degradation. The parameter sharing between paths significantly reduces the computational overhead, with the guard model typically adding only a fraction (often 1/1000th) of the original model’s parameters.

LoRA-Guard is trained through supervised fine-tuning of f’ and hguard on labeled datasets, keeping the chat model’s parameters frozen. This approach utilizes the chat model’s existing knowledge while learning to detect harmful content efficiently.

LoRA-Guard demonstrates exceptional performance on multiple datasets. On ToxicChat, it outperforms baselines in AUPRC while using significantly fewer parameters – up to 1500 times less than fully fine-tuned models. For OpenAIModEval, it matches alternative methods with 100 times fewer parameters. Cross-domain evaluations reveal interesting asymmetries: models trained on ToxicChat generalize well to OpenAIModEval, but the reverse shows considerable performance drops. This asymmetry might be due to differences in dataset characteristics or the presence of jailbreak samples in ToxicChat. Overall, LoRA-Guard proves to be an efficient and effective solution for content moderation in language models.

LoRA-Guard represents a significant leap in moderated conversational systems, reducing guardrailing parameter overhead by 100-1000 times while maintaining or improving performance. This efficiency is achieved through knowledge sharing and parameter-efficient learning mechanisms. Its dual-path design prevents catastrophic forgetting during fine-tuning, a common issue in other approaches. By dramatically reducing training time, inference time, and memory requirements, LoRA-Guard emerges as a crucial development for implementing robust content moderation in resource-constrained environments. As on-device LLMs become more prevalent, LoRA-Guard paves the way for safer AI interactions across a broader range of applications and devices.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit
The post Samsung Researchers Introduce LoRA-Guard: A Parameter-Efficient Guardrail Adaptation Method that Relies on Knowledge Sharing between LLMs and Guardrail Models appeared first on MarkTechPost.

A Decade of Transformation: How Deep Learning Redefined Stereo Matchin …

A fundamental topic in computer vision for nearly half a century, stereo matching involves calculating dense disparity maps from two corrected pictures. It plays a critical role in many applications, including autonomous driving, robotics, and augmented reality, among many others.

According to their cost-volume computation and optimization methodologies, existing surveys categorize end-to-end architectures into 2D and 3D classes. These surveys also highlight the still unanswered problems, offering significant insights into this rapid change. New approaches and paradigms have emerged in the field since then, spurred by innovations in other branches of deep learning, and the domain has seen tremendous growth since then. Examples of the field’s evolution that show the potential for additional gains in accuracy and efficiency, such as iterative refinement and transformer-based architectures, instill a sense of optimism and hope for the future of deep stereo matching. As deep stereo matching has progressed, numerous problems have surfaced, notwithstanding the outstanding accomplishments. The inability to generalize, especially when dealing with domain transitions between actual and synthetic data, is a major problem mentioned in earlier surveys.

Prior surveys conducted in the late 2010s addressed the initial phase of this revolution, but the area has witnessed even more revolutionary progress in the subsequent five years of study. A new study by the University of Bologna team, a leading group in the field, presents:

A detailed analysis of recent advancements in deep stereo matching, specifically looking at the innovative paradigm shifts such as the use of transformer-based architectures and ground-breaking architectural designs like RAFT-new stereo, that have changed the game in the 2020s

Analyze the key problems due to these advancements, categorize them all, and look at the best methods for fixing them.

The key findings from their paper are highlighted as follows:

Architecture Design: The benchmark findings demonstrate that RAFT-new stereo’s design approach is revolutionary, significantly increasing resilience to domain changes. The team anticipates that more frameworks will follow this new paradigm since it was used by most of the most recent ones launched a few months before this study. However, the search for innovative and efficient designs, as shown by the most recent suggestions yielding ever-improving outcomes, is a fascinating journey that continues to engage the field. 

Audio Enhanced with RGB: Utilizing thermal, multispectral, or event camera pictures as input to stereo-matching networks is an emerging concept that has grown in popularity over the last five years. This injects new ideas into an established but dynamic field. While this trend is encouraging, online needs to be more of these emerging tasks still need to be improved. 

Some of the problems predicted by earlier studies still exist despite the numerous triumphs in dealing with them. The Booster dataset demonstrated how high-resolution images are still challenging to process and how non-Lambertian objects are crucial, mostly because there is a shortage of training data or methods to deal with them that could be better. Likewise, difficult weather conditions can still be a problem. 

The team concludes by stating that, despite developing visual foundational models for other computer vision tasks, one still needs stereo matching. There has yet to be any effort in this area for stereo, while there have been some for single-image depth estimates.

By revealing the most effective methods currently in use, this work not only clarifies the existing obstacles but also suggests promising avenues for further study. Newcomers and seasoned pros alike can find useful information and inspiring ideas in this survey, which the team hopes will ignite their passion for pushing the boundaries of deep stereo matching.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit
The post A Decade of Transformation: How Deep Learning Redefined Stereo Matching in the Twenties appeared first on MarkTechPost.

5 Levels in AI by OpenAI: A Roadmap to Human-Level Problem Solving Cap …

In an effort to track its advancement towards creating Artificial Intelligence (AI) that can surpass human performance, OpenAI has launched a new classification system. According to a Bloomberg article, OpenAI has recently discussed a five-level framework to clarify its goal for AI safety and future improvements.

Level 1: Conversational AI

AI programs such as ChatGPT can converse intelligibly with people at a basic level. At this point in the development of AI, chatbots can comprehend and react to human language, which makes them helpful for various activities like basic information retrieval, customer support, and informal conversation.

Level Two: Reasoners

The second tier is referred to by OpenAI as “Reasoners.” Without the need for external tools, these AI systems are able to answer simple problems just as well as a doctorate-level educated human. The team has showcased the GPT-4 model’s enhanced capabilities at the discussion, demonstrating human-like reasoning abilities and suggesting a possible move to this second tier.

Level 3: Agents

The AI systems in the third tier, referred to as “Agents,” are able to act on behalf of people for extended periods of time. These agents greatly reduce the requirement for human involvement by being able to undertake tasks that call for sustained effort and decision-making on their own.

Level 4: Innovators

AI systems that reach Level 4 are called “Innovators.” These AI models can help generate fresh concepts and breakthroughs, collaborating with people to propel creative and technical breakthroughs. At this point, AI’s function has significantly improved from simply obeying commands to actively contributing to creativity.

Level 5: Organizations

The highest ranking level in OpenAI’s classification is Level 5, or “Organisations.” At this stage, AI is capable of overseeing all organizational functions, including strategic decision-making and department-wide process optimization. At this point, AI is viewed as more than just a tool; it is a crucial component of company strategy and execution, able to manage intricate organizational duties.

The new categorization system from OpenAI is currently being developed, and this framework is intended to aid both external and internal stakeholders in understanding the company’s trajectory towards the development of increasingly sophisticated AI systems.

The goal of OpenAI has always been to develop artificial general intelligence (AGI), or AI, that is capable of outperforming humans at the majority of tasks. OpenAI CEO Sam Altman has voiced optimism that AGI might be accomplished this decade despite the fact that such systems do not currently exist. The new five-level approach offers a clear path for monitoring advancement towards this challenging objective.

The Google DeepMind researchers’ comparable methodology is echoed in the OpenAI five-level classification scheme. The “Expert” and “Superhuman” categories are among the five levels of AI that these researchers described in a previous study to evaluate AI’s capabilities. This grading scheme is similar to the one employed by the auto industry to assess how automated self-driving cars are.

The frameworks developed by OpenAI and Google DeepMind demonstrate how AI capabilities are always evolving and how AGI is still being sought. These organized methodologies offer useful benchmarks for tracking advancements and directing future enhancements as AI technology develops.

In conclusion, the company’s efforts to create AI that can perform better than humans have advanced significantly with the release of OpenAI’s five-level classification system. OpenAI offers a roadmap for the advancement of AI development by clearly defining the route from conversational AI to organizational management. With each step the company takes towards reaching better AI capabilities, the potential for AI to transform sectors and boost human productivity grows more real.
The post 5 Levels in AI by OpenAI: A Roadmap to Human-Level Problem Solving Capabilities appeared first on MarkTechPost.

NVIDIA Researchers Introduce MambaVision: A Novel Hybrid Mamba-Transfo …

Computer vision enables machines to interpret & understand visual information from the world. This encompasses a variety of tasks, such as image classification, object detection, and semantic segmentation. Innovations in this area have been propelled by developing advanced neural network architectures, particularly Convolutional Neural Networks (CNNs) and, more recently, Transformers. These models have demonstrated significant potential in processing visual data. Still, there remains a continuous need for improvements in their ability to balance computational efficiency with capturing both local and global visual contexts.

A central challenge in computer vision is the efficient modeling and processing of visual data. This requires understanding both local details and broader contextual information within images. Traditional models often need help with this balance. CNNs, while efficient at handling local spatial relationships, may overlook broader contextual information. On the other hand, Transformers, which leverage self-attention mechanisms to capture global context, can be computationally intensive due to their quadratic complexity relative to sequence length. This trade-off between efficiency and context-capture capability has significantly hindered the advancing vision models’ performance.

Existing approaches primarily utilize CNNs for their effectiveness in handling local spatial relationships. However, these models may only partially capture the broader contextual information necessary for more complex vision tasks. Transformers have been applied to vision tasks to address this issue, utilizing self-attention mechanisms to enhance the understanding of the global context. Despite these advancements, both CNNs and Transformers have inherent limitations. CNNs can miss the broader context, while Transformers are computationally expensive and challenging to train and deploy efficiently.

Researchers at NVIDIA have introduced MambaVision, a novel hybrid model that combines the strengths of Mamba and Transformer architectures. This new approach integrates CNN-based layers with Transformer blocks to enhance the modeling capacity for vision applications. The MambaVision family includes various model configurations to meet different design criteria and application needs, providing a flexible and powerful tool for various vision tasks. The introduction of MambaVision represents a significant step forward in the development of hybrid models for computer vision.

MambaVision employs a hierarchical architecture divided into four stages. The initial stages use CNN layers for rapid feature extraction, capitalizing on their efficiency in processing high-resolution features. The later stages incorporate MambaVision and Transformer blocks to effectively capture both short—and long-range dependencies. This innovative design allows the model to handle global context more efficiently than traditional approaches. The redesigned Mamba blocks, which now include self-attention mechanisms, are central to this improvement, enabling the model to process visual data with greater accuracy and throughput.

The performance of MambaVision is notable, achieving state-of-the-art results on the ImageNet-1K dataset. For example, the MambaVision-B model achieves a Top-1 accuracy of 84.2%, surpassing other leading models such as ConvNeXt-B and Swin-B, which gained 83.8% and 83.5%, respectively. In addition to its high accuracy, MambaVision demonstrates superior image throughput, with the MambaVision-B model processing images significantly faster than its competitors. In downstream tasks like object detection and semantic segmentation on the MS COCO and ADE20K datasets, MambaVision outperforms comparably-sized backbones, showcasing its versatility and efficiency. For instance, MambaVision models show improvements in box AP and mask AP metrics, achieving 46.4 and 41.8, respectively, higher than those achieved by models like ConvNeXt-T and Swin-T.

A comprehensive ablation study supports these findings, demonstrating the effectiveness of MambaVision’s design choices. The researchers improved accuracy and image throughput by redesigning the Mamba block to be more suitable for vision tasks. The study explored various integration patterns of Mamba and Transformer blocks, revealing that incorporating self-attention blocks in the final layers significantly enhances the model’s ability to capture global context and long-range spatial dependencies. This design produces a richer feature representation and better performance across various vision tasks.

In conclusion, MambaVision represents a significant advancement in vision modeling by combining the strengths of CNNs and Transformers into a single, hybrid architecture. This approach effectively addresses the limitations of existing models by enhancing understanding of local and global contexts, leading to superior performance in various vision tasks. The results of this study indicate a promising direction for future developments in computer vision, potentially setting a new standard for hybrid vision models.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit
The post NVIDIA Researchers Introduce MambaVision: A Novel Hybrid Mamba-Transformer Backbone Specifically Tailored for Vision Applications appeared first on MarkTechPost.

Patronus AI Introduces Lynx: A SOTA Hallucination Detection LLM that …

Patronus AI has announced the release of Lynx. This cutting-edge hallucination detection model promises to outperform existing solutions such as GPT-4, Claude-3-Sonnet, and other models used as judges in closed and open-source settings. This groundbreaking model, which marks a significant advancement in artificial intelligence, was introduced with the support of key integration partners, including Nvidia, MongoDB, and Nomic.

Hallucination in large language models (LLMs) refers to generating information either unsupported or contradictory to the provided context. This poses serious risks in applications where accuracy is paramount, such as medical diagnosis or financial advising. Traditional techniques like Retrieval Augmented Generation (RAG) aim to mitigate these hallucinations, but they are not always successful. Lynx addresses these shortcomings with unprecedented accuracy.

One of Lynx’s key differentiators is its performance on the HaluBench, a comprehensive hallucination evaluation benchmark consisting of 15,000 samples from various real-world domains. Lynx has superior performance in detecting hallucinations across diverse fields, including medicine and finance. For instance, in the PubMedQA dataset, Lynx’s 70 billion parameter version was 8.3% more accurate than GPT-4 at identifying medical inaccuracies. This level of precision is critical in ensuring the reliability of AI-driven solutions in sensitive areas.

Image Source

The robustness of Lynx is further evidenced by its performance compared to other leading models. The 8 billion parameter version of Lynx outperformed GPT-3.5 by 24.5% on HaluBench and showed significant gains over Claude-3-Sonnet and Claude-3-Haiku by 8.6% and 18.4%, respectively. These results highlight Lynx’s ability to handle complex hallucination detection tasks with a smaller model, making it more accessible and efficient for various applications.

The development of Lynx involved several innovative approaches, including Chain-of-Thought reasoning, which enables the model to perform advanced task reasoning. This approach has significantly enhanced Lynx’s capability to catch hard-to-detect hallucinations, making its outputs more explainable and interpretable, akin to human reasoning. This feature is particularly important as it allows users to understand the model’s decision-making process, increasing trust in its outputs.

Image Source

Lynx has been fine-tuned from the Llama-3-70B-Instruct model, which produces a score and can also reason about it, providing a level of interpretability crucial for real-world applications. The model’s integration with Nvidia’s NeMo-Guardrails ensures that it can be deployed as a hallucination detector in chatbot applications, enhancing the reliability of AI interactions.

Patronus AI has released the HaluBench dataset and evaluation code for public access, enabling researchers and developers to explore and contribute to this field. The dataset is available on Nomic Atlas, a visualization tool that helps identify patterns and insights from large-scale datasets, making it a valuable resource for further research and development.

In conclusion, Patronus AI launched Lynx to develop AI models capable of detecting and mitigating hallucinations. With its superior performance, innovative reasoning capabilities, and strong support from leading technology partners, Lynx is set to become a cornerstone in the next generation of AI applications. This release underscores Patronus AI’s commitment to advancing AI technology and effective deployment in critical domains.

Check out the Paper and Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit
The post Patronus AI Introduces Lynx: A SOTA Hallucination Detection LLM that Outperforms GPT-4o and All State-of-the-Art LLMs on RAG Hallucination Tasks appeared first on MarkTechPost.

KAIST Researchers Introduce CHOP: Enhancing EFL Students’ Oral Prese …

The field of English as a Foreign Language (EFL) focuses on equipping non-native speakers with the skills to communicate effectively in English. One critical aspect of this education is developing students’ oral presentation abilities. These skills are important for academic & professional success, enabling students to convey their ideas clearly & confidently. Effective oral presentation practice helps EFL learners enhance their communication skills, preparing them for real-world scenarios.

EFL students often face significant challenges in delivering oral presentations. Issues such as speech anxiety, limited vocabulary, and inadequate feedback from traditional teaching methods impede their ability to present effectively. These obstacles highlight the need for innovative tools that provide more interactive and personalized assistance, helping students overcome these barriers and improve their presentation skills.

Existing methods to support EFL students include workshops and digital tools that offer increased practice opportunities and feedback. Digital video recordings and video-based blogs have been used to enhance students’ presentation skills. However, these approaches often need more personalized, real-time feedback to meet the dynamic needs of EFL learners, leading to a gap in effective support mechanisms.

Researchers from KAIST in South Korea introduced a novel tool called CHOP (ChatGPT-based interactive platform for oral presentation practice). This platform leverages ChatGPT to provide EFL students personalized feedback on their oral presentations. By integrating advanced AI technologies, CHOP aims to enhance students’ practice sessions and help them improve their presentation skills in a structured and interactive manner.

CHOP utilizes technologies like ChatGPT and Whisper for real-time feedback generation. The platform evaluates presentations based on grammar, vocabulary, content, organization, and delivery criteria. Students receive detailed feedback on their rehearsals, enabling them to identify and correct specific errors. This method enhances their presentation skills comprehensively, addressing individual areas of improvement with precision.

Image Source

The effectiveness of CHOP was tested with 13 EFL students over two weeks. During this period, the platform collected interaction data, including rehearsal audio, feedback generated by ChatGPT, user ratings, chat logs, and platform logs. Expert evaluations of the feedback quality revealed that CHOP excelled in accuracy and relevance, with scores of 5.66/7 and 5.8/7, respectively. The platform was particularly effective in improving vocabulary, receiving the highest accuracy score of 6.17/7 from experts and a helpfulness score of 6.02/7 from students.

Despite its strengths, CHOP showed limitations in feedback on delivery, scoring lower in detail and helpfulness (4.81/7). Both experts and students noted ambiguities in the input, particularly in understanding the rationale behind differing delivery scores. This limitation stems from the interpretability issues of the SuperSpeech API used for assessing delivery, indicating a need for more transparent and detailed feedback mechanisms in this area.

Image Source

Learners’ perceptions of CHOP were positive overall. The platform significantly boosted students’ confidence and reduced nervousness, with an average rating of 5.26/7. Students also reported improved self-assessment skills and greater awareness of their strengths and weaknesses, with ratings of 5.27/7. These insights highlight CHOP’s potential to enhance learners’ confidence and self-evaluation capabilities, addressing key challenges EFL students face in oral presentations.

Usage patterns revealed that students preferred using partial rehearsals for detailed segment-by-segment refinement, followed by full rehearsals for a comprehensive review. This approach allowed them to hone minor details while gaining broader insights into their presentation skills. However, challenges emerged in interpreting feedback for long rehearsals, necessitating feedback context and relevance improvements.

In conclusion, the CHOP platform offers a promising solution to the challenges faced by EFL students in delivering oral presentations. By providing personalized, real-time feedback, CHOP helps students overcome speech anxiety and improve their presentation skills comprehensively. The platform’s integration of advanced AI technologies marks a significant advancement in EFL education, offering students a more interactive and effective learning experience.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit
The post KAIST Researchers Introduce CHOP: Enhancing EFL Students’ Oral Presentation Skills with Real-Time, Personalized Feedback Using ChatGPT and Whisper Technologies appeared first on MarkTechPost.