This AI Research Introduces a Deep Learning Model that can Steal Data …

As deep learning continues to advance and microphones become more ubiquitous, along with the growing popularity of online services through personal devices, the potential for acoustic side-channel attacks to impact keyboards is on the rise.

A team of researchers from the UK have trained an AI model that steals data from the system. The model has shown a significant accuracy of 95%. Further, when they demonstrated this deep learning model on a Zoom call, they noted an accuracy of 93%.

The researchers discovered that wireless keyboards emit detectable and interpretable electromagnetic (EM) signals through their studies. However, a more widespread emission, which is abundant and simpler to identify, comes in keystroke sounds. Therefore, they used keystroke acoustics for their research. Further, the researchers studied the keystroke acoustics on laptops since laptops are more transportable than desktop computers and, therefore, more available in public areas where keyboard acoustics may be overheard. Also, Laptops are non-modular, which implies that identical laptop models will come equipped with the same type of keyboard, leading to similar keyboard signals being emitted.

This study introduced self-attention transformer layers in the context of attacking keyboards for the first time. The effectiveness of their newly developed attack was then assessed in real-world scenarios. Specifically, they tested the attack on laptop keyboards in the same room as the attacker’s microphone (using a mobile device). Also, they evaluated the attack on laptop keystrokes during a Zoom call.

In the setup process, the team employed an iPhone microphone and trained the AI using keystrokes. This surprisingly straightforward approach highlights the potential ease with which passwords and classified data could be compromised, even without specialized equipment.

A MacBook Pro and an iPhone 13 mini were used for the experimentation. The iPhone was positioned 17cm away from the laptop on a folded micro-fiber cloth to minimize desk vibrations. To capture keystrokes, the researchers leveraged the built-in recording function of the Zoom call software. On the second laptop dataset, which they referred to as the ‘Zoom-recorded data,’ they captured keystrokes by using the built-in feature of the Zoom video-conferencing application. 

The results that the researchers got were impressive. They found out that when trained on keystrokes recorded by a nearby phone, the model achieves an accuracy of 95%. Further, the model showed an accuracy of 93% when trained on keystrokes recorded using the video-conferencing software Zoom. The researchers emphasize that their results prove the practicality of side-channel attacks via off-the-shelf equipment and algorithms.

In the future, the researchers are looking to develop more robust techniques to extract individual keystrokes from a single recording. This is crucial because all ASCA methods rely on accurately isolating keystrokes for proper classification. Also, using smart speakers to record keystrokes for classification can be used, as these devices remain always-on and are present in many homes.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post This AI Research Introduces a Deep Learning Model that can Steal Data by Listening to Keystrokes Recorded by a nearby Phone with 95% Accuracy appeared first on MarkTechPost.

Airbnb Researchers Develop Chronon: A Framework for Developing Product …

In the ever-evolving landscape of machine learning, feature management has emerged as a key pain point for ML Engineers at Airbnb. While they strive to create innovative models for various products, they often find themselves spending a significant amount of time dealing with infrastructure complexities instead of focusing solely on their models. Airbnb recognized the need for a solution that could streamline feature data management, provide real-time updates, and ensure consistency between training and production environments.

Enter Chronon, a powerful API designed by the Airbnb team to address these challenges head-on. Chronon empowers ML practitioners to define features and centralize data computation for model training and production inference, guaranteeing accuracy and consistency throughout the process.

Ingesting Data from Diverse Sources

Chronon can ingest data from various sources, including event streams, fact/dimension tables in the data warehouse, table snapshots, Change Data Streams, and more. Whether real-time event data or historical snapshots, Chronon handles it all seamlessly.

Transforming Data with Flexibility

With Chronon’s SQL-like transformations and time-based aggregations, ML practitioners have the freedom to process data with ease. Whether standard aggregation or sophisticated windowing techniques, Chronon’s Python API empowers users to perform complex computations while ensuring full flexibility and composability.

Online and Offline Results Generation

Chronon caters to both online and offline data generation requirements. Chronon has you covered for low-latency end-points serving feature data or Hive tables for training data. The “Accuracy” parameter allows users to decide the update frequency, making it suitable for a range of use cases, from real-time updates to daily refreshes.

Understanding Accuracy and Data Sources

Chronon’s unique approach to accuracy enables users to express the desired update frequency for derived data. Whether near real-time or daily intervals, Chronon’s “Temporal” or “Snapshot” accuracy models ensure that computations align with each use-case’s specific requirements.

Data sources are essential components in the Chronon ecosystem. It supports three primary data ingestion patterns:

Event data sources for timestamped activity

Entity data sources for attribute metadata related to business entities

Cumulative Event Sources for tracking historical changes in slowly changing dimensions

Computation Contexts and Types

Chronon operates in two distinct contexts: online and offline. Online computations serve applications with low latency, while offline computations are performed on warehouse datasets using batch jobs. All Chronon definitions fall into three categories: GroupBy for aggregation, Join for combining data from various GroupBy computations, and StagingQuery for custom Spark SQL computations.

Understanding Aggregations for Powerful Insights

Chronon’s GroupBy aggregations provide various extensions to traditional SQL group-by functionalities. Users can leverage Windows for time-bound aggregations, bucketing for additional granularity, and auto-unpack to handle nested data within an array. Additionally, time-based aggregations offer even more flexibility to create insightful features for ML models.

A Seamless Integration for Airbnb’s ML Practitioners

Chronon has proven to be a game-changer for Airbnb’s ML practitioners. Chronon enables users to generate thousands of features to power ML models effortlessly by simplifying feature engineering. This revolutionary solution has freed ML Engineers from the burden of manual pipeline implementation, allowing them to focus on building innovative models that cater to ever-changing user behaviors and product demands.

In conclusion, Chronon has become an indispensable tool in Airbnb’s machine-learning arsenal. Providing a comprehensive feature management solution has elevated the productivity and scalability of feature engineering, empowering ML practitioners to deliver cutting-edge models and enhance the Airbnb experience for millions of users.

Check out the Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Airbnb Researchers Develop Chronon: A Framework for Developing Production-Grade Features for Machine Learning Models appeared first on MarkTechPost.

Host the Spark UI on Amazon SageMaker Studio

Amazon SageMaker offers several ways to run distributed data processing jobs with Apache Spark, a popular distributed computing framework for big data processing.
You can run Spark applications interactively from Amazon SageMaker Studio by connecting SageMaker Studio notebooks and AWS Glue Interactive Sessions to run Spark jobs with a serverless cluster. With interactive sessions, you can choose Apache Spark or Ray to easily process large datasets, without worrying about cluster management.
Alternately, if you need more control over the environment, you can use a pre-built SageMaker Spark container to run Spark applications as batch jobs on a fully managed distributed cluster with Amazon SageMaker Processing. This option allows you to select several types of instances (compute optimized, memory optimized, and more), the number of nodes in the cluster, and the cluster configuration, thereby enabling greater flexibility for data processing and model training.
Finally, you can run Spark applications by connecting Studio notebooks with Amazon EMR clusters, or by running your Spark cluster on Amazon Elastic Compute Cloud (Amazon EC2).
All these options allow you to generate and store Spark event logs to analyze them through the web-based user interface commonly named the Spark UI, which runs a Spark History Server to monitor the progress of Spark applications, track resource usage, and debug errors.
In this post, we share a solution for installing and running Spark History Server on SageMaker Studio and accessing the Spark UI directly from the SageMaker Studio IDE, for analyzing Spark logs produced by different AWS services (AWS Glue Interactive Sessions, SageMaker Processing jobs, and Amazon EMR) and stored in an Amazon Simple Storage Service (Amazon S3) bucket.
Solution overview
The solution integrates Spark History Server into the Jupyter Server app in SageMaker Studio. This allows users to access Spark logs directly from the SageMaker Studio IDE. The integrated Spark History Server supports the following:

Accessing logs generated by SageMaker Processing Spark jobs
Accessing logs generated by AWS Glue Spark applications
Accessing logs generated by self-managed Spark clusters and Amazon EMR

A utility command line interface (CLI) called sm-spark-cli is also provided for interacting with the Spark UI from the SageMaker Studio system terminal. The sm-spark-cli enables managing Spark History Server without leaving SageMaker Studio.

The solution consists of shell scripts that perform the following actions:

Install Spark on the Jupyter Server for SageMaker Studio user profiles or for a SageMaker Studio shared space
Install the sm-spark-cli for a user profile or shared space

Install the Spark UI manually in a SageMaker Studio domain
To host Spark UI on SageMaker Studio, complete the following steps:

Choose System terminal from the SageMaker Studio launcher.

Run the following commands in the system terminal:

curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/download/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts
chmod +x install-history-server.sh
./install-history-server.sh

The commands will take a few seconds to complete.

When the installation is complete, you can start the Spark UI by using the provided sm-spark-cli and access it from a web browser by running the following code:

sm-spark-cli start s3://DOC-EXAMPLE-BUCKET/<SPARK_EVENT_LOGS_LOCATION>
The S3 location where the event logs produced by SageMaker Processing, AWS Glue, or Amazon EMR are stored can be configured when running Spark applications.
For SageMaker Studio notebooks and AWS Glue Interactive Sessions, you can set up the Spark event log location directly from the notebook by using the sparkmagic kernel.
The sparkmagic kernel contains a set of tools for interacting with remote Spark clusters through notebooks. It offers magic (%spark, %sql) commands to run Spark code, perform SQL queries, and configure Spark settings like executor memory and cores.

For the SageMaker Processing job, you can configure the Spark event log location directly from the SageMaker Python SDK.

Refer to the AWS documentation for additional information:

For SageMaker Processing, refer to PySparkProcessor
For AWS Glue Interactive Sessions, refer to Configuring the Spark UI (console)
For Amazon EMR, refer to Configure an output location

You can choose the generated URL to access the Spark UI.

The following screenshot shows an example of the Spark UI.

You can check the status of the Spark History Server by using the sm-spark-cli status command in the Studio System terminal.

You can also stop the Spark History Server when needed.

Automate the Spark UI installation for users in a SageMaker Studio domain
As an IT admin, you can automate the installation for SageMaker Studio users by using a lifecycle configuration. This can be done for all user profiles under a SageMaker Studio domain or for specific ones. See Customize Amazon SageMaker Studio using Lifecycle Configurations for more details.
You can create a lifecycle configuration from the install-history-server.sh script and attach it to an existing SageMaker Studio domain. The installation is run for all the user profiles in the domain.
From a terminal configured with the AWS Command Line Interface (AWS CLI) and appropriate permissions, run the following commands:

curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/download/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts

LCC_CONTENT=`openssl base64 -A -in install-history-server.sh`

aws sagemaker create-studio-lifecycle-config
–studio-lifecycle-config-name install-spark-ui-on-jupyterserver
–studio-lifecycle-config-content $LCC_CONTENT
–studio-lifecycle-config-app-type JupyterServer
–query ‘StudioLifecycleConfigArn’

aws sagemaker update-domain
–region {YOUR_AWS_REGION}
–domain-id {YOUR_STUDIO_DOMAIN_ID}
–default-user-settings
‘{
“JupyterServerAppSettings”: {
“DefaultResourceSpec”: {
“LifecycleConfigArn”: “arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver”,
“InstanceType”: “system”
},
“LifecycleConfigArns”: [
“arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver”
]
}}’

After Jupyter Server restarts, the Spark UI and the sm-spark-cli will be available in your SageMaker Studio environment.
Clean up
In this section, we show you how to clean up the Spark UI in a SageMaker Studio domain, either manually or automatically.
Manually uninstall the Spark UI
To manually uninstall the Spark UI in SageMaker Studio, complete the following steps:

Choose System terminal in the SageMaker Studio launcher.

Run the following commands in the system terminal:

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts

chmod +x uninstall-history-server.sh
./uninstall-history-server.sh

Uninstall the Spark UI automatically for all SageMaker Studio user profiles
To automatically uninstall the Spark UI in SageMaker Studio for all user profiles, complete the following steps:

On the SageMaker console, choose Domains in the navigation pane, then choose the SageMaker Studio domain.

On the domain details page, navigate to the Environment tab.
Select the lifecycle configuration for the Spark UI on SageMaker Studio.
Choose Detach.

Delete and restart the Jupyter Server apps for the SageMaker Studio user profiles.

Conclusion
In this post, we shared a solution you can use to quickly install the Spark UI on SageMaker Studio. With the Spark UI hosted on SageMaker, machine learning (ML) and data engineering teams can use scalable cloud compute to access and analyze Spark logs from anywhere and speed up their project delivery. IT admins can standardize and expedite the provisioning of the solution in the cloud and avoid proliferation of custom development environments for ML projects.
All the code shown as part of this post is available in the GitHub repository.

About the Authors
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.
Bruno Pistone is an AI/ML Specialist Solutions Architect for AWS based in Milan. He works with customers of any size, helping them understand their technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His field of expertice includes machine learning end to end, machine learning endustrialization, and generative AI. He enjoys spending time with his friends and exploring new places, as well as traveling to new destinations.

Deploy thousands of model ensembles with Amazon SageMaker multi-model …

Artificial intelligence (AI) adoption is accelerating across industries and use cases. Recent scientific breakthroughs in deep learning (DL), large language models (LLMs), and generative AI is allowing customers to use advanced state-of-the-art solutions with almost human-like performance. These complex models often require hardware acceleration because it enables not only faster training but also faster inference when using deep neural networks in real-time applications. GPUs’ large number of parallel processing cores makes them well-suited for these DL tasks.
However, in addition to model invocation, those DL application often entail preprocessing or postprocessing in an inference pipeline. For example, input images for an object detection use case might need to be resized or cropped before being served to a computer vision model, or tokenization of text inputs before being used in an LLM. NVIDIA Triton is an open-source inference server that enables users to define such inference pipelines as an ensemble of models in the form of a Directed Acyclic Graph (DAG). It is designed to run models at scale on both CPU and GPU. Amazon SageMaker supports deploying Triton seamlessly, allowing you to use Triton’s features while also benefiting from SageMaker capabilities: a managed, secured environment with MLOps tools integration, automatic scaling of hosted models, and more.
AWS, in its dedication to help customers achieve the highest saving, has continuously innovated not only in pricing options and cost-optimization proactive services, but also in launching cost savings features like multi-model endpoints (MMEs). MMEs are a cost-effective solution for deploying a large number of models using the same fleet of resources and a shared serving container to host all of your models. Instead of using multiple single-model endpoints, you can reduce your hosting costs by deploying multiple models while paying only for a single inference environment. Additionally, MMEs reduce deployment overhead because SageMaker manages loading models in memory and scaling them based on the traffic patterns to your endpoint.
In this post, we show how to run multiple deep learning ensemble models on a GPU instance with a SageMaker MME. To follow along with this example, you can find the code on the public SageMaker examples repository.
How SageMaker MMEs with GPU work
With MMEs, a single container hosts multiple models. SageMaker controls the lifecycle of models hosted on the MME by loading and unloading them into the container’s memory. Instead of downloading all the models to the endpoint instance, SageMaker dynamically loads and caches the models as they are invoked.
When an invocation request for a particular model is made, SageMaker does the following:

It first routes the request to the endpoint instance.
If the model has not been loaded, it downloads the model artifact from Amazon Simple Storage Service (Amazon S3) to that instance’s Amazon Elastic Block Storage volume (Amazon EBS).
It loads the model to the container’s memory on the GPU-accelerated compute instance. If the model is already loaded in the container’s memory, invocation is faster because no further steps are needed.

When an additional model needs to be loaded, and the instance’s memory utilization is high, SageMaker will unload unused models from that instance’s container to ensure that there is enough memory. These unloaded models will remain on the instance’s EBS volume so that they can be loaded into the container’s memory later, thereby removing the need to download them again from the S3 bucket. However, If the instance’s storage volume reaches its capacity, SageMaker will delete the unused models from the storage volume. In cases where the MME receives many invocation requests, and additional instances (or an auto-scaling policy) are in place, SageMaker routes some requests to other instances in the inference cluster to accommodate for the high traffic.
This not only provides a cost saving mechanism, but also enables you to dynamically deploy new models and deprecate old ones. To add a new model, you upload it to the S3 bucket the MME is configured to use and invoke it. To delete a model, stop sending requests and delete it from the S3 bucket. Adding models or deleting them from an MME doesn’t require updating the endpoint itself!
Triton ensembles
The Triton model ensemble represents a pipeline that consists of one model, preprocessing and postprocessing logic, and the connection of input and output tensors between them. A single inference request to an ensemble triggers the run of the entire pipeline as a series of steps using the ensemble scheduler. The scheduler collects the output tensors in each step and provides them as input tensors for other steps according to the specification. To clarify: the ensemble model is still viewed as a single model from an external view.
Triton server architecture includes a model repository: a file system-based repository of the models that Triton will make available for inferencing. Triton can access models from one or more locally accessible file paths or from remote locations like Amazon S3.
Each model in a model repository must include a model configuration that provides required and optional information about the model. Typically, this configuration is provided in a config.pbtxt file specified as ModelConfig protobuf. A minimal model configuration must specify the platform or backend (like PyTorch or TensorFlow), the max_batch_size property, and the input and output tensors of the model.
Triton on SageMaker
SageMaker enables model deployment using Triton server with custom code. This functionality is available through the SageMaker managed Triton Inference Server Containers. These containers support common machine leaning (ML) frameworks (like TensorFlow, ONNX, and PyTorch, as well as custom model formats) and useful environment variables that let you optimize performance on SageMaker. Using SageMaker Deep Learning Containers (DLC) images is recommended because they’re maintained and regularly updated with security patches.
Solution walkthrough
For this post, we deploy two different types of ensembles on a GPU instance, using Triton and a single SageMaker endpoint.
The first ensemble consists of two models: a DALI model for image preprocessing and a TensorFlow Inception v3 model for actual inference. The pipeline ensemble takes encoded images as an input, which will have to be decoded, resized to 299×299 resolution, and normalized. This preprocessing will be handled by the DALI model. DALI is an open-source library for common image and speech preprocessing tasks such as decoding and data augmentation. Inception v3 is an image recognition model that consists of symmetric and asymmetric convolutions, and average and max pooling fully connected layers (and therefore is perfect for GPU usage).
The second ensemble transforms raw natural language sentences into embeddings and consists of three models. First, a preprocessing model is applied to the input text tokenization (implemented in Python). Then we use a pre-trained BERT (uncased) model from the Hugging Face Model Hub to extract token embeddings. BERT is an English language model that was trained using a masked language modeling (MLM) objective. Finally, we apply a postprocessing model where the raw token embeddings from the previous step are combined into sentence embeddings.
After we configure Triton to use these ensembles, we show how to configure and run the SageMaker MME.
Finally, we provide an example of each ensemble invocation, as can be seen in the following diagram:

Ensemble 1 – Invoke the endpoint with an image, specifying DALI-Inception as the target ensemble
Ensemble 2 – Invoke the same endpoint, this time with text input and requesting the preprocess-BERT-postprocess ensemble

Set up the environment
First, we set up the needed environment. This includes updating AWS libraries (like Boto3 and the SageMaker SDK) and installing the dependencies required to package our ensembles and run inferences using Triton. We also use the SageMaker SDK default execution role. We use this role to enable SageMaker to access Amazon S3 (where our model artifacts are stored) and the container registry (where the NVIDIA Triton image will be used from). See the following code:

import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import nvidia.dali as dali
import nvidia.dali.types as types

# SageMaker varaibles
sm_client = boto3.client(service_name=”sagemaker”)
runtime_sm_client = boto3.client(“sagemaker-runtime”)
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
role = get_execution_role()

# Other Variables
instance_type = “ml.g4dn.4xlarge”
sm_model_name = “triton-tf-dali-ensemble-” + time.strftime(“%Y-%m-%d-%H-%M-%S”, time.gmtime())
endpoint_config_name = “triton-tf-dali-ensemble-” + time.strftime(“%Y-%m-%d-%H-%M-%S”, time.gmtime())
endpoint_name = “triton-tf-dali-ensemble-” + time.strftime(“%Y-%m-%d-%H-%M-%S”, time.gmtime())

Prepare ensembles
In this next step, we prepare the two ensembles: the TensorFlow (TF) Inception with DALI preprocessing and BERT with Python preprocessing and postprocessing.
This entails downloading the pre-trained models, providing the Triton configuration files, and packaging the artifacts to be stored in Amazon S3 before deploying.
Prepare the TF and DALI ensemble
First, we prepare the directories for storing our models and configurations: for the TF Inception (inception_graphdef), for DALI preprocessing (dali), and for the ensemble (ensemble_dali_inception). Because Triton supports model versioning, we also add the model version to the directory path (denoted as 1 because we only have one version). To learn more about the Triton version policy, refer to Version Policy. Next, we download the Inception v3 model, extract it, and copy to the inception_graphdef model directory. See the following code:

!mkdir -p model_repository/inception_graphdef/1
!mkdir -p model_repository/dali/1
!mkdir -p model_repository/ensemble_dali_inception/1

!wget -O /tmp/inception_v3_2016_08_28_frozen.pb.tar.gz
https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz

!(cd /tmp && tar xzf inception_v3_2016_08_28_frozen.pb.tar.gz)
!mv /tmp/inception_v3_2016_08_28_frozen.pb model_repository/inception_graphdef/1/model.graphdef

Now, we configure Triton to use our ensemble pipeline. In a config.pbtxt file, we specify the input and output tensor shapes and types, and the steps the Triton scheduler needs to take (DALI preprocessing and the Inception model for image classification):

%%writefile model_repository/ensemble_dali_inception/config.pbtxt
name: “ensemble_dali_inception”
platform: “ensemble”
max_batch_size: 256
input [
{
name: “INPUT”
data_type: TYPE_UINT8
dims: [ -1 ]
}
]
output [
{
name: “OUTPUT”
data_type: TYPE_FP32
dims: [ 1001 ]
}
]
ensemble_scheduling {
step [
{
model_name: “dali”
model_version: -1
input_map {
key: “DALI_INPUT_0”
value: “INPUT”
}
output_map {
key: “DALI_OUTPUT_0”
value: “preprocessed_image”
}
},
{
model_name: “inception_graphdef”
model_version: -1
input_map {
key: “input”
value: “preprocessed_image”
}
output_map {
key: “InceptionV3/Predictions/Softmax”
value: “OUTPUT”
}
}
]
}

Next, we configure each of the models. First, the model config for DALI backend:

%%writefile model_repository/dali/config.pbtxt
name: “dali”
backend: “dali”
max_batch_size: 256
input [
{
name: “DALI_INPUT_0”
data_type: TYPE_UINT8
dims: [ -1 ]
}
]
output [
{
name: “DALI_OUTPUT_0”
data_type: TYPE_FP32
dims: [ 299, 299, 3 ]
}
]
parameters: [
{
key: “num_threads”
value: { string_value: “12” }
}
]

Next, the model configuration for TensorFlow Inception v3 we downloaded earlier:

%%writefile model_repository/inception_graphdef/config.pbtxt
name: “inception_graphdef”
platform: “tensorflow_graphdef”
max_batch_size: 256
input [
{
name: “input”
data_type: TYPE_FP32
format: FORMAT_NHWC
dims: [ 299, 299, 3 ]
}
]
output [
{
name: “InceptionV3/Predictions/Softmax”
data_type: TYPE_FP32
dims: [ 1001 ]
label_filename: “inception_labels.txt”
}
]
instance_group [
{
kind: KIND_GPU
}
]

Because this is a classification model, we also need to copy the Inception model labels to the inception_graphdef directory in the model repository. These labels include 1,000 class labels from the ImageNet dataset.

!aws s3 cp s3://sagemaker-sample-files/datasets/labels/inception_labels.txt model_repository/inception_graphdef/inception_labels.txt

Next, we configure and serialize the DALI pipeline that will handle our preprocessing to file. The preprocessing includes reading the image (using CPU), decoding (accelerated using GPU), and resizing and normalizing the image.

@dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
def pipe():
“””Create a pipeline which reads images and masks, decodes the images and returns them.”””
images = dali.fn.external_source(device=”cpu”, name=”DALI_INPUT_0″)
images = dali.fn.decoders.image(images, device=”mixed”, output_type=types.RGB)
images = dali.fn.resize(images, resize_x=299, resize_y=299) #resize image to the default 299×299 size
images = dali.fn.crop_mirror_normalize(
images,
dtype=types.FLOAT,
output_layout=”HWC”,
crop=(299, 299), #crop image to the default 299×299 size
mean=[0.485 * 255, 0.456 * 255, 0.406 * 255], #crop a central region of the image
std=[0.229 * 255, 0.224 * 255, 0.225 * 255], #crop a central region of the image
)
return images

pipe().serialize(filename=”model_repository/dali/1/model.dali”)

Finally, we package the artifacts together and upload them as a single object to Amazon S3:

!tar -cvzf model_tf_dali.tar.gz -C model_repository .
model_uri = sagemaker_session.upload_data(
path=”model_tf_dali.tar.gz”, key_prefix=”triton-mme-gpu-ensemble”
)
print(“S3 model uri: {}”.format(model_uri))

Prepare the TensorRT and Python ensemble
For this example, we use a pre-trained model from the transformers library.
You can find all models (preprocess and postprocess, along with config.pbtxt files) in the folder ensemble_hf. Our file system structure will include four directories (three for the individual model steps and one for the ensemble) as well as their respective versions:

ensemble_hf
├── bert-trt
|   |── model.pt
|   |──config.pbtxt
├── ensemble
│   └── 1
|   └── config.pbtxt
├── postprocess
│   └── 1
|       └── model.py
|   └── config.pbtxt
├── preprocess
│   └── 1
|       └── model.py
|   └── config.pbtxt

In the workspace folder, we provide with two scripts: the first to convert the model into ONNX format (onnx_exporter.py) and the TensorRT compilation script (generate_model_trt.sh).
Triton natively supports the TensorRT runtime, which enables you to easily deploy a TensorRT engine, thereby optimizing for a selected GPU architecture.
To make sure we use the TensorRT version and dependencies that are compatible with the ones in our Triton container, we compile the model using the corresponding version of NVIDIA’s PyTorch container image:

model_id = “sentence-transformers/all-MiniLM-L6-v2″
! docker run –gpus=all –rm -it -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash generate_model_trt.sh $model_id

We then copy the model artifacts to the directory we created earlier and add a version to the path:

! mkdir -p ensemble_hf/bert-trt/1 && mv workspace/model.plan ensemble_hf/bert-trt/1/model.plan && rm -rf workspace/model.onnx workspace/core*

We use a Conda pack to generate a Conda environment that the Triton Python backend will use in preprocessing and postprocessing:

!bash conda_dependencies.sh
!cp processing_env.tar.gz ensemble_hf/postprocess/ && cp processing_env.tar.gz ensemble_hf/preprocess/
!rm processing_env.tar.gz

Finally, we upload the model artifacts to Amazon S3:

!tar -C ensemble_hf/ -czf model_trt_python.tar.gz .
model_uri = sagemaker_session.upload_data(
path=”model_trt_python.tar.gz”, key_prefix=”triton-mme-gpu-ensemble”
)

print(“S3 model uri: {}”.format(model_uri))

Run ensembles on a SageMaker MME GPU instance
Now that our ensemble artifacts are stored in Amazon S3, we can configure and launch the SageMaker MME.
We start by retrieving the container image URI for the Triton DLC image that matches the one in our Region’s container registry (and is used for TensorRT model compilation):

account_id_map = {
“us-east-1”: “785573368785”,
“us-east-2”: “007439368137”,
“us-west-1”: “710691900526”,
“us-west-2”: “301217895009”,
“eu-west-1”: “802834080501”,
“eu-west-2”: “205493899709”,
“eu-west-3”: “254080097072”,
“eu-north-1”: “601324751636”,
“eu-south-1”: “966458181534”,
“eu-central-1”: “746233611703”,
“ap-east-1”: “110948597952”,
“ap-south-1”: “763008648453”,
“ap-northeast-1”: “941853720454”,
“ap-northeast-2”: “151534178276”,
“ap-southeast-1”: “324986816169”,
“ap-southeast-2”: “355873309152”,
“cn-northwest-1”: “474822919863”,
“cn-north-1”: “472730292857”,
“sa-east-1”: “756306329178”,
“ca-central-1”: “464438896020”,
“me-south-1”: “836785723513”,
“af-south-1”: “774647643957”,
}
region = boto3.Session().region_name
if region not in account_id_map.keys():
raise (“UNSUPPORTED REGION”)
base = “amazonaws.com.cn” if region.startswith(“cn-“) else “amazonaws.com”
triton_image_uri = “{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.03-py3”.format(
account_id=account_id_map[region], region=region, base=base
)

Next, we create the model in SageMaker. In the create_model request, we describe the container to use and the location of model artifacts, and we specify using the Mode parameter that this is a multi-model.

container = {
“Image”: triton_image_uri,
“ModelDataUrl”: models_s3_location,
“Mode”: “MultiModel”,
}

create_model_response = sm_client.create_model(
ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

To host our ensembles, we create an endpoint configuration with the create_endpoint_config API call, and then create an endpoint with the create_endpoint API. SageMaker then deploys all the containers that you defined for the model in the hosting environment.

create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
“InstanceType”: instance_type,
“InitialVariantWeight”: 1,
“InitialInstanceCount”: 1,
“ModelName”: sm_model_name,
“VariantName”: “AllTraffic”,
}
],
)

create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

Although in this example we are setting a single instance to host our model, SageMaker MMEs fully support setting an auto scaling policy. For more information on this feature, see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.
Create request payloads and invoke the MME for each model
After our real-time MME is deployed, it’s time to invoke our endpoint with each of the model ensembles we used.
First, we create a payload for the DALI-Inception ensemble. We use the shiba_inu_dog.jpg image from the SageMaker public dataset of pet images. We load the image as an encoded array of bytes to use in the DALI backend (to learn more, see Image Decoder examples).

sample_img_fname = “shiba_inu_dog.jpg”

import numpy as np

s3_client = boto3.client(“s3”)
s3_client.download_file(
“sagemaker-sample-files”, “datasets/image/pets/shiba_inu_dog.jpg”, sample_img_fname
)

def load_image(img_path):
“””
Loads image as an encoded array of bytes.
This is a typical approach you want to use in DALI backend
“””
with open(img_path, “rb”) as f:
img = f.read()
return np.array(list(img)).astype(np.uint8)

rv = load_image(sample_img_fname)
print(f”Shape of image {rv.shape}”)

rv2 = np.expand_dims(rv, 0)
print(f”Shape of expanded image array {rv2.shape}”)

payload = {
“inputs”: [
{
“name”: “INPUT”,
“shape”: rv2.shape,
“datatype”: “UINT8”,
“data”: rv2.tolist(),
}
]
}

With our encoded image and payload ready, we invoke the endpoint.
Note that we specify our target ensemble to be the model_tf_dali.tar.gz artifact. The TargetModel parameter is what differentiates MMEs from single-model endpoints and enables us to direct the request to the right model.

response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name, ContentType=”application/octet-stream”, Body=json.dumps(payload), TargetModel=”model_tf_dali.tar.gz”
)

The response includes metadata about the invocation (such as model name and version) and the actual inference response in the data part of the output object. In this example, we get an array of 1,001 values, where each value is the probability of the class the image belongs to (1,000 classes and 1 extra for others). Next, we invoke our MME again, but this time target the second ensemble. Here the data is just two simple text sentences:

text_inputs = [“Sentence 1”, “Sentence 2”]

To simplify communication with Triton, the Triton project provides several client libraries. We use that library to prepare the payload in our request:

import tritonclient.http as http_client

text_inputs = [“Sentence 1”, “Sentence 2”]
inputs = []
inputs.append(http_client.InferInput(“INPUT0”, [len(text_inputs), 1], “BYTES”))
batch_request = [[text_inputs[i]] for i in range(len(text_inputs))]
input0_real = np.array(batch_request, dtype=np.object_)
inputs[0].set_data_from_numpy(input0_real, binary_data=True)
outputs = []
outputs.append(http_client.InferRequestedOutput(“finaloutput”))
request_body, header_length = http_client.InferenceServerClient.generate_request_body(
inputs, outputs=outputs
)

Now we are ready to invoke the endpoint—this time, the target model is the model_trt_python.tar.gz ensemble:

response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=”application/vnd.sagemaker-triton.binary+json;json-header-size={}”.format(
header_length
),
Body=request_body,
TargetModel=”model_trt_python.tar.gz”
)

The response is the sentence embeddings that can be used in a variety of natural language processing (NLP) applications.
Clean up
Lastly, we clean up and delete the endpoint, endpoint configuration, and model:

sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=sm_model_name)

Conclusion
In this post, we showed how to configure, deploy, and invoke a SageMaker MME with Triton ensembles on a GPU-accelerated instance. We hosted two ensembles on a single real-time inference environment, which reduced our cost by 50% (for a g4dn.4xlarge instance, which represents over $13,000 in yearly savings). Although this example used only two pipelines, SageMaker MMEs can support thousands of model ensembles, making it an extraordinary cost savings mechanism. Furthermore, you can use SageMaker MMEs’ dynamic ability to load (and unload) models to minimize the operational overhead of managing model deployments in production.

About the authors
Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Nikhil Kulkarni is a software developer with AWS Machine Learning, focusing on making machine learning workloads more performant on the cloud, and is a co-creator of AWS Deep Learning Containers for training and inference. He’s passionate about distributed Deep Learning Systems. Outside of work, he enjoys reading books, fiddling with the guitar, and making pizza.
Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers to design, build, and operate ML workloads at scale. In his spare time, he enjoys cycling, backpacking, and backpropagating.
 Eliuth Triana Isaza is a Developer Relations Manager on the NVIDIA-AWS team. He connects Amazon and AWS product leaders, developers, and scientists with NVIDIA technologists and product leaders to accelerate Amazon ML/DL workloads, EC2 products, and AWS AI services. In addition, Eliuth is a passionate mountain biker, skier, and poker player.

Artificial Intelligence (AI) and Web3: How are they Connected?

What is AI?

Simply put, Artificial Intelligence (AI) is the ability of machines to do functions that we usually associate with a human mind – for example, doing a reasoning task, solving a mathematics problem, stock market trading, and much more. At its core, AI combines computer science with robust datasets to enable problem-solving.

Since its inception, AI has had many cycles of hype, but the release of OpenAI’s ChatGPT last year has taken the excitement and anticipation of AI to new heights, marking a significant milestone, particularly in the field of Natural Language Processing. Furthermore, generative models have proven their adaptability beyond language, showcasing the capacity to learn the grammar of diverse data types such as software code, molecules, natural images, and more.

The applications of AI are evolving day by day, and in this article, we will discuss its role and use cases in Web3.

What is Web3?

Web3 is defined as a series of inter-connected and open-source applications. Powered by the blockchain computing architecture, these applications are decentralized, enabling trust and transparency where data and transactions are secured and distributed across a network of nodes, eliminating the need for central authorities or intermediaries.

Web1 (1990-2004) was static and read-only (e.g. Yahoo News), with users consuming information. Web2 introduced interactivity (e.g., Facebook, YouTube) but led to privacy concerns (exploiting user data for targeted advertising) and centralized control (making decisions unilaterally, like blocking any particular account). Web3 aims to be decentralized, giving users ownership and governance through blockchain, addressing Web2’s limitations, and fostering a more user-controlled internet.

An example of a Web3 application is the Brave Browser, a Chromium-based browser renowned for its Web3 integration and cryptocurrency integration. Operating on public blockchains and leveraging IPFS (a transfer protocol that sends and delivers data faster than HTTPS), Brave enables users to earn crypto, link wallets, explore NFTs, and access DApps seamlessly. With robust security features and a focus on online safety, Brave offers an ideal browsing experience for Web3 enthusiasts.

Source: https://www.businessinsider.com/personal-finance/what-is-web3?IR=T

Role of AI in Web3

Autonomous Agents

Autonomous Agents could be provided with real-time data and a set of predefined rules to enhance the capabilities of smart contracts in Web3 platforms. Additionally, these agents can negotiate, execute transactions, and provide personalized services. The use of such agents automates complex processes, minimizes intermediaries, and enhances the overall Web3 ecosystem.

Personalization

In the Web3 context, artificial intelligence (AI) is pivotal in crafting tailored user experiences through data analysis, interaction patterns, and preferences. AI employs collaborative and content-based filtering techniques, generating personalized recommendations and enhancing various facets of Web3 platforms. 

Such personalization fosters deeper user engagement by aligning content and interactions with individual needs and preferences, ultimately enriching the decentralized Web3 experience and facilitating efficient content discovery and curation.

Decentralized Data Marketplaces

By harnessing AI, decentralized data marketplaces can be created that empower individuals with enhanced data control. AI algorithms enable selective data sharing and monetization while upholding privacy. They facilitate data analysis, categorization, and efficient, secure transactions, optimizing matching between data buyers and sellers in these decentralized markets.

Analytics & Insights

By leveraging AI methods like machine learning and natural language processing, Web3 networks can efficiently process and analyze extensive data. This empowers users with predictive analytics, sentiment analysis, and personalized recommendations, enhancing their understanding of decentralized dynamics and improving navigation within the landscape.

Security & Privacy

Web3 ecosystems can improve cybersecurity and safeguard user data privacy by utilizing advanced AI techniques. AI models can analyze extensive data to identify vulnerabilities, malicious behavior, and anomalies. Machine learning algorithms can prevent cyber threats like phishing and DDoS attacks. By proactively ensuring security, AI enhances user trust and confidence in Web3 platforms and applications.

Decentralized Autonomous Organizations (DAOs)

AI is pivotal in the development of Decentralized Autonomous Organizations (DAOs). These blockchain-based organizations automate voting, fund management, and operations and achieve greater transparency and adaptability by integrating AI and optimizing Web3 governance models.

Real-World Examples of Use of AI in Web3

Medibloc

Medibloc, a decentralized healthcare platform, utilizes the Ethereum blockchain and smart contracts to facilitate secure data sharing and healthcare access. Its native cryptocurrency, MED, supports transactions and rewards for data contributions. 

AI is harnessed for personalized treatment advice, data analysis, and automation. The platform fosters patient connections and experience sharing while AI processes medical data, spotting trends, generating tailored recommendations, and automating tasks like medication reminders.

Chainalysis

Chainalysis, established in 2014, is a blockchain analytics platform utilized by various entities, including cryptocurrency exchanges, law enforcement, and financial institutions, to uncover and prevent unlawful activities within the blockchain.

The platform employs a proprietary database of known illicit addresses and transactions to centrally store the data from diverse blockchain networks like Bitcoin and Ethereum. The system detects irregularities like unusually large transactions or links to fraudulent addresses through AI-driven processing, subsequently issuing alerts for potentially suspicious transactions. 

Augur

Augur is a decentralized prediction market on the Ethereum blockchain. Users create and trade event outcome predictions, aided by AI that analyzes diverse data sources like news, social media, and historical data to enhance accuracy. This analysis guides users in making more informed predictions and trades, with accurate predictions earning rewards fostering AI-driven prediction improvement.

Ocean Protocol

Ocean Protocol focuses on constructing a decentralized data exchange for secure data sharing and monetization, safeguarding privacy. The company’s scope extends to developing decentralized autonomous organizations (DAOs) and other Web 3.0 innovations. 

Enterprises contribute supply chain data to the platform, with Ocean Protocol’s AI processing it to extract valuable insights, aiding in supply chain enhancement such as identifying bottlenecks and predicting demand. The system automates tasks like issue resolution and logistics optimization. Data services are facilitated by the native cryptocurrency, Ocean Token (OCEAN).

MyCryptoHeroes

MyCryptoHeroes is a blockchain-based gaming platform where players collect, train, and battle with distinctive digital heroes using non-fungible tokens (NFTs) on the Ethereum blockchain. The platform’s native cryptocurrency, GUM, acquires in-game assets and facilitates trades, while AI offers personalized suggestions and automates tasks. Alongside this, a social dimension enables hero trading and player interaction within the game.

Don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Artificial Intelligence (AI) and Web3: How are they Connected? appeared first on MarkTechPost.

The Consistent AI Video Editor Has Arrived: TokenFlow is an AI Model T …

Diffusion models are something you should be familiar with at this point. They have been the key topic in the AI domain for the last year. These models showed remarkable success in image generation, and they opened an entirely new page. 

We are in the text-to-image generation era, and they improve daily. Diffusion-based generative models, such as MidJourney, have demonstrated incredible capabilities in synthesizing high-quality images from text descriptions. These models use large-scale image-text datasets, enabling them to generate diverse and realistic visual content based on textual prompts.

The rapid advancement of text-to-image models has led to remarkable advancements in image editing and content generation. Nowadays, users can control various aspects of both generated and real images. This enables them to express their ideas better and demonstrate the outcome in a relatively rapid way instead of spending days in manual drawing.

However, the story is different when it comes to applying these exciting breakthroughs to the realm of videos. We have relatively slower progress here. Although large-scale text-to-video generative models have emerged, showcasing impressive results in generating video clips from textual descriptions, they still face limitations regarding resolution, video length, and the complexity of video dynamics they can represent.

One of the key challenges in using an image diffusion model for video editing is to ensure that the edited content remains consistent across all video frames. While existing video editing methods based on image diffusion models have achieved global appearance coherency by extending the self-attention module to include multiple frames, they often fall short of achieving the desired level of temporal consistency. This leaves professionals and semi-professionals to resort to elaborate video editing pipelines involving additional manual work.

Let us meet with TokenFlow, an AI model that utilizes the power of a pre-trained text-to-image model to enable text-driven editing of natural videos.

The main goal of TokenFlow is to generate high-quality videos that adhere to the target edit expressed by an input text prompt while preserving the spatial layout and motion of the original video.

TokenFlow can edit natural videos using text prompts. Source: https://arxiv.org/pdf/2307.10373.pdf

TokenFlow is introduced to tackle the temporal inconsistency. It explicitly enforces the original inter-frame video correspondences on the edit. By recognizing that natural videos contain redundant information across frames, TokenFlow builds upon the observation that the internal representation of the video in the diffusion model exhibits similar properties. 

Overview of TokenFlow. Source: https://arxiv.org/pdf/2307.10373.pdf

This insight serves as the pillar of TokenFlow, enabling the enforcement of consistent edits by ensuring that the features of the edited video are consistent across frames. This is achieved by propagating the edited diffusion features based on the original video dynamics, leveraging the generative prior to the state-of-the-art image diffusion model without the need for additional training or fine-tuning. TokenFlow also works seamlessly in conjunction with an off-the-shelf diffusion-based image editing method.

Check out the Paper, GitHub Page, and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post The Consistent AI Video Editor Has Arrived: TokenFlow is an AI Model That Uses Diffusion Features for Consistent Video Editing appeared first on MarkTechPost.

UC Berkeley Researchers Introduce Dynalang: An AI Agent that Learns a …

Creating bots that can communicate organically with people in the real world using language has long been an aim of artificial intelligence. Present-day embodied agents can execute straightforward, low-level commands like “get the blue block” or “go past the lift and turn right.” However, interactive agents need to be able to comprehend the full range of ways people use the language outside of the “here and now,” including knowledge transmission (for example, “the top left button turns off the TV”), situational information (for example, “we’re out of milk”), and coordination (for example, “I already vacuumed the living room”). 

Most of what kids read in texts or hear from others conveys information about the world, either how it functions or as it is right now. How might they make it possible for agents to speak in other languages? Reinforcement learning (RL) is a technique for teaching language-conditioned agents to solve problems. However, most language-conditioned RL techniques now in use are trained to produce actions from task-specific instructions, for example, by taking a goal description like “pick up the blue block” as input and making a series of motor commands. Directly mapping language to the best course of action offers a difficult learning challenge when considering the variety of roles natural language fulfills in the actual world. 

If the work at hand is cleaning up, the agent should answer by going on to the next cleaning step, but if it is serving supper, the agent should collect the bowls. Take the case of “I put the bowls away” as an example. Language only has a weak correlation with the best course of action for the agent when it does not discuss the job. As a result, task-reward-only mapping of language to activities could be a better learning signal for learning to employ a variety of language inputs to complete tasks. Instead, they suggest that a unifying function of language for agents is to aid in future prediction. The phrase “I put the bowls away” enables agents to predict future observations more accurately (i.e., if it opens the cabinet, it will see the bowls within). 

In this sense, much of the language kids come across might be rooted in visual experience. Agents can predict environmental changes using prior information, such as “wrenches can be used to tighten nuts.” Agents might anticipate observations by saying, “the package is outside.” This paradigm also combines common instruction-following practices under predictive terms: instructions aid agents in expecting their rewards. They contend that forecasting future representations offers agents a rich learning signal that will help them comprehend language and how it interacts with the outside world, much to how next-token prediction enables language models to construct internal representations of world knowledge. 

Researchers from UC Berkeley introduce Dynalang, an agent that acquires a language and visual model of the world through online experience and utilizes the model to understand how to behave. Dynalang separates learning to behave using that model (reinforcement learning with task incentives) from learning to model the world with language (supervised learning with prediction targets). The world model receives visual and textual inputs as observation modalities, which are compressed into a latent space. With data gathered online as the agent interacts with its surroundings, it trains the world model to anticipate future latent representations. Using the latent representation of the world model as input, they train the policy to adopt decisions that maximize task reward. 

Since world modeling is distinct from action, Dynalang may be pretrained on single modalities (text-only or video-only data) without activities or task rewards. Additionally, the framework for language production may be unified: an agent’s perception can influence its language model (i.e., its predictions about future tokens), allowing it to communicate about the environment by producing language in the action space. They test Dynalang on a wide range of domains with various linguistic contexts. Dynalang learns to employ linguistic cues regarding future observations, environment dynamics, and corrections to carry out chores more quickly in a multitask house cleaning setting. On the Messenger benchmark, Dynalang outperforms task-specific architectures by reading game manuals to match the most difficult stage of the game. They show that Dynalang can pick up instructions in visually and linguistically complicated areas in vision-language navigation. These contributions demonstrate that Dynalang learns to comprehend many forms of language to accomplish various tasks, frequently beating state-of-the-art RL algorithms and task-specific architectures.

These are the contributions they made:

• They suggest Dynalang, an agent that uses future prediction to connect language to visual experience.

• They show that Dynalang outperforms state-of-the-art RL algorithms and task-specific designs by learning to comprehend various types of language to tackle a wide variety of tasks.

• They demonstrate that the Dynalang formulation opens up new possibilities, including the ability to combine language creation with text-only pretraining without actions or task incentives in a single model.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post UC Berkeley Researchers Introduce Dynalang: An AI Agent that Learns a Multimodal World Model to Predict Future Text and Image Representations and Learns to Act from Imagined Model Rollouts appeared first on MarkTechPost.

AWS performs fine-tuning on a Large Language Model (LLM) to classify t …

The video gaming industry has an estimated user base of over 3 billion worldwide1. It consists of massive amounts of players virtually interacting with each other every single day. Unfortunately, as in the real world, not all players communicate appropriately and respectfully. In an effort to create and maintain a socially responsible gaming environment, AWS Professional Services was asked to build a mechanism that detects inappropriate language (toxic speech) within online gaming player interactions. The overall business outcome was to improve the organization’s operations by automating an existing manual process and to improve user experience by increasing speed and quality in detecting inappropriate interactions between players, ultimately promoting a cleaner and healthier gaming environment.
The customer ask was to create an English language detector that classifies voice and text excerpts into their own custom defined toxic language categories. They wanted to first determine if the given language excerpt is toxic, and then classify the excerpt in a specific customer-defined category of toxicity such as profanity or abusive language.
AWS ProServe solved this use case through a joint effort between the Generative AI Innovation Center (GAIIC) and the ProServe ML Delivery Team (MLDT). The AWS GAIIC is a group within AWS ProServe that pairs customers with experts to develop generative AI solutions for a wide range of business use cases using proof of concept (PoC) builds. AWS ProServe MLDT then takes the PoC through production by scaling, hardening, and integrating the solution for the customer.
This customer use case will be showcased in two separate posts. This post (Part 1) serves as a deep dive into the scientific methodology. It will explain the thought process and experimentation behind the solution, including the model training and development process. Part 2 will delve into the productionized solution, explaining the design decisions, data flow, and illustration of the model training and deployment architecture.
This post covers the following topics:

The challenges AWS ProServe had to solve for this use case
Historical context about large language models (LLMs) and why this technology is a perfect fit for this use case
AWS GAIIC’s PoC and AWS ProServe MLDT’s solution from a data science and machine learning (ML) perspective

Data challenge
The main challenge AWS ProServe faced with training a toxic language classifier was obtaining enough labeled data from the customer to train an accurate model from scratch. AWS received about 100 samples of labeled data from the customer, which is a lot less than the 1,000 samples recommended for fine-tuning an LLM in the data science community.
As an added inherent challenge, natural language processing (NLP) classifiers are historically known to be very costly to train and require a large set of vocabulary, known as a corpus, to produce accurate predictions. A rigorous and effective NLP solution, if provided sufficient amounts of labeled data, would be to train a custom language model using the customer’s labeled data. The model would be trained solely with the players’ game vocabulary, making it tailored to the language observed in the games. The customer had both cost and time constraints that made this solution unviable. AWS ProServe was forced to find a solution to train an accurate language toxicity classifier with a relatively small labeled dataset. The solution lay in what’s known as transfer learning.
The idea behind transfer learning is to use the knowledge of a pre-trained model and apply it to a different but relatively similar problem. For example, if an image classifier was trained to predict if an image contains a cat, you could use the knowledge that the model gained during its training to recognize other animals like tigers. For this language use case, AWS ProServe needed to find a previously trained language classifier that was trained to detect toxic language and fine-tune it using the customer’s labeled data.
The solution was to find and fine-tune an LLM to classify toxic language. LLMs are neural networks that have been trained using a massive number of parameters, typically in the order of billions, using unlabeled data. Before going into the AWS solution, the following section provides an overview into the history of LLMs and their historical use cases.
Tapping into the power of LLMs
LLMs have recently become the focal point for businesses looking for new applications of ML, ever since ChatGPT captured the public mindshare by being the fastest growing consumer application in history2, reaching 100 million active users by January 2023, just 2 months after its release. However, LLMs are not a new technology in the ML space. They have been used extensively to perform NLP tasks such as analyzing sentiment, summarizing corpuses, extracting keywords, translating speech, and classifying text.
Due to the sequential nature of text, recurrent neural networks (RNNs) had been the state of the art for NLP modeling. Specifically, the encoder-decoder network architecture was formulated because it created an RNN structure capable of taking an input of arbitrary length and generating an output of arbitrary length. This was ideal for NLP tasks like translation where an output phrase of one language could be predicted from an input phrase of another language, typically with differing numbers of words between the input and output. The Transformer architecture3 (Vaswani, 2017) was a breakthrough improvement on the encoder-decoder; it introduced the concept of self-attention, which allowed the model to focus its attention on different words on the input and output phrases. In a typical encoder-decoder, each word is interpreted by the model in an identical fashion. As the model sequentially processes each word in an input phrase, the semantic information at the beginning may be lost by the end of the phrase. The self-attention mechanism changed this by adding an attention layer to both the encoder and decoder block, so that the model could put different weightings on certain words from the input phrase when generating a certain word in the output phrase. Thus the basis of the transformer model was born.
The transformer architecture was the foundation for two of the most well-known and popular LLMs in use today, the Bidirectional Encoder Representations from Transformers (BERT)4 (Radford, 2018) and the Generative Pretrained Transformer (GPT)5 (Devlin 2018). Later versions of the GPT model, namely GPT3 and GPT4, are the engine that powers the ChatGPT application. The final piece of the recipe that makes LLMs so powerful is the ability to distill information from vast text corpuses without extensive labeling or preprocessing via a process called ULMFiT. This method has a pre-training phase where general text can be gathered and the model is trained on the task of predicting the next word based on previous words; the benefit here is that any input text used for training comes inherently prelabeled based on the order of the text. LLMs are truly capable of learning from internet-scale data. For example, the original BERT model was pre-trained on the BookCorpus and entire English Wikipedia text datasets.
This new modeling paradigm has given rise to two new concepts: foundation models (FMs) and Generative AI. As opposed to training a model from scratch with task-specific data, which is the usual case for classical supervised learning, LLMs are pre-trained to extract general knowledge from a broad text dataset before being adapted to specific tasks or domains with a much smaller dataset (typically on the order of hundreds of samples). The new ML workflow now starts with a pre-trained model dubbed a foundation model. It’s important to build on the right foundation, and there are an increasing number of options, such as the new Amazon Titan FMs, to be released by AWS as part of Amazon Bedrock. These new models are also considered generative because their outputs are human interpretable and in the same data type as the input data. While past ML models were descriptive, such as classifying images of cats vs. dogs, LLMs are generative because their output is the next set of words based on input words. That allows them to power interactive applications such as ChatGPT that can be expressive in the content they generate.
Hugging Face has partnered with AWS to democratize FMs and make them easy to access and build with. Hugging Face has created a Transformers API that unifies more than 50 different transformer architectures on different ML frameworks, including access to pre-trained model weights in their Model Hub, which has grown to over 200,000 models as of writing this post. In the next sections, we explore the proof of concept, the solution, and the FMs that were tested and chosen as the basis for solving this toxic speech classification use case for the customer.
AWS GAIIC proof of concept
AWS GAIIC chose to experiment with LLM foundation models with the BERT architecture to fine-tune a toxic language classifier. A total of three models from Hugging Face’s model hub were tested:

vinai/bertweet-base
cardiffnlp/bertweet-base-offensive
cardiffnlp/bertweet-base-hate

All three model architectures are based on the BERTweet architecture. BERTweet is trained based on the RoBERTa pre-training procedure. The RoBERTa pre-training procedure is an outcome of a replication study of BERT pre-training that evaluated the effects of hyperparameter tuning and training set size to improve the recipe for training BERT models6 (Liu 2019). The experiment sought to find a pre-training method that improved the performance results of BERT without changing the underlying architecture. The conclusion of the study found that the following pre-training modifications substantially improved the performance of BERT:

Training the model with bigger batches over more data
Removing the next sentence prediction objective
Training on longer sequences
Dynamically changing the masking pattern applied to the training data

The bertweet-base model uses the preceding pre-training procedure from the RoBERTa study to pre-train the original BERT architecture using 850 million English tweets. It is the first public large-scale language model pre-trained for English tweets.
Pre-trained FMs using tweets were thought to fit the use case for two main theoretical reasons:

The length of a tweet is very similar to the length of an inappropriate or toxic phrase found in online game chats
Tweets come from a population with a large variety of different users, similar to that of the population found in gaming platforms

AWS decided to first fine-tune BERTweet with the customer’s labeled data to get a baseline. Then chose to fine-tune two other FMs in bertweet-base-offensive and bertweet-base-hate that were further pre-trained specifically on more relevant toxic tweets to achieve potentially higher accuracy. The bertweet-base-offensive model uses the base BertTweet FM and is further pre-trained on 14,100 annotated tweets that were deemed as offensive7 (Zampieri 2019). The bertweet-base-hate model also uses the base BertTweet FM but is further pre-trained on 19,600 tweets that were deemed as hate speech8 (Basile 2019).
To further enhance the performance of the PoC model, AWS GAIIC made two design decisions:

Created a two-stage prediction flow where the first model acts as a binary classifier that classifies whether a piece of text is toxic or not toxic. The second model is a fine-grained model that classifies text based on the customer’s defined toxic types. Only if the first model predicts the text as toxic does it get passed to the second model.
Augmented the training data and added a subset of a third-party-labeled toxic text dataset from a public Kaggle competition (Jigsaw Toxicity) to the original 100 samples received from the customer. They mapped the Jigsaw labels to the associated customer-defined toxicity labels and did an 80% split as training data and 20% split as test data to validate the model.

AWS GAIIC used Amazon SageMaker notebooks to run their fine-tuning experiments and found that the bertweet-base-offensive model achieved the best scores on the validation set. The following table summarizes the observed metric scores.

Model
Precision
Recall
F1
AUC

Binary
.92
.90
.91
.92

Fine-grained
.81
.80
.81
.89

From this point, GAIIC handed off the PoC to the AWS ProServe ML Delivery Team to productionize the PoC.
AWS ProServe ML Delivery Team solution
To productionize the model architecture, the AWS ProServe ML Delivery Team (MLDT) was asked by the customer to create a solution that is scalable and easy to maintain. There were a few maintenance challenges of a two-stage model approach:

The models would require double the amount of model monitoring, which makes retraining timing inconsistent. There may be times that one model will have to be retrained more often than the other.
Increased costs of running two models as opposed to one.
The speed of inference slows because inference goes through two models.

To address these challenges, AWS ProServe MLDT had to figure out how to turn the two-stage model architecture into a single model architecture while still being able to maintain the accuracy of the two-stage architecture.
The solution was to first ask the customer for more training data, then to fine-tune the bertweet-base-offensive model on all the labels, including non-toxic samples, into one model. The idea was that fine-tuning one model with more data would result in similar results as fine-tuning a two-stage model architecture on less data. To fine-tune the two-stage model architecture, AWS ProServe MLDT updated the pre-trained model multi-label classification head to include one extra node to represent the non-toxic class.
The following is a code sample of how you would fine-tune a pre-trained model from the Hugging Face model hub using their transformers platform and alter the model’s multi-label classification head to predict the desired number of classes. AWS ProServe MLDT used this blueprint as its basis for fine-tuning. It assumes that you have your train data and validation data ready and in the correct input format.
First, Python modules are imported as well as the desired pre-trained model from the Hugging Face model hub:

# Imports.
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
DataCollatorWithPadding,
PreTrainedTokenizer,
Trainer,
TrainingArguments,
)

# Load pretrained model from model hub into a tokenizer.
model_checkpoint = “cardiffnlp/bertweet-base-offensive”
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The pre-trained model then gets loaded and prepped for fine-tuning. This is the step where the number of toxic categories and all model parameters get defined:

# Load pretrained model into a sequence classifier to be fine-tuned and define the number of classes you want to classify in the num_labels parameter.

model = AutoModelForSequenceClassification.from_pretrained(
model_checkpoint,
num_labels=[number of classes]
)

# Set your training parameter arguments. The below are some key parameters that AWS ProServe MLDT tuned:
training_args = TrainingArguments(
num_train_epochs=[enter input]
per_device_train_batch_size=[enter input]
per_device_eval_batch_size=[enter input]
evaluation_strategy=”epoch”,
logging_strategy=”epoch”,
save_strategy=”epoch”,
learning_rate=[enter input]
load_best_model_at_end=True,
metric_for_best_model=[enter input]
optim=[enter input],
)

Model fine-tuning starts with inputting paths to the training and validation datasets:

# Finetune the model from the model_checkpoint, tokenizer, and training_args defined assuming train and validation datasets are correctly preprocessed.
trainer = Trainer(
model=model,
args=training_args,
train_dataset=[enter input],
eval_dataset=[enter input],
tokenizer=tokenizer,
data_collator=data_collator,
)

# Finetune model command.
trainer.train()

AWS ProServe MLDT received approximately 5,000 more labeled data samples, 3,000 being non-toxic and 2,000 being toxic, and fine-tuned all three bertweet-base models, combining all labels into one model. They used this data in addition to the 5,000 samples from the PoC to fine-tune new one-stage models using the same 80% train set, 20% test set method. The following table shows that the performance scores were comparable to that of the two-stage model.

Model
Precision
Recall
F1
AUC

bertweet-base (1-Stage)
.76
.72
.74
.83

bertweet-base-hate (1-Stage)
.85
.82
.84
.87

bertweet-base-offensive (1-Stage)
.88
.83
.86
.89

bertweet-base-offensive (2-Stage)
.91
.90
.90
.92

The one-stage model approach delivered the cost and maintenance improvements while only decreasing the precision by 3%. After weighing the trade-offs, the customer opted for AWS ProServe MLDT to productionize the one-stage model.
By fine-tuning one model with more labeled data, AWS ProServe MLDT was able to deliver a solution that met the customer’s threshold for model accuracy, as well as deliver on their ask for ease of maintenance, while lowering cost and increasing robustness.
Conclusion
A large gaming customer was looking for a way to detect toxic language within their communication channels to promote a socially responsible gaming environment. AWS GAIIC created a PoC of a toxic language detector by fine-tuning an LLM to detect toxic language. AWS ProServe MLDT then updated the model training flow from a two-stage approach to a one-stage approach and productionized the LLM for the customer to be used at scale.
In this post, AWS demonstrates the effectiveness and practicality of fine-tuning an LLM to solve this customer use case, shares context on the history of foundation models and LLMs, and introduces the workflow between the AWS Generative AI Innovation Center and the AWS ProServe ML Delivery Team. In the next post in this series, we will dive deeper into how AWS ProServe MLDT productionized the resulting one-stage model using SageMaker.
If you are interested in working with AWS to build a Generative AI solution, please reach out to the GAIIC. They will assess your use case, build out a Generative-AI-based proof of concept, and have options to extend collaboration with AWS to implement the resulting PoC into production.
References

Gamer Demographics: Facts and Stats About the Most Popular Hobby in the World
ChatGPT sets record for fastest-growing user base – analyst note
Vaswani et al., “Attention is All You Need”
Radford et al., “Improving Language Understanding by Generative Pre-Training”
Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”
Yinhan Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach”
Marcos Zampieri et al., “SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)”
Valerio Basile et al., “SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter”

About the authors
James Poquiz is a Data Scientist with AWS Professional Services based in Orange County, California. He has a BS in Computer Science from the University of California, Irvine and has several years of experience working in the data domain having played many different roles. Today he works on implementing and deploying scalable ML solutions to achieve business outcomes for AWS clients.
Han Man is a Senior Data Science & Machine Learning Manager with AWS Professional Services based in San Diego, CA. He has a PhD in Engineering from Northwestern University and has several years of experience as a management consultant advising clients in manufacturing, financial services, and energy. Today, he is passionately working with key customers from a variety of industry verticals to develop and implement ML and GenAI solutions on AWS.
Safa Tinaztepe is a full-stack data scientist with AWS Professional Services. He has a BS in computer science from Emory University and has interests in MLOps, distributed systems, and web3.

11 Business AI Tools for Startups in 2023

AdCreative AI

Boost your advertising and social media game with AdCreative.ai – the ultimate Artificial Intelligence solution. Say goodbye to hours of creative work and hello to the high-converting ad and social media posts generated in mere seconds. Maximize your success and minimize your effort with AdCreative.ai today.

Canva 

Canva is a web-based graphic design application that helps non-designers produce high-quality output quickly and easily. The platform provides several useful tools, such as an image editor, a library of premade templates, and a drag-and-drop editor. Canva is a free online design tool used by people of all skill levels and businesses to make presentations, social media graphics, marketing materials, and more. Numerous premade formats are available. Canva’s library of templates includes more than 100,000 examples you may use as a jumping-off point for your creations. You can discover the appropriate template for your needs from this collection, which covers various topics. It’s quite cheap. There is a free version of Canva that includes all the fundamental tools. More advanced functions are available with a paid subscription.

Webflow 

Webflow is a web development platform in the cloud that lets customers build and launch websites without knowing how to code. It has a content management system and visual editor, making creating and modifying web pages simple. Webflow also has several tools for marketing, SEO, and online sales. Webflow is an excellent option for companies and people needing a sleek, modern website but needing more time to code. It’s also a perfect option for companies needing an e-commerce platform. Creating and editing websites has never been simpler than using the visual editor’s drag-and-drop page builder. The CMS facilitates simple content administration for its users. Webflow has many e-commerce tools, such as shopping carts, payment gateways, and shipment tracking.

Beehiiv 

Beehiiv AI is an AI platform developed with the needs of newsletter publishers in mind. Its array of AI tools is designed to improve and expedite the content development process. With the help of the AI Writing Assistant, users may generate content by describing their ideas, choosing the desired tone and length, and writing interesting pieces. Existing text can now be improved using the AI Text Tools’ capabilities for auto-correction, auto-completion, tone alterations, and text regeneration. With the help of the AI Image Tools, anyone may create stunning photos just by expressing what they want to see. The in-editor AI Translator found in Beehives AI makes it easy to localize your material into different languages. The software aims to help newsletter administrators work smarter, not harder, by increasing productivity and efficiency. Operators can use the AI-powered Writing Assistant to generate content automatically, while the Text Tools support editing and upgrading existing text.

Senja 

Senja is a program that shortens the time to gather, organize, and disseminate user reviews from days to minutes. It’s a straightforward method of collecting consumer feedback and highlighting their positive experiences with your company. Senja allows you to design and distribute your own online testimonial collection forms. You can encourage consumers to post testimonials by offering them benefits and following up with a thank-you letter after they do so. Senja simplifies the management of testimonials once they have been obtained. Tag them with keywords, add your notes, and arrange them in projects. With Senja, including customer reviews in online and offline marketing materials is simple. Using Senja can help you gain the trust and respect of your target audience. People are more likely to buy from you if they read about the success of your business from the mouths of others.

Grammarly 

The artificial intelligence (AI) behind Grammarly makes it an excellent online writing coach. It instantly fixes any mistakes you may have made with grammar, spelling, punctuation, clarity, style, or tone. Over 500,000 apps and websites are compatible with Grammarly across Windows, Mac, iOS, and Android. It can check for grammar, spelling, punctuation, and plagiarism, among other things, and generate citations for your essays. Grammarly is a flexible resource that may be used in various settings, from individuals to large corporations to educational institutions. It provides several plans to accommodate multiple needs. In addition to its editing software, Grammarly also offers many other resources, such as developer and education and business and technology-focused blogs.

Copy.ai

Copy.ai is an artificial intelligence–powered copywriting platform that helps businesses produce engaging content. There is no entry charge or required initial purchase to become a member. This application uses cookies to tailor your experience and provide relevant advertisements. For GDPR purposes and to identify bots, this site uses functional cookies. The app keeps track of everything its users do on the site and uses that data to create analytics and heat maps. The user’s preferred language and server cluster can also be stored in cookies. This is good for the user’s experience and the ads they view.

ChatGPT 

Simply put, ChatGPT is an artificial intelligence-powered conversational UI. It takes user input, processes it, and outputs a response. Because of OpenAI, the machine can comprehend both written and spoken language. It can provide set answers or prompt the user to supply their own. The technology can have meaningful consumer interactions using machine learning and natural language processing. Because of its adaptability, the system can be used in various contexts, such as virtual assistants, chatbots, and customer support. To give users a conversational AI system that can understand and fulfill their demands, ChatGPT makes use of OpenAI technologies.

Notion AI 

Quickly create blog posts, meeting agendas, and sales letters with the help of the advanced AI-driven software Notion AI. Notion AI writes the first draft, giving the user a head start on paragraphs or whole pages. By utilizing AI’s vast potential, writers can boost their productivity, scope of thought, and creative output. Besides these more obvious applications, you can use them to compose poetry, catch mistakes, translate text inline, and summarize preliminary drafts of longer works. With Notion AI, people may experience the wonder of AI-driven content creation via brainstorming, getting ideas, and more.

Tweetlify 

Tweetlify is a one-stop shop for increasing your Twitter following, engagement, and the likelihood of your tweets becoming viral. Using this app, you can craft virally successful tweets with ease. Automatically interact with your followers by liking, retweeting, and replying to their tweets with this handy application. Users can be followed and unfollowed mechanically, and individual users can be targeted based on their interests with this technology. You can learn from the practices of other popular accounts and implement them into your own Twitter strategy with the help of this tool. Tweetlify is a fantastic choice if you’re trying to expand your Twitter following. The platform has many tools that can facilitate the production of popular tweets, interaction with an audience, and the automatic acquisition of new followers.

Otter AI

Using artificial intelligence, Otter.AI empowers users with real-time transcriptions of meeting notes that are shareable, searchable, accessible, and secure. Get a meeting assistant that records audio, writes notes, automatically captures slides, and generates summaries.

Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

The post 11 Business AI Tools for Startups in 2023 appeared first on MarkTechPost.

Meet Project Rumi: Multimodal Paralinguistic Prompting for Large Langu …

In the digital era of emerging technologies, LLMs have emerged as a powerful tool revolutionizing many aspects of human society and culture, reshaping how we interact with computers. Yet, there is a pivotal challenge that needs to be solved. The limitations of  LLMs are evident, revealing a gap in the inability to grasp the contexts and nuances of a conversation and depend on the quality and specificity of the prompt. One major limitation is they lack the depth of real communication, missing all the paralinguistic information.

Project Rumi from Microsoft aims to enhance the capabilities of LLMs by addressing limitations in understanding nonverbal cues and contextual nuances. It incorporates paralinguistic input into prompt-based interactions with LLMs to improve the quality of communication. The researchers have used audio and video models to detect real-time non-verbal cues from data streams. Two separate models are used for paralinguistic information from the user’s audio, the first prosody tone and inflection of audio and the other from the semantics of the speech. They have used vision transformers for encoding the frames and identifying facial expressions from video. A downstream service incorporates the paralinguistic information into the text-based prompt. This multimodal approach aims to enhance user sentiment and intent understanding, thus elevating human-AI interaction to a new level.

In this research, researchers have only briefly explored the role that paralinguistic provides in communicating critical information about user’s intentions. In the future, they plan to model to make the model better and more efficient. They also want to add more details like  HRV (heart rate variability) derived from standard video and cognitive and ambient sensing. This is all part of a bigger effort to add unspoken meaning and intention in the next wave of interactions with AI.

Check out the Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Meet Project Rumi: Multimodal Paralinguistic Prompting for Large Language Models appeared first on MarkTechPost.

Can Large Language Models Help Long-term Action Anticipation from Vide …

From video observations, research focuses on the LTA task—long-term action anticipation. Sequences of verb and noun predictions for an interested actor across a generally extended time horizon are its desired outcomes. LTA is essential for human-machine communication. A machine agent might use LTA to help people in situations like self-driving cars and routine domestic chores. Additionally, due to human behaviors’ inherent ambiguity and unpredictability, video action detection is quite difficult, even with perfect perception. 

Bottom-up modeling, a popular LTA strategy, directly simulates human behavior’s temporal dynamics using latent visual representations or discrete action labels. Most current bottom-up LTA strategies are implemented as end-to-end trained neural networks using visual inputs. Knowing an actor’s goal may aid action prediction because human behavior, especially in everyday domestic situations, is frequently “purposive.” As a result, they consider a top-down framework in addition to the widely used bottom-up strategy. The top-down framework first outlines the process necessary to achieve the goal, thereby implying the longer-term aim of the human actor. 

However, it is typically difficult to use goal-conditioned process planning for action anticipation since the target information is frequently left unlabeled and latent in current LTA standards. These issues are addressed in their study in both top-down and bottom-up LTA. They suggest examining whether large language models (LLMs) may profit from films because of their success in robotic planning and program-based visual question answering. They propose that the LLMs encode helpful prior information for the long-term action anticipation job by pretraining on procedural text material, such as recipes. 

In an ideal scenario, prior knowledge encoded in LLMs can assist both bottom-up and top-down LTA approaches because they can respond to queries like, “What are the most likely actions following this current action?” as well as, “What is the actor trying to achieve, and what are the remaining steps to achieve the goal?” Their research specifically aims to answer four inquiries on using LLMs for long-term action anticipation: What is an appropriate interface for the LTA work between videos and LLMs, first? Second, are LLMs useful for top-down LTA, and can they infer the goals? Third, may action anticipation be aided by LLMs’ prior knowledge of temporal dynamics? Lastly, can they use the few-shot LTA functionality provided by LLMs’ in-context learning capability? 

Researchers from Brown University and Honda Research Institute provide a two-stage system called AntGPT to do the quantitative and qualitative evaluations required to provide answers to these questions. AntGPT first identifies human activities using supervised action recognition algorithms. The OpenAI GPT models are fed the recognized actions by AntGPT as discretized video representations to determine the intended outcome of the actions or the actions to come, which may then optionally be post-processed into the final predictions. In bottom-up LTA, they explicitly ask the GPT model to predict future action sequences using autoregressive methods, fine-tuning, or in-context learning. They initially ask GPT to forecast the actor’s aim before producing the actor’s behaviors to accomplish top-down LTA. 

They then use the goal information to provide predictions that are goal-conditioned. Additionally, they look at AntGPT’s capacity for top-down and bottom-up LTA using chains of reasoning and few-shot bottom-up LTA, respectively. They do tests on several LTA benchmarks, including EGTEA GAZE+, EPIC-Kitchens-55, and Ego4D. The quantitative tests demonstrate the viability of their suggested AntGPT. Additional quantitative and qualitative studies show that LLMs can infer the actors’ high-level objectives given discretized action labels from the video observations. Additionally, they note that the LLMs can execute counterfactual action anticipation when given a variety of input objectives. 

Their study contributes the following: 

1. They suggest using big language models to infer objectives model temporal dynamics and define long-term action anticipation as bottom-up and top-down methods. 

2. They suggest the AntGPT framework, which naturally connects LLMs with computer vision algorithms for comprehending videos and achieves state-of-the-art long-term action prediction performance on the EPIC-Kitchens-55, EGTEA GAZE+, and Ego4D LTA v1 and v2 benchmarks. 

3. They carry out comprehensive quantitative and qualitative assessments to comprehend LLMs’ crucial design decisions, benefits, and drawbacks when used for the LTA job. They also plan to release the code soon.

Check out the Paper and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Can Large Language Models Help Long-term Action Anticipation from Videos? Meet AntGPT: An AI Framework to Incorporate Large Language Models for the Video-based Long-Term Action Anticipation Task appeared first on MarkTechPost.

This AI Research Introduces a Novel Two-Stage Pose Distillation for W …

Numerous human-centric perception, comprehension, and creation tasks depend on whole-body pose estimation, including 3D whole-body mesh recovery, human-object interaction, and posture-conditioned human image and motion production. Furthermore, using user-friendly algorithms like OpenPose and MediaPipe, recording human postures for virtual content development and VR/AR has significantly increased in popularity. Although these tools are convenient, their performance still needs to improve, which limits their potential. Therefore, more developments in human pose assessment technologies are essential to realizing the promise of user-driven content production. 

Comparatively speaking, whole-body pose estimation presents more difficulties than human pose estimation with body-only key points detection due to the following factors:

The hierarchical structures of the human body for fine-grained key points localization.

The small resolutions of the hand and face.

The complex body parts match multiple people in an image, especially for occlusion and difficult hand poses.

Data limitation, particularly for the whole-body images’ diverse hand pose and head pose.

Additionally, a model must be compressed into a thin network before deployment. Distillation, trimming, and quantization make up the fundamental compression techniques. 

Knowledge distillation (KD) can boost a compact model’s effectiveness without adding unnecessary expenses to the inference process. This method, which has broad use in various tasks like categorization, detection, and segmentation, allows students to pick up knowledge from a more experienced teacher. A set of real-time pose estimators with good performance and efficiency are produced as a consequence of the investigation of KD for whole-body pose estimation in this work. Researchers from Tsinghua Shenzhen International Graduate School and International Digital Economy Academy specifically suggest a revolutionary two-stage pose distillation architecture called DWPose, which, as demonstrated in Fig. 1, provides cutting-edge performance. They use the most recent pose estimator, RTMPose, trained on COCO-WholeBody, as their fundamental model. 

Figure 1 shows a comparison between their model and comparable models for COCO-WholeBody’s whole-body posture estimation.

They natively use the teacher’s (e.g., RTMPose-x) intermediate layer and final logits in the first stage distillation to direct the student model (e.g., RTMPose-l). Keypoints may be distinguished in previous posture training by their visibility, and only visible key points are used for monitoring. Instead, they employ the teacher’s entire outputs which include both visible and invisible key points—as final logits, which may convey accurate and thorough values to aid in the learning process for the students. They also use a weight-decay approach to increase effectiveness, which progressively lowers the device’s weight throughout the training session. The second stage, distillation, suggests a head-aware self-KD to increase the capacity of the head since a better head would decide a more accurate localization. 

They build two identical models, choosing one as the student to be updated and the other as the instructor. Only the head of the student is updated by the logit-based distillation, leaving the rest of the body frozen. Notably, this plug-and-play strategy works with dense prediction heads and enables the student to get better outcomes with 20% less training time, whether trained from the start with distillation or without. The volume and variety of data addressing different sizes of human body parts will impact the model’s performance. Due to the datasets ‘ need for comprehensive annotated key points, existing estimators must help accurately localize the fine-grained finger and facial landmarks. 

Therefore, they incorporate an extra UBody dataset comprising numerous face and hand key points photographed in various real-life settings to examine the data effect. Thus, the following may be said about their contributions: 

• To overcome the whole-body data limitation, they explore more comprehensive training data, especially on diverse and expressive hand gestures and facial expressions, making it applicable to real-life applications. 

• They introduce a two-stage pose knowledge distillation method, pursuing efficient and precise whole-body pose estimation. 

• Their suggested distillation and data techniques may greatly enhance RTMPose-l from 64.8% to 66.5% AP, even exceeding RTMPose-x instructor with 65.3% AP, using the most recent RTMPose as their base model. Additionally, they confirm DWPose’s strong efficacy and efficiency in generating work.

Check out the Paper and GitHub. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post This AI Research Introduces a Novel Two-Stage Pose Distillation for Whole-Body Pose Estimation appeared first on MarkTechPost.

This AI Research Evaluates the Correctness and Faithfulness of Instruc …

Recently introduced Large Language Models (LLMs) have taken the Artificial Intelligence (AI) community by storm. These models have been able to successfully imitate human beings by using super-good Natural Language Processing (NLP), Natural Language Generation (NLG) and Natural Language Understanding (NLU). LLMs have become famous for imitating humans for having realistic conversations and are capable of answering simple and complex questions, content generation, code completion, machine translation, and text summarization. The goal of NLP is to make it possible for computer systems to comprehend and react to commands given in natural language, enabling people to engage with them in a more natural and flexible way, the best example of which is the instruction following models.

These models are trained using LLMs, supervised examples, or other types of supervision, and exposure to thousands of tasks written as natural language instructions. In recent research, a team from Mila Quebec AI Institute, McGill University, and Facebook CIFAR AI Chair has researched evaluating the performance of instruction-following models for their ability to perform question-answering (QA) on a given set of text passages. These models can answer questions when provided with a prompt describing the task, the question, and relevant text passages retrieved by a retriever, and the responses produced by these models are known to be natural and informative, which helps build users’ trust and engagement. 

These models can respond to user queries naturally and fluently by only adding retrieved documents and instructions to their input. However, this extra verbosity makes it difficult for conventional QA evaluation metrics like exact match (EM) and F1 score to effectively quantify model performance. This is due to the possibility that the model’s response may include more details that the reference answer omits while still being accurate. The team has provided two criteria for measuring instruction-following models in retrieval-augmented quality assurance (QA) in order to overcome this problem.

Regarding information necessity, accuracy: This dimension evaluates how well the model satisfies the informational requirements of a user. It is concerned with whether the generated response includes pertinent information, even if it goes beyond what is mentioned directly in the reference answer.

Fidelity in relation to information provided: This dimension assesses how well the model grounds answers in the knowledge presented. A true model should refrain from responding when irrelevant information is presented, in addition to giving precise answers when it is accessible.

The authors have evaluated several recent instruction-following models on three diverse QA datasets: Natural Questions for open-domain QA, HotpotQA for multi-hop QA, and TopiOCQA for conversational QA. They analyzed 900 model responses manually and compared the results with different automatic metrics for accuracy and faithfulness. Their research has suggested that recall, which measures the percentage of tokens from the reference answer that are also present in the model response, correlates more strongly with correctness than lexical overlap metrics like EM or F1 score. Compared to other token-overlap metrics for faithfulness, K-Precision, which is the percentage of model answer tokens that exist in the knowledge snippet, has a stronger correlation with human judgments.

In conclusion, this study seeks to advance a more thorough assessment of instruction-following models for QA tasks, taking into account both their advantages and disadvantages. The team has promoted additional advancement in this area by making their code and data accessible on their GitHub repository

Check out the Paper, GitHub, and Tweet. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post This AI Research Evaluates the Correctness and Faithfulness of Instruction-Following Models For Their Ability To Perform Question-Answering appeared first on MarkTechPost.

Sorbonne University Researchers Introduce UnIVAL: A Unified AI Model f …

One big leap forward in creating generalist models is the appearance of Large Language Models (LLMs). Their astounding text understanding and generation performances are often based on the Transformer architecture and a single next-token prediction aim. However, they are currently hampered by their inability to access information outside the text. This emphasizes the requirement for reliable multimodal models capable of performing various tasks using various modalities. 

Recent efforts have sought to improve task/modality-specific techniques by constructing multimodal models with more power. A few of these methods seek to include more than two modalities, such as image/video-text, although most of these efforts are devoted to image-text jobs. 

To address this problem, the researchers at Sorbonne University began by developing general-purpose models that can address any problem. They introduce UnIVAL, a method that avoids relying on any single modality. UnIVAL integrates two modalities and all four (text, pictures, video, and audio).

UnIVAL is the first model to solve picture, video, and audio language challenges with a unified architecture, vocabulary, input/output format, and training aim without requiring massive amounts of data for training or massive model size. The 0.25 billion parameter model delivers performance on par with prior art tailored to a certain modality. The researchers obtained new SoTA on several jobs with similarly sized models. 

Their research into the interplay and transfer of knowledge between pretrained tasks and modalities demonstrates the value of multitask pretraining compared to traditional single-task pretraining. They also discover that pretraining the model on additional modalities improves its generalization to untrained modalities. In particular, when fine-tuned on audio-text problems, UnIVAL can achieve competitive performance to SoTA without audio pretraining. 

Based on previous studies, the team also presents a new investigation into merging multimodal models by weight interpolation. They demonstrate that interpolation in the weight space may successfully combine the skills of the multiple fine-tuned weights, creating more robust multitask models without any inference overhead when using the unified pretrained model for various multimodal tasks. The diversity of multimodal activities can thus be used and recycled by averaging various fine-tuned weights and multitasking pretraining. Weight interpolation has never been tested with multimodal baseline models before, but this research is the first to successfully do so.

The researchers also mention two significant drawbacks of UnIVAL:

UnIVAL is susceptible to hallucinations. In particular, it may invent new objects in visual descriptions (object bias), giving more weight to consistency than accuracy. 

It has trouble following elaborate directions. They found that the model underperformed when given complex instructions, such as picking out one object from a group of similar ones, finding things that are far away or extremely close, or recognizing numbers.

The researchers hope their findings will motivate other scientists and speed up the process of building new modality-agnostic generalist assistant agents. 

Check out the Project, Paper, and GitHub. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Sorbonne University Researchers Introduce UnIVAL: A Unified AI Model for Image, Video, Audio, and Language Tasks appeared first on MarkTechPost.

Google DeepMind Researchers Introduce RT-2: A Novel Vision-Language-Ac …

Large language models can enable fluent text generation, emergent problem-solving, and creative generation of prose and code. In contrast, vision-language models enable open-vocabulary visual recognition and can even make complex inferences about object-agent interactions in images. The best way for robots to learn new skills needs to be clarified. Compared to the billions of tokens and photos used to train the most advanced language and vision-language models on the web, the amount of data collected from robots is unlikely to be comparable. However, it is also challenging to immediately adapt such models to robotic activities since these models reason about semantics, labels, and textual prompts. In contrast, robots must be instructed in low-level actions, such as those using the Cartesian end-effector.

Google Deepmind’s research aims to improve generalization and enable emergent semantic reasoning by directly incorporating vision-language models trained on Internet-scale data into end-to-end robotic control. With the help of web-based language and vision-language data, we aim to make a single, comprehensively trained model to learn to link robot observations to actions. They propose fine-tuning state-of-the-art vision-language models together using data from robot trajectories and large-scale visual question-answering exercises conducted over the Internet. In contrast to other methods, they propose a straightforward, all-purpose recipe: express robotic actions as text tokens and incorporate them directly into the model’s training set as natural language tokens would. Researchers study vision-language-action models (VLA), and RT-2 instantiates one such model. Through rigorous testing (6k assessment trials), they could ascertain that RT-2 acquired various emergent skills through Internet-scale training and that the technique led to performant robotic policies.

Google DeepMind unveiled RT-2, a Transformer-based model trained on the web-sourced text and images that can directly perform robotic operations, as a follow-up to its Robotics Transformer model 1. They use robot actions to represent a second language that can be converted into text tokens and taught alongside large-scale vision-language datasets available online. Inference involves de-tokenizing text tokens into robot behaviors that can then be controlled via a feedback loop. This permits transferring some of the generalization, semantic comprehension, and reasoning of vision-language models to learning robotic policies. On the project website, accessible at https://robotics-transformer2.github.io/, the team behind RT-2 provides live demonstrations of its use. 

The model retains the ability to deploy its physical skills in ways consistent with the distribution found in the robot data. Still, it also learns to use those skills in novel contexts by reading visuals and linguistic commands using knowledge gathered from the web. Even though semantic cues like precise numbers or icons aren’t included in the robot data, the model can repurpose its learned pick-and-place skills. No such relations were supplied in the robot demos, yet the model could pick the correct object and position it in the correct location. In addition, the model can make even more complex semantic inferences if the command is supplemented with a chain of thought prompting, such as knowing that a rock is the best choice for an improvised hammer or an energy drink is the best choice for someone tired.

Google DeepMind’s key contribution is RT-2, a family of models created by fine-tuning huge vision-language models trained on web-scale data to serve as generalizable and semantically aware robotic rules. Experiments probe models with as much as 55B parameters, learned from publicly available data and annotated with robotic motion commands. Across 6,000 robotic evaluations, they demonstrate that RT-2 enables considerable advances in generalization over objects, scenes, and instructions and displays a range of emergent abilities that are a byproduct of web-scale vision-language pretraining. 

Key Features

The reasoning, symbol interpretation, and human identification capabilities of RT-2 can be used in a wide range of practical scenarios. 

The results of RT-2 demonstrate that pretraining VLMs using robotic data can turn them into powerful vision-language-action (VLA) models that can directly control a robot.

A promising direction to pursue is to construct a general-purpose physical robot that can think, problem-solve, and interpret information for completing various activities in the actual world, like RT-2.

Its adaptability and efficiency in handling various tasks are displayed in RT-2’s capacity to transfer information from language and visual training data to robot movements.

Limitations

Despite its encouraging generalization properties, RT-2 suffers from several drawbacks. Although studies suggest that incorporating web-scale pretraining through VLMs improves generalization across semantic and visual concepts, this does not give the robot any new abilities regarding its capacity to perform motions. Though the model can only use the physical abilities found in the robot data in novel ways, it does learn to make better use of its abilities. They attribute this to a need for more diversity in the sample along the dimensions of competence. New data-gathering paradigms, such as films of humans, present an intriguing opportunity for future research into acquiring new skills.

To sum it up, Google DeepMind researchers demonstrated that big VLA models could be run in real-time, but this was at a considerable computational expense. As these methods are applied to situations requiring high-frequency control, real-time inference risks become a significant bottleneck. Quantization and distillation approaches that could let such models operate faster or on cheaper hardware are attractive areas for future study. This is related to another existing restriction in that relatively few VLM models can be utilized to develop RT-2.

Researchers from Google DeepMind summarized the process of training vision-language-action (VLA) models by integrating pretraining with vision-language models (VLMs) and data from robotics. They then introduced two variants of VLAs (RT-2-PaLM-E and RT-2-PaLI-X) that PaLM-E and PaLI-X, respectively inspired. These models are fine-tuned with data on robotic trajectories to generate robot actions, which are tokenized as text. More crucially, they demonstrated that the technique improves generalization performance and emergent capabilities inherited from web-scale vision-language pretraining, leading to very effective robotic policies. According to Google DeepMind, the discipline of robot learning is now strategically positioned to profit from improvements in other fields thanks to this straightforward and universal methodology. 

Check out the Paper and Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Google DeepMind Researchers Introduce RT-2: A Novel Vision-Language-Action (VLA) Model that Learns from both Web and Robotics Data and Turns it into Action appeared first on MarkTechPost.