Meet EasyEdit: An Easy-to-Use Knowledge Editing AI Framework for LLMs

We constantly need to keep up with this ever-changing world, as do machine learning models, to produce precise output. Large Language Models often suffer from fallacy issues; that is, they are unaware of unseen events or generate text with incorrect facts owing to the outdated/noisy data. For example- LLMs such as ChatGPT and LlaMA possess information only up to their last training point, and essentially, we need to update the parametric knowledge within the LLMs to modify their specific behaviors. Numerous knowledge editing or model editing methods have been introduced to craft edits in machine learning models whilst minimizing the impact on unrelated inputs. 

To tackle persistent challenges based on knowledge cut-off/biased outputs, researchers have applied two major methods:

Fine – Tuning, traditional fine-tuning, and delta tuning utilize domain-specific datasets, but they consume enormous resources and risk the potential of catastrophic forgetting.

Prompt- Augmentation, when provided with ample demonstrations or gathered contexts, large language models (LLMs) exhibit the capacity to improve their reasoning capabilities and enhance their generation tasks through integrating external knowledge. The downside is this technique may be sensitive to factors such as the prompting template and the selection of in-context examples.

Owing to significant differences among various knowledge editing methods and the variations in task setups, no standard implementation framework is available. To address these issues and provide a unified framework, researchers have introduced EASYEDIT, an easy-to-use knowledge editing framework for LLMs. It supports cutting-edge knowledge editing approaches and can be readily applied to many well-known LLMs such as T5, GPT-J, and LlaMA.

https://arxiv.org/abs/2308.07269

The EASYEDIT platform introduces a user-friendly “edit” interface that enables easy model modification. Comprising key elements like Hparams, Method, and Evaluate, this interface seamlessly incorporates various strategies for knowledge editing. The core mechanism for implementing these strategies is the APPLY_TO_MODEL function, accessible through different methods. The figure above demonstrates an instance of applying MEND to LLaMA, altering the output of the U.S. President to Joe Biden.

EASYEDIT employs a modular approach to organizing editing methods and evaluating their efficacy while also accounting for their interplay and combination. The platform accommodates a range of editing scenarios, encompassing single-instance, batch-instance, and sequential editing. Furthermore, it conducts evaluations of critical metrics such as Reliability, Generalization, Locality, and Portability, which assist users in identifying the most suitable method tailored to their distinct requirements. 

The knowledge editing results on LlaMA-2 with EASYEDIT demonstrate that knowledge editing surpasses traditional fine-tuning regarding reliability and generalization. In conclusion, the EasyEdit framework emerges as a pivotal advancement in the realm of large language models (LLMs), addressing the critical need for accessible and intuitive knowledge editing.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

The post Meet EasyEdit: An Easy-to-Use Knowledge Editing AI Framework for LLMs appeared first on MarkTechPost.

Meet SQLCoder: An New Open-Sourced and State of the Art Model for Conv …

Defog.ai has released SQLCoder, a cutting-edge model for translating inquiries in natural language into database queries. Regarding generic SQL schemas in Postgres, SQLCoder greatly beats all major open-source models. When optimized for a specific database schema, it performs better than gpt-4.

The model’s size is such that it may be executed in 16-bit floats on a single A100-40GB or an 8-bit quantized high-end consumer GPU (such as an RTX 3090/4090). The evaluation mechanism for LLM-generated SQL is likewise being made open-source. Evaluating SQL code can be difficult. Researchers want to conduct extensive, public, and reproducible testing to push the limits of open-source text-to-SQL systems.

The model weights are licensed under CC BY-SA 4.0. The model is free for both personal and commercial use. If you change the consequences (by fine-tuning, for instance), you must release those changes as open source under the same license.

SQLCoder is an optimized version of StarCoder that uses 15B parameters. SQLCoder has been fine-tuned on progressively challenging SQL queries created by hand. Database schema-specific tuning allows it to achieve or exceed the performance of GPT-4.

Researchers have used SQLCoder with enterprise customers in the healthcare, financial services, and government sectors in the past three months. Self-hosted models are the sole option for customers who do not want sensitive data to leave their servers when employing LLMs.

The model was refined in two phases by the research team. They honed StarCoder’s foundational model using only our mild to moderate queries. The resulting defog-easy model was then fine-tuned on difficult and extremely difficult questions to produce SQLcoder. Defog In our benchmarking, the SQLCoder outperforms nearly every popular model except GPT-4. In particular, it outperforms models more than ten times its size, such as the gpt-3.5-turbo and the text-da-vinci-003. These outcomes only represent the performance of SQLCoder on general SQL databases and not on specific database schemas. When SQLCoder is optimized for particular database schemas, it can outperform OpenAI’s GPT-4 while incurring less latency.

An open-source version of SQLCoder can be found at https://github.com/defog-ai/sqlcoder. It has many potential applications, such as:

Putting it through its paces on a home turf

Putting it in the cloud

Having it work with other programs

SQLCoder is a robust program that may streamline and automate data processing operations. Query the database easily using SQLCoder, which translates the natural language questions into SQL queries.

Using SQLCoder can help you in a variety of ways.

SQLCoder’s accuracy is such that it can construct correct and efficient SQL queries.

SQLCoder is efficient in that it can produce SQL queries rapidly and effortlessly.

SQLCoder produces queries that are idiomatic or written by the rules of SQL.

SQLCoder’s adaptability means that it can be modified to suit the requirements of your program.

Check out the Portal and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

We are open-sourcing SQLCoder, state of the art model for converting natural language questions to SQL queries! https://t.co/421Xf3Vld9SQLCoder outperforms OpenAI’s gpt-3.5-turbo, and significantly outperforms all major open-source models for generic SQL schemas in Postgres.… pic.twitter.com/W3MvdeQkzH— Defog.ai (YC W23) (@defogdata) August 21, 2023

The post Meet SQLCoder: An New Open-Sourced and State of the Art Model for Converting Natural Language Questions to SQL Queries appeared first on MarkTechPost.

Revolutionizing Real-Time 1080p Novel-View Synthesis: A Breakthrough w …

Meshes and points are the most common 3D scene representations because they are explicit and are a good fit for fast GPU/CUDA-based rasterization. In contrast, recent Neural Radiance Field (NeRF) methods build on continuous scene representations, typically optimizing a Multi-Layer Perceptron (MLP) using volumetric ray-marching for the novel-view synthesis of captured scenes. Similarly, the most efficient radiance field solutions build on continuous representations by interpolating values stored in, e.g., voxel, hash grids, or points. While the constant nature of these methods helps optimization, the stochastic sampling required for rendering is costly and can result in noise. 

Researchers from Université Côte d’Azur and Max-Planck-Institut für Informatik introduce a new approach that combines the best of both worlds: their 3D Gaussian representation allows optimization with state-of-the-art (SOTA) visual quality and competitive training times. At the same time, their tile-based splatting solution ensures real-time rendering at SOTA quality for 1080p resolution on several previously published datasets (see Fig. 1). Their goal is to allow real-time rendering for scenes captured with multiple photos and create the representations with optimization times as fast as the most efficient previous methods for typical real scenes. Recent methods achieve fast training but struggle to achieve the visual quality obtained by the current SOTA NeRF methods, i.e., Mip-NeRF360, which requires up to 48 hours of training.

Figure 1: The approach renders radiance fields in real-time with quality on par with the best prior methods while only needing optimization times commensurate with the quickest previous ways. A unique 3D Gaussian scene representation and a real-time differentiable renderer, which significantly accelerates scene optimization and innovative view synthesis, are essential to this performance. While this is the highest quality that InstantNGP can produce after a comparable training time, they can obtain state-of-the-art quality within 51 minutes, which is even slightly superior to Mip-NeRF360.

The fast – but lower-quality – radiance field methods can achieve interactive rendering times depending on the scene (10-15 frames per second) but fall short of high-resolution real-time rendering. Their solution builds on three main components. They first introduce 3D Gaussians as a flexible and expressive scene representation. They start with the same input as previous NeRF-like methods, i.e., cameras calibrated with Structure-from-Motion (SfM) and initialize the set of 3D Gaussians with the sparse point cloud produced for free as part of the SfM process. In contrast to most point-based solutions that require Multi-View Stereo (MVS) data, they achieve high-quality results with only SfM points as input. Note that for the NeRF-synthetic dataset, their method achieves high quality even with random initialization. 

They show that 3D Gaussians are an excellent choice since they are a differentiable volumetric representation. Still, they can be rasterized very efficiently by projecting them to 2D and applying standard 𝛼-blending, using an equivalent image formation model as NeRF. The second component of their method is the optimization of the properties of the 3D Gaussians – 3D position, opacity 𝛼, anisotropic covariance, and spherical harmonic (SH) coefficients – interleaved with adaptive density control steps, where they add and occasionally remove 3D Gaussians during optimization. The optimization procedure produces a reasonably compact, unstructured, and precise representation of the scene (1-5 million Gaussians for all scenes tested). Their method’s third and final element is their real-time rendering solution, which uses fast GPU sorting algorithms inspired by tile-based rasterization following recent work. 

However, thanks to their 3D Gaussian representation, they can perform anisotropic splatting that respects visibility ordering – thanks to sorting and 𝛼- blending – and enable a fast and accurate backward pass by tracking the traversal of as many-sorted splats as required. To summarize, they provide the following contributions: 

• The introduction of anisotropic 3D Gaussians as a high-quality, unstructured representation of radiance fields. 

• An optimization method of 3D Gaussian properties, interleaved with adaptive density control, creates high-quality representations for captured scenes. 

• A fast, differentiable rendering approach for the GPU, which is visibility-aware, allows anisotropic splatting and fast backpropagation to achieve high-quality novel view synthesis. 

Their results on previously published datasets show that they can optimize their 3D Gaussians from multi-view captures and achieve equal or better quality than the best of previous implicit radiance field approaches. They also can achieve training speeds and quality similar to the fastest methods and, importantly, provide the first real-time rendering with high quality for novel-view synthesis.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

The post Revolutionizing Real-Time 1080p Novel-View Synthesis: A Breakthrough with 3D Gaussians and Visibility-Aware Rendering appeared first on MarkTechPost.

Announcing Amazon S3 access point support for Amazon SageMaker Data Wr …

We’re excited to announce Amazon SageMaker Data Wrangler support for Amazon S3 Access Points. With its visual point and click interface, SageMaker Data Wrangler simplifies the process of data preparation and feature engineering including data selection, cleansing, exploration, and visualization, while S3 Access Points simplifies data access by providing unique hostnames with specific access policies.
Starting today, SageMaker Data Wrangler is making it easier for users to prepare data from shared datasets stored in Amazon Simple Storage Service (Amazon S3) while enabling organizations to securely control data access in their organization. With S3 Access Points, data administrators can now create application- and team-specific access points to facilitate data sharing, rather than managing complex bucket policies with many different permission rules.
In this post, we walk you through importing data from, and exporting data to, an S3 access point in SageMaker Data Wrangler.
Solution Overview
Imagine you, as an administrator, have to manage data for multiple data science teams running their own data preparation workflows in SageMaker Data Wrangler. Administrators often face three challenges:

Data science teams need to access their datasets without compromising the security of others
Data science teams need access to some datasets with sensitive data, which further complicates managing permissions
Security policy only permits data access through specific endpoints to prevent unauthorized access and to reduce the exposure of data

With traditional bucket policies, you would struggle setting up granular access because bucket policies apply the same permissions to all objects within the bucket. Traditional bucket policies also can’t support securing access at the endpoint level.
S3 Access Points solves these problems by granting fine-grained access control at a granular level, making it easier to manage permissions for different teams without impacting other parts of the bucket. Instead of modifying a single bucket policy, you can create multiple access points with individual policies tailored to specific use cases, reducing the risk of misconfiguration or unintended access to sensitive data. Lastly, you can enforce endpoint policies on access points to define rules that control which VPCs or IP addresses can access the data through a specific access point.
We demonstrate how to use S3 Access Points with SageMaker Data Wrangler with the following steps:

Upload data to an S3 bucket.
Create an S3 access point.
Configure your AWS Identity and Access Management (IAM) role with the necessary policies.
Create a SageMaker Data Wrangler flow.
Export data from SageMaker Data Wrangler to the access point.

For this post, we use the Bank Marketing dataset for our sample data. However, you can use any other dataset you prefer.
Prerequisites
For this walkthrough, you should have the following prerequisites:

An AWS account.
An Amazon SageMaker Studio domain and user. For details on setting these up, refer to Onboard to Amazon SageMaker Domain Using Quick setup.
An S3 bucket.

Upload data to an S3 bucket
Upload your data to an S3 bucket. For instructions, refer to Uploading objects. For this post, we use the Bank Marketing dataset.

Create an S3 access point
To create an S3 access point, complete the following steps. For more information, refer to Creating access points.

On the Amazon S3 console, choose Access Points in the navigation pane.
Choose Create access point.
For Access point name, enter a name for your access point.
For Bucket, select Choose a bucket in this account.
For Bucket name, enter the name of the bucket you created.
Leave the remaining settings as default and choose Create access point.

On the access point details page, note the Amazon Resource Name (ARN) and access point alias. You use these later when you interact with the access point in SageMaker Data Wrangler.

Configure your IAM role
If you have a SageMaker Studio domain up and ready, complete the following steps to edit the execution role:

On the SageMaker console, choose Domains in the navigation pane.
Choose your domain.
On the Domain settings tab, choose Edit.

By default, the IAM role that you use to access Data Wrangler is SageMakerExecutionRole. We need to add the following two policies to use S3 access points:

Policy 1 – This IAM policy grants SageMaker Data Wrangler access to perform PutObject, GetObject, and DeleteObject:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “S3AccessPointAccess”,
“Effect”: “Allow”,
“Action”: [
“s3:PutObject”,
“s3:GetObject”,
“s3:DeleteObject”
],
“Resource”: “arn:aws:s3:us-east-1:<<accountID>>:accesspoint/<<s3-dw-accesspoint>>”
}
]
}

Policy 2 – This IAM policy grants SageMaker Data Wrangler access to get the S3 access point:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “GetAccessPoint”,
“Effect”: “Allow”,
“Action”: “s3:GetAccessPoint”,
“Resource”: “arn:aws:s3:us-east-1:<<accountID>>:accesspoint/<<s3-dw-accesspoint>>”
}
]
}

Create these two policies and attach them to the role.

Using S3 Access Points in SageMaker Data Wrangler
To create a new SageMaker Data Wrangler flow, complete the following steps:

Launch SageMaker Studio.
On the File menu, choose New and Data Wrangler Flow.

Choose Amazon S3 as the data source.

For S3 source, enter the S3 access point using the ARN or alias that you noted down earlier.

For this post, we use the ARN to import data using the S3 access point. However, the ARN only works for S3 access points and SageMaker Studio domains within the same Region.

Alternatively, you can use the alias, as shown in the following screenshot. Unlike ARNs, aliases can be referenced across Regions.

Export data from SageMaker Data Wrangler to S3 access points
After we complete the necessary transformations, we can export the results to the S3 access point. In our case, we simply dropped a column. When you complete whatever transformations you need for your use case, complete the following steps:

In the data flow, choose the plus sign.
Choose Add destination and Amazon S3.

Enter the dataset name and the S3 location, referencing the ARN.

Now you have used S3 access points to import and export data securely and efficiently without having to manage complex bucket policies and navigate multiple folder structures.

Clean up
If you created a new SageMaker domain to follow along, be sure to stop any running apps and delete your domain to stop incurring charges. Also, delete any S3 access points and delete any S3 buckets.
Conclusion
In this post, we introduced the availability of S3 Access Points for SageMaker Data Wrangler and showed you how you can use this feature to simplify data control within SageMaker Studio. We accessed the dataset from, and saved the resulting transformations to, an S3 access point alias across AWS accounts. We hope that you take advantage of this feature to remove any bottlenecks with data access for your SageMaker Studio users, and encourage you to give it a try!

About the authors
Peter Chung is a Solutions Architect serving enterprise customers at AWS. He loves to help customers use technology to solve business problems on various topics like cutting costs and leveraging artificial intelligence. He wrote a book on AWS FinOps, and enjoys reading and building solutions.
Neelam Koshiya is an Enterprise Solution Architect at AWS. Her current focus is to help enterprise customers with their cloud adoption journey for strategic business outcomes. In her spare time, she enjoys reading and being outdoors.

Machine learning with decentralized training data using federated lear …

Machine learning (ML) is revolutionizing solutions across industries and driving new forms of insights and intelligence from data. Many ML algorithms train over large datasets, generalizing patterns it finds in the data and inferring results from those patterns as new unseen records are processed. Usually, if the dataset or model is too large to be trained on a single instance, distributed training allows for multiple instances within a cluster to be used and distribute either data or model partitions across those instances during the training process. Native support for distributed training is offered through the Amazon SageMaker SDK, along with example notebooks in popular frameworks.
However, sometimes due to security and privacy regulations within or across organizations, the data is decentralized across multiple accounts or in different Regions and it can’t be centralized into one account or across Regions. In this case, federated learning (FL) should be considered to get a generalized model on the whole data.
In this post, we discuss how to implement federated learning on Amazon SageMaker to run ML with decentralized training data.
What is federated learning?
Federated learning is an ML approach that allows for multiple separate training sessions running in parallel to run across large boundaries, for example geographically, and aggregate the results to build a generalized model (global model) in the process. More specifically, each training session uses its own dataset and gets its own local model. Local models in different training sessions will be aggregated (for example, model weight aggregation) into a global model during the training process. This approach stands in contrast to centralized ML techniques where datasets are merged for one training session.
Federated learning vs. distributed training on the cloud
When these two approaches are running on the cloud, distributed training happens in one Region on one account, and training data starts with a centralized training session or job. During distributed training process, the dataset gets split into smaller subsets and, depending on the strategy (data parallelism or model parallelism), subsets are sent to different training nodes or go through nodes in a training cluster, which means individual data doesn’t necessarily stay in one node of the cluster.
In contrast, with federated learning, training usually occurs in multiple separate accounts or across Regions. Each account or Region has its own training instances. The training data is decentralized across accounts or Regions from the beginning to the end, and individual data is only read by its respective training session or job between different accounts or Regions during the federated learning process.
Flower federated learning framework
Several open-source frameworks are available for federated learning, such as FATE, Flower, PySyft, OpenFL, FedML, NVFlare, and Tensorflow Federated. When choosing an FL framework, we usually consider its support for model category, ML framework, and device or operation system. We also need to consider the FL framework’s extensibility and package size so as to run it on the cloud efficiently. In this post, we choose an easily extensible, customizable, and lightweight framework, Flower, to do the FL implementation using SageMaker.
Flower is a comprehensive FL framework that distinguishes itself from existing frameworks by offering new facilities to run large-scale FL experiments, and enables richly heterogeneous FL device scenarios. FL solves challenges related to data privacy and scalability in scenarios where sharing data is not possible.
Design principles and implementation of Flower FL
Flower FL is language-agnostic and ML framework-agnostic by design, is fully extensible, and can incorporate emerging algorithms, training strategies, and communication protocols. Flower is open-sourced under Apache 2.0 License.
The conceptual architecture of the FL implementation is described in the paper Flower: A friendly Federated Learning Framework and is highlighted in the following figure.

In this architecture, edge clients live on real edge devices and communicate with the server over RPC. Virtual clients, on the other hand, consume close to zero resources when inactive and only load model and data into memory when the client is being selected for training or evaluation.
The Flower server builds the strategy and configurations to be sent to the Flower clients. It serializes these configuration dictionaries (or config dict for short) to their ProtoBuf representation, transports them to the client using gRPC, and then deserializes them back to Python dictionaries.
Flower FL strategies
Flower allows customization of the learning process through the strategy abstraction. The strategy defines the entire federation process specifying parameter initialization (whether it’s server or client initialized), the minimum number of clients available required to initialize a run, the weight of the client’s contributions, and training and evaluation details.
Flower has an extensive implementation of FL averaging algorithms and a robust communication stack. For a list of averaging algorithms implemented and associated research papers, refer to the following table, from Flower: A friendly Federated Learning Framework.

Federated learning with SageMaker: Solution architecture
A federated learning architecture using SageMaker with the Flower framework is implemented on top of bi-directional gRPC (foundation) streams. gRPC defines the types of messages exchanged and uses compilers to then generate efficient implementation for Python, but it can also generate the implementation for other languages, such as Java or C++.
The Flower clients receive instructions (messages) as raw byte arrays via the network. Then the clients deserialize and run the instruction (training on local data). The results (model parameters and weights) are then serialized and communicated back to the server.
The server/client architecture for Flower FL is defined in SageMaker using notebook instances in different accounts in the same Region as the Flower server and Flower client. The training and evaluation strategies are defined on the server as well as the global parameters, then the configuration is serialized and sent to the client over VPC peering.
The notebook instance client starts a SageMaker training job that runs a custom script to trigger the instantiation of the Flower client, which deserializes and reads the server configuration, triggers the training job, and sends the parameters response.
The last step occurs on the server when the evaluation of the newly aggregated parameters is triggered upon completion of the number of runs and clients stipulated on the server strategy. The evaluation takes place on a testing dataset existing only on the server, and the new improved accuracy metrics are produced.
The following diagram illustrates the architecture of the FL setup on SageMaker with the Flower package.

Implement federated learning using SageMaker
SageMaker is a fully managed ML service. With SageMaker, data scientists and developers can quickly build and train ML models, and then deploy them into a production-ready hosted environment.
In this post, we demonstrate how to use the managed ML platform to provide a notebook experience environment and perform federated learning across AWS accounts, using SageMaker training jobs. The raw training data never leaves the account that owns the data and only the derived weights are sent across the peered connection.
We highlight the following core components in this post:

Networking – SageMaker allows for quick setup of default networking configuration while also allowing you to fully customize the networking depending on your organization’s requirements. We use a VPC peering configuration within the Region in this example.
Cross-account access settings – In order to allow a user in the server account to start a model training job in the client account, we delegate access across accounts using AWS Identity and Access Management (IAM) roles. This way, a user in the server account doesn’t have to sign out of the account and sign in to the client account to perform actions on SageMaker. This setting is only for purposes of starting SageMaker training jobs, and it doesn’t have any cross-account data access permission or sharing.
Implementing federated learning client code in the client account and server code in the server account – We implement federated learning client code in the client account by using the Flower package and SageMaker managed training. Meanwhile, we implement server code in the server account by using the Flower package.

Set up VPC peering
A VPC peering connection is a networking connection between two VPCs that enables you to route traffic between them using private IPv4 addresses or IPv6 addresses. Instances in either VPC can communicate with each other as if they are within the same network.
To set up a VPC peering connection, first create a request to peer with another VPC. You can request a VPC peering connection with another VPC in the same account, or in our use case, connect with a VPC in a different AWS account. To activate the request, the owner of the VPC must accept the request. For more details about VPC peering, refer to Create a VPC peering connection.
Launch SageMaker notebook instances in VPCs
A SageMaker notebook instance provides a Jupyter notebook app through a fully managed ML Amazon Elastic Compute Cloud (Amazon EC2) instance. SageMaker Jupyter notebooks are used to perform advanced data exploration, create training jobs, deploy models to SageMaker hosting, and test or validate your models.
The notebook instance has a variety of networking configurations available to it. In this setup, we have the notebook instance run within a private subnet of the VPC and don’t have direct internet access.

Configure cross-account access settings
Cross-account access settings include two steps to delegate access from the server account to client account by using IAM roles:

Create an IAM role in the client account.
Grant access to the role in the server account.

For detailed steps to set up a similar scenario, refer to Delegate access across AWS accounts using IAM roles.
In the client account, we create an IAM role called FL-kickoff-client-job with the policy FL-sagemaker-actions attached to the role. The FL-sagemaker-actions policy has JSON content as follows:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“sagemaker:CreateTrainingJob”,
“sagemaker:DescribeTrainingJob”,
“sagemaker:StopTrainingJob”,
“sagemaker:UpdateTrainingJob”
],
“Resource”: “*”
},
{
“Effect”: “Allow”,
“Action”: [
“ec2:DescribeSubnets”,
“ec2:DescribeVpcs”,
“ec2:DescribeNetworkInterfaces”
],
“Resource”: “*”
},
{
“Effect”: “Allow”,
“Action”: [
“iam:GetRole”,
“iam:PassRole”
],
“Resource”: “arn:aws:iam::<client-account-number>:role/service-role/AmazonSageMaker-ExecutionRole-<xxxxxxxxxxxxxxx>”
}
]
}

We then modify the trust policy in the trust relationships of the FL-kickoff-client-job role:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“AWS”: “arn:aws:iam::<server-account-number>:root”
},
“Action”: “sts:AssumeRole”,
“Condition”: {}
}
]
}

In the server account, permissions are added to an existing user (for example, developer) to allow switching to the FL-kickoff-client-job role in client account. To do this, we create an inline policy called FL-allow-kickoff-client-job and attach it to the user. The following is the policy JSON content:

{
“Version”: “2012-10-17”,
“Statement”: {
“Effect”: “Allow”,
“Action”: “sts:AssumeRole”,
“Resource”: “arn:aws:iam::<client-account-number>:role/FL-kickoff-client-job”
}
}

Sample dataset and data preparation
In this post, we use a curated dataset for fraud detection in Medicare providers’ data released by the Centers for Medicare & Medicaid Services (CMS). Data is split into a training dataset and a testing dataset. Because the majority of the data is non-fraud, we apply SMOTE to balance the training dataset, and further split the training dataset into training and validation parts. Both the training and validation data are uploaded to an Amazon Simple Storage Service (Amazon S3) bucket for model training in the client account, and the testing dataset is used in the server account for testing purposes only. Details of the data preparation code are in the following notebook.
With the SageMaker pre-built Docker images for the scikit-learn framework and SageMaker managed training process, we train a logistic regression model on this dataset using federated learning.
Implement a federated learning client in the client account
In the client account’s SageMaker notebook instance, we prepare a client.py script and a utils.py script. The client.py file contains code for the client, and the utils.py file contains code for some of the utility functions that will be needed for our training. We use the scikit-learn package to build the logistic regression model.
In client.py, we define a Flower client. The client is derived from the class fl.client.NumPyClient. It needs to define the following three methods:

get_parameters – It returns the current local model parameters. The utility function get_model_parameters will do this.
fit – It defines the steps to train the model on the training data in client’s account. It also receives global model parameters and other configuration information from the server. We update the local model’s parameters using the received global parameters and continue training it on the dataset in the client account. This method also sends the local model’s parameters after training, the size of the training set, and a dictionary communicating arbitrary values back to the server.
evaluate – It evaluates the provided parameters using the validation data in the client account. It returns the loss together with other details such as the size of the validation set and accuracy back to the server.

The following is a code snippet for the Flower client definition:

“””Client interface”””
class FlowerClient(fl.client.NumPyClient):
def get_parameters(self, config):
return utils.get_model_parameters(model)

def fit(self, parameters, config):
utils.set_model_params(model, parameters)
with warnings.catch_warnings():
warnings.simplefilter(“ignore”)
model.fit(X_train, y_train)
return utils.get_model_parameters(model), len(X_train), {}

def evaluate(self, parameters, config):
utils.set_model_params(model, parameters)
loss = log_loss(y_test, model.predict_proba(X_test))
accuracy = model.score(X_test, y_test)
return loss, len(X_test), {“accuracy”: accuracy}

We then use SageMaker script mode to prepare the rest of the client.py file. This includes defining parameters that will be passed to SageMaker training, loading training and validation data, initializing and training the model on the client, setting up the Flower client to communicate with the server, and finally saving the trained model.
utils.py includes a few utility functions that are called in client.py:

get_model_parameters – It returns the scikit-learn LogisticRegression model parameters.
set_model_params – It sets the model’s parameters.
set_initial_params – It initializes the parameters of the model as zeros. This is required because the server asks for initial model parameters from the client at launch. However, in the scikit-learn framework, LogisticRegression model parameters are not initialized until model.fit() is called.
load_data – It loads the training and testing data.
save_model – It saves model as a .joblib file.

Because Flower is not a package installed in the SageMaker pre-built scikit-learn Docker container, we list flwr==1.3.0 in a requirements.txt file.
We put all three files (client.py, utils.py, and requirements.txt) under a folder and tar zip it. The .tar.gz file (named source.tar.gz in this post) is then uploaded to an S3 bucket in the client account.
Implement a federated learning server in the server account
In the server account, we prepare code on a Jupyter notebook. This includes two parts: the server first assumes a role to start a training job in the client account, then the server federates the model using Flower.
Assume a role to run the training job in the client account
We use the Boto3 Python SDK to set up an AWS Security Token Service (AWS STS) client to assume the FL-kickoff-client-job role and set up a SageMaker client so as to run a training job in the client account by using the SageMaker managed training process:

sts_client = boto3.client(‘sts’)
assumed_role_object = sts_client.assume_role(
RoleArn = “arn:aws:iam::<client-account-number>:role/FL-kickoff-client-job”,
RoleSessionName = “AssumeRoleSession1”
)

credentials = assumed_role_object[‘Credentials’]

sagemaker_client = boto3.client(
‘sagemaker’,
aws_access_key_id = credentials[‘AccessKeyId’],
aws_secret_access_key = credentials[‘SecretAccessKey’],
aws_session_token = credentials[‘SessionToken’],
)

Using the assumed role, we create a SageMaker training job in client account. The training job uses the SageMaker built-in scikit-learn framework. Note that all S3 buckets and the SageMaker IAM role in the following code snippet are related to the client account:

sagemaker_client.create_training_job(
TrainingJobName = training_job_name,
HyperParameters = {
“penalty”: “l2”,
“max-iter”: “10”,
“server-address”:”<server-ip-address>:8080″,
“sagemaker_program”: “client.py”,
“sagemaker_submit_directory”: “s3://<client-account-s3-code-bucket>/client_code/source.tar.gz”,
},
AlgorithmSpecification = {
“TrainingImage”: training_image,
“TrainingInputMode”: “File”,
},
RoleArn = “arn:aws:iam::<client-account-number>:role/service-role/AmazonSageMaker-ExecutionRole-<xxxxxxxxxxxxxxx>”,
InputDataConfig=[
{
“ChannelName”: “train”,
“DataSource”: {
“S3DataSource”: {
“S3DataType”: “S3Prefix”,
“S3Uri”: “s3://<client-account-s3-data-bucket>/data_prep/”,
“S3DataDistributionType”: “FullyReplicated”,
}
},
},
],
OutputDataConfig = {
“S3OutputPath”: “s3://<client-account-s3-bucket-for-model-artifact>/client_artifact/”
},
ResourceConfig = {
“InstanceType”: “ml.m5.xlarge”,
“InstanceCount”: 1,
“VolumeSizeInGB”: 10,
},
VpcConfig={
‘SecurityGroupIds’: [
“<client-account-notebook-instance-security-group>”,
],
‘Subnets’: [
“<client-account-notebook-instance-sunbet>”,
]
},
StoppingCondition = {
“MaxRuntimeInSeconds”: 86400
},
)

Aggregate local models into a global model using Flower
We prepare code to federate the model on the server. This includes defining the strategy for federation and its initialization parameters. We use utility functions in the utils.py script described earlier to initialize and set model parameters. Flower allows you to define your own callback functions to customize an existing strategy. We use the FedAvg strategy with custom callbacks for evaluation and fit configuration. See the following code:

“””Initialize the model and federation strategy, then start the server”””
model = LogisticRegression()
utils.set_initial_params(model)

strategy = fl.server.strategy.FedAvg(
min_available_clients = 1, # Minimum number of clients that need to be connected to the server before a training round can start
min_fit_clients = 1, # Minimum number of clients to be sampled for the next round
min_evaluate_clients = 1,
evaluate_fn = get_evaluate_fn(model, X_test, y_test),
on_fit_config_fn = fit_round,
)

fl.server.start_server(
server_address = args.server_address,
strategy = strategy,
config = fl.server.ServerConfig(num_rounds=3) # run for 3 rounds
)

utils.save_model(args.model_dir, model)

The following two functions are mentioned in the preceding code snippet:

fit_round – It’s used to send the round number to the client. We pass this callback as the on_fit_config_fn parameter of the strategy. We do this simply to demonstrate the use of the on_fit_config_fn parameter.
get_evaluate_fn – It’s used for model evaluation on the server.

For demo purposes, we use the testing dataset that we set aside in data preparation to evaluate the model federated from the client’s account and communicate the result back to the client. However, it’s worth noting that in almost all real use cases, the data used in the server account is not split from the dataset used in the client account.
After the federated learning process is finished, a model.tar.gz file is saved by SageMaker as a model artifact in an S3 bucket in the client account. Meanwhile, a model.joblib file is saved on the SageMaker notebook instance in the server account. Lastly, we use the testing dataset to test the final model (model.joblib) on the server. Testing output of the final model is as follows:

Clean up
After you are done, clean up the resources in both the server account and client account to avoid additional charges:

Stop the SageMaker notebook instances.
Delete VPC peering connections and corresponding VPCs.
Empty and delete the S3 bucket you created for data storage.

Conclusion
In this post, we walked through how to implement federated learning on SageMaker by using the Flower package. We showed how to configure VPC peering, set up cross-account access, and implement the FL client and server. This post is useful for those who need to train ML models on SageMaker using decentralized data across accounts with restricted data sharing. Because the FL in this post is implemented using SageMaker, it’s worth noting that a lot more features in SageMaker can be brought into the process.
Implementing federated learning on SageMaker can take advantage of all the advanced features that SageMaker provides through the ML lifecycle. There are other ways to achieve or apply federated learning on the AWS Cloud, such as using EC2 instances or on the edge. For details about these alternative approaches, refer to Federated Learning on AWS with FedML and Applying Federated Learning for ML at the Edge.

About the authors
Sherry Ding is a senior AI/ML specialist solutions architect at Amazon Web Services (AWS). She has extensive experience in machine learning with a PhD degree in computer science. She mainly works with public sector customers on various AI/ML-related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.
Lorea Arrizabalaga is a Solutions Architect aligned to the UK Public Sector, where she helps customers design ML solutions with Amazon SageMaker. She is also part of the Technical Field Community dedicated to hardware acceleration and helps with testing and benchmarking AWS Inferentia and AWS Trainium workloads.
Ben Snively is an AWS Public Sector Senior Principal Specialist Solutions Architect. He works with government, non-profit, and education customers on big data, analytical, and AI/ML projects, helping them build solutions using AWS.

Together AI Unveils Llama-2-7B-32K-Instruct: A Breakthrough in Extende …

A multifaceted challenge has arisen in the expansive realm of natural language processing: the ability to adeptly comprehend and respond to intricate and lengthy instructions. As communication nuances become more complicated, the shortcomings of prevailing models in dealing with extensive contextual intricacies have been laid bare. Within these pages, an extraordinary solution crafted by the dedicated minds at Together AI comes to light—a solution that holds the promise of reshaping the very fabric of language processing. This innovation has profound implications, especially in tasks requiring an acute grasp of extended contextual nuances.

Contemporary natural language processing techniques rely heavily on tools and methodologies that grapple with the complexities of protracted instructions. However, the research team’s creation, Llama-2-7B-32K-Instruct, ventures into promising new territory. By skillfully harnessing the capabilities of the Together Inference API, the team has conceived a model that thrives in the realm of longer instructions without compromising its performance in briefer contextual scenarios. This strategy echoes the successful approaches embraced by models like Alpaca, Vicuna, WizardLM, and Orca, where tapping into potent language models yields invaluable insights.

The success of Llama-2-7B-32K-Instruct is underpinned by a rigorously directed four-step process undertaken by the research team. This journey commences with the rigorous distillation of the model—a unified amalgamation of diverse datasets encompassing conversations, human directives, and outputs derived from Llama-2-70B-Chat. This broad-ranging mix allows the model to comprehend intricate instructions with finesse. The research team skillfully wields the Together Inference API to query Llama-2-70B-Chat—a robust language model—leading to the fine-tuning of Llama-2-7B-32K-Instruct.

Following a dynamic fine-tuning process, the model undergoes rigorous evaluations. Its performance is benchmarked across a spectrum of tasks from summarization to multi-document question answering. Llama-2-7B-32K-Instruct consistently outperforms existing baseline models, including GPT-3.5-Turbo-16K, Llama-2-7b-chat, Longchat-7b-16k, and Longchat-7b-v1.5-32k. This resolute performance affirms the model’s adeptness in managing lengthy instructions while excelling across diverse benchmarks.

https://together.ai/blog/llama-2-7b-32k-instruct

https://together.ai/blog/llama-2-7b-32k-instruct

In conclusion, the revelation of Llama-2-7B-32K-Instruct signifies a notable stride in grappling with the complexities posed by extended-context language processing. The research team’s upright methodology, synergized with the innovative utilization of the Together Inference API, has culminated in a model that meets the demands of complex instructions and establishes a new performance benchmark. Llama-2-7B-32K-Instruct provides a compelling preview of forthcoming advancements in natural language processing by bridging the chasm between understanding complex contexts and generating relevant responses. This advancement stands poised to empower applications that demand exhaustive comprehension and adept response generation from intricate instructions, propelling the field toward uncharted frontiers.

Check out the Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

Introducing our newest long-context model: Llama-2-7B-32K-InstructFine-tuned using Together API, the model is now available to use with our APIs & Playground: https://t.co/IMjX88xh2cTry it out and send us feedback!https://t.co/A6iwIwNV6Z pic.twitter.com/E6gg0lllcr— Together AI (@togethercompute) August 18, 2023

The post Together AI Unveils Llama-2-7B-32K-Instruct: A Breakthrough in Extended-Context Language Processing appeared first on MarkTechPost.

The Future of Language Models: Embracing Multi-Modality for Enhanced U …

Artificial Intelligence is advancing, thanks to the introduction of super beneficial and efficient Large Language Models. Based on the concepts of Natural Language Processing, Natural Language Generation, and Natural Language Understanding, these models have been able to make lives easier. From text generation and question answering to code completion, language translation, and text summarization, LLMs have come a long way. With the development of the latest version of LLM by OpenAI, i.e., GPT 4, this advancement has opened the way for the progress of the multi-modal nature of models. Unlike the previous versions, GPT 4 can take textual as well as inputs in the form of images.

The future is becoming more multi-modal, which means that these models can now understand and process various types of data in a manner akin to that of people. This change reflects how we communicate in real life, which involves combining text, visuals, music, and diagrams to express meaning effectively. This invention is viewed as a crucial improvement in the user experience, comparable to the revolutionary effects that chat functionality had earlier.

In a recent tweet, the author emphasized the significance of multi-modality in terms of user experience and technical difficulties in the context of language models. ByteDance has taken the lead in realizing the promise of multi-modal models thanks to its well-known platform, TikTok. They use a combination of text and image data as part of their technique, and a variety of applications, such as object detection and text-based image retrieval, are powered by this combination. Their method’s main component is offline batch inference, which produces embeddings for 200 terabytes of image and text data, which makes it possible to process various data kinds in an integrated vector space without any issues.

Some of the limitations that accompany the implementation of multi-modal systems include inference optimization, resource scheduling, elasticity, and the amount of data and models involved is enormous. ByteDance has used Ray, a flexible computing framework that provides a number of tools to solve the complexities of multi-modal processing to address the problems. Ray’s capabilities provide the flexibility and scalability needed for large-scale model parallel inference, especially Ray Data. The technology supports effective model sharding, which permits the spread of computing jobs over various GPUs or even various regions of the same GPU, which guarantees efficient processing of even models that are too huge to fit on a single GPU.

The move towards multi-modal language models heralds a new era in AI-driven interactions. ByteDance uses Ray to provide effective and scalable multi-modal inference, showcasing the enormous potential of this method. The capacity of AI systems to comprehend, interpret, and react to multi-modal input will surely influence how people interact with technology as the digital world grows more complex and varied. Innovative businesses working with cutting-edge frameworks like Ray are paving the way for a time when AI systems can comprehend not just our speech but also our visual cues, enabling richer and more human-like interactions.

Check out the Reference 1 and Reference 2. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Nearly all LLMs will be multi-modal.Multi-modality is another 10x UX improvement in the same way that chat was.But multi-modality is hard to do, and it’s expensive.This article by @BytedanceTalk gives a taste of where things are headed (and how they’re used in TikTok).… https://t.co/Umv0ziOPxw— Robert Nishihara (@robertnishihara) August 15, 2023

The post The Future of Language Models: Embracing Multi-Modality for Enhanced User Experiences appeared first on MarkTechPost.

Best AI Spreadsheet Tools 2023

When combined with data from other sources, including marketing data platforms, Excel may provide invaluable insights quickly. While most people think of it as a spreadsheet program, it is a powerful computing tool capable of solving intricate issues.

However, mastering numerous intricate formulas is required before being fully used. The sheer volume of information needed to become proficient in Excel prevents most users from tapping into the program’s full potential. 

This need only sometimes be the situation. With the advent of AI (artificial intelligence), Excel users no longer need to memorize hundreds of long, winding Excel formulas to enter complex calculations and get thorough insights. 

Let’s check out some of Excel’s AI tools. 

Botsheets 

Botsheets is an AI program that can automatically convert discussions into spreadsheets. By linking a Google Sheet to a user’s customer message channels, users may instruct the AI what information to capture using column headings in the connected Google Sheet. The AI will sift through client discussions and automatically record relevant information in a Google Sheet. This eliminates the need for time-consuming and error-prone manual sorting through chat transcripts. A single message can yield useful information that can be collected and kept, making this tool perfect for lead creation. Botsheets simplify streamlining assistance by transforming unstructured support inquiries into organized data. Smart prompts like “Sentiment” capture clients’ positive and negative emotions, allowing users to easily spot trends and patterns in the customer service they provide.

SheetAI

SheetAI harnesses the intelligence of computers to bring you a set of AI tools that may be used to streamline repetitive processes and get new insights. Using SheetAI’s built-in functions, you may express your desired outcome in one cell and receive it in another. SHEETAI, SHEETAI_LIST, SHEETAI_TABLE, SHEETAI_IMAGE, SHEETAI_EDIT, SHEETAI_FILL, and SHEETAI_EXTRACT are just a few of the available functions. For instance, you may like to compile a list of the top 10 healthiest veggies in terms of various nutrients such as iron, magnesium, calcium, and vitamin D. Then, the =SHEETAI_TABLE function of the tool can be used to construct the table based on the requirements you give. The SHEETAI_BRAIN feature streamlines the copywriting process by providing instant access to relevant data, allowing you to develop content and catchy slogans rapidly.

Ajelix 

Ajelix provides several tools, such as an Excel formula generator, VBA script creation, and virtual AI aid, all of which contribute to increased efficiency while working with Google Sheets or Excel. The Excel formula generator shortens the time it takes to convert plain text into an Excel formula. Users may quickly and easily grasp Excel formulas with the help of the formula explanation. The Google Apps Script Generator can produce code automatically based on specifications. Excel VBA Script Explainer uses AI to explain Excel VBA code, while the Excel VBA Script Generator creates VBA scripts. The Google Sheets Formula Generator and the Google Apps Script Explainer use artificial intelligence to analyze and clarify scripts. The Excel File Translator, Excel Formula and Script Library, Excel Template Generator, and Excel Add-in are also included in the tools. Ajelix also provides services, including web development, business process analytics, and optimization, report development (static and dynamic), data analytics and forecasting (historical and projected), and WordPress development.

Arcwise AI

Arcwise AI is a GPT-based assistant app for Google Sheets. Simple text instructions expedite the analyzing, cleaning, and processing of data in Sheets. Sheets now feature an artificial intelligence tool that lets users ask queries like “What does this sheet do?” and “What are the calculation interdependencies in A10:D20?” in a conversational format. The software suggests appropriate formulas based on the current situation and gives access to related discussions on StackOverflow. Arcwise AI can help with tasks such as normalizing addresses, summarizing answers, scraping text from browser tabs into tables, and reformatting date columns. As a result, users can save time and effort in the data analysis process by eliminating the need for manual data preparation. The software is available as a free Chrome extension.

Sheet+

Sheet+ is an artificial intelligence-based program that will forever change how spreadsheets are used. It has several functions that can be used to build correct formulas for Google Sheets and Excel from text, simplify formulas, debug formulas, and more. Users may improve their spreadsheet abilities, save time and energy, and give their team a technological edge with the help of AI with Sheet+. Text may be converted into precise formulae in Excel and Google Sheets with the help of the tool’s Text to Formula feature. In contrast, the Formula to Explanation function provides detailed explanations of the formula’s inner workings. Sheet+ is an open-source spreadsheet app that eliminates the headaches of complicated formulas and calculations for good.

PromptLoop 

Using a straightforward formula, customers of PromptLoop may include AI tools like GPT-3 in their Google Sheets and Excel spreadsheets. Users can organize, summarize, and reformat textual information stored in spreadsheets. The software can be used for many different purposes, such as examining sales records, creating search engine optimization and keywords, creating persuasive customer messages for online storefronts, deconstructing survey results, and transforming jumbled language. Anyone with access to a spreadsheet can use PromptLoop to create AI-enabled spreadsheet models with little to no prior experience in AI or programming. PromptLoop provides a formula similar to sum or vlookup that uses advanced AI models to create appropriate responses automatically. Artificial intelligence models use a minimal amount of training data to conclude. Workflows that incorporate text production, summarization, online search, and user-defined endpoints can be rapidly developed with the help of this application.

Luminal 

Luminal is an artificial intelligence-driven spreadsheet importer that aims to reduce the time and effort required to import spreadsheets for further processing. Customers can get their spreadsheets processed 10 times faster with Luminal than without it. Luminal is an adaptive solution that gets better at importing spreadsheets the more you use it since it is built to learn from your every activity. The solution is designed to work with large spreadsheets (over 5 million rows) that contain a variety of complicated data types, formats, and validations. Luminal provides a comprehensive answer, covering everything from setup to transformations powered by artificial intelligence to semantic column mapping, cutting-edge formatting, and customizable validation criteria. With Luminal, users can easily apply AI-powered changes like summarization, auto-tagging, and auto-categorization on columns containing data of varying types, from language codes to company names.

SheetGod 

With artificial intelligence, SheetGod can generate sophisticated Excel formulae, macros, regular expressions, and simple tasks in seconds, all from plain English. It also supports the development of Google Appscript code snippets, which may be used to automate repetitive tasks. Easy navigation and use are hallmarks of the dashboard’s minimalist design. In addition, it includes tutorials that walk users through the fundamentals of using Excel and Google Sheets. SheetGod allows users to extend the functionality of their spreadsheets and automate even more chores through sending marketing emails, generating mass PDFs, creating Google Workspace Add-Ons, and creating Microsoft Excel add-ons. It also works with regular expressions, letting users do complicated transformations and extract specific data. With a 30-day money-back guarantee and personalized onboarding through Zoom and WhatsApp, SheetGod has earned the trust of more than 50,000 organizations.

Excel Formula Bot

The Excel Formula Bot is an online tool that uses NLP algorithms to create custom formulas in Excel or Google Sheets in response to user-supplied text input. The use of AI hopes to speed up the process of formula creation. One of Excel Formula Bot’s most useful capabilities is its ability to automatically build formulae in Excel or Google Sheets in response to simple textual cues. This makes it possible for users to quickly handle complex calculations with the help of AI, as the usual ambiguity connected with them is eliminated. The Excel Formula Bot also can automatically produce VBA code by guiding the user through simple text prompts. A free plan with 5 credits is available on the website for consumers to try out the features and quality of the service before committing to a paid membership. Users are advised to evaluate and verify the generated formulae to ensure they match their unique needs, notwithstanding the Excel Formula Bot’s best efforts to offer accurate formulas. The tool can easily handle complex formulas, including those with nested functions, many conditions, and intricate computations.

GPTExcel 

GPTExcel is an AI program that can quickly create and explain formulas in Google Sheets and Microsoft Excel. Over 44,000 formulas have been produced, allowing users to easily build complex equations without learning much about Excel operations. The app has been featured on popular tech blogs like ProductHunt and Aitoolsclub. To help users understand how the tool arrived at a given answer, GPTExcel explains the logic underlying the created formulae. In addition, the application may be used to create Excel formulas compatible with Mac and Windows. Financial forecasting, data analysis, and statistical modeling are just some of the many possible applications of the user-generated formulas. The most recent update to GPTExcel boasts a 10x improvement in speed and quality, making the tool even easier to work with. In sum, GPTExcel is a helpful tool for those who spend a lot of time in Excel or Google Sheets.

Don’t forget to join our 29k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

If you like our work, please follow us on Twitter

The post Best AI Spreadsheet Tools 2023 appeared first on MarkTechPost.

Explain medical decisions in clinical settings using Amazon SageMaker …

Explainability of machine learning (ML) models used in the medical domain is becoming increasingly important because models need to be explained from a number of perspectives in order to gain adoption. These perspectives range from medical, technological, legal, and the most important perspective—the patient’s. Models developed on text in the medical domain have become accurate statistically, yet clinicians are ethically required to evaluate areas of weakness related to these predictions in order to provide the best care for individual patients. Explainability of these predictions is required in order for clinicians to make the correct choices on a patient-by-patient basis.
In this post, we show how to improve model explainability in clinical settings using Amazon SageMaker Clarify.
Background
One specific application of ML algorithms in the medical domain, which uses large volumes of text, is clinical decision support systems (CDSSs) for triage. On a daily basis, patients are admitted to hospitals and admission notes are taken. After these notes are taken, the triage process is initiated, and ML models can assist clinicians with estimating clinical outcomes. This can help reduce operational overhead costs and provide optimal care for patients. Understanding why these decisions are suggested by the ML models is extremely important for decision-making related to individual patients.
The purpose of this post is to outline how you can deploy predictive models with Amazon SageMaker for the purposes of triage within hospital settings and use SageMaker Clarify to explain these predictions. The intent is to offer an accelerated path to adoption of predictive techniques within CDSSs for many healthcare organizations.
The notebook and code from this post are available on GitHub. To run it yourself, clone the GitHub repository and open the Jupyter notebook file.
Technical background
A large asset for any acute healthcare organization is its clinical notes. At the time of intake within a hospital, admission notes are taken. A number of recent studies have shown the predictability of key indicators such as diagnoses, procedures, length of stay, and in-hospital mortality. Predictions of these are now highly achievable from admission notes alone, through the use of natural language processing (NLP) algorithms [1].
Advances in NLP models, such as Bi-directional Encoder Representations from Transformers (BERT), have allowed for highly accurate predictions on a corpus of text, such as admission notes, that were previously difficult to get value from. Their prediction of the clinical indicators is highly applicable for use in a CDSS.
Yet, in order to use the new predictions effectively, how these accurate BERT models are achieving their predictions still needs to be explained. There are several techniques to explain the predictions of such models. One such technique is SHAP (SHapley Additive exPlanations), which is a model-agnostic technique for explaining the output of ML models.
What is SHAP
SHAP values are a technique for explaining the output of ML models. It provides a way to break down the prediction of an ML model and understand how much each input feature contributes to the final prediction.
SHAP values are based on game theory, specifically the concept of Shapley values, which were originally proposed to allocate the payout of a cooperative game among its players [2]. In the context of ML, each feature in the input space is considered a player in a cooperative game, and the prediction of the model is the payout. SHAP values are calculated by examining the contribution of each feature to the model prediction for each possible combination of features. The average contribution of each feature across all possible feature combinations is then calculated, and this becomes the SHAP value for that feature.
SHAP allows models to explain predictions without understanding the model’s inner workings. In addition, there are techniques to display these SHAP explanations in text, so that the medical and patient perspectives can all have intuitive visibility into how algorithms come to their predictions.
With new additions to SageMaker Clarify, and the use of pre-trained models from Hugging Face that are easily used implemented in SageMaker, model training and explainability can all be easily done in AWS.
For the purpose of an end-to-end example, we take the clinical outcome of in-hospital mortality and show how this process can be implemented easily in AWS using a pre-trained Hugging Face BERT model, and the predictions will be explained using SageMaker Clarify.
Choices of Hugging Face model
Hugging Face offers a variety of pre-trained BERT models that have been specialized for use on clinical notes. For this post, we use the bigbird-base-mimic-mortality model. This model is a fine-tuned version of Google’s BigBird model, specifically adapted for predicting mortality using MIMIC ICU admission notes. The model’s task is to determine the likelihood of a patient not surviving a particular ICU stay based on the admission notes. One of the significant advantages of using this BigBird model is its capability to process larger context lengths, which means we can input the complete admission notes without the need for truncation.
Our steps involve deploying this fine-tuned model on SageMaker. We then incorporate this model into a setup that allows for real-time explanation of its predictions. To achieve this level of explainability, we use SageMaker Clarify.
Solution overview
SageMaker Clarify provides ML developers with purpose-built tools to gain greater insights into their ML training data and models. SageMaker Clarify explains both global and local predictions and explains decisions made by computer vision (CV) and NLP models.
The following diagram shows the SageMaker architecture for hosting an endpoint that serves explainability requests. It includes interactions between an endpoint, the model container, and the SageMaker Clarify explainer.

In the sample code, we use a Jupyter notebook to showcase the functionality. However, in a real-world use case, electronic health records (EHRs) or other hospital care applications would directly invoke the SageMaker endpoint to get the same response. In the Jupyter notebook, we deploy a Hugging Face model container to a SageMaker endpoint. Then we use SageMaker Clarify to explain the results that we obtain from the deployed model.
Prerequisites
You need the following prerequisites:

An AWS account
A SageMaker Jupyter notebook instance

Access the code from the GitHub repository and upload it to your notebook instance. You can also run the notebook in an Amazon SageMaker Studio environment, which is an integrated development environment (IDE) for ML development. We recommend using a Python 3 (Data Science) kernel on SageMaker Studio or a conda_python3 kernel on a SageMaker notebook instance.
Deploy the model with SageMaker Clarify enabled
As the first step, download the model from Hugging Face and upload it to an Amazon Simple Storage Service (Amazon S3) bucket. Then create a model object using the HuggingFaceModel class. This uses a prebuilt container to simplify the process of deploying Hugging Face models to SageMaker. You also use a custom inference script to do the predictions within the container. The following code illustrates the script that is passed as an argument to the HuggingFaceModel class:

from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
model_data = model_path_s3,
transformers_version=’4.6.1′,
pytorch_version=’1.7.1′,
py_version=’py36′,
role=role,
source_dir = “./{}/code”.format(model_id),
entry_point = “inference.py”
)

Then you can define the instance type that you deploy this model on:

instance_type = “ml.g4dn.xlarge”
container_def = huggingface_model.prepare_container_def(instance_type=instance_type)
container_def

We then populate ExecutionRoleArn, ModelName and PrimaryContainer fields to create a Model.

model_name = “hospital-triage-model”

sagemaker_client.create_model(
ExecutionRoleArn=role,
ModelName=model_name,
PrimaryContainer=container_def,
)
print(f”Model created: {model_name}”)

Next, create an endpoint configuration by calling the create_endpoint_config API. Here, you supply the same model_name used in the create_model API call. The create_endpoint_config now supports the additional parameter ClarifyExplainerConfig to enable the SageMaker Clarify explainer. The SHAP baseline is mandatory; you can provide it either as inline baseline data (the ShapBaseline parameter) or by a S3 baseline file (the ShapBaselineUri parameter). For optional parameters, see the developer guide.
In the following code, we use a special token as the baseline:

baseline = [[“<UNK>”]]
print(f”SHAP baseline: {baseline}”)

The TextConfig is configured with sentence-level granularity (each sentence is a feature, and we need a few sentences per review for good visualization) and the language as English:

endpoint_config_name = “hospital-triage-model-ep-config”
csv_serializer = sagemaker.serializers.CSVSerializer()
json_deserializer = sagemaker.deserializers.JSONDeserializer()

sagemaker_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
“VariantName”: “MainVariant”,
“ModelName”: model_name,
“InitialInstanceCount”: 1,
“InstanceType”: instance_type,
}
],
ExplainerConfig={
“ClarifyExplainerConfig”: {
“InferenceConfig”: {“FeatureTypes”: [“text”]},
“ShapConfig”: {
“ShapBaselineConfig”: {“ShapBaseline”: csv_serializer.serialize(baseline)},
“TextConfig”: {“Granularity”: “sentence”, “Language”: “en”},
},
}
},
)

Finally, after you have the model and endpoint configuration ready, use the create_endpoint API to create your endpoint. The endpoint_name must be unique within a Region in your AWS account. The create_endpoint API is synchronous in nature and returns an immediate response with the endpoint status being in the Creating state.

endpoint_name = “hospital-triage-prediction-endpoint”
sagemaker_client.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name,
)

Explain the prediction
Now that you have deployed the endpoint with online explainability enabled, you can try some examples. You can invoke the real-time endpoint using the invoke_endpoint method by providing the serialized payload, which in this case is some sample admission notes:

response = sagemaker_runtime_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=”text/csv”,
Accept=”text/csv”,
Body=csv_serializer.serialize(sample_admission_note.iloc[:1, :].to_numpy())
)

result = json_deserializer.deserialize(response[“Body”], content_type=response[“ContentType”])
pprint.pprint(result)

In the first scenario, let’s assume that the following medical admission note was taken by a healthcare worker:

“Patient is a 25-year-old male with a chief complaint of acute chest pain. Patient reports the pain began suddenly while at work and has been constant since. Patient rates the pain as 8/10 in severity. Patient denies any radiation of pain, shortness of breath, nausea, or vomiting. Patient reports no previous history of chest pain. Vital signs are as follows: blood pressure 140/90 mmH. Heart rate 92 beats per minute. Respiratory rate 18 breaths per minute. Oxygen saturation 96% on room air. Physical examination reveals mild tenderness to palpation over the precordium and clear lung fields. EKG shows sinus tachycardia with no ST-elevations or depressions.”

The following screenshot shows the model results.

After this is forwarded to the SageMaker endpoint, the label was predicted as 0, which indicates that the risk of mortality is low. In other words, 0 implies that the admitted patient is in non-acute condition according to the model. However, we need the reasoning behind that prediction. For that, you can use the SHAP values as the response. The response includes the SHAP values corresponding to the phrases of the input note, which can be further color-coded as green or red based on how the SHAP values contribute to the prediction. In this case, we see more phrases in green, such as “Patient reports no previous history of chest pain” and “EKG shows sinus tachycardia with no ST-elevations or depressions,” as opposed to red, aligning with the mortality prediction of 0.
In the second scenario, let’s assume that the following medical admission note was taken by a healthcare worker:

“Patient is a 72-year-old female with a chief complaint of severe sepsis and septic shock. Patient reports a fever, chills, and weakness for the past 3 days, as well as decreased urine output and confusion. Patient has a history of chronic obstructive pulmonary disease (COPD) and a recent hospitalization for pneumonia. Vital signs are as follows: blood pressure 80/40 mmHg. Heart rate 130 beats per minute. Respiratory rate 30 breaths per minute. Oxygen saturation 82% on 4L of oxygen via nasal cannula. Physical examination reveals diffuse erythema and warmth over the lower extremities and positive findings for sepsis such as altered mental status, tachycardia, and tachypnea. Blood cultures were taken and antibiotic therapy was started with appropriate coverage.”

The following screenshot shows our results.

After this is forwarded to the SageMaker endpoint, the label was predicted as 1, which indicates that the risk of mortality is high. This implies that the admitted patient is in acute condition according to the model. However, we need the reasoning behind that prediction. Again, you can use the SHAP values as the response. The response includes the SHAP values corresponding to the phrases of the input note, which can be further color-coded. In this case, we see more phrases in red, such as “Patient reports a fever, chills, and weakness for the past 3 days, as well as decreased urine output and confusion” and “Patient is a 72-year-old female with a chief complaint of severe sepsis shock,” as opposed to green, aligning with the mortality prediction of 1.
The clinical care team can use these explanations to assist in their decisions on the care process for each individual patient.
Clean up
To clean up the resources that have been created as part of this solution, run the following statements:

huggingface_model.delete_model()

predictor = sagemaker.Predictor(endpoint_name=”triage-prediction-endpoint”)

predictor.delete_endpoint()

Conclusion
This post showed you how to use SageMaker Clarify to explain decisions in a healthcare use case based on the medical notes captured during various stages of triage process. This solution can be integrated into existing decision support systems to provide another data point to clinicians as they evaluate patients for admission into the ICU. To learn more about using AWS services in the healthcare industry, check out the following blog posts:

Introducing the Healthcare Industry Lens for the AWS Well-Architected Framework
How Telescope Health Streamlines Virtual Care in the Cloud
The Pathway to better Surgical Care with Operating Room Analytics on AWS
Predicting diabetic patient readmission using multi-model training on Amazon SageMaker Pipelines
How Pieces Technologies leverages AWS services to predict patient outcomes

References
[1] https://aclanthology.org/2021.eacl-main.75/
[2] https://arxiv.org/pdf/1705.07874.pdf

About the authors
Shamika Ariyawansa, serving as a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences division at Amazon Web Services (AWS), has a keen focus on Generative AI. He assists customers in integrating Generative AI into their projects, emphasizing the importance of explainability within their AI-driven initiatives. Beyond his professional commitments, Shamika passionately pursues skiing and off-roading adventures.”
Ted Spencer is an experienced Solutions Architect with extensive acute healthcare experience. He is passionate about applying machine learning to solve new use cases, and rounds out solutions with both the end consumer and their business/clinical context in mind. He lives in Toronto Ontario, Canada, and enjoys traveling with his family and training for triathlons as time permits.
Ram Pathangi is a Solutions Architect at AWS supporting healthcare and life sciences customers in the San Francisco Bay Area. He has helped customers in finance, healthcare, life sciences, and hi-tech verticals run their business successfully on the AWS Cloud. He specializes in Databases, Analytics, and Machine Learning.

Apply fine-grained data access controls with AWS Lake Formation in Ama …

Amazon SageMaker Data Wrangler reduces the time it takes to collect and prepare data for machine learning (ML) from weeks to minutes. You can streamline the process of feature engineering and data preparation with SageMaker Data Wrangler and finish each stage of the data preparation workflow (including data selection, purification, exploration, visualization, and processing at scale) within a single visual interface. Data is frequently kept in data lakes that can be managed by AWS Lake Formation, giving you the ability to implement fine-grained access control using a straightforward grant or revoke procedure. SageMaker Data Wrangler supports fine-grained data access control with Lake Formation and Amazon Athena connections.
We are happy to announce that SageMaker Data Wrangler now supports using Lake Formation with Amazon EMR to provide this fine-grained data access restriction.
Data professionals such as data scientists want to use the power of Apache Spark, Hive, and Presto running on Amazon EMR for fast data preparation; however, the learning curve is steep. Our customers wanted the ability to connect to Amazon EMR to run ad hoc SQL queries on Hive or Presto to query data in the internal metastore or external metastore (such as the AWS Glue Data Catalog), and prepare data within a few clicks.
In this post, we show how to use Lake Formation as a central data governance capability and Amazon EMR as a big data query engine to enable access for SageMaker Data Wrangler. The capabilities of Lake Formation simplify securing and managing distributed data lakes across multiple accounts through a centralized approach, providing fine-grained access control.
Solution overview
We demonstrate this solution with an end-to-end use case using a sample dataset, the TPC data model. This data represents transaction data for products and includes information such as customer demographics, inventory, web sales, and promotions. To demonstrate fine-grained data access permissions, we consider the following two users:

David, a data scientist on the marketing team. He is tasked with building a model on customer segmentation, and is only permitted to access non-sensitive customer data.
Tina, a data scientist on the sales team. She is tasked with building the sales forecast model, and needs access to sales data for the particular region. She is also helping the product team with innovation, and therefore needs access to product data as well.

The architecture is implemented as follows:

Lake Formation manages the data lake, and the raw data is available in Amazon Simple Storage Service (Amazon S3) buckets
Amazon EMR is used to query the data from the data lake and perform data preparation using Spark
AWS Identity and Access Management (IAM) roles are used to manage data access using Lake Formation
SageMaker Data Wrangler is used as the single visual interface to interactively query and prepare the data

The following diagram illustrates this architecture. Account A is the data lake account that houses all the ML-ready data obtained through extract, transform, and load (ETL) processes. Account B is the data science account where a group of data scientists compile and run data transformations using SageMaker Data Wrangler. In order for SageMaker Data Wrangler in Account B to have access to the data tables in Account A’s data lake via Lake Formation permissions, we must activate the necessary rights.

You can use the provided AWS CloudFormation stack to set up the architectural components for this solution.
Prerequisites
Before you get started, make sure you have the following prerequisites:

An AWS account
An IAM user with administrator access
An S3 bucket

Provision resources with AWS CloudFormation
We provide a CloudFormation template that deploys the services in the architecture for end-to-end testing and to facilitate repeated deployments. The outputs of this template are as follows:

An S3 bucket for the data lake.
An EMR cluster with EMR runtime roles enabled. For more details on using runtime roles with Amazon EMR, see Configure runtime roles for Amazon EMR steps. Associating runtime roles with EMR clusters is supported in Amazon EMR 6.9. Make sure the following configuration is in place:

Create a security configuration in Amazon EMR.
The EMR runtime role’s trust policy should allow the EMR EC2 instance profile to assume the role.
The EMR EC2 instance profile role should be able to assume the EMR runtime roles.
The EMR cluster should be created with encryption in transit.

IAM roles for accessing the data in data lake, with fine-grained permissions:

Marketing-data-access-role
Sales-data-access-role

An Amazon SageMaker Studio domain and two user profiles. The SageMaker Studio execution roles for the users allow the users to assume their corresponding EMR runtime roles.
A lifecycle configuration to enable the selection of the role to use for the EMR connection.
A Lake Formation database populated with the TPC data.
Networking resources required for the setup, such as VPC, subnets, and security groups.

Create Amazon EMR encryption certificates for the data in transit
With Amazon EMR release version 4.8.0 or later, you have option for specifying artifacts for encrypting data in transit using a security configuration. We manually create PEM certificates, include them in a .zip file, upload it to an S3 bucket, and then reference the .zip file in Amazon S3. You likely want to configure the private key PEM file to be a wildcard certificate that enables access to the VPC domain in which your cluster instances reside. For example, if your cluster resides in the us-east-1 Region, you could specify a common name in the certificate configuration that allows access to the cluster by specifying CN=*.ec2.internal in the certificate subject definition. If your cluster resides in us-west-2, you could specify CN=*.us-west-2.compute.internal.
Run the following commands using your system terminal. This will generate PEM certificates and collate them into a .zip file:

openssl req -x509 -newkey rsa:1024 -keyout privateKey.pem -out certificateChain.pem -days 365 -nodes -subj ‘/C=US/ST=Washington/L=Seattle/O=MyOrg/OU=MyDept/CN=*.us-east-2.compute.internal’

cp certificateChain.pem trustedCertificates.pem

zip -r -X my-certs.zip certificateChain.pem privateKey.pem trustedCertificates.pem

Upload my-certs.zip to an S3 bucket in the same Region where you intend to run this exercise. Copy the S3 URI for the uploaded file. You’ll need this while launching the CloudFormation template.

This example is a proof of concept demonstration only. Using self-signed certificates is not recommended and presents a potential security risk. For production systems, use a trusted certification authority (CA) to issue certificates.
Deploying the CloudFormation template
To deploy the solution, complete the following steps:

Sign in to the AWS Management Console as an IAM user, preferably an admin user.
Choose Launch Stack to launch the CloudFormation template:

Choose Next.

For Stack name, enter a name for the stack.
For IdleTimeout, enter a value for the idle timeout for the EMR cluster (to avoid paying for the cluster when it’s not being used).
For S3CertsZip, enter an S3 URI with the EMR encryption key.

For instructions to generate a key and .zip file specific to your Region, refer to Providing certificates for encrypting data in transit with Amazon EMR encryption. If you are deploying in US East (N. Virginia), remember to use CN=*.ec2.internal. For more information, refer to Create keys and certificates for data encryption. Make sure to upload the .zip file to an S3 bucket in the same Region as your CloudFormation stack deployment.

On the review page, select the check box to confirm that AWS CloudFormation might create resources.
Choose Create stack.

Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE. The process usually takes 10–15 minutes.

After the stack is created, allow Amazon EMR to query Lake Formation by updating the External Data Filtering settings on Lake Formation. For instructions, refer to Getting started with Lake Formation. Specify Amazon EMR for Session tag values and enter your AWS account ID under AWS account IDs.
Test data access permissions
Now that the necessary infrastructure is in place, you can verify that the two SageMaker Studio users have access to granular data. To review, David shouldn’t have access to any private information about your customers. Tina has access to information about sales. Let’s put each user type to the test.
Test David’s user profile
To test your data access with David’s user profile, complete the following steps:

On the SageMaker console, choose Domains in the navigation pane.
From the SageMaker Studio domain, launch SageMaker Studio from the user profile david-non-sensitive-customer.

In your SageMaker Studio environment, create an Amazon SageMaker Data Wrangler flow, and choose Import & prepare data visually.

Alternatively, on the File menu, choose New, then choose Data Wrangler flow.
We discuss these steps to create a data flow in detail later in this post.
Test Tina’s user profile
Tina’s SageMaker Studio execution role allows her to access the Lake Formation database using two EMR execution roles. This is achieved by listing the role ARNs in a configuration file in Tina’s file directory. These roles can be set using SageMaker Studio lifecycle configurations to persist the roles across app restarts. To test Tina’s access, complete the following steps:

On the SageMaker console, navigate to the SageMaker Studio domain.
Launch SageMaker Studio from the user profile tina-sales-electronics.

It’s a good practice to close any previous SageMaker Studio sessions on your browser when switching user profiles. There can only be one active SageMaker Studio user session at a time.

Create a Data Wrangler data flow.

In the following sections, we showcase creating a data flow within SageMaker Data Wrangler and connecting to Amazon EMR as the data source. David and Tina will have similar experiences with data preparation, except for access permissions, so they will see different tables.
Create a SageMaker Data Wrangler data flow
In this section, we cover connecting to the existing EMR cluster created through the CloudFormation template as a data source in SageMaker Data Wrangler. For demonstration purposes, we use David’s user profile.
To create your data flow, complete the following steps:

On the SageMaker console, choose Domains in the navigation pane.
Choose StudioDomain, which was created by running the CloudFormation template.
Select a user profile (for this example, David’s) and launch SageMaker Studio.

Choose Open Studio.
In SageMaker Studio, create a new data flow and choose Import & prepare data visually.

Alternatively, on the File menu, choose New, then choose Data Wrangler flow.
Creating a new flow can take a few minutes. After the flow has been created, you see the Import data page.

To add Amazon EMR as a data source in SageMaker Data Wrangler, on the Add data source menu, choose Amazon EMR.

You can browse all the EMR clusters that your SageMaker Studio execution role has permissions to see. You have two options to connect to a cluster: one is through the interactive UI, and the other is to first create a secret using AWS Secrets Manager with a JDBC URL, including EMR cluster information, and then provide the stored AWS secret ARN in the UI to connect to Presto or Hive. In this post, we use the first method.

Select any of the clusters that you want to use, then choose Next.

Select which endpoint you want to use.
Enter a name to identify your connection, such as emr-iam-connection, then choose Next.

Select IAM as your authentication type and choose Connect.

When you’re connected, you can interactively view a database tree and table preview or schema. You can also query, explore, and visualize data from Amazon EMR. For a preview, you see a limit of 100 records by default. After you provide a SQL statement in the query editor and choose Run, the query is run on the Amazon EMR Hive engine to preview the data. Choose Cancel query to cancel ongoing queries if they are taking an unusually long time.

Let’s access data from the table that David doesn’t have permissions to.

The query will result in the error message “Unable to fetch table dl_tpc_web_sales. Insufficient Lake Formation permission(s) on dl_tpc_web_sales.”

The last step is to import the data. When you are ready with the queried data, you have the option to update the sampling settings for the data selection according to the sampling type (FirstK, Random, or Stratified) and the sampling size for importing data into Data Wrangler.

Choose Import to import the data.

On the next page, you can add various transformations and essential analysis to the dataset.

Navigate to the data flow and add more steps to the flow as needed for transformations and analysis.

You can run a data insight report to identify data quality issues and get recommendations to fix those issues. Let’s look at some example transforms.

In the Data flow view, you should see that we are using Amazon EMR as a data source using the Hive connector.

Choose the plus sign next to Data types and choose Add transform.

Let’s explore the data and apply a transformation. For example, the c_login column is empty and it will not add value as a feature. Let’s delete the column.

In the All steps pane, choose Add step.
Choose Manage columns.

For Transform, choose Drop column.
For Columns to drop, choose the c_login column.
Choose Preview, then choose Add.

Verify the step by expanding the Drop column section.

You can continue adding steps based on the different transformations required for your dataset. Let’s go back to our data flow. You can now see the Drop column block showing the transform we performed.

ML practitioners spend a lot of time crafting feature engineering code, applying it to their initial datasets, training models on the engineered datasets, and evaluating model accuracy. Given the experimental nature of this work, even the smallest project will lead to multiple iterations. The same feature engineering code is often run again and again, wasting time and compute resources on repeating the same operations. In large organizations, this can cause an even greater loss of productivity because different teams often run identical jobs or even write duplicate feature engineering code because they have no knowledge of prior work. To avoid the reprocessing of features, we can export our transformed features to Amazon SageMaker Feature Store. For more information, refer to New – Store, Discover, and Share Machine Learning Features with Amazon SageMaker Feature Store.

Choose the plus sign next to Drop column.
Choose Export to and SageMaker Feature Store (via Jupyter notebook).

You can easily export your generated features to SageMaker Feature Store by specifying it as the destination. You can save the features into an existing feature group or create a new one. For more information, refer to Easily create and store features in Amazon SageMaker without code.
We have now created features with SageMaker Data Wrangler and stored those features in SageMaker Feature Store. We showed an example workflow for feature engineering in the SageMaker Data Wrangler UI.
Clean up
If your work with SageMaker Data Wrangler is complete, delete the resources you created to avoid incurring additional fees.

In SageMaker Studio, close all the tabs, then on the File menu, choose Shut Down.

When prompted, choose Shutdown All.

Shutdown might take a few minutes based on the instance type. Make sure all the apps associated with each user profile got deleted. If they were not deleted, manually delete the app associated under each user profile created using the CloudFormation template.

On the Amazon S3 console, empty any S3 buckets that were created from the CloudFormation template when provisioning clusters.

The buckets should have the same prefix as the CloudFormation launch stack name and cf-templates-.

On the Amazon EFS console, delete the SageMaker Studio file system.

You can confirm that you have the correct file system by choosing the file system ID and confirming the tag ManagedByAmazonSageMakerResource on the Tags tab.

On the AWS CloudFormation console, select the stack you created and choose Delete.

You’ll receive an error message, which is expected. We’ll come back to this and clean it up in the subsequent steps.

Identify the VPC that was created by the CloudFormation stack, named dw-emr-, and follow the prompts to delete the VPC.

Return to the AWS CloudFormation console and retry the stack deletion for dw-emr-.

All the resources provisioned by the CloudFormation template described in this post have now been removed from your account.
Conclusion
In this post, we went over how to apply fine-grained access control with Lake Formation and access the data using Amazon EMR as a data source in SageMaker Data Wrangler, how to transform and analyze a dataset, and how to export the results to a data flow for use in a Jupyter notebook. After visualizing our dataset using SageMaker Data Wrangler’s built-in analytical features, we further enhanced our data flow. The fact that we created a data preparation pipeline without writing a single line of code is significant.
To get started with SageMaker Data Wrangler, refer to Prepare ML Data with Amazon SageMaker Data Wrangler.

About the Authors
Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.
Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS enterprise customers grow by understanding their goals and challenges, and guides them on how they can architect their applications in a cloud-native manner while ensuring resilience and scalability. She’s passionate about machine learning technologies and environmental sustainability.
Parth Patel is a Senior Solutions Architect at AWS in the San Francisco Bay Area. Parth guides enterprise customers to accelerate their journey to the cloud and help them adopt and grow on the AWS Cloud successfully. He is passionate about machine learning technologies, environmental sustainability, and application modernization.

Meer Pyrus Base: A New Open-Source Python-Based Platform for the Two-D …

Robotics, the branch which is completely dedicated to the field of Electronics and Computer Science Engineering is now being connected to Artificial Intelligence for several purposes. Such robots are being connected for playing soccer via Artificial Intelligence. This event is called Robocup. There is a wide competition every year among researchers to present their robots at the Robocup Challenge. 

There was an introduction of Pyrus, which was a Python-based platform for the simulation of Robocup. Researchers from Dalhousie University and Memorial University of Canada published a research paper in which they mentioned that they would train and test the model using Pyrus easily. The common frameworks used for the Robocup were HeliosBase and Cyrus2DBase. These frameworks used their primary language C++. These frameworks had advantages over others as C++ is used for many instances, as it is more advanced and wide compared to Python. Researchers are also working on building a better framework that will be completely based on Python. This framework can be diversified and can be used by a wide range of users with different technical experiences and skills. Frameworks like Tensorflow, Keras, and PyTorch are used in base code to a large extent. A framework such as Pyrus implements C++base code easily. The main advantage of Pyrus over other frameworks is that it is simple and accessible so even beginners can also test their models for the Robocup league. The main problem is that the environment for Robocup is a bit noisy. To tackle this problem, researchers implemented Reinforcement Learning and Machine Learning models such as dribbling or passing. This reduced the noise to some extent as they had noise cancellation ability.

The Robocup has introduced all the data enthusiasts to solve all Data Analytics problems to a larger extent. Researchers have also introduced Pyrus to solve basic Machine Learning Challenges related to Robocup. Researchers are still working on the Pyrus base code to improve its performance. Researchers also plan to implement a Python monitor and log analysis software to improve the feasibility of the model to a larger extent.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

The post Meer Pyrus Base: A New Open-Source Python-Based Platform for the Two-Dimensional (2D) Simulation of RoboCup Soccer appeared first on MarkTechPost.

MIT and Harvard Researchers Propose (FAn): A Comprehensive AI System t …

In a new AI research, a team of MIT and Harvard University researchers has introduced a groundbreaking framework called “Follow Anything” (FAn). The system addresses the limitations of current object-following robotic systems and presents an innovative solution for real-time, open-set object tracking and following.

The primary shortcomings of existing robotic object-following systems are a constrained ability to accommodate new objects due to a fixed set of recognized categories and a lack of user-friendliness in specifying target objects. The new FAn system tackles these issues by presenting an open-set approach that can seamlessly detect, segment, track, and follow a wide range of things while adapting to novel objects through text, images, or click queries.

The core features of the proposed FAn system can be summarized as follows:

Open-Set Multimodal Approach: FAn introduces a novel methodology that facilitates real-time detection, segmentation, tracking, and following of any object within a given environment, regardless of its category.

Unified Deployment: The system is designed for easy deployment on robotic platforms, focusing on micro aerial vehicles, enabling efficient integration into practical applications.

Robustness: The system incorporates re-detection mechanisms to handle scenarios where tracked objects are occluded or temporarily lost during the tracking process.

The fundamental objective of the fan system is to empower robotic systems equipped with onboard cameras to identify and track objects of interest. This involves ensuring the object remains within the camera’s field of view as the robot moves.

FAn leverages state-of-the-art Vision Transformer (ViT) models to achieve this objective. These models are optimized for real-time processing and merged into a cohesive system. The researchers exploit the strengths of various models, such as the Segment Anything Model (SAM) for segmentation, DINO and CLIP for learning visual concepts from natural language, and a lightweight detection and semantic segmentation scheme. Additionally, real-time tracking is facilitated using the (Seg)AOT and SiamMask models. A light visual serving controller is also introduced to govern the object-following process.

The researchers conducted comprehensive experiments to evaluate FAn’s performance across diverse objects in zero-shot detection, tracking, and following scenarios. The results demonstrated the system’s seamless and efficient capability to follow objects of interest in real-time.

In conclusion, the FAn framework represents an encompassing solution for real-time object tracking and following, eliminating the limitations of closed-set systems. Its open-set nature, multimodal compatibility, real-time processing, and adaptability to new environments make it a significant advancement in robotics. Moreover, the team’s commitment to open-sourcing the system underscores its potential to benefit a wide array of real-world applications.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

The post MIT and Harvard Researchers Propose (FAn): A Comprehensive AI System that Bridges the Gap between SOTA Computer Vision and Robotic Systems- Providing an End-to-End Solution for Segmenting, Detecting, Tracking, and Following any Object appeared first on MarkTechPost.

Researchers from Cornell Introduce Quantization with Incoherence Proce …

Improvements in areas such as text creation, few-shot learning, reasoning, and protein sequence modelling have been made possible by large language models (LLMs). Due to their enormous scale, these models might have hundreds of billions of parameters, necessitating complex deployment strategies and inspiring study into efficient inference techniques.

New research by Cornell University quantizes LLM parameters after training to boost performance in real-world scenarios. Their key insight is that it is easier to adaptively round the weights to a finite set of compressed values when the weight and proxy Hessian matrices are incoherent. Intuitively, this is because both the weights themselves and the directions in which it is important to have good rounding accuracy are not too large in any one coordinate.

Using this insight, the researchers create two-bit quantization techniques that are both theoretically sound and scalable to LLM-sized models. Based on this realization, they provide a novel technique called quantization with incoherence processing (QuIP). 

There are two phases to QuIP: 

An efficient pre- and post-processing that ensures the Hessian matrices are incoherent by multiplying them by a Kronecker product of random orthogonal matrices.

An adaptive rounding procedure that minimizes a quadratic proxy objective of the error between the original weights and the quantized weights using an estimate of the Hessian. “Incoherence processing” refers to both the initial processing phase and the final processing phase of the proposed method.

In addition to their practical implementation, they provide a theoretical study, the first of its kind for a quantization algorithm that scales to LLM-sized models, investigates the impact of incoherence and demonstrates the superiority of the quantization procedure relative to a broad category of rounding techniques. This study also presents the first theoretical analysis for OPTQ, an earlier technique, showing that QuIP without incoherence processing yields a more efficient implementation of that method.

The empirical results show that incoherence processing significantly enhances large-model quantization, particularly at higher compression rates, and results in the first LLM quantization approach to achieve usable results with only two bits per weight. Small gaps between 2-bit and 4-bit compression are observed for large LLM sizes (>2B parameters), and these gaps shrink further with model size, suggesting the possibility of accurate 2-bit inference in LLMs.

Interactions between transformer blocks, or even between layers within a block, are not taken into account by the proxy objective. The team state that the benefits of including such interactions at this scale and whether or not they are worth the computational effort are currently unknown. 

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

Two-bit and three-bit LLMs are almost here! QuiP yields the first usable two-bit LLMs and further reduces the costs of running LLMs on just one GPU. [1/4]paper: https://t.co/8IlI11DYt1code: https://t.co/uWa0jryGKh pic.twitter.com/Hny3R6ueuH— Volodymyr Kuleshov (@volokuleshov) August 15, 2023

The post Researchers from Cornell Introduce Quantization with Incoherence Processing (QuIP): A New AI Method based on the Insight that Quantization Benefits from Incoherent Weight and Hessian Matrices appeared first on MarkTechPost.

Unlocking the Power of Context with Google AI: A Showdown Between pref …

The War of Troy is famous, where Achilles etched his name in history forever by defeating Prince Hector once and for all, but today, in the rapidly evolving landscape of artificial intelligence, the quest to harness context for improved learning and comprehension has taken center stage. Two contenders, prefixLM and causalLM, have entered the ring to combat in-context learning. As the battle between these language model giants rages on, it’s clear that the way they handle context will make all the difference in learning outcomes in machine learning.

The Challenger and the Conqueror

Both prefixLM and causalLM have entered the ring equipped with their unique theoretical frameworks. PrefixLM dons the armor of unrestricted attention, allowing all in-context samples to communicate freely. It treats each sample as a prefix and uses full attention on the first n positions in the battle.

In the other corner of the ring stands causalLM, armed with autoregressive attention – a mechanism that curbs interactions between in-context samples and their future counterparts. This strategy preserves a linear learning trajectory, preventing futuristic spoilers from influencing the learning process. It is a focused approach, but does it truly capture the essence of context? Can it defeat PrefixLM’s robust approach to ICL?

The Battle is Afoot

To separate theory from practice, a battlefield of synthetic numerical tasks becomes the proving ground relying on softmax transformers. Linear regression, nonlinear regression, and multiclass classification form the battleground where prefixLM and causalLM have locked horns. As the dust settles, the outcomes echo the voices of empirical evidence.

Amidst linear regression tasks, the training errors of both models exhibit linear decay rates, a testament to their learning prowess. However, the tide turns when the test errors emerge from the shadows. CausalLM stumbles with significantly larger test errors, raising eyebrows from the crowd. The culprit? The autoregressive nature of causalLM restricts the mutual attention between the in-context examples which yields it a suboptimal result.

The Champion rises from the ashes

With the empirical outcomes illuminating the path, it’s prefixLM that emerges as the champion of in-context learning. Its open-armed approach, enabling diverse in-context samples to communicate, appears to be the key. Whether it’s linear regression, nonlinear regression, or multiclass classification, prefixLM consistently showcases its superiority, proving that its power of context can’t be denied.

As the curtain falls on this clash of the titans, prefixLM stands tall, waving the banner of comprehensive context understanding. CausalLM, while valiant, might need to revisit its strategy in the in-context arena. The battle highlights that prefixLM is the champion today indeed, awaiting yet another challenger in the future in the battle of AI. 

To a more mathematical approach to this battle to analyze PrefixLM’s triumph deeply, please refer to the research paper.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 28k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

The post Unlocking the Power of Context with Google AI: A Showdown Between prefixLM and causalLM in In-Context Learning appeared first on MarkTechPost.

This AI Paper Introduces A Comprehensive RDF Dataset With Over 26 Bill …

Keeping up with recent research is becoming increasingly difficult due to the rise of scientific publications. For instance, more than 8 million scientific articles were recorded in 2022 alone. Researchers use various techniques, from search interfaces to recommendation systems, to investigate connected intellectual entities, such as authors and institutions. Modeling the underlying academic data as an RDF knowledge graph (KG) is one efficient method. This makes standardization, visualization, and interlinking with Linked Data resources easier. As a result, scholarly KGs are essential for converting document-centric academic material into linked and automatable knowledge structures. 

However, one or more of the following are limitations of the existing academic KGs:

They seldom include a comprehensive list of works from every subject.

They frequently solely cover particular fields, like computer science.

They get updated infrequently, making a lot of studies and business models outdated.

They often have use limitations.

They do not comply with W3C standards like RDF, even if they meet these criteria.

These problems prevent the widespread deployment of scientific KGs, such as in thorough search and recommender systems or for quantifying scientific impact. For instance, the Microsoft Academic Knowledge Graph (MAKG), its RDF descendant, cannot be updated because the Microsoft Academic Graph was terminated in 2021. 

The innovative OpenAlex dataset seeks to close this gap. OpenAlex’s data, however, does not adhere to the Linked Data Principles and is not accessible in RDF. As a result, OpenAlex cannot be regarded as a KG, making semantic inquiries, application integration, and connecting to new resources difficult. At first appearance, it could seem like a straightforward way to include academic information about scientific articles into Wikidata, and so support the WikiCite movement. Apart from the specific schema, the amount of data is already so vast that the Wikidata Query Service’s Blazegraph triplestore approaches its capacity limit, blocking any integration. 

SemOpenAlex, a very sizable RDF dataset of the academic landscape with its publications, authors, sources, institutions, ideas, and publishers, is introduced by researchers from Karlsruhe Institute of Technology and Metaphacts GmbH in this work. SemOpenAlex has about 249 million papers from all academic areas and more than 26 billion semantic triples. It is built on their comprehensive ontology and references additional LOD sources, including Wikidata, Wikipedia, and the MAKG. They offer a public SPARQL interface to facilitate quick and effective usage of SemOpenAlex’s integration with the LOD cloud. Additionally, they provide a sophisticated semantic search interface that enables users to retrieve information in real-time about entities contained in the database and their semantic relationships (for example, by displaying co-authors or an author’s most important concepts, which are inferred through semantic reasoning rather than being directly contained in the database). 

They also offer the whole RDF data snapshots to facilitate large data analysis. They have created a pipeline utilizing AWS for routinely updating SemOpenAlex completely without any service disruptions due to the scale of SemOpenAlex and the growing number of scientific articles being integrated into SemOpenAlex. Additionally, they trained cutting-edge knowledge graph entity embeddings for usage with SemOpenAlex in downstream applications. They guarantee system interoperability in line with FAIR principles by employing pre-existing ontologies whenever possible, and they open the door for integrating SemOpenAlex into the Linked Open Data Cloud. By offering monthly updates that enable continuing monitoring of an author’s scientific impact, tracking of award-winning research, and other use cases employing their data, they fill the void left by the termination of MAKG. They enable research groups from many disciplinary backgrounds to access the data it provides and incorporate it into their studies by making SemOpenAlex free and unconstrained. Initial SemOpenAlex application cases and production systems currently exist. 

Overall, they contribute the following: 

1. They use popular vocabulary to develop an ontology for SemOpenAlex. 

2. At https://semopenalex.org, they produce the SemOpenAlex knowledge graph in RDF, which covers 26 billion triples, and make all SemOpenAlex data, code, and services available to the public.

3. They enable SemOpenAlex to participate in the Linked Open Data cloud by making all its URIs resolvable. Using a SPARQL endpoint, they index all the data in a triple store and make it accessible to the general public. 

4. They offer a semantic search interface with entity disambiguation so that users may access, search, and instantly view the knowledge graph and its essential statistical data. 

5. Using high-performance computation, they offer cutting-edge knowledge graph embeddings for the entities represented in SemOpenAlex. 

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 28k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

The post This AI Paper Introduces A Comprehensive RDF Dataset With Over 26 Billion Triples Covering Scholarly Data Across All Scientific Disciplines appeared first on MarkTechPost.