A Research Group From CMU, AI2 and University of Washington Introduces …

Researchers’ positionality—their perspectives formed by their own experience, identity, culture, and background—influences their design decisions while developing NLP datasets and models. 

Latent design choices and the researcher’s positionality are two sources of design bias in producing datasets and models. This leads to discrepancies in how well datasets and models function for different populations. However, by forcing one group’s standards upon the rest of the world, they can help maintain systemic inequities. The difficulty arises because of the wide variety of design decisions that must be taken, and only a subset of these decisions may be recorded when building datasets and models. Furthermore, many widely used models in production are not exposed outside of APIs, making it difficult to characterize design biases directly. 

Recent research by the University of Washington, Carnegie Mellon University, and Allen Institute for AI presents NLPositionality, a paradigm for describing the positionality and design biases of natural language processing (NLP) datasets and models. The researchers recruit a global community of volunteers from various cultural and linguistic backgrounds to annotate a dataset sample. Next, they measure biases in the design by contrasting different identities and contexts to see which ones are more in line with the original dataset labels or model predictions. 

NLPositionality has three benefits over other methods (such as paid crowdsourcing or in-lab experiments):

Compared to other crowdsourcing platforms and conventional laboratory studies, LabintheWild has a more diverse participant population. 

Instead of relying on monetary remuneration, this method relies on participants’ intrinsic urge to grow by expanding their self-awareness. Learning possibilities for participants are increased, and data quality is improved compared to paid crowdsourcing platforms. Thus, unlike one-time paid studies like those found in other research, this platform can freely collect new annotations and reflect more recent observations of design biases over extended periods.

This method does not require any pre-existing labels or predictions to be applied post hoc to any dataset or model. 

The researchers use NLPositionality on two examples of NLP tasks known to be biased in their design: social acceptability and hate speech detection. They look at task-specific and task-general large language models (i.e., GPT-4) and the associated datasets and supervised models. On average, 1,096 annotators from 87 countries have contributed 38 annotations per day for 16,299 annotations as of May 25, 2023. The team found that White, college-educated millennials from English-speaking countries—a subset of “WEIRD” (Western, Educated, Industrialized, Rich, Democratic) populations—are the best fit for the datasets and models they examine. The importance of collecting data and annotations from a wide range of sources is also highlighted by their observation that datasets display high levels of alignment with their original annotators. Their findings indicate the necessity of expanding NLP research to include more diverse models and datasets. 

Check out the Paper and Github link. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 800+ AI Tools in AI Tools Club

The post A Research Group From CMU, AI2 and University of Washington Introduces NLPositionality: An AI Framework for Characterizing Design Biases and Quantifying the Positionality of NLP Datasets and Models appeared first on MarkTechPost.

The University Of Pennsylvania Researchers Introduced An Alternative …

The human brain is one of the most complex systems nature has ever created. The neurons interact with each other by forming recurring neural links and transmitting information through impulses.  Due to their incredible logical reasoning and numerical analysis methods, researchers try to implement these biological neural network methods into the current artificial neural systems. Neural computation methods involve the RNN in dynamical systems and neural replicas of computer architectures in machine learning.

The research group is asserting that advancements in current neural network technology could enable the complete distributed neural execution of software virtualization and logical circuits. This would be achieved without the need for any example data or sampling of the state space, which are typically required for training and refining these neural networks. Essentially, this suggests the potential for a more efficient and robust application of artificial intelligence in areas like virtualization and digital circuit design.

The current access to neural computation is limited due to the need for an understanding of the relationship between neural computers and modern-day silicon computers. This requires a neural network with a simple set of governing equations that manage many computer-like capabilities. As a consequence of the simple set of equations, networks such as reservoir computer (RC), which is a recurrent neural network (RNN) are well understood theoretically. Upon receiving the inputs, these evolve as a set of internal states, and output is a weighted sum of those states.  

The research team from the University of Pennsylvania developed two frameworks named state neural programming (SNP) and dynamic neural programming (DNP). RCs to solve analytic equations and perform operations, SNP is used. DNP is used to program RCs to store chaotic dynamical systems as random-access memories, implementing neural logic AND, NAND, OR, NOR, XOR, and XNOR.

Through  “Open-Loop architecture with SNP” researchers obtained a programming matrix with polynomial powers of time-lagged inputs, which can be used in operations as a high pass filter. In order to solve algorithms, Closed-loop architecture with SNP is used in which an RNN is programmed to store the substantial time history of a stochastic, non-differentiable time series, and a short-time Fourier transform is performed.

Simulating and virtualizing require programming of time history for continuous-time RNN so Closed-loop RNN with DNP method is implemented. Researchers tried to emulate the dynamics of the feedback of a 2000 state host RNN and 15 state guest RNN. They find that it is just simulating a chaotic Lorentz attractor without any samples. This concludes that:

Researchers have discovered that an alternative computing framework can be fully programmable, which challenges current approaches that mimic silicon hardware. Instead, they propose focusing on creating specific programming systems that maximize the full computational abilities of each unique system.

Check out the Paper and Blog. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club

The post The University Of Pennsylvania Researchers Introduced An Alternative AI Approach To Design And Program RNN-Based Reservoir Computers appeared first on MarkTechPost.

CarperAI Introduces OpenELM: An Open-Source Library Designed to Enable …

Natural Language Processing, one of the primary subfields of Artificial Intelligence, is advancing at an extraordinary pace. With its ability to enable a computer to understand human language the way it is spoken and written, NLP has a number of use cases. One such development is the introduction of Large Language Models, which are the trained deep learning models based on Natural Language Processing, Natural Language Understanding, and Natural Language Generation. These models imitate humans by answering questions, generating precise textual content, completing codes, summarizing long paragraphs of texts, translating languages, and so on. 

Recently, CarperAI, a leading AI research organization, has introduced OpenELM, an open-source library that promises to transform the field of evolutionary search. OpenELM, in which ELM stands for Evolution through Large Models, combines the power of large language models with evolutionary algorithms to enable the generation of diverse and high-quality text and code. OpenELM version 0.9 has been proposed with the aim of providing developers and researchers with an exceptional tool for solving complex problems across various domains. Along with OpenELM, the team has also released its paper at GPTP 2023.

Evolution Through Large Models (ELM) demonstrates how LLMs can iteratively enhance, critique, and improve their output. This skill can be used to improve language models’ capacity for problem-solving and demonstrates their potential as intelligent search operators for both language and code. The core idea behind ELM is that LLMs can act as intelligent operators of variation in evolutionary algorithms. OpenELM takes advantage of this potential to improve language models’ problem-solving skills, enabling the creation of varied and high-quality content in areas that the model might not have seen during training. The team has introduced OpenELM with four major goals, which are as follows.

Open source – OpenELM gives an open-source release of ELM and the differential models that go along with it, which makes it possible for developers to freely use the library and contribute.

Model Integration: OpenELM is built to work easily with both closed models, which can only be used with commercial APIs like the OpenAI API, and open-source language models, which can be used locally or on platforms like Colab.

User-Friendly Interface and Sample Environments: OpenELM aims to provide a straightforward user interface along with a variety of evolutionary search sample environments.

Evolutionary Potential – OpenELM intends to demonstrate the evolutionary potential of language models in combination with evolution, and it shows how intelligent variation operators can help evolutionary algorithms, especially in fields like plain-text code creation and creative writing, by utilizing the possibilities of huge language models.

With a focus on quality-diversity (QD) methods like MAP-Elites, CVT-MAP-Elites, and Deep Grid MAP-Elites, OpenELM, being a feature-rich library, smoothly interacts with well-known evolutionary techniques. It makes it possible to create high-quality and diversified solutions by encouraging diversity and preserving the best individuals within each specialty. In conclusion, OpenELM marks a significant milestone in the field of evolutionary search by utilizing the potential of large language models to generate diverse and high-quality text and code.

Check out the Paper, Blog, and Github Link. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club

The post CarperAI Introduces OpenELM: An Open-Source Library Designed to Enable Evolutionary Search With Language Models In Both Code and Natural Language appeared first on MarkTechPost.

LAION AI Introduces Video2Dataset: An Open-Source Tool Designed To Cur …

Big foundational models like CLIP, Stable Diffusion, and Flamingo have radically improved multimodal deep learning over the past few years. Joint text-image modeling has gone from being a niche application to one of the (if not the) most relevant issues in today’s artificial intelligence landscape due to the outstanding capabilities of such models to generate spectacular, high-resolution imagery or perform hard downstream problems. Surprisingly, despite tackling vastly different tasks and having vastly different designs, all these models have three fundamental properties in common that contribute to their strong performance: a simple and stable objective function during (pre-)training, a well-investigated scalable model architecture, and – perhaps most importantly – a large, diverse dataset.

Multimodal deep learning, as of 2023, is still primarily concerned with text-image modeling, with only limited attention paid to additional modalities like video (and audio). Considering that the techniques used to train the models are typically modality agnostic, one could wonder why there aren’t solid groundwork models for these other modalities. The simple explanation is the scarcity of high-quality, large-scale annotated datasets. This lack of clean data impedes research and development of large multimodal models, especially in the video domain, in contrast to image modeling, where there exist established datasets for scaling like LAION-5B, DataComp, and COYO-700M and scalable tools like img2dataset.

Because it can pave the way for groundbreaking initiatives like high-quality video and audio creation, improved pre-trained models for robotics, movie AD for the blind community, and more, researchers suggest that resolving this data problem is a central aim of (open source) multimodal research.

Researchers present video2dataset, an open-source program for fast and extensive video and audio dataset curating. It has been successfully tested on several large video datasets, and it is adaptable, extensible, and provides a huge number of transformations. You can find these case studies and detailed instructions on replicating our method in the repository.

By downloading individual video datasets, merging them, and reshaping them into more manageable shapes with new features and significantly more samples, researchers have utilized video2dataset to build upon existing video datasets. Please refer to the examples section for a more in-depth description of this chain processing. The outcomes they achieved by training different models on the datasets supplied by video2dataset demonstrate the tool’s efficacy. Our forthcoming study will extensively discuss the new data set and associated findings.

To begin, let’s define video2dataset.

Since webdataset is an acceptable input_format, video2dataset can be used in a chain to reprocess previously downloaded data. You can use the WebVid data you downloaded in the previous example to execute this script, which will calculate the optical flow for each movie and store it in metadata shards (shards that only have the optical flow metadata in them).

Architecture

Based on img2dataset, video2dataset takes a list of URLs and associated metadata and converts it into a WebDataset that can be loaded with a single command. In addition, the WebDataset can be reprocessed for additional changes with the same shard contents preserved. How does video2dataset work? I’ll explain.

Exchanging Ideas

The first step is to partition the input data so that it may be distributed evenly among the workers. These input shards are cached temporarily, and the one-to-one mapping between them and their corresponding output shards guarantees fault-free recovery. If a dataset processing run terminates unexpectedly, one can save time by skipping the input shards for which researchers already have the corresponding output shard.

Communication and Study

Workers then take turns reading and processing the samples contained within the shards. Researchers offer three different distribution modes: multiprocessing, pyspark, and slurm. The former is ideal for single-machine applications, while the latter is useful for scaling across several machines. The format of the incoming dataset determines the reading strategy. If the data is a table of URLs, video2dataset will fetch the video from the internet and add it to the dataset. video2dataset works with many different video platforms because it uses yt-dlp to request videos it can’t find. However, if the video samples come from an existing Web dataset, the data loader for that dataset can read the tensor format of the bytes or frames.

Subsampling

After the video has been read and the worker has the video bytes, the bytes are sent through a pipeline of subsamplers according to the job configuration. In this stage, the video may be optionally downsampled in terms of both frame rate and resolution; clipped; scenes may be identified; and so on. On the other hand, there are subsamplers whose sole purpose is to extract and add metadata, such as resolution/compression information, synthetic captions, optical flow, and so on, from the input modalities. Defining a new subsampler or modifying an existing one is all it takes to add a new transformation to video2dataset if it isn’t already there. This is a huge help and can be implemented with a few changes elsewhere in the repository.

Logging

Video2dataset keeps meticulous logs at multiple points in the process. Each shard’s completion results in its associated “ID” _stats.json file. Information such as the total number of samples handled, the percentage of those handled successfully, and the occurrence and nature of any errors are recorded here. Weights & Biases (wand) is an additional tool that can be used with video2dataset. With just one argument, you can turn on this integration and access detailed performance reporting and metrics for successes and failures. Such capabilities are helpful for benchmarking and cost-estimating tasks connected to whole jobs.

Writing

Finally, video2dataset stores the modified information in output shards at user-specified places to use in subsequent training or reprocessing operations. The dataset can be downloaded in several formats, all consisting of shards with N samples each. These formats include folders, tar files, records, and parquet files. The most important ones are the directories format for smaller datasets for debugging and tar files utilized by the WebDataset format for loading. 

Reprocessing

video2dataset can reprocess earlier output datasets by reading the output shards and passing the samples through new transformations. This functionality is particularly advantageous for video datasets, considering their often hefty size and awkward nature. It allows us to carefully downsample the data to avoid numerous downloads of large datasets. Researchers dig into a practical example of this in the next section.

Code and details can be found in GitHub https://github.com/iejMac/video2dataset 

Future Plans

Study of a massive dataset built with the software described in this blog article, followed by public dissemination of the results of that study.

It improved synthetic captioning. There is a lot of room for innovation in synthetic captioning for videos. Soon in video2dataset, researchers will have more interesting methods to produce captions for videos that use image captioning models and LLMs.

Whisper’s ability to extract numerous text tokens from the video has been the subject of much discussion since its release. Using video2dataset, they are currently transcribing a sizable collection of podcasts to make the resulting text dataset (targeting 50B tokens) publicly available.

Many exciting modeling ideas. Hopefully, with improved dataset curation tooling, more people will attempt to push the SOTA in the video and audio modality.

video2dataset is a fully open-source project, and researchers are committed to developing it in the open. This means all the relevant TODOs and future directions can be found in the issues tab of the repository. Contributions are welcomed; the best way to do that is to pick out a problem, address it, and submit a pull request.

Check out the Blog and Github Link. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club

The post LAION AI Introduces Video2Dataset: An Open-Source Tool Designed To Curate Video And Audio Datasets Efficiently And At Scale appeared first on MarkTechPost.

Effectively solve distributed training convergence issues with Amazon …

Recent years have shown amazing growth in deep learning neural networks (DNNs). This growth can be seen in more accurate models and even opening new possibilities with generative AI: large language models (LLMs) that synthesize natural language, text-to-image generators, and more. These increased capabilities of DNNs come with the cost of having massive models that require significant computational resources in order to be trained. Distributed training addresses this problem with two techniques: data parallelism and model parallelism. Data parallelism is used to scale the training process over multiple nodes and workers, and model parallelism splits a model and fits them over the designated infrastructure. Amazon SageMaker distributed training jobs enable you with one click (or one API call) to set up a distributed compute cluster, train a model, save the result to Amazon Simple Storage Service (Amazon S3), and shut down the cluster when complete. Furthermore, SageMaker has continuously innovated in the distributed training space by launching features like heterogeneous clusters and distributed training libraries for data parallelism and model parallelism.
Efficient training on a distributed environment requires adjusting hyperparameters. A common example of good practice when training on multiple GPUs is to multiply batch (or mini-batch) size by the GPU number in order to keep the same batch size per GPU. However, adjusting hyperparameters often impacts model convergence. Therefore, distributed training needs to balance three factors: distribution, hyperparameters, and model accuracy.
In this post, we explore the effect of distributed training on convergence and how to use Amazon SageMaker Automatic Model Tuning to fine-tune model hyperparameters for distributed training using data parallelism.
The source code mentioned in this post can be found on the GitHub repository (an m5.xlarge instance is recommended).
Scale out training from a single to distributed environment
Data parallelism is a way to scale the training process to multiple compute resources and achieve faster training time. With data parallelism, data is partitioned among the compute nodes, and each node computes the gradients based on their partition and updates the model. These updates can be done using one or multiple parameter servers in an asynchronous, one-to-many, or all-to-all fashion. Another way can be to use an AllReduce algorithm. For example, in the ring-allreduce algorithm, each node communicates with only two of its neighboring nodes, thereby reducing the overall data transfers. To learn more about parameter servers and ring-allreduce, see Launching TensorFlow distributed training easily with Horovod or Parameter Servers in Amazon SageMaker. With regards to data partitioning, if there are n compute nodes, then each node should get a subset of the data, approximately 1/n in size.
To demonstrate the effect of scaling out training on model convergence, we run two simple experiments:

Train an image classification model using a fully connected-layer DNN with ReLU activation functions using MXNet and Gluon frameworks. For training data, we used the MNIST dataset of handwritten digits. We used the source provided in the SageMaker example repository.
Train a binary classification model using the SageMaker built-in XGBoost algorithm. We used the direct marketing dataset to predict bank customers who are likely to respond with a specific offer. The source code and steps to reproduce the experiment can be found on the GitHub repo.

Each model training ran twice: on a single instance and distributed over multiple instances. For the DNN distributed training, in order to fully utilize the distributed processors, we multiplied the mini-batch size by the number of instances (four). The following table summarizes the setup and results.

Problem type
Image classification
Binary classification

Model
DNN
XGBoost

Instance
ml.c4.xlarge
ml.m5.2xlarge

Data set
MNIST (Labeled images)
Direct Marketing (tabular, numeric and vectorized categories)

Validation metric
Accuracy
AUC

Epocs/Rounds
20
150

Number of Instances
1
4
1
3

Distribution type
N/A
Parameter server
N/A
AllReduce

Training time (minutes)
8
3
3
1

Final Validation score
0.97
0.11
0.78
0.63

For both models, the training time was reduced almost linearly by the distribution factor. However, model convergence suffered a significant drop. This behavior is consistent for the two different models, the different compute instances, the different distribution methods, and different data types. So, why did distributing the training process affect model accuracy?
There are a number of theories that try to explain this effect:

When tensor updates are big in size, traffic between workers and the parameter server can get congested. Therefore, asynchronous parameter servers will suffer significantly worse convergence due to delays in weights updates [1].
Increasing batch size can lead to over-fitting and poor generalization, thereby reducing the validation accuracy [2].
When asynchronously updating model parameters, some DNNs might not be using the most recent updated model weights; therefore, they will be calculating gradients based on weights that are a few iterations behind. This leads to weight staleness [3] and can be caused by a number of reasons.
Some hyperparameters are model or optimizer specific. For example, the XGBoost official documentation says that the exact value for the tree_mode hyperparameter doesn’t support distributed training because XGBoost employs row splitting data distribution whereas the exact tree method works on a sorted column format.
Some researchers proposed that configuring a larger mini-batch may lead to gradients with less stochasticity. This can happen when the loss function contains local minima and saddle points and no change is made to step size, to optimization getting stuck in such local minima or saddle point [4].

Optimize for distributed training
Hyperparameter optimization (HPO) is the process of searching and selecting a set of hyperparameters that are optimal for a learning algorithm. SageMaker Automatic Model Tuning (AMT) provides HPO as a managed service by running multiple training jobs on the provided dataset. SageMaker AMT searches the ranges of hyperparameters that you specify and returns the best values, as measured by a metric that you choose. You can use SageMaker AMT with the built-in algorithms or use your custom algorithms and containers.
However, optimizing for distributed training differs from common HPO because instead of launching a single instance per training job, each job actually launches a cluster of instances. This means a greater impact on cost (especially if you consider costly GPU-accelerated instances, which are typical for DNN). In addition to AMT limits, you could possibly hit SageMaker account limits for concurrent number of training instances. Finally, launching clusters can introduce operational overhead due to longer starting time. SageMaker AMT has specific features to address these issues. Hyperband with early stopping ensures that well-performing hyperparameters configurations are fine-tuned and those that underperform are automatically stopped. This enables efficient use of training time and reduces unnecessary costs. Also, SageMaker AMT fully supports the use of Amazon EC2 Spot Instances, which can optimize the cost of training up to 90% over on-demand instances. With regards to long start times, SageMaker AMT automatically reuses training instances within each tuning job, thereby reducing the average startup time of each training job by 20 times. Additionally, you should follow AMT best practices, such as choosing the relevant hyperparameters, their appropriate ranges and scales, and the best number of concurrent training jobs, and setting a random seed to reproduce results.
In the next section, we see these features in action as we configure, run, and analyze an AMT job using the XGBoost example we discussed earlier.
Configure, run, and analyze a tuning job
As mentioned earlier, the source code can be found on the GitHub repo. In Steps 1–5, we download and prepare the data, create the xgb3 estimator (the distributed XGBoost estimator is set to use three instances), run the training jobs, and observe the results. In this section, we describe how to set up the tuning job for that estimator, assuming you already went through Steps 1–5.
A tuning job computes optimal hyperparameters for the training jobs it launches by using a metric to evaluate performance. You can configure your own metric, which SageMaker will parse based on regex you configure and emit to stdout, or use the metrics of SageMaker built-in algorithms. In this example, we use the built-in XGBoost objective metric, so we don’t need to configure a regex. To optimize for model convergence, we optimize based on the validation AUC metric:

objective_metric_name=”validation:auc”

We tune seven hyperparameters:

num_round – Number of rounds for boosting during the training.
eta – Step size shrinkage used in updates to prevent overfitting.
alpha – L1 regularization term on weights.
min_child_weight – Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning.
max_depth – Maximum depth of a tree.
colsample_bylevel – Subsample ratio of columns for each split, in each level. This subsampling takes place once for every new depth level reached in a tree.
colsample_bytree – Subsample ratio of columns when constructing each tree. For every tree constructed, the subsampling occurs once.

To learn more about XGBoost hyperparameters, see XGBoost Hyperparameters. The following code shows the seven hyperparameters and their ranges:

hyperparameter_ranges = {
“num_round”: IntegerParameter(100, 200),
“eta”: ContinuousParameter(0, 1),
“min_child_weight”: ContinuousParameter(1, 10),
“alpha”: ContinuousParameter(0, 2),
“max_depth”: IntegerParameter(1, 10),
“colsample_bylevel”: ContinuousParameter(0, 1),
“colsample_bytree”: ContinuousParameter(0, 1),
}

Next, we provide the configuration for the Hyperband strategy and the tuner object configuration using the SageMaker SDK. HyperbandStrategyConfig can use two parameters: max_resource (optional) for the maximum number of iterations to be used for a training job to achieve the objective, and min_resource – the minimum number of iterations to be used by a training job before stopping the training. We use HyperbandStrategyConfig to configure StrategyConfig, which is later used by the tuning job definition. See the following code:

hsc = HyperbandStrategyConfig(max_resource=30, min_resource=1)
sc = StrategyConfig(hyperband_strategy_config=hsc)

Now we create a HyperparameterTuner object, to which we pass the following information:

The XGBoost estimator, set to run with three instances
The objective metric name and definition
Our hyperparameter ranges
Tuning resource configurations such as number of training jobs to run in total and how many training jobs can be run in parallel
Hyperband settings (the strategy and configuration we configured in the last step)
Early stopping (early_stopping_type) set to Off

Why do we set early stopping to Off? Training jobs can be stopped early when they are unlikely to improve the objective metric of the hyperparameter tuning job. This can help reduce compute time and avoid overfitting your model. However, Hyperband uses an advanced built-in mechanism to apply early stopping. Therefore, the parameter early_stopping_type must be set to Off when using the Hyperband internal early stopping feature. See the following code:

tuner = HyperparameterTuner(
xgb3,
objective_metric_name,
hyperparameter_ranges,
max_jobs=30,
max_parallel_jobs=4,
strategy=”Hyperband”,
early_stopping_type=”Off”,
strategy_config=sc
)

Finally, we start the automatic model tuning job by calling the fit method. If you want to launch the job in an asynchronous fashion, set wait to False. See the following code:

tuner.fit(
{“train”: s3_input_train, “validation”: s3_input_validation},
include_cls_metadata=False,
wait=True,
)

You can follow the job progress and summary on the SageMaker console. In the navigation pane, under Training, choose Hyperparameter tuning jobs, then choose the relevant tuning job. The following screenshot shows the tuning job with details on the training jobs’ status and performance.

When the tuning job is complete, we can review the results. In the notebook example, we show how to extract results using the SageMaker SDK. First, we examine how the tuning job increased model convergence. You can attach the HyperparameterTuner object using the job name and call the describe method. The method returns a dictionary containing tuning job metadata and results.
In the following code, we retrieve the value of the best-performing training job, as measured by our objective metric (validation AUC):

tuner = HyperparameterTuner.attach(tuning_job_name=tuning_job_name)
tuner.describe()[“BestTrainingJob”][“FinalHyperParameterTuningJobObjectiveMetric”][“Value”]

The result is 0.78 in AUC on the validation set. That’s a significant improvement over the initial 0.63!
Next, let’s see how fast our training job ran. For that, we use the HyperparameterTuningJobAnalytics method in the SDK to fetch results about the tuning job, and read into a Pandas data frame for analysis and visualization:

tuner_analytics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)
full_df = tuner_analytics.dataframe()
full_df.sort_values(by=[“FinalObjectiveValue”], ascending=False).head()

Let’s see the average time a training job took with Hyperband strategy:

full_df[“TrainingElapsedTimeSeconds”].mean()

The average time took approximately 1 minute. This is consistent with the Hyperband strategy mechanism that stops underperforming training jobs early. In terms of cost, the tuning job charged us for a total of 30 minutes of training time. Without Hyperband early stopping, the total billable training duration was expected to be 90 minutes (30 jobs * 1 minutes per job * 3 instances per job). That is three times better in cost savings! Finally, we see that the tuning job ran 30 training jobs and took a total of 12 minutes. That is almost 50% less of the expected time (30 jobs/4 jobs in parallel * 3 minutes per job).
Conclusion
In this post, we described some observed convergence issues when training models with distributed environments. We saw that SageMaker AMT using Hyperband addressed the main concerns that optimizing data parallel distributed training introduced: convergence (which improved by more than 10%), operational efficiency (the tuning job took 50% less time than a sequential, non-optimized job would have taken) and cost-efficiency (30 vs. the 90 billable minutes of training job time). The following table summarizes our results:

Improvement Metric
No Tuning/Naive Model Tuning Implementation
SageMaker Hyperband Automatic Model Tuning
Measured Improvement

Model Quality (Measured by validation AUC)
0.63
0.78
15%

Cost (Measured by billable training minutes)
90
30
66%

Operational efficiency (Measured by total running time)
24
12
50%

In order to fine-tune with regards to scaling (cluster size), you can repeat the tuning job with multiple cluster configurations and compare the results to find the optimal hyperparameters that satisfy speed and model accuracy.
We included the steps to achieve this in the last section of the notebook.
References
[1] Lian, Xiangru, et al. “Asynchronous decentralized parallel stochastic gradient descent.” International Conference on Machine Learning. PMLR, 2018.
[2] Keskar, Nitish Shirish, et al. “On large-batch training for deep learning: Generalization gap and sharp minima.” arXiv preprint arXiv:1609.04836 (2016).
[3] Dai, Wei, et al. “Toward understanding the impact of staleness in distributed machine learning.” arXiv preprint arXiv:1810.03264 (2018).
[4] Dauphin, Yann N., et al. “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.” Advances in neural information processing systems 27 (2014).

About the Author
Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers to design, build, and operate ML workloads at scale. In his spare time, he enjoys cycling, hiking, and complaining about data preparation.

Deep Language Models are getting increasingly better by learning to pr …

Deep learning has made significant strides in text generation, translation, and completion in recent years. Algorithms trained to predict words from their surrounding context have been instrumental in achieving these advancements. However, despite access to vast amounts of training data, deep language models still need help to perform tasks like long story generation, summarization, coherent dialogue, and information retrieval. These models have been shown to need help capturing syntax and semantic properties, and their linguistic understanding needs to be more superficial. Predictive coding theory suggests that the brain of a human makes predictions over multiple timescales and levels of representation across the cortical hierarchy. Although studies have previously shown evidence of speech predictions in the brain, the nature of predicted representations and their temporal scope remain largely unknown. Recently, researchers analyzed the brain signals of 304 individuals listening to short stories and found that enhancing deep language models with long-range and multi-level predictions improved brain mapping.

The results of this study revealed a hierarchical organization of language predictions in the cortex. These findings align with predictive coding theory, which suggests that the brain makes predictions over multiple levels and timescales of expression. Researchers can bridge the gap between human language processing and deep learning algorithms by incorporating these ideas into deep language models.

The current study evaluated specific hypotheses of predictive coding theory by examining whether cortical hierarchy predicts several levels of representations, spanning multiple timescales, beyond the neighborhood and word-level predictions usually learned in deep language algorithms. Modern deep language models and the brain activity of 304 people listening to spoken tales were compared. It was discovered that the activations of deep language algorithms supplemented with long-range and high-level predictions best describe brain activity.

The study made three main contributions. Initially, it was discovered that the supramarginal gyrus and the lateral, dorsolateral, and inferior frontal cortices had the largest prediction distances and actively anticipated future language representations. The superior temporal sulcus and gyrus are best modeled by low-level predictions, while high-level predictions best model the middle temporal, parietal, and frontal regions. Second, the depth of predictive representations varies along a similar anatomical architecture. Eventually, it was demonstrated that semantic traits, rather than syntactic ones, are what influence long-term forecasts.

According to the data, the lateral, dorsolateral, inferior, and supramarginal gyri were shown to have the longest predicted distances. These cortical areas are linked to high-level executive activities like abstract thought, long-term planning, attentional regulation, and high-level semantics. According to the research, these regions, which are at the top of the language hierarchy, may actively anticipate future language representations in addition to passively processing past stimuli.

The study also demonstrated variations in the depth of predictive representations along the same anatomical organization. The superior temporal sulcus and gyrus are best modeled by low-level predictions, while high-level predictions best model the middle temporal, parietal, and frontal regions. The results are consistent with the hypothesis. In contrast to present-day language algorithms, the brain predicts representations at several levels rather than only those at the word level.

Eventually, the researchers separated the brain activations into syntactic and semantic representations, discovering that semantic features—rather than syntactic ones—influence long-term forecasts. This finding supports the hypothesis that the heart of long-form language processing may involve high-level semantic prediction.

The study’s overall conclusion is that benchmarks for natural language processing might be improved, and models could become more like the brain by consistently training algorithms to predict numerous timelines and levels of representation.

Check out the Paper, Dataset and Code. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 15k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Leveraging TensorLeap for Effective Transfer Learning: Overcoming Domain Gaps

The post Deep Language Models are getting increasingly better by learning to predict the next word from its context: Is this really what the human brain does? appeared first on MarkTechPost.

Meet DeepOnto: A Python Package for Ontology Engineering with Deep Lea …

Advances in Deep Learning Methodologies are greatly impacting the Artificial Intelligence community. With some great innovations and developments, a number of tasks are getting easier. Deep Learning techniques are being widely used in almost every industry, be it healthcare, social media, engineering, finance, or education. One of the best deep learning inventions is Large Language Models (LLMs), which recently got popular and are mostly in the headlines for their incredible use cases. These models imitate humans and, by utilizing the power of Natural Language Processing or Computer Vision, demonstrate some amazing solutions. 

The application of Large Language Models in the field of Ontology Engineering has been the topic of discussion ever since. Ontology engineering is a branch of knowledge engineering that is concerned with the creation, building, curation, assessment, and upkeep of ontologies. An ontology is basically a formal and precise specification of knowledge within a particular area that offers a systematic vocabulary of concepts and attributes, along with the relationships between them, in order to enable a shared understanding of semantics between humans and machines. 

Well-known ontology APIs like the OWL API and Jena are mostly Java-based, while deep learning frameworks like PyTorch and Tensorflow are developed generally for Python programming. This comes as a challenge to address which, a team of researchers has introduced DeepOnto, a Python package developed specifically for ontology engineering that enables seamless integration of the frameworks and APIs. 

DeepOnto package provides comprehensive, general, and Python-friendly support for deep learning-based ontology engineering, and it consists of an ontology processing module as the foundation which supports basic operations such as loading, saving, querying entities, modifying entities and axioms, and advanced functions like reasoning and verbalization. It also includes tools and resources for ontology alignment, completion, and ontology-based language model probing.

The team has chosen the OWL API as the backend dependency for DeepOnto. This is because of the characteristics of the API, such as its stability, reliability, and widespread adoption in notable projects and tools such as ROBOT and HermiT. PyTorch is the foundation for DeepOnto’s deep learning dependencies due to its dynamic computing graph, which permits runtime adjustment of the model’s architecture, offering flexibility and usability. Huggingface’s Transformers library has been used for language model applications, and the OpenPrompt library has been used to support the prompt learning paradigm, which is a crucial underpinning for big language models like ChatGPT.

DeepOnto’s basic ontology processing module is made up of a number of parts, each of which performs a particular task – First is Ontology, DeepOnto’s base class that offers the fundamental methods for viewing and changing an ontology. Second, is ontology reasoning, which is used for conducting reasoning activities, followed by Ontology pruning in which an ontology is taken and a scalable subset is extracted depending on particular criteria, such as semantic kinds. Lastly, Ontology Verbalization is there which improves the ontology’s accessibility and aids in a variety of ontology engineering activities by verbalizing ontology elements into natural language text.

The team has demonstrated the practical utility of DeepOnto with the help of two use cases. In the first use-case, DeepOnto has been used to help ontology engineering tasks within the framework of Digital Health Coaching at Samsung Research UK. The Ontology Alignment Evaluation Initiative (OAEI)’s Bio-ML track is the second use-case, where DeepOnto has been used to align and finish biomedical ontologies using deep learning techniques.

In conclusion, DeepOnto is a strong package for ontology engineering and is a great addition to the developments in the field of Artificial Intelligence. For future implementations and projects like logic embeddings and the discovery and introduction of new concepts, DeepOnto provides a flexible and expandable interface.

Check out the Paper and Github Link. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club

The post Meet DeepOnto: A Python Package for Ontology Engineering with Deep Learning appeared first on MarkTechPost.

Researchers From ETH Zurich and Microsoft Introduce LightGlue: A Deep …

Matching corresponding points between images is crucial to many computer vision applications, such as camera tracking and 3D mapping. The conventional approach involves using sparse interest points and high-dimensional representations to match them based on their visual appearance. However, accurately describing each issue becomes challenging in scenarios with symmetries, weak texture, or variations in viewpoint and lighting. Additionally, these representations should be able to distinguish outliers caused by occlusion and missing points. Balancing the objectives of robustness and uniqueness proves to be complicated.

To address these limitations, a research team from ETH Zurich and Microsoft introduced a novel paradigm called LightGlue. LightGlue utilizes a deep network that simultaneously considers both images to match sparse points and reject outliers jointly. The network incorporates the Transformer model, which learns to match challenging image pairs by leveraging large datasets. This approach has demonstrated robust image-matching capabilities in indoor and outdoor environments. LightGlue has proven to be highly effective for visual localization in challenging conditions and has shown promising performance in other tasks, including aerial matching, object pose estimation, and fish re-identification.

Despite its effectiveness, the original approach, known as “SuperGlue,” is computationally expensive, making it unsuitable for tasks requiring low latency or high processing volumes. Additionally, training SuperGlue models is notoriously challenging and demands significant computing resources. As a result, subsequent attempts to improve the SuperGlue model have failed to improve its performance. However, since the publication of SuperGlue, there have been significant advancements and applications of Transformer models in language and vision tasks. In response, the researchers designed LightGlue as a more accurate, efficient, and easier-to-train alternative to SuperGlue. They reexamined the design choices and introduced numerous simple yet effective architectural modifications. By distilling a recipe for training high-performance deep matches with limited resources, the team achieved state-of-the-art accuracy within a few GPU days.

LightGlue offers a Pareto-optimal solution, striking a balance between efficiency and accuracy. Unlike previous approaches, LightGlue adapts to the difficulty of each image pair. It predicts correspondences after each computational block and dynamically determines whether further computation is necessary based on confidence. By discarding unmatchable points early on, LightGlue focuses on the area of interest, improving efficiency.

Experimental results demonstrate that LightGlue outperforms existing sparse and dense matches. It is a seamless replacement for SuperGlue, delivering intense matches from local features while significantly reducing runtime. This advancement opens up exciting opportunities for deploying deep matches in latency-sensitive applications, such as simultaneous localization and mapping (SLAM) and reconstructing more significant scenes from crowd-sourced data.

The LightGlue model and training code will be publicly available under a permissive license. This release empowers researchers and practitioners to utilize LightGlue’s capabilities and contribute to advancing computer vision applications that require efficient and accurate image matching.

Check out the Paper and Code. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 800+ AI Tools in AI Tools Club

The post Researchers From ETH Zurich and Microsoft Introduce LightGlue: A Deep Neural Network That Learns To Match Local Features Across Images appeared first on MarkTechPost.

University of Michigan Researchers Open-Source ‘FedScale’: a Feder …

Federated learning (FL) is a new machine learning (ML) environment in which a logically centralized coordinator orchestrates numerous dispersed clients (e.g., cellphones or laptops) to train or assess a model collectively. It enables model training and assessment of end-user data while avoiding significant costs and privacy hazards associated with acquiring raw data from customers, with applications spanning a wide range of ML jobs. Existing work has focused on improving critical features of FL in the context of varied execution speeds of client devices and non-IID data distributions.

A thorough benchmark for evaluating an FL solution must study its behavior in a practical FL scenario with (1) data heterogeneity and (2) device heterogeneity under (3) heterogeneous connectivity and (4) availability conditions at (5) many scales on a (6) wide range of ML tasks. While the first two elements are frequently cited in the literature, real network connections and client device availability might impact both forms of heterogeneity, impeding model convergence. Similarly, large-scale assessment can reveal an algorithm’s resilience since actual FL deployment frequently involves thousands of concurrent participants out of millions of customers.

Overlooking just one component can cause the FL assessment to be skewed. Regrettably, established FL benchmarks frequently fall short across numerous dimensions. For starters, they have restricted data flexibility for many real-world FL applications. Even though they have many datasets and FL training objectives (e.g., LEAF, their datasets frequently comprise synthetically created partitions derived from conventional datasets (e.g., CIFAR) and do not represent realistic features. This is because these benchmarks are primarily based on classic ML benchmarks (e.g., MLPerf or are built for simulated FL systems such as TensorFlow Federated or PySyft.

Second, existing benchmarks frequently ignore system performance, connection, and client availability (e.g., FedML and Flower). This inhibits FL attempts from considering system efficiency, resulting in unduly optimistic statistical performance. Third, their datasets are predominantly small-scale because their experimental setups cannot simulate large-scale FL deployments. While real FL frequently involves thousands of participants in each training cycle, most available benchmarking platforms can only train tens of participants each round.

Finally, most of them lack user-friendly APIs for automatic integration, necessitating significant technical work for large-scale benchmarking. To facilitate complete and consistent FL assessments, we present FedScale, an FL benchmark and supporting runtime: • FedScale, to the best of our knowledge, has the most comprehensive collection of FL datasets for examining various elements of practical FL installations. It presently has 20 actual FL datasets with small, medium, and large sizes for a wide range of task categories, including image classification, object identification, word prediction, speech recognition, and reinforcement learning.

FedScale Runtime to standardize and simplify FL assessment in more realistic conditions. FedScale Runtime includes a mobile backend for on-device FL assessment and a cluster backend for benchmarking different practical FL metrics (for example, actual client round length) on GPUs/CPUs using accurate FL statistical and system information. The cluster backend can efficiently train thousands of clients on a small number of GPUs in each cycle. FedScale Runtime is also extendable, allowing for the quick implementation of new algorithms and concepts through flexible APIs. Researchers conducted systematic tests to demonstrate how FedScale enables thorough FL benchmarking and highlight the critical requirement for co-optimizing system and statistical efficiency, particularly in dealing with system stragglers, accuracy bias, and device energy trade-offs.

FedScale (fedscale.ai) provides high-level APIs for implementing FL algorithms, deploying, and evaluating them at scale across various hardware and software backends. FedScale also features the most comprehensive FL benchmark, including FL tasks from image classification and object identification to language modeling and speech recognition. Furthermore, it delivers datasets that properly simulate FL training scenarios where FL will be applied practically. The best feature is its open source, and the code is freely available on Github. 

This Article is written as a summary article by Marktechpost Staff based on the research paper ‘FedScale: Benchmarking Model and System Performance of Federated Learning at Scale’. All Credit For This Research Goes To Researchers on This Project. Checkout the paper, github link and reference article.

Please Don’t Forget To Join Our ML Subreddit

The post University of Michigan Researchers Open-Source ‘FedScale’: a Federated Learning (FL) Benchmarking Suite with Realistic Datasets and a Scalable Runtime to Enable Reproducible FL Research on Privacy-Preserving Machine Learning appeared first on MarkTechPost.

Researchers at Stanford have developed an Artificial Intelligence (AI) …

Stock market behavior forecasting is a crucial undertaking that requires careful attention since, with the right choices, a successful prediction of stock prices might result in attractive gains. Due to non-stationary, noisy, inter-dependence, and chaotic data, stock market forecasting is a significant challenge, making it difficult for investors to spend their money in a way that would result in profits. Given the importance of this area, machine learning experts have proposed several models that aim to predict the future value of stock market groups.

The earlier works used traditional machine learning techniques such as support vector regression, random forests, and the bayesian model. More recently, researchers have turned to deep learning models. Deep neural networks such as LSTM and encoder-decoder are increasingly used to perform the task of stock market prediction since they are more efficient in facing the time-series nature of the data. 

StockBot, a new approach proposed by researchers from Stanford University, was introduced to help investors make a daily decision: sell or buy. It is a generalizable price predicting model based on stacked LSTM aiming to predict stock prices for new stocks that do not have sufficient historical data. 

Generally, LSTM-based prediction models are trained on the price of a single stock and can perform only the inference using the parameters learned on the same stock. Therefore, the authors proposed to train the network specifically to an industry type such as “energy” or “finance.” Concretely, past and future prices from multiple tickers in the same industry are combined to create a mixed training and/or test set. In this way, the model can operate in two modes. Although the training step is made using the combined set, the prediction step can be done for all the tickers or just for a single one which is very useful for performing a more robust prediction for stocks with insufficient historical data. In addition, a bot is used to perform the buy or sell operations at the time of closing every day in order to maximize gains. The decision is made by use of the predictions of the stock prices analytically without any training phase. The algorithm followed by the bot is as follows: 

1) Calculate the δi changes given by δi = sign(ci+1 − ci), where ci is the stock price on the ith day. 

2) check the evolutions of the δi, by following ∆i = δi+1 − δi.

The decision is made regarding the value of ∆. When ∆ = −2, the bot decides to buy since it indicates the end of a trough. While ∆i = 2 indicates the beginning of a dip, the bot decides to sell.

The authors explored several possibilities in the experimental study by comparing different prediction models such as single/stacked many-to-one LSTM architectures and the Encoder-Decoder model. Results demonstrate that the single/double-stacked LSTMs are the best architectures. In addition, for simplicity and speed, the prediction of multiple days together has proven to be more interesting than predicting the future day one at a time since the latter approach can only predict one day at a time. Finally, the decisions taken by the bot surpass even the most aggressive ETFs and the main investment products provided by investment firms.

We have seen in this article a new model of stock market prediction which allows benefiting from two major advantages: firstly, to predict stocks which suffers from a limited database thanks to a network trained on several other firms belonging to the same sector of activity. Second, benefit from decision support thanks to a bot that knows when to buy or sell according to daily changes in predicted closing values.

This Article is written as a research summary article by Marktechpost Staff based on the research article ‘StockBot: Using LSTMs to Predict Stock Prices’. All Credit For This Research Goes To Researchers on This Project. Checkout the paper, gitlab link.

Please Don’t Forget To Join Our ML Subreddit

The post Researchers at Stanford have developed an Artificial Intelligence (AI) model,’ StockBot’, which uses LSTMs to predict stock prices with gains higher than the most aggressive ETFs appeared first on MarkTechPost.

Access private repos using the @remote decorator for Amazon SageMaker …

As more and more customers are looking to put machine learning (ML) workloads in production, there is a large push in organizations to shorten the development lifecycle of ML code. Many organizations prefer writing their ML code in a production-ready style in the form of Python methods and classes as opposed to an exploratory style (writing code without using methods or classes) because this helps them ship production-ready code faster.
With Amazon SageMaker, you can use the @remote decorator to run a SageMaker training job simply by annotating your Python code with an @remote decorator. The SageMaker Python SDK will automatically translate your existing workspace environment and any associated data processing code and datasets into a SageMaker training job that runs on the SageMaker training platform.
Running a Python function locally often requires several dependencies, which may not come with the local Python runtime environment. You can install them via package and dependency management tools like pip or conda.
However, organizations operating in regulated industries like banking, insurance, and healthcare operate in environments that have strict data privacy and networking controls in place. These controls often mandate having no internet access available to any of their environments. The reason for such restriction is to have full control over egress and ingress traffic so they can reduce the chances of unscrupulous actors sending or receiving non-verified information through their network. It’s often also mandated to have such network isolation as part of the auditory and industrial compliance rules. When it comes to ML, this restricts data scientists from downloading any package from public repositories like PyPI, Anaconda, or Conda-Forge.
To provide data scientists access to the tools of their choice while also respecting the restrictions of the environment, organizations often set up their own private package repository hosted in their own environment. You can set up private package repositories on AWS in multiple ways:

Using AWS CodeArtifact
Using Amazon Simple Storage (Amazon S3)
Hosting a repository on Amazon Elastic Compute Cloud (Amazon EC2)

In this post, we focus on the first option: using CodeArtifact.
Solution overview
The following architecture diagram shows the solution architecture.

The high-level steps to implement the solution are as follows

Set up a virtual private cloud (VPC) with no internet access using an AWS CloudFormation template.
Use a second CloudFormation template to set up CodeArtifact as a private PyPI repository and provide connectivity to the VPC, and set up an Amazon SageMaker Studio environment to use the private PyPI repository.
Train a classification model based on the MNIST dataset using an @remote decorator from the open-source SageMaker Python SDK. All the dependencies will be downloaded from the private PyPI repository.

Note that using SageMaker Studio in this post is optional. You can choose to work in any integrated development environment (IDE) of your choice. You just need to set up your AWS Command Line Interface (AWS CLI) credentials correctly. For more information, refer to Configure the AWS CLI.
Prerequisites
You need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution. For details, refer to Creating an AWS account.
Set up a VPC with no internet connection
Create a new CloudFormation stack using the vpc.yaml template. This template creates the following resources:

A VPC with two private subnets across two Availability Zones with no internet connectivity
A Gateway VPC endpoint for accessing Amazon S3
Interface VPC endpoints for SageMaker, CodeArtifact, and a few other services to allow the resources in the VPC to connect to AWS services via AWS PrivateLink

Provide a stack name, such as No-Internet, and complete the stack creation process.

Wait for the stack creation process to complete.
Set up a private repository and SageMaker Studio using the VPC
The next step is to deploy another CloudFormation stack using the sagemaker_studio_codeartifact.yaml template. This template creates the following resources:

A SageMaker domain connected to the VPC created in the previous step
A CodeArtifact domain
A CodeArtifact private repository connected to an upstream public PyPI repository

Provide a stack name and keep the default values or adjust the parameters for the CodeArtifact domain name, private repository name, user profile name for SageMaker Studio, and name for the upstream public PyPI repository. You also we need to provide the VPC stack name created in the previous step.

When the stack creation is complete, the SageMaker domain should be visible on the SageMaker console.

To verify there is no internet connection available in SageMaker Studio, launch SageMaker Studio. Choose File, New, and Terminal to launch a terminal and try to curl any internet resource. It should fail to connect, as shown in the following screenshot.

Train an image classifier using an @remote decorator with the private PyPI repository
In this section, we use the @remote decorator to run a PyTorch training job that produces a MNIST image classification model. To achieve this, we set up a configuration file, develop the training script, and run the training code.
Set up a configuration file
We set up a config.yaml file and provide the configurations needed to do the following:

Run a SageMaker training job in the no-internet VPC created earlier
Download the required packages by connecting to the private PyPI repository created earlier

The file looks like the following code:

SchemaVersion: ‘1.0’
SageMaker:
PythonSDK:
Modules:
RemoteFunction:
Dependencies: ‘../config/requirements.txt’
InstanceType: ‘ml.m5.xlarge’
PreExecutionCommands:
– ‘aws codeartifact login –tool pip –domain <domain-name> –domain-owner <AWS account number> –repository <private repository name> –endpoint-url <VPC-endpoint-url-prefixed with https://>
RoleArn: ‘<execution role ARN for running training job>’
S3RootUri: ‘<s3 bucket to store the job output>’
VpcConfig:
SecurityGroupIds:
– ‘<security group id used by SageMaker Studio>’
Subnets:
– ‘<VPC subnet id 1>’
– ‘<VPC subnet id 2>’

The Dependencies field contains the path to requirements.txt, which contains all the dependencies needed. Note that all the dependencies will be downloaded from the private repository. The requirements.txt file contains the following code:

torch
torchvision
sagemaker>=2.156.0,<3

The PreExecutionCommands section contains the command to connect to the private PyPI repository. To get the CodeArtifact VPC endpoint URL, use the following code:

response = ec2.describe_vpc_endpoints(
Filters=[
{
‘Name’: ‘service-name’,
‘Values’: [
f’com.amazonaws.{boto3_session.region_name}.codeartifact.api’
]
},
]
)

code_artifact_api_vpc_endpoint = response[‘VpcEndpoints’][0][‘DnsEntries’][0][‘DnsName’]

endpoint_url = f’https://{code_artifact_api_vpc_endpoint}’
endpoint_url

Generally, we get two VPC endpoints for CodeArtifact, and we can use any of them in the connection commands. For more details, refer to Use CodeArtifact from a VPC.
Additionally, configurations like execution role, output location, and VPC configurations are provided in the config file. These configurations are needed to run the SageMaker training job. To know more about all the configurations supported, refer to Configuration file.
It’s not mandatory to use the config.yaml file in order to work with the @remote decorator. This is just a cleaner way to supply all configurations to the @remote decorator. All the configs could also be supplied directly in the decorator arguments, but that reduces readability and maintainability of changes in the long run. Also, the config file can be created by an admin and shared with all the users in an environment.
Develop the training script
Next, we prepare the training code in simple Python files. We have divided the code into three files:

load_data.py – Contains the code to download the MNIST dataset
model.py – Contains the code for the neural network architecture for the model
train.py – Contains the code for training the model by using load_data.py and model.py

In train.py, we need to decorate the main training function as follows:

@remote(include_local_workdir=True)
def perform_train(train_data,
test_data,
*,
batch_size: int = 64,
test_batch_size: int = 1000,
epochs: int = 3,
lr: float = 1.0,
gamma: float = 0.7,
no_cuda: bool = True,
no_mps: bool = True,
dry_run: bool = False,
seed: int = 1,
log_interval: int = 10,
):
# pytorch native training code……..

Now we’re ready to run the training code.
Run the training code with an @remote decorator
We can run the code from a terminal or from any executable prompt. In this post, we use a SageMaker Studio notebook cell to demonstrate this:

!python ./train.py

Running the preceding command triggers the training job. In the logs, we can see that it’s downloading the packages from the private PyPI repository.

This concludes the implementation of an @remote decorator working with a private repository in an environment with no internet access.
Clean up
To clean up the resources, follow the instructions in CLEANUP.md.
Conclusion
In this post, we learned how to effectively use the @remote decorator’s capabilities while still working in restrictive environments without any internet access. We also learned how can we integrate CodeArtifact private repository capabilities with the help of configuration file support in SageMaker. This solution makes iterative development much simpler and faster. Another added advantage is that you can still continue to write the training code in a more natural, object-oriented way and still use SageMaker capabilities to run training jobs on a remote cluster with minimal changes in your code. All the code shown as part of this post is available in the GitHub repository.
As a next step, we encourage you to check out the @remote decorator functionality and Python SDK API and use it in your choice of environment and IDE. Additional examples are available in the amazon-sagemaker-examples repository to get you started quickly. You can also check out the post Run your local machine learning code as Amazon SageMaker Training jobs with minimal code changes for more details.

About the author
Vikesh Pandey is a Machine Learning Specialist Solutions Architect at AWS, helping customers from financial industries design and build solutions on generative AI and ML. Outside of work, Vikesh enjoys trying out different cuisines and playing outdoor sports.

Huawei Researchers Develop Pangu-Σ: A Large Language Model With Spars …

Large Language Models (LLMs) have exhibited exceptional skills and potential in natural language processing, creation, and reasoning. By employing a large quantity of textual data, the performance of language models scales up with compute budget and model parameters, displaying significant zero/few-shot learning skills or even emerging abilities. Since GPT-3, several big language models have been developed and published, including the Megatron-Turing NLG, PanGu, ERNIE 3.0 Titan, Gopher, PaLM, OPT, Bloom, and GLM-130B. With more than one trillion parameters, researchers have begun constructing ever bigger language models. Generally, sparsely-activated models like Mixture-of-Experts (MoE) are used to achieve this.

Several notable works among the trillion-parameter models are available, including Switch-C, GLaM, MoE-1.1T, Wu Dao 2.0, and M6-10T. Unfortunately, only a chosen number have achieved the expected performance while publishing thorough assessment findings across various jobs. According to their observations, scaling efficiency is the main challenge. Current research on the scaling laws of language models shows that for LLMs to function at their best, there must be an adequate amount of training data and a reasonable computing budget. Designing a scalable model architecture and an effective distributed training system that can ingest the data with high training throughput is, therefore, one of the key motivations for this effort.

• Scaling the model: LLM model performance is anticipated to increase as the model size grows. Sparse architectures like a Mixture of Experts (MoE) are an intriguing option to scale the model size up without incurring a linear rise in computational cost compared to the high computational price for training dense Transformer models. Yet, issues such as an imbalanced workload and global communication delay plague MoE models. Also, there are still unresolved issues with adding MoE to an existing dense model and how many experts to place in each layer. Thus, developing a trillion-parameter sparse model with good performance and training efficiency is a critical but difficult challenge.

• Scaling the system: It has been suggested to use frameworks like DeepSpeed 4 to enable training models with a trillion parameters. The primary constraint is frequently a constrained compute budget, or more precisely, the number of accelerating devices (such as GPU, NPU, and TPU) that may be employed. Practitioners may train trillion-parameter models with workable batch sizes using tensor parallelism, pipeline parallelism, zero redundancy optimizer, and rematerialization over thousands of accelerating devices. By using heterogeneous computing strategies, such as shifting a portion of the processing to host machines, practitioners can minimize the number of computing resources.

However, the poor bandwidth between the host and device and the CPUs’ limited computational power compared to accelerating devices make it impossible to feed big language models with a sufficient quantity of data and achieve optimal performance using the present methodologies. Consequently, the effectiveness of big language models depends on how to scale the system performance with a restricted computing budget. In this paper, researchers from Huawei introduce Pangu-Σ a large language model with sparse architecture and 1.085 trillion parameters. They create the Pangu-Σmodel within the MindSpore 5 framework and train it over 100 days on a cluster using 512 Ascend 910 AI Accelerators and 329 billion tokens.

PanGu’s built-in parameters are expanded using Random Routed Experts’ Transformer decoder architecture (RRE). RRE uses two levels of routing as opposed to traditional MoE. Experts are organized by task or domain at the first level, and tokens are evenly and randomly assigned to each group at the second level without using any learnable gating functions as in MoE. Using the RRE architecture, it is simple to extract sub-models from the Pangu-Σ for various downstream applications, including conversation, translation, code production, and interpreting natural language in general.

They suggest the Expert Computation and Storage Separation (ECSS) mechanism to make training systems efficient and scalable. This mechanism achieves 69905 tokens/s observed throughput in training 1.085 trillion Pangu-Σ on a cluster of 512 Ascend 910 accelerators and significantly reduces host-to-device and device-to-host communication as optimizer update computation. Overall, the training throughput is 6.3 times faster than it was for the model with the MoE architecture but with the same hyperparameters.

The sub-modal of Pangu-Σ in the Chinese domain significantly outperforms the previous SOTA models, including Pangu-Σ with 13B parameters and ERNIE 3.0 Titan with 260B parameters over 16 downstream tasks in six categories in the zero-shot setting without any multitask finetuning or instruction tuning. The Pangu-Σ model performs better in the relevant regions than the SOTA models. It uses 329B tokens in more than 40 natural and programming languages. Moreover, they evaluate how well Pangu-Σ has been tweaked in several application domains, including conversation, machine translation, and code production.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Huawei Researchers Develop Pangu-Σ: A Large Language Model With Sparse Architecture And 1.085 Trillion Parameters appeared first on MarkTechPost.

Meet Text2Room: A New AI Method For Generating Room-Scale Textured 3D …

Mesh representations of 3D sceneries are essential to many applications, from developing AR/VR assets to computer graphics. However, making these 3D assets is still laborious and demands a lot of skill. Recent efforts have utilized generative models, such as diffusion models, to effectively produce high-quality pictures from a text in the 2D realm. These techniques successfully contribute to the democratization of content production by greatly lowering the obstacles to producing pictures that include a user’s chosen content. A new area of research has attempted to use comparable techniques to generate 3D models from the text. However, current methods have drawbacks and need more generality of 2D text-to-image models.

Dealing with the dearth of 3D training data is one of the main difficulties in creating 3D models since 3D datasets are much smaller than those used in many other applications, such as 2D image synthesis. For instance, methods that employ 3D supervision directly are frequently restricted to datasets of basic forms, like ShapeNet. Recent techniques overcome these data constraints by formalizing 3D creation as an iterative optimization problem in the picture domain, enhancing the expressive potential of 2D text-to-image models into 3D. The capacity to produce arbitrary (neural) forms from text is demonstrated by their ability to construct 3D objects stored in a radiance field representation. Unfortunately, expanding on these techniques to produce 3D structure and texture at room size can be challenging.

Making sure that the output is dense and cohesive across outward-facing viewpoints and that these views include all necessary features, such as walls, floors, and furniture, is difficult when creating enormous scenes. A mesh remains a preferred representation for several end-user activities, including rendering on affordable technology. Researchers from TU Munich and University of Michigan suggest a technique that extracts scene-scale 3D meshes from commercially available 2D text-to-image models to solve these drawbacks. Their technique employs inpainting and monocular depth perception to create a scene iteratively. Using a depth estimate technique, they make the first mesh by creating a picture from text and back projecting it into three dimensions. The model is then repeatedly rendered from fresh angles.

Figure 1: Text-based prompts are used to create textured 3D meshes. Using 2D text-to-image models, we create textured 3D objects from a given text prompt. (a) Several views are used to generate the scene iteratively (marked in blue). (b) The textures and geometry of the model we developed are interesting. For a better representation of the scene layout in the top-down views, the ceiling is removed.

For each, they inpaint any gaps in the displayed pictures before fusing the created content into the mesh (Fig. 1a). Two key design factors for their iterative generation approach are how they select the views and how they integrate the created scene material with the current geometry. They initially choose perspectives from predetermined trajectories that will cover a significant portion of the scene material, and they then select viewpoints adaptively to fill in any gaps. To produce seamless transitions when combining generated content with the mesh, they align the two depth maps and remove any areas of the model with distorted textures.

Combined, these choices provide sizable, scene-scale 3D models (Fig. 1b) that can depict a variety of rooms and have appealing materials and uniform geometry. So, their contributions are as follows: 

• A technique that uses 2D text-to-image models and monocular depth estimation to lift frames into 3D in iterative scene creation. 

• A method that creates 3D meshes of room-scale interior scenes with beautiful textures and geometry from any text input. They can produce seamless, undistorted geometry and textures using their suggested depth alignment and mesh fusion methods. 

• A two-stage customized perspective selection that samples camera poses from ideal angles to first lay out the furnishings and layout of the area and then fill in any gaps to provide a watertight mesh.

Check out the Paper, Project, and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Meet Text2Room: A New AI Method For Generating Room-Scale Textured 3D Meshes From A Given Text Prompt As Input appeared first on MarkTechPost.

Meet LongLLaMA: A Large Language Model Capable of Handling Long Contex …

Researchers have made significant advancements in various fields using language models. However, effectively incorporating extensive new knowledge into these models remains a challenge. Fine-tuning, the common practice, is resource-intensive and complex to manage, and it only sometimes provides a straightforward method for incorporating new knowledge. Researchers propose a promising alternative called Focused Transformer (FOT) to address this.

The FOT technique aims to overcome the challenge of limited context length in language models. As the number of documents increases, the ratio of relevant to irrelevant tokens diminishes, leading to overlaps between keys related to irrelevant and relevant values. This issue is referred to as the distraction issue. The FOT allows a subset of attention layers to access an external memory of (key, value) pairs using the k-nearest neighbors (kNN) algorithm. This mechanism effectively extends the context length and helps address the distraction issue.

The training procedure of the Focused Transformer draws from contrastive learning. During training, the memory attention layers are exposed to both relevant and irrelevant keys, resembling negative samples from unrelated documents. This approach encourages the model to differentiate between keys connected to semantically diverse values, enhancing their structure.

The researchers introduce LONGLLAMAs, which are fine-tuned OpenLLaMA models with FOT. This method demonstrates that it does not require long context during training and can be applied to existing models. LONGLLAMAs significantly improve tasks requiring long-context modeling, such as passkey retrieval.

The research contributions include identifying the distraction issue as a significant challenge to scaling up context length in Transformer models, developing the Focused Transformer (FOT) to address this issue, and providing a simple implementation method that allows existing models to be augmented with memory without modifying their architecture. The resulting models, LONGLLAMAs, exhibit enhancements in tasks that benefit from increasing the number of few-shot demonstrations in the extended context. The FOT’s capabilities are further analyzed across various datasets and model sizes, demonstrating improvements in perplexity over baselines in long-context language modeling tasks.

In summary, the Focused Transformer (FOT) technique addresses the distraction issue and allows context length extension in language models. Training the model to differentiate between relevant and irrelevant keys enhances the structure and significantly improves tasks requiring long-context modeling. The FOT method can be applied to existing models without architectural modifications, making it a cost-effective solution for augmenting models with memory.

Check out the Paper and GitHub link. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club

The post Meet LongLLaMA: A Large Language Model Capable of Handling Long Contexts of 256k Tokens appeared first on MarkTechPost.

UC Berkeley And MIT Researchers Propose A Policy Gradient Algorithm Ca …

Researchers have made notable strides in training diffusion models using reinforcement learning (RL) to enhance prompt-image alignment and optimize various objectives. Introducing denoising diffusion policy optimization (DDPO), which treats denoising diffusion as a multi-step decision-making problem, enables fine-tuning Stable Diffusion on challenging downstream objectives.

By directly training diffusion models on RL-based objectives, the researchers demonstrate significant improvements in prompt-image alignment and optimizing objectives that are difficult to express through traditional prompting methods. DDPO presents a class of policy gradient algorithms designed for this purpose. To improve prompt-image alignment, the research team incorporates feedback from a large vision-language model known as LLaVA. By leveraging RL training, they achieved remarkable progress in aligning prompts with generated images. Notably, the models shift towards a more cartoon-like style, potentially influenced by the prevalence of such representations in the pretraining data.

The results obtained using DDPO for various reward functions are promising. Evaluations on objectives such as compressibility, incompressibility, and aesthetic quality show notable enhancements compared to the base model. The researchers also highlight the generalization capabilities of the RL-trained models, which extend to unseen animals, everyday objects, and novel combinations of activities and objects. While RL training brings substantial benefits, the researchers note the potential challenge of over-optimization. Fine-tuning learned reward functions can lead to models exploiting the rewards non-usefully, often destroying meaningful image content.

Additionally, the researchers observe a susceptibility of the LLaVA model to typographic attacks. RL-trained models can loosely generate text resembling the correct number of animals, fooling LLaVA in prompt-based alignment scenarios.

In summary, introducing DDPO and using RL training for diffusion models represent significant progress in improving prompt-image alignment and optimizing diverse objectives. The results showcase advancements in compressibility, incompressibility, and aesthetic quality. However, challenges such as reward over-optimization and vulnerabilities in prompt-based alignment methods warrant further investigation. These findings open up new opportunities for research and development in diffusion models, particularly in image generation and completion tasks.

Check out the Paper, Project, and GitHub Link. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club

The post UC Berkeley And MIT Researchers Propose A Policy Gradient Algorithm Called Denoising Diffusion Policy Optimization (DDPO) That Can Optimize A Diffusion Model For Downstream Tasks Using Only A Black-Box Reward Function appeared first on MarkTechPost.