Designing resilient cities at Arup using Amazon SageMaker geospatial c …

This post is co-authored with Richard Alexander and Mark Hallows from Arup.
Arup is a global collective of designers, consultants, and experts dedicated to sustainable development. Data underpins Arup consultancy for clients with world-class collection and analysis providing insight to make an impact.
The solution presented here is to direct decision-making processes for resilient city design. Informing design decisions towards more sustainable choices reduces the overall urban heat islands (UHI) effect and improves quality of life metrics for air quality, water quality, urban acoustics, biodiversity, and thermal comfort. Identifying key areas within an urban environment for intervention allows Arup to provide the best guidance in the industry and create better quality of life for citizens around the planet.
Urban heat islands describe the effect urban areas have on temperature compared to surrounding rural environments. Understanding how UHI affects our cities leads to improved designs that reduce the impact of urban heat on residents. The UHI effect impacts human health, greenhouse gas emissions, and water quality, and leads to increased energy usage. For city authorities, asset owners, and developers, understanding the impact on the population is key to improving quality of life and natural ecosystems. Modeling UHI accurately is a complex challenge, which Arup is now solving with earth observation data and Amazon SageMaker.
This post shows how Arup partnered with AWS to perform earth observation analysis with Amazon SageMaker geospatial capabilities to unlock UHI insights from satellite imagery. SageMaker geospatial capabilities make it easy for data scientists and machine learning (ML) engineers to build, train, and deploy models using geospatial data. SageMaker geospatial capabilities allow you to efficiently transform and enrich large-scale geospatial datasets, accelerate product development and time to insight with pre-trained ML models, and explore model predictions and geospatial data on an interactive map using 3D accelerated graphics and built-in visualization tools.
Overview of solution
The initial solution focuses on London, where during a heatwave in the summer of 2022, the UK Health Security Agency estimated 2,803 excess deaths were caused due to heat. Identifying areas within an urban environment where people may be more vulnerable to the UHI effect allows public services to direct assistance where it will have the greatest impact. This can even be forecast prior to high temperature events, reducing the impact of extreme weather and delivering a positive outcome for city dwellers.
Earth Observation (EO) data was used to perform the analysis at city scale. However, the total size poses challenges with traditional ways of storing, organizing, and querying data for large geographical areas. Arup addressed this challenge by partnering with AWS and using SageMaker geospatial capabilities to enable analysis at a city scale and beyond. As the geographic area grows to larger metropolitan areas like Los Angeles or Tokyo, the more storage and compute for analysis is required. The elasticity of AWS infrastructure is ideal for UHI analyses of urban environments of any size.
The solution: UHeat
Arup used SageMaker to develop UHeat, a digital solution that analyzes huge areas of cities to identify particular buildings, structures, and materials that are causing temperatures to rise. UHeat uses a combination of satellite imagery and open-source climate data to perform the analysis.
A small team at Arup undertook the initial analysis, during which additional data scientists needed to be trained on the SageMaker tooling and workflows. Onboarding data scientists to a new project used to take weeks using in-house tools. This now takes a matter of hours with SageMaker.
The first step of any EO analysis is the collection and preparation of the data. With SageMaker, Arup can access data from a catalog of geospatial data providers, including Sentinel-2 data, which was used for the London analysis. Built-in geospatial dataset access saves weeks of effort otherwise lost to collecting and preparing data from various data providers and vendors. EO imagery is frequently made up of small tiles which, to cover an area the size of London, need to be combined. This is known as a geomosaic, which can be created automatically using the managed geospatial operations in a SageMaker Geomosaic Earth Observation job.
After the EO data for the area of interest is compiled, the key influencing parameters for the analysis can be extracted. For UHI, Arup needed to be able to derive data on parameters for building geometry, building materials, anthropogenic heat sources, and coverage of existing and planned green spaces. Using optical imagery such as Sentinel-2, land cover classes including buildings, roads, water, vegetation cover, bare ground, and the albedo (measure of reflectiveness) of each of these surfaces can be calculated.
Calculating the values from the different bands in the satellite imagery allows them to be used as inputs into the SUEWS model, which provides a rigorous way of calculating UHI effect. The results of SUEWS are then visualized, in this case with Arup’s existing geospatial data platform. By adjusting values such as the albedo of a specific location, Arup are able to test the effect of mitigation strategies. Albedo performance can be further refined in simulations by modeling different construction materials, cladding, or roofing. Arup found that in one area of London, increasing albedo from 0.1 to 0.9 could decrease ambient temperature by 1.1°C during peak conditions. Over larger areas of interest, this modeling can also be used to forecast the UHI effect alongside climate projections to quantify the scale of the UHI effect.
With historical data from sources such as Sentinel-2, temporal studies can be completed. This enables Arup to visualize the UHI effect during periods of interest, such as the London summer 2022 heatwave. The Urban Heat Snapshot research Arup has completed reveals how the UHI effect is pushing up temperatures in cities like London, Madrid, Mumbai, and Los Angeles.
Collecting data for an area of interest
SageMaker eliminates the complexities in manually collecting data for Earth Observation jobs (EOJs) by providing a catalog of geospatial data providers. As of this writing, USGS Landsat, Sentinel-1, Copernicus DEM, NAIP: National Agriculture Imagery Program, and Sentinel-2 data is available directly from the catalog. You can also bring your own Planet Labs data when imagery at a higher resolution and frequency is required. Built-in geospatial dataset access saves weeks of effort otherwise lost to collecting data from various data providers and vendors. Coordinates for the polygon area of interest need to be provided as well as the time range for when EO imagery was collected.
Arup’s next step was to combine these images into a larger single raster covering the entire area of interest. This is known as mosaicking and is supported by passing GeoMosaicConfig to the SageMaker StartEarthObservationJob API.
We have provided some code samples representative of the steps Arup took:

input_config = {
‘AreaOfInterest’: {
‘AreaOfInterestGeometry’: {
‘PolygonGeometry’: {
‘Coordinates’: [
[
[-0.10813482652250173,51.52037502928192],
[-0.10813482652250173, 51.50403627237003],
[-0.0789364331937179, 51.50403627237003],
[-0.0789364331937179, 51.52037502928192],
[-0.10813482652250173, 51.52037502928192]
]
]
}
}
},
‘TimeRangeFilter’: {
‘StartTime’: ‘2020-01-01T00:00:00’,
‘EndTime’: ‘2023-01-1T00:00:00’
},
‘PropertyFilters’: {
‘Properties’: [
{
‘Property’: {
‘EoCloudCover’: {
‘LowerBound’: 0,
‘UpperBound’: 1
}
}
}
],
‘LogicalOperator’: ‘AND’
},
‘RasterDataCollectionArn’: ‘arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8’
}

eoj_config = {
“JobConfig”: {
“CloudRemovalConfig”: {
“AlgorithmName”: “INTERPOLATION”,
“InterpolationValue”: “-9999”,
“TargetBands”: [“red”, “green”, “blue”, “nir”, “swir16″],
},
}
}

#invoke EOJ this will run in the background for several minutes
eoj = sm_geo_client.start_earth_observation_job(
Name=”London-Observation-Job”,
ExecutionRoleArn=sm_exec_role,
InputConfig={“RasterDataCollectionQuery”:input_config},
**eoj_config
)
print(“EOJ started with… nName: {} nID: {}”.format(eoj[“Name”],eoj[“Arn”]))

This can take a while to complete. You can check the status of your jobs like so:

eoj_arn = eoj[“Arn”]
job_details = sm_geo_client.get_earth_observation_job(Arn=eoj_arn)
{k: v for k, v in job_details.items() if k in [“Arn”, “Status”, “DurationInSeconds”]}
# List all jobs in the account
sm_geo_client.list_earth_observation_jobs()[“EarthObservationJobSummaries”]

Resampling
Next, the raster is resampled to normalize the pixel size across the collected images. You can use ResamplingConfig to achieve this by providing the value of the length of a side of the pixel:

eoj_config = {
“JobConfig”: {
“ResamplingConfig”: {
“OutputResolution”: {
“UserDefined”: {
“Value”: 20,
“Unit”: “METERS”
}
},
“AlgorithmName”: “NEAR”,
},
}
}

eojParams = {
“Name”: “Resample”,
“InputConfig”: {
“PreviousEarthObservationJobArn”: eoj[“Arn”]
},
**eoj_config,
“ExecutionRoleArn”: sm_exec_role,
}

eoj = sm_geo_client.start_earth_observation_job(**eojParams)
print(“EOJ started with… nName: {} nID: {}”.format(eoj[“Name”],eoj[“Arn”]))

Determining coverage
Determining land coverage such as vegetation is possible by applying a normalized difference vegetation index (NDVI). In practice, this can be calculated from the intensity of reflected red and near-infrared light. To apply such a calculation to EO data within SageMaker, the BandMathConfig can be supplied to the StartEarthObservationJob API:

job_config={
“BandMathConfig”: {
‘CustomIndices’: {
“Operations”:[
{
“Name”: “NDVI”,
“Equation”: “(nir – red)/(nir+red)”
}
]
}
}
}

eojParams = {
“Name”: “Bandmath”,
“InputConfig”: {
“PreviousEarthObservationJobArn”: eoj[“Arn”]
},
“JobConfig”:job_config,
“ExecutionRoleArn”: sm_exec_role,
}

eoj = sm_geo_client.start_earth_observation_job(**eojParams)
print(“EOJ started with… nName: {} nID: {}”.format(eoj[“Name”],eoj[“Arn”]))

We can visualize the result of the band math job output within the SageMaker geospatial capabilities visualization tool. SageMaker geospatial capabilities can help you overlay model predictions on a base map and provide layered visualization to make collaboration easier. The GPU-powered interactive visualizer and Python notebooks provide a seamless way to explore millions of data points in a single window as well as collaborate on the insights and results.

Preparing for visualization
As a final step, Arup prepares the various bands and calculated bands for visualization by combining them into a single GeoTIFF. For band stacking, SageMaker EOJs can be passed the StackConfig object, where the output resolution can be set based on the resolutions of the input images:

job_config={
‘StackConfig’: {
‘OutputResolution’: {
‘Predefined’: ‘HIGHEST’
}
}
}

eojParams = {
“Name”: “Stack”,
“InputConfig”: {
“PreviousEarthObservationJobArn”: “arn:aws:sagemaker-geospatial:us-west-2:951737352731:earth-observation-job/8k2rfir84zb7”
},
“JobConfig”:job_config,
“ExecutionRoleArn”: sm_exec_role,
}

eoj = sm_geo_client.start_earth_observation_job(**eojParams)
print(“EOJ started with… nName: {} nID: {}”.format(eoj[“Name”],eoj[“Arn”]))

Finally, the output GeoTIFF can be stored for later use in Amazon Simple Storage Service (Amazon S3) or visualized using SageMaker geospatial capabilities. By storing the output in Amazon S3, Arup can use the analysis in new projects and incorporate the data into new inference jobs. In Arup’s case, they used the processed GeoTIFF in their existing geographic information system visualization tooling to produce visualizations consistent with their product design themes.

Conclusion
By utilizing the native functionality of SageMaker, Arup was able to conduct an analysis of UHI effect at city scale, which previously took weeks, in a few hours. This helps Arup enable their own clients to meet their sustainability targets faster and narrows the areas of focus where UHI effect mitigation strategies should be applied, saving precious resources and optimizing mitigation tactics. The analysis can also be integrated into future earth observation tooling as part of larger risk analysis projects, and helps Arup’s customers forecast the effect of UHI in different scenarios.
Companies such as Arup are unlocking sustainability through the cloud with earth observation data. Unlock the possibilities of earth observation data in your sustainability projects by exploring the SageMaker geospatial capabilities on the SageMaker console today. To find out more, refer to Amazon SageMaker geospatial capabilities, or get hands on with a guidance solution.

About the Authors
Richard Alexander is an Associate Geospatial Data Scientist at Arup, based in Bristol. He has a proven track record of building successful teams and leading and delivering earth observation and data science-related projects across multiple environmental sectors.
Mark Hallows is a Remote Sensing Specialist at Arup, based in London. Mark provides expertise in earth observation and geospatial data analysis to a broad range of clients and delivers insights and thought leadership using both traditional machine learning and deep learning techniques.
Thomas Attree is a Senior Solutions Architect at Amazon Web Services based in London. Thomas currently helps customers in the power and utilities industry and applies his passion for sustainability to help customers architect applications for energy efficiency, as well as advise on using cloud technology to empower sustainability projects.
Tamara Herbert is a Senior Application Developer with AWS Professional Services in the UK. She specializes in building modern and scalable applications for a wide variety of customers, currently focusing on those within the public sector. She is actively involved in building solutions and driving conversations that enable organizations to meet their sustainability goals both in and through the cloud.
Anirudh Viswanathan – is a Sr Product Manager, Technical – External Services with the SageMaker geospatial ML team. He holds a Masters in Robotics from Carnegie Mellon University and an MBA from the Wharton School of Business, and is named inventor on over 50 patents. He enjoys long-distance running, visiting art galleries, and Broadway shows.

Can Large Language Models Self-Evaluate for Safety? Meet RAIN: A Novel …

Pre-trained Large Language Models (LLMs), like GPT-3, have proven to have extraordinary aptitudes for comprehending and replying to questions from humans, helping with coding chores, and more. However, they frequently generate outcomes that differ from what people like. In the past, researchers have attempted to resolve this problem by gathering information on human preferences and then aligning previously trained models through the use of reinforcement learning or instruction tuning, entailing a fine-tuning stage. It is more appealing to align frozen LLMs, ones that have yet to undergo additional training, without the requirement for additional data. 

Recently, a team of researchers has discovered that unaligned LLMs can directly produce replies that match human preferences through a self-improvement process by including self-evaluation and rewind mechanisms. In the interest of AI safety, they have introduced Rewindable Auto-regressive INference (RAIN), a unique inference technique that enables pre-trained LLMs to assess their own generated text and use the evaluation results to direct backward rewinding and forward generation.

RAIN is notable for its ability to run without requiring any further data for model alignment. It does away with the requirement for parameter updates, gradient computation, or training. The model obtains direction on which human preferences to align during the self-evaluation phase through a fixed-template prompt, obviating the requirement to adjust the initial query repeatedly.

The experimental outcomes, assessed by the GPT-4 model and human assessors, showed how successful RAIN is. For instance, using the HH dataset, RAIN keeps the helpfulness rate constant while dramatically boosting the harmlessness rate of LLaMA 30B compared to vanilla inference, going from 82% to 97%. The team has shared that RAIN even established a new baseline for defense by lowering the assault success rate from 94% to 19% when Vicuna 33B is the target of a notable hostile attack (LLM-ATTACKS).

RAIN offers a number of benefits over currently used methods for aligning Large Language Models (LLMs) – 

Universality: The RAIN approach is adaptable and can be used for a variety of language-generating jobs. It fits in perfectly with the auto-regressive inference paradigm, which is the norm for many LLMs. This means that RAIN is highly customizable and user-friendly and can be quickly integrated into most current LLMs.

Alignment with Frozen Weights: RAIN does not necessitate the upkeep of extra models or the storing of gradient data and computational networks, in contrast to some other alignment strategies like RLHF. The minimum memory overhead produced by this is comparable to that of simple auto-regressive inference. RAIN is a realistic option for aligning LLMs with frozen weights because of its simple implementation and memory-efficient design, eliminating resource-intensive fine-tuning procedures.

Learning-free: RAIN does not rely on any type of labeled or unlabeled data or on human annotations. It doesn’t require a lot of information or training because it operates in a learning-free manner. RAIN considerably enhances alignment performance across a range of tasks and makes LLMs more resistant to hostile, prompt attacks. It significantly lowers the assault success rate when evaluated against a well-known adversarial attack method, demonstrating its potency as a defense against such attacks.

In conclusion, this study has introduced RAIN as a technique for adjusting LLMs to human preferences without the need for additional information or laborious fine-tuning. This is accomplished by allowing LLMs to assess and enhance their own outputs, ultimately resulting in more coordinated and secure AI-generated responses.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Can Large Language Models Self-Evaluate for Safety? Meet RAIN: A Novel Inference Method Transforming AI Alignment and Defense Without Finetuning appeared first on MarkTechPost.

Researchers from China Introduce ImageBind-LLM: A Multi-Modality Instr …

Researchers have recently seen significant improvements in large language models’ (LLMs) instruction tuning. ChatGPT and GPT-4 are general-purpose talking systems that obey human commands in language and visuals. However, they are still unreplicable because of the closed-source constraint. Alpaca, LLaMAAdapter, and related efforts offer to modify the publicly accessible LLaMA into language instruction models using self-generated data in response to this. LLaVA, LLaMA-Adapter, and others integrate visual understanding capabilities into LLMs for image-conditioned generation to accomplish picture instruction tailoring. 

Despite the success of current instruction tuning techniques, more is needed to create an LLM for broad multimodality instructions, such as text, picture, audio, 3D point clouds, and video. The authors of this study from Shanghai Artificial Intelligence Laboratory, CUHK MMLab and vivo AI Lab introduce the ImageBind-LLM multimodality instruction-following model, which effectively fine-tunes LLaMA under the direction of the joint embedding space in the pre-trained ImageBind. As shown in Figure 1, their ImageBind-LLM (b) can respond to input instructions of numerous modalities in addition to pictures, distinct from earlier visual instruction models (a), demonstrating promising extensibility and generalization capacity.

They specifically propose solely using the vision-language data for tweaking multimodality instruction due to ImageBind’s image-aligned multimodality embedding space. For a picture-caption pair, they first extract the global image feature using ImageBind’s frozen image encoder before embedding transformation using a learnable bind network. The converted picture feature is subsequently applied to all transformer layer word tokens in LLaMA, creating the visual context for generating the appropriate textual caption. In contrast to the zero-initialized attention in the LLaMA-Adapter series, their visual injection mechanism is simple and weighted by a trainable zero-initialized gating factor. 

In this effective way, as the training progresses, the instruction cues of ImageBind’s multimodality embeddings may be gradually introduced into LLaMA without interfering with the original language understanding. Using ImageBind for modality-specific encodings, such as text, picture, audio, and video, their ImageBind-LLM acquires the competence to obey instructions of diverse modalities after the basic vision-language training. They use the pre-trained 3D encoder in Point-Bind to encode the input 3D point clouds for instructions in 3D domains. They also provide a training-free visual cache approach for embedding augmentation during inference to address the modality gap between image training and text, audio, 3D, or video-conditioned production. 

Figure 1 compares our multi-modality vs. visual instruction models ImageBind-LLM. ImageBind-LLM performs a universal multi-modality instruction tuning for image, text, audio, video, and 3D, in contrast to earlier efforts [1-3] that are exclusively conditioned on image modality.

The cache model comprises millions of picture features in the training datasets retrieved by ImageBind, which enhances text/audio/3D/video embeddings by obtaining comparable visual characteristics (Tip-Adapter). As a result, verbal replies to multimodal instructions are of greater quality. They test ImageBind-LLM’s multimodality instruction-following capabilities in various circumstances and consistently find it to perform better. 

Overall, their ImageBind-LLM demonstrates the four qualities listed below.

• Instructions with many modes. ImageBind-LLM is optimized to respond to general multimodality inputs, such as image, text, audio, 3D point clouds, and video, and their embedding-space arithmetic represented by ImageBind and Point-Bind. This is different from earlier language and image instruction models. 

• Efficiency Tuning. During training, they freeze ImageBind’s image encoder and adjust partial weights in LLaMA using parameter-efficient approaches like LoRA and bias-norm tuning. They also train the zero-initialized gating factors and the extra bind network. 

• Zero-initialized Injection without Attention. They employ a learnable gating method for progressive knowledge injection, which is more straightforward and efficient, and incorporate the multimodality requirements with all word tokens of LLaMA directly instead of introducing additional instruction signals through attention layers. 

• Retrieval from a cross-modal cache. They offer a visual cache model from image features extracted by ImageBind, which performs cross-modality retrieval for embedding augmentation to address the modality disparity between training (single picture) and inference (many modalities).

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Researchers from China Introduce ImageBind-LLM: A Multi-Modality Instruction Tuning Method of Large Language Models (LLMs) via ImageBind appeared first on MarkTechPost.

This AI Paper Introduces Agents: An Open-Source Python Framework for A …

In tasks like customer service, consulting, programming, writing, teaching, etc., language agents can reduce human effort and are a potential first step toward artificial general intelligence (AGI). Recent demonstrations of language agents’ potential, including AutoGPT and BabyAGI, have sparked much attention from researchers, developers, and general audiences. 

Even for seasoned developers or researchers, most of these demos or repositories are not conducive to customizing, configuring, and deploying new agents. This restriction results from the fact that these demonstrations are frequently proof-of-concepts that highlight the potential of language agents rather than being more substantial frameworks that can be used to gradually develop and customize language agents. 

Furthermore, studies show that the majority of these open-source sources cover only a tiny percentage of the basic language agent abilities, such as job decomposition, long-term memory, web navigation, tool usage, and multi-agent communication. Additionally, most (if not all) of the language agent frameworks currently in use rely exclusively on a brief task description and entirely on the ability of LLMs to plan and act. Due to the high randomness and consistency across different runs, language agents are difficult to modify and tweak, and the user experience is poor.

Researchers from AIWaves Inc., Zhejiang University, and ETH Zürich present AGENTS, an open-source language agent library and framework to support LLM-powered language agents. The goal of AGENTS is to make language agent customization, tuning, and deployment as straightforward as possible—even for non-specialists—while yet being easily expandable for programmers and researchers. The library also offers the core capabilities listed below, which combine to make it a flexible platform for language agents: 

Long-short-term memory: AGENTS incorporate the memory components, allowing language agents to routinely update a short-term working memory with a scratchpad and store and retrieve long-term memory using VectorDB and semantic search. Users can decide whether to give an agent long-term memory, short-term memory, or both by simply filling up a field in the configuration file. 

Web navigation and the use of tools: The capability of autonomous agents to use external tools and browse the internet is another crucial characteristic. AGENTS supports a few widely used external APIs and offers an abstract class that makes it simple for programmers to incorporate other tools. By classifying web search and navigation as specialized APIs, we also make it possible for agents to browse the internet and gather information. 

Multiple-agent interaction: AGENTS permit customizable multi-agent systems and single-agent capabilities, which might be useful for specific applications like games, social experiments, software development, etc. The “dynamic scheduling” function in AGENTS is one new addition for multi-agent communication. Dynamic scheduling allows establishing a controller agent that serves as a “moderator” and chooses which agent to conduct the next action based on their roles and recent history instead of scheduling the order for the agents to act with hard-coded rules. The possibility exists for more flexible and natural communication between several agents when using dynamic scheduling. By defining the controller’s rule in the configuration file using plain language, developers can quickly alter the controller’s behavior. 

Human-agent interaction is supported by AGENTS in both single-agent and multi-agent scenarios, enabling interaction and communication between one or more humans and language agents.

Controllability: Using a symbolic plan, often known as standard operating procedures (SOPs), AGENTS offer a revolutionary paradigm for developing controllable agents. An SOP is a graph with several states that describes the various circumstances an agent might face while carrying out a task and the rules for transitioning between the states. An SOP in AGENTS is a painstakingly recorded collection of detailed instructions that specify how an agent or group of agents should carry out a specific activity or procedure. This is similar to SOPs in the real world. An LLM can produce SOPs that the user can alter while personalizing and fine-tuning the agent. After deployment, an agent will function by the instructions and standards set forth for each state and dynamically change its present state in response to interactions with the outside world, people, or other agents. With the advent of the symbolic plan, it is now possible to provide fine-grained control over an agent’s behavior, improving its stability and predictability while facilitating tuning and agent optimization.

The team hopes that AGENTS make it easier for researchers to study language agents, developers to create applications utilizing language agents, and non-technical audiences to create and modify unique language agents. 

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post This AI Paper Introduces Agents: An Open-Source Python Framework for Autonomous Language Agents appeared first on MarkTechPost.

Researchers at Stanford Introduce Spellburst: A Large Language Model ( …

While creating stunning digital artworks, generative artists often find themselves grappling with the complexities of coding. Using languages like Processing or AI text-to-image tools, they translate their imaginative visions into intricate lines of code, resulting in mesmerizing visual compositions. However, this process can be time-consuming and frustrating due to the iterative nature of trial and error. While traditional artists can easily adjust with a pencil or a brush, generative artists must navigate through opaque interfaces, leading to creative roadblocks.

Existing solutions attempt to mitigate these challenges, but they often fall short of providing the level of control and flexibility that artists require. Large language models, while helpful for generating initial concepts, struggle to offer fine-grained control over details like textures, colors, and patterns. This is where Spellburst steps in as a groundbreaking tool developed by scholars from Stanford University.

Spellburst leverages the power of the cutting-edge GPT-4 language model to streamline the process of translating artistic ideas into code. It begins with artists inputting an initial prompt, such as “a stained glass image of a beautiful, bright bouquet of roses.” The model then generates the corresponding code to bring that concept to life. However, what sets Spellburst apart is its ability to go beyond the initial generation. If the artist wishes to tweak the flowers’ shades or adjust the stained glass’s appearance, they can utilize dynamic sliders or add specific modification notes like “make the flowers a dark red.” This level of control empowers artists to make nuanced adjustments, ensuring their vision is faithfully realized.

Additionally, Spellburst facilitates the merging of different versions, allowing artists to combine elements from various iterations. For instance, they can instruct the tool to “combine the color of the flowers in version 4 with the shape of the vase in version 9.” This feature opens up a new realm of creative possibilities, enabling artists to experiment with different visual elements seamlessly.

One of the key strengths of Spellburst lies in its ability to transition between prompt-based exploration and code editing. Artists can simply click on the generated image to reveal the underlying code, granting them granular control for fine-tuning. This bridging of the semantic space and the code provides artists with a powerful tool to refine their creations iteratively.

In testing Spellburst, the research team at Stanford University sought feedback from 10 expert creative coders. The response was overwhelmingly positive, with artists reporting that the tool not only expedites the transition from semantic space to code but also encourages exploration and facilitates larger creative leaps. This newfound efficiency could revolutionize the way generative artists approach their craft, potentially leading to a surge in innovative and captivating digital artworks.

While Spellburst showcases immense promise, it is important to acknowledge its limitations. Some prompts may lead to unexpected results or errors, particularly in version mergers. Additionally, the tool’s effectiveness may vary for different artists, and the feedback received from a small sample size may not capture the full spectrum of experiences within the generative artist community.

In conclusion, Spellburst represents a significant leap forward in the realm of generative art. By offering a seamless interface between artistic vision and code execution, it empowers artists to unleash their creativity with unprecedented precision. As the tool prepares for an open-source release later this year, it holds the potential to not only revolutionize the workflows of seasoned creative coders but also serve as an invaluable learning tool for novices venturing into the world of code-driven art. With Spellburst, the future of generative art looks brighter and more accessible than ever before.

Check out the Paper and Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Researchers at Stanford Introduce Spellburst: A Large Language Model (LLM) Powered Creative-Coding Environment appeared first on MarkTechPost.

Meet Würstchen: A Super Fast and Efficient Diffusion Model Whose Text …

Text-to-image generation is a challenging task in artificial intelligence that involves creating images from textual descriptions. This problem is computationally intensive and comes with substantial training costs. The need for high-quality images further exacerbates these challenges. Researchers have been trying to balance computational efficiency and image fidelity in this domain.

To solve the text-to-image generation problem efficiently, researchers have introduced an innovative solution known as Würstchen. This model stands out in the field by adopting a unique two-stage compression approach. Stage A employs a VQGAN, while Stage B uses a Diffusion Autoencoder. Together, these two stages are referred to as the Decoder. Their primary function is to decode highly compressed images into the pixel space.

What sets Würstchen apart is its exceptional spatial compression capability. While previous models typically achieved compression ratios of 4x to 8x, Würstchen pushes the boundaries by performing a remarkable 42x spatial compression. This groundbreaking achievement is a testament to its novel design, which surpasses the limitations of common methods that often struggle to reconstruct detailed images after 16x spatial compression faithfully.

Würstchen’s success can be attributed to its two-stage compression process. Stage A, the VQGAN plays a crucial role in quantizing the image data into a highly compressed latent space. This initial compression significantly reduces the computational resources required for subsequent stages. Stage B, the Diffusion Autoencoder, further refines this compressed representation and reconstructs the image with remarkable fidelity.

Combining these two stages results in a model that can efficiently generate images from text prompts. This reduces the computational cost of training and enables faster inference. Importantly, Würstchen doesn’t compromise on image quality, making it a compelling choice for various applications.

Additionally, Würstchen introduces Stage C, the Prior, which is trained in the highly compressed latent space. This adds an extra layer of adaptability and efficiency to the model. It allows Würstchen to adapt to new image resolutions quickly, minimizing the computational overhead of fine-tuning for different scenarios. This adaptability makes it a versatile tool for researchers and organizations working with images of varying resolutions.

The reduced training cost of Würstchen is exemplified by the fact that Würstchen v1, trained at 512×512 resolution, required only 9,000 GPU hours, a fraction of the 150,000 GPU hours needed for Stable Diffusion 1.4 at the same resolution. This substantial cost reduction benefits researchers in their experimentation and makes it more accessible for organizations to harness the power of such models.

In conclusion, Würstchen offers a groundbreaking solution to the longstanding challenges of text-to-image generation. Its innovative two-stage compression approach and its remarkable spatial compression ratio set a new standard for efficiency in this domain. With reduced training costs and rapid adaptability to varying image resolutions, Würstchen emerges as a valuable tool that accelerates research and application development in text-to-image generation.

Check out the Paper, Demo, Documentation, and Blog. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Meet Würstchen: A Super Fast and Efficient Diffusion Model Whose Text-Conditional Component Works in a Highly Compressed Latent Space of Image appeared first on MarkTechPost.

What’s the Connection Between Transformers and Support Vector Machin …

Natural language processing (NLP) has revolutionized because of self-attention, the transformer design’s key element, allowing the model to recognize intricate connections within input sequences. Self-attention gives various aspects of the input sequence varied amounts of priority by evaluating the relevant token’s relevance to each other. The other technique has shown to be very good at capturing long-range relationships, which is important for reinforcement learning, computer vision, and NLP applications. Self-attention mechanisms and transformers have achieved remarkable success, clearing the path for creating complex language models like GPT4, Bard, LLaMA, and ChatGPT. 

Can they describe the implicit bias of transformers and the optimization landscape? How does the attention layer choose and combine tokens when trained with gradient descent? Researchers from the University of Pennsylvania, the University of California, the University of British Columbia, and the University of Michigan answer these problems by carefully tying together the attention layer’s optimization geometry with the (Att-SVM) hard max-margin SVM problem, which separates and chooses the best tokens from each input sequence. Experiments show that this formalism, which builds on previous work, is practically significant and illuminates the nuances of self-attention. 

Theorem 1

Throughout, they investigate the fundamental cross-attention and self-attention models using input sequences X, Z ∈ RT×d with length T and embedding dimension d: Here, the trainable key, query, and value matrices are K, Q ∈ Rd×m, and V ∈ Rd×v respectively. S( . ) stands for the softmax nonlinearity, which is applied row-wise to XQK⊤X⊤. By setting Z ← X, it can be seen that self-attention (1b) is a unique case of crossattention (1a). Consider using the initial token of Z, represented by z, for prediction to reveal their major findings. 

Specifically, they address the empirical risk minimization with a decreasing loss function l(): R R, expressed as follows: Given a training dataset (Yi, Xi, zi)ni=1 with labels Yi ∈ {−1, 1} and inputs Xi ∈ RT×d, zi ∈ Rd, they evaluate the following: The prediction head in this case, denoted by the symbol h( . ), includes the value weights V. In this formulation, an MLP follows the attention layer in the model f( . ), which accurately depicts a one-layer transformer. The self-attention is restored in (2) by setting zi ← xi1, where xi1 designates the first token of the sequence Xi. Due to its nonlinear character, the softmax operation presents a considerable hurdle for optimizing (2). 

Theorem 2

The issue is nonconvex and nonlinear, even when the prediction head is fixed and linear. This work optimizes the attention weights (K, Q, or W) to overcome these difficulties and establish a basic SVM equivalence. 

The following are the paper’s key contributions: 

• The layer’s implicit bias in attention. With the nuclear norm goal of the combination parameter W:= KQ (Thm 2), optimizing the attention parameters (K, Q) with diminishing regularisation converges in the direction of a max-margin solution of (Att-SVM). The regularisation path (RP) directionally converges to the (Att-SVM) solution with the Frobenius norm objective when cross-attention is explicitly parameterized by the combination parameter W. To their knowledge, this is the first study that formally compares the optimization dynamics of (K, Q) parameterizations to those of (W) parameterizations, highlighting the latter’s low-rank bias. Theorem 11 and SAtt-SVM in the appendix describe how their theory easily extends to sequence-to-sequence or causal categorization contexts and clearly defines the optimality of chosen tokens. 

• Gradient descent convergence. With the proper initialization and a linear head h(), the gradient descent iterations for the combined key-query variable W converge in the direction of an Att-SVM solution that is locally optimum. Selected tokens must perform better than their surrounding tokens for local optimality. Locally optimum rules are defined in the following problem geometry, although they are not always unique. They significantly contribute by identifying the geometric parameters that ensure convergence to the globally optimal direction. These include (i) the ability to differentiate ideal tokens based on their scores or (ii) the alignment of the initial gradient direction with optimal tokens. Beyond these, they demonstrate how over-parameterization (i.e., dimension d being large and equivalent conditions) promotes global convergence by guaranteeing (Att-SVM) feasibility and (benign) optimization landscape, which means there are no stationary points and no fictitious locally optimal directions.

• The SVM equivalence’s generality. The attention layer, often known as hard attention when optimizing with linear h(), is intrinsically biased towards choosing one token from each sequence. As a result of the output tokens being convex combinations of the input tokens, this is mirrored in the (Att-SVM). 

They demonstrate, however, that nonlinear heads need the creation of several tokens, underscoring the significance of these components to the dynamics of the transformer. They suggest a more broad SVM equivalency by concluding their theory. Surprisingly, they show that their hypothesis correctly predicts the implicit bias of attention trained by gradient descent under wide conditions not addressed by approach (for example, h() being an MLP). Their general equations specifically dissociate attention weights into two components: a finite component determining the precise composition of the selected words by modifying the softmax probabilities and a directional component controlled by SVM that picks the tokens by applying a 0-1 mask. 

The fact that these results can be mathematically verified and applied to any dataset (whenever SVM is practical) is a key aspect of them. Through insightful experiments, they comprehensively confirm the max-margin equivalence and implicit bias of transformers. They believe that these results contribute to our knowledge of transformers as hierarchical max-margin token selection processes, and they anticipate that their findings will provide a solid basis for future research on the optimization and generalization dynamics of transformers. 

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post What’s the Connection Between Transformers and Support Vector Machines? Unveiling the Implicit Bias and Optimization Geometry in Transformer Architectures appeared first on MarkTechPost.

This AI Research Introduces AstroLLaMA: A 7B Parameter Model Fine-Tune …

The arrival of Large Language Models (LLMs) has attracted attention from many fields because of several important factors coming together. These factors include the availability of huge amounts of data, improvements in computer power, and breakthroughs in the design of neural networks. Prominent models like GPT-4, PaLM, and LLaMA have shown that they can do many different tasks really well. These tasks often use methods like giving them prompts, fine-tuning their abilities, and getting feedback from humans to help them learn and improve. The astronomy discipline presents both a unique challenge and a fertile ground for the application of LLMs.

In the above image, we can notice each model is prompted with the same short text snippet, highlighted in their respective boxes. GPT-4 tends to produce more generic statements, lacking domain-specific nuance. AstroLLaMA demonstrates the most robust completion, offering more relevant concepts and deeper insights specific to the field of astronomy, thus significantly outperforming LLaMA-2 and GPT-4.

However, AstroLLaMA does have some limitations that need to be acknowledged. One significant limitation is the model’s lack of knowledge in specific areas of astronomy, where AstroLLaMA’s ability to estimate potential star candidates from Gaia-ESO data is notably inaccurate. To address these issues, researchers are currently working on enhancing AstroLLaMA’s training dataset. Instead of just using abstracts, researchers plan to incorporate the complete LaTeX sources of existing astronomy articles. This expansion will substantially increase the number of tokens the model can learn from.

AstroLLaMA serves as an impressive prototype for specialized Large Language Models (LLMs) designed for astronomy. It exhibits remarkable context-aware abilities, outperforming GPT-4 even though it has significantly fewer parameters. This advancement not only opens doors for enhanced performance in various tasks like answering questions, summarising scientific content, and generating hypotheses but also has implications for multi-modal models.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post This AI Research Introduces AstroLLaMA: A 7B Parameter Model Fine-Tuned from LLaMA-2 Using Over 300K Astronomy Abstracts From ArXiv appeared first on MarkTechPost.

Researchers from MIT and Microsoft Introduce DoLa: A Novel AI Decoding …

Numerous natural language processing (NLP) applications have benefited greatly from using large language models (LLMs). While LLMs have improved in performance and gained additional capabilities due to being scaled, they still have a problem with “hallucinating” or producing information inconsistent with the real-world facts detected during pre-training. This represents a significant barrier to adoption for high-stakes applications (such as those found in clinical and legal settings), where the generation of trustworthy text is essential.

The maximum likelihood language modeling target, which seeks to minimize the forward KL divergence between the data and model distributions, may be to blame for LMs’ hallucinations. However, this is far from certain. The LM may assign a non-zero probability to phrases that are not fully consistent with the knowledge encoded in the training data if this goal is pursued.

From the perspective of the interpretability of the model, studies have shown that the earlier layers of transformer LMs encode “lower level” information (such as part-of-speech tags). In contrast, the later levels encode more “semantic” information. 

A group of researchers at MIT and Microsoft suggest using this modular encoding of knowledge to increase the LM’s factual knowledge via a contrastive decoding strategy, where the likelihood of the next word’s output is calculated using the difference in logits from a higher layer. With this, it is possible to make LMs more grounded in reality and cut down on hallucinations by prioritizing information from deeper levels and downplaying that from intermediate or shallower ones.

Their recent work introduces Decoding by Contrasting Layers (DoLa), a novel decoding approach. The proposed method is based on improving the exposure of factual knowledge encoded in an LLM without retrieving external knowledge or doing further fine-tuning. 

DoLa has been shown experimentally to improve the integrity of LLaMA family models on both TruthfulQA and FACTOR. For both StrategyQA and GSM8K cc, additional experiments on chain-of-thought reasoning demonstrate its potential to improve factual reasoning. Finally, experimental results on open-ended text production (evaluated with GPT-4) reveal that DoLa can generate informative and significantly more factual responses that lead to superior ratings compared to the original decoding approach. DoLa is a decoding approach that can be used to increase the honesty of LLMs, and findings show that it adds only a small amount of time to the decoding process.

The researchers did not investigate the model’s performance in other domains, such as following instructions or picking up on human feedback. In addition, rather than leveraging human labels or factual information sources for fine-tuning, the team relies on preexisting architecture and parameters, restricting the scope of possible enhancements. Unlike certain retrieval-augmented LMs, this technique depends entirely on the model’s preexisting knowledge rather than adding new information through external retrieval modules. The team hopes future work incorporates the components above with their decoding technique to help overcome the restrictions.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Researchers from MIT and Microsoft Introduce DoLa: A Novel AI Decoding Strategy Aimed at Reducing Hallucinations in LLMs appeared first on MarkTechPost.

How is AI Revolutionizing Audiobook Production? Creating Thousands of …

Nowadays, many people read audiobooks instead of books or other media. Audiobooks not only let current readers enjoy information while on the road, but they may also help make content accessible to groups, including children, the visually impaired, and anyone learning a new language. Traditional audiobook production techniques take time and money and can result in varying recording quality, such as professional human narration or volunteer-driven initiatives like LibriVox. Due to these issues, keeping up with the rising number of published books takes time and effort. 

However, automatic audiobook creation has historically suffered due to the robotic nature of text-to-speech systems and the difficulty in deciding what text should not be read aloud (such as tables of contents, page numbers, figures, and footnotes). They provide a method for overcoming the abovementioned difficulties by creating high-quality audiobooks from various online e-book collections. Their approach specifically incorporates recent developments in neural text-to-speech, expressive reading, scalable computation, and automated recognition of pertinent content to produce thousands of natural-sounding audiobooks. 

They contribute over 5,000 audiobooks worth of speech, totaling over 35,000 hours, to the open source. They also provide demonstration software that enables conference participants to make their audiobooks by reading any book from the library aloud in their voices using only a brief sample of sound. This work introduces a scalable method for converting HTML-based e-books to excellent audiobooks. SynapseML, a scalable machine learning platform that enables distributed orchestration of the whole audiobook generation process, is the foundation for their pipeline. Their distribution chain starts with thousands of Project Gutenberg-provided free e-books. They deal mostly with the HTML format of these e-books since it lends itself to automated parsing, the best of all the available formats for these publications. 

As a result, we could organize and visualize the complete collection of Project Gutenberg HTML pages and identify many sizable groups of similarly structured files. The major classes of e-books were transformed into a standard format that could be automatically processed using a rule-based HTML normalizer created using these collections of HTML files. Thanks to this approach, we developed a system that could swiftly and deterministically parse a huge number of books. Most significantly, it allowed us to focus on the files that would result in high-quality recordings when read. 

Figure 1: t-SNE Clustered ebook representation. Clusters of books with the same format are shown by colored regions.

The results of this approach for clustering are shown in Figure 1, which illustrates how various groups of similarly organized electronic books spontaneously arise in the Project Gutenberg collection. After processing, a plain text stream may be extracted and fed into text-to-speech algorithms. There are many reading techniques required for various audiobooks. A clear, objective voice is best for nonfiction, whereas an expressive reading and a little “acting” are better for fiction with dialogue. However, in their live demonstration, they will provide customers the option to alter the text’s voice, pace, pitch, and intonation. For the bulk of the books, they utilize a clear and neutral neural text-to-speech voice. 

They use zero-shot text-to-speech techniques to transfer the voice effectively features from a small number of enrolled recordings to duplicate a user’s voice. By doing this, a user may rapidly produce an audiobook in their voice, utilizing just a tiny bit of audio that has been captured. They employ an automated speaker and emotion inference system to dynamically alter the reading voice and tone based on context to produce an emotional text reading. This enhances the lifelikeness and interest of sequences with several people and dynamic interaction. 

To do this, they first divide the text into narrative and conversation, designating a different speaker for each line of dialogue. Then, self-supervised, they predict each dialogue’s emotional tone. Finally, they use the multi-style and contextual-based neural text-to-speech model introduced to assign distinct voices and emotions to the narrator and the character conversations. They think this approach might significantly increase the availability and accessibility of audiobooks.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.If you like our work, you will love our newsletter..

The post How is AI Revolutionizing Audiobook Production? Creating Thousands of High-Quality Audiobooks from E-books with Neural Text-to-Speech Technology appeared first on MarkTechPost.

Learn how to build and deploy tool-using LLM agents using AWS SageMake …

Large language model (LLM) agents are programs that extend the capabilities of standalone LLMs with 1) access to external tools (APIs, functions, webhooks, plugins, and so on), and 2) the ability to plan and execute tasks in a self-directed fashion. Often, LLMs need to interact with other software, databases, or APIs to accomplish complex tasks. For example, an administrative chatbot that schedules meetings would require access to employees’ calendars and email. With access to tools, LLM agents can become more powerful—at the cost of additional complexity.
In this post, we introduce LLM agents and demonstrate how to build and deploy an e-commerce LLM agent using Amazon SageMaker JumpStart and AWS Lambda. The agent will use tools to provide new capabilities, such as answering questions about returns (“Is my return rtn001 processed?”) and providing updates about orders (“Could you tell me if order 123456 has shipped?”). These new capabilities require LLMs to fetch data from multiple data sources (orders, returns) and perform retrieval augmented generation (RAG).
To power the LLM agent, we use a Flan-UL2 model deployed as a SageMaker endpoint and use data retrieval tools built with AWS Lambda. The agent can subsequently be integrated with Amazon Lex and used as a chatbot inside websites or AWS Connect. We conclude the post with items to consider before deploying LLM agents to production. For a fully managed experience for building LLM agents, AWS also provides the agents for Amazon Bedrock feature (in preview).
A brief overview of LLM agent architectures
LLM agents are programs that use LLMs to decide when and how to use tools as necessary to complete complex tasks. With tools and task planning abilities, LLM agents can interact with outside systems and overcome traditional limitations of LLMs, such as knowledge cutoffs, hallucinations, and imprecise calculations. Tools can take a variety of forms, such as API calls, Python functions, or webhook-based plugins. For example, an LLM can use a “retrieval plugin” to fetch relevant context and perform RAG.
So what does it mean for an LLM to pick tools and plan tasks? There are numerous approaches (such as ReAct, MRKL, Toolformer, HuggingGPT, and Transformer Agents) to using LLMs with tools, and advancements are happening rapidly. But one simple way is to prompt an LLM with a list of tools and ask it to determine 1) if a tool is needed to satisfy the user query, and if so, 2) select the appropriate tool. Such a prompt typically looks like the following example and may include few-shot examples to improve the LLM’s reliability in picking the right tool.

‘’’
Your task is to select a tool to answer a user question. You have access to the following tools.

search: search for an answer in FAQs
order: order items
noop: no tool is needed

{few shot examples}

Question: {input}
Tool:
‘’’

More complex approaches involve using a specialized LLM that can directly decode “API calls” or “tool use,” such as GorillaLLM. Such finetuned LLMs are trained on API specification datasets to recognize and predict API calls based on instruction. Often, these LLMs require some metadata about available tools (descriptions, yaml, or JSON schema for their input parameters) in order to output tool invocations. This approach is taken by agents for Amazon Bedrock and OpenAI function calls. Note that LLMs generally need to be sufficiently large and complex in order to show tool selection ability.

Assuming task planning and tool selection mechanisms are chosen, a typical LLM agent program works in the following sequence:

User request – The program takes a user input such as “Where is my order 123456?” from some client application.
Plan next action(s) and select tool(s) to use – Next, the program uses a prompt to have the LLM generate the next action, for example, “Look up the orders table using OrdersAPI.” The LLM is prompted to suggest a tool name such as OrdersAPI from a predefined list of available tools and their descriptions. Alternatively, the LLM could be instructed to directly generate an API call with input parameters such as OrdersAPI(12345).

Note that the next action may or may not involve using a tool or API. If not, the LLM would respond to user input without incorporating additional context from tools or simply return a canned response such as, “I cannot answer this question.”

Parse tool request – Next, we need to parse out and validate the tool/action prediction suggested by the LLM. Validation is needed to ensure tool names, APIs, and request parameters aren’t hallucinated and that the tools are properly invoked according to specification. This parsing may require a separate LLM call.
Invoke tool – Once valid tool name(s) and parameter(s) are ensured, we invoke the tool. This could be an HTTP request, function call, and so on.
Parse output – The response from the tool may need additional processing. For example, an API call may result in a long JSON response, where only a subset of fields are of interest to the LLM. Extracting information in a clean, standardized format can help the LLM interpret the result more reliably.
Interpret output – Given the output from the tool, the LLM is prompted again to make sense of it and decide whether it can generate the final answer back to the user or whether additional actions are required.
Terminate or continue to step 2 – Either return a final answer or a default answer in the case of errors or timeouts.

Different agent frameworks execute the previous program flow differently. For example, ReAct combines tool selection and final answer generation into a single prompt, as opposed to using separate prompts for tool selection and answer generation. Also, this logic can be run in a single pass or run in a while statement (the “agent loop”), which terminates when the final answer is generated, an exception is thrown, or timeout occurs. What remains constant is that agents use the LLM as the centerpiece to orchestrate planning and tool invocations until the task terminates. Next, we show how to implement a simple agent loop using AWS services.
Solution overview
For this blog post, we implement an e-commerce support LLM agent that provides two functionalities powered by tools:

Return status retrieval tool – Answer questions about the status of returns such as, “What is happening to my return rtn001?”
Order status retrieval tool – Track the status of orders such as, “What’s the status of my order 123456?”

The agent effectively uses the LLM as a query router. Given a query (“What is the status of order 123456?”), select the appropriate retrieval tool to query across multiple data sources (that is, returns and orders). We accomplish query routing by having the LLM pick among multiple retrieval tools, which are responsible for interacting with a data source and fetching context. This extends the simple RAG pattern, which assumes a single data source.
Both retrieval tools are Lambda functions that take an id (orderId or returnId) as input, fetches a JSON object from the data source, and converts the JSON into a human friendly representation string that’s suitable to be used by LLM. The data source in a real-world scenario could be a highly scalable NoSQL database such as DynamoDB, but this solution employs simple Python Dict with sample data for demo purposes.
Additional functionalities can be added to the agent by adding Retrieval Tools and modifying prompts accordingly. This agent can be tested a standalone service that integrates with any UI over HTTP, which can be done easily with Amazon Lex.

Here are some additional details about the key components:

LLM inference endpoint – The core of an agent program is an LLM. We will use SageMaker JumpStart foundation model hub to easily deploy the Flan-UL2 model. SageMaker JumpStart makes it easy to deploy LLM inference endpoints to dedicated SageMaker instances.
Agent orchestrator – Agent orchestrator orchestrates the interactions among the LLM, tools, and the client app. For our solution, we use an AWS Lambda function to drive this flow and employ the following as helper functions.

Task (tool) planner – Task planner uses the LLM to suggest one of 1) returns inquiry, 2) order inquiry, or 3) no tool. We use prompt engineering only and Flan-UL2 model as-is without fine-tuning.
Tool parser – Tool parser ensures that the tool suggestion from task planner is valid. Notably, we ensure that a single orderId or returnId can be parsed. Otherwise, we respond with a default message.
Tool dispatcher – Tool dispatcher invokes tools (Lambda functions) using the valid parameters.
Output parser – Output parser cleans and extracts relevant items from JSON into a human-readable string. This task is done both by each retrieval tool as well as within the orchestrator.
Output interpreter – Output interpreter’s responsibility is to 1) interpret the output from tool invocation and 2) determine whether the user request can be satisfied or additional steps are needed. If the latter, a final response is generated separately and returned to the user.

Now, let’s dive a bit deeper into the key components: agent orchestrator, task planner, and tool dispatcher.
Agent orchestrator
Below is an abbreviated version of the agent loop inside the agent orchestrator Lambda function. The loop uses helper functions such as task_planner or tool_parser, to modularize the tasks. The loop here is designed to run at most two times to prevent the LLM from being stuck in a loop unnecessarily long.

#.. imports ..
MAX_LOOP_COUNT = 2 # stop the agent loop after up to 2 iterations
# … helper function definitions …
def agent_handler(event):
user_input = event[“query”]
print(f”user input: {user_input}”)

final_generation = “”
is_task_complete = False
loop_count = 0

# start of agent loop
while not is_task_complete and loop_count < MAX_LOOP_COUNT:
tool_prediction = task_planner(user_input)
print(f”tool_prediction: {tool_prediction}”)

tool_name, tool_input, tool_output, error_msg = None, None, “”, “”

try:
tool_name, tool_input = tool_parser(tool_prediction, user_input)
print(f”tool name: {tool_name}”)
print(f”tool input: {tool_input}”)
except Exception as e:
error_msg = str(e)
print(f”tool parse error: {error_msg}”)

if tool_name is not None: # if a valid tool is selected and parsed
raw_tool_output = tool_dispatch(tool_name, tool_input)
tool_status, tool_output = output_parser(raw_tool_output)
print(f”tool status: {tool_status}”)

if tool_status == 200:
is_task_complete, final_generation = output_interpreter(user_input, tool_output)
else:
final_generation = tool_output
else: # if no valid tool was selected and parsed, either return the default msg or error msg
final_generation = DEFAULT_RESPONSES.NO_TOOL_FEEDBACK if error_msg == “” else error_msg

loop_count += 1

return {
‘statusCode’: 200,
‘body’: final_generation
}

Task planner (tool prediction)
The agent orchestrator uses task planner to predict a retrieval tool based on user input. For our LLM agent, we will simply use prompt engineering and few shot prompting to teach the LLM this task in context. More sophisticated agents could use a fine-tuned LLM for tool prediction, which is beyond the scope of this post. The prompt is as follows:

tool_selection_prompt_template = “””
Your task is to select appropriate tools to satisfy the user input. If no tool is required, then pick “no_tool”

Tools available are:

returns_inquiry: Database of information about a specific return’s status, whether it’s pending, processed, etc.
order_inquiry: Information about a specific order’s status, such as shipping status, product, amount, etc.
no_tool: No tool is needed to answer the user input.

You can suggest multiple tools, separated by a comma.

Examples:
user: “What are your business hours?”
tool: no_tool

user: “Has order 12345 shipped?”
tool: order_inquiry

user: “Has return ret812 processed?”
tool: returns_inquiry

user: “How many days do I have until returning orders?”
tool: returns_inquiry

user: “What was the order total for order 38745?”
tool: order_inquiry

user: “Can I return my order 38756 based on store policy?”
tool: order_inquiry

user: “Hi”
tool: no_tool

user: “Are you an AI?”
tool: no_tool

user: “How’s the weather?”
tool: no_tool

user: “What is the refund status of order 12347?”
tool: order_inquiry

user: “What is the refund status of return ret172?”
tool: returns_inquiry

user input: {}
tool:
“””

Tool dispatcher
The tool dispatch mechanism works via if/else logic to call appropriate Lambda functions depending on the tool’s name. The following is tool_dispatch helper function’s implementation. It’s used inside the agent loop and returns the raw response from the tool Lambda function, which is then cleaned by an output_parser function.

def tool_dispatch(tool_name, tool_input):
#…

tool_response = None

if tool_name == “returns_inquiry”:
tool_response = lambda_client.invoke(
FunctionName=RETURNS_DB_TOOL_LAMBDA,
InvocationType=”RequestResponse”,
Payload=json.dumps({
“returnId”: tool_input
})
)
elif tool_name == “order_inquiry”:
tool_response = lambda_client.invoke(
FunctionName=ORDERS_DB_TOOL_LAMBDA,
InvocationType=”RequestResponse”,
Payload=json.dumps({
“orderId”: tool_input
})
)
else:
raise ValueError(“Invalid tool invocation”)

return tool_response

Deploy the solution
Important prerequisites – To get started with the deployment, you need to fulfill the following prerequisites:

Access to the AWS Management Console via a user who can launch AWS CloudFormation stacks
Familiarity with navigating the AWS Lambda and Amazon Lex consoles
Flan-UL2 requires a single ml.g5.12xlarge for deployment, which may necessitate increasing resource limits via a support ticket. In our example, we use us-east-1 as the Region, so please make sure to increase the service quota (if needed) in us-east-1.

Deploy using CloudFormation – You can deploy the solution to us-east-1 by clicking the button below:
Deploying the solution will take about 20 minutes and will create a LLMAgentStack stack, which:

deploys the SageMaker endpoint using Flan-UL2 model from SageMaker JumpStart;
deploys three Lambda functions: LLMAgentOrchestrator, LLMAgentReturnsTool, LLMAgentOrdersTool; and
deploys an AWS Lex bot that can be used to test the agent: Sagemaker-Jumpstart-Flan-LLM-Agent-Fallback-Bot.

Test the solution
The stack deploys an Amazon Lex bot with the name Sagemaker-Jumpstart-Flan-LLM-Agent-Fallback-Bot. The bot can be used to test the agent end-to-end. Here’s an additional comprehensive guide for testing AWS Amazon Lex bots with a Lambda integration and how the integration works at a high level. But in short, Amazon Lex bot is a resource that provides a quick UI to chat with the LLM agent running inside a Lambda function that we built (LLMAgentOrchestrator).
The sample test cases to consider are as follows:

Valid order inquiry (for example, “Which item was ordered for 123456?”)

Order “123456” is a valid order, so we should expect a reasonable answer (e.g. “Herbal Handsoap”)

Valid return inquiry for a return (for example, “When is my return rtn003 processed?”)

We should expect a reasonable answer about the return’s status.

Irrelevant to both returns or orders (for example, “How is the weather in Scotland right now?”)

An irrelevant question to returns or orders, thus a default answer should be returned (“Sorry, I cannot answer that question.”)

Invalid order inquiry (for example, “Which item was ordered for 383833?”)

The id 383832 does not exist in the orders dataset and hence we should fail gracefully (for example, “Order not found. Please check your Order ID.”)

Invalid return inquiry (for example, “When is my return rtn123 processed?”)

Similarly, id rtn123 does not exist in the returns dataset, and hence should fail gracefully.

Irrelevant return inquiry (for example, “What is the impact of return rtn001 on world peace?”)

This question, while it seems to pertain to a valid order, is irrelevant. The LLM is used to filter questions with irrelevant context.

To run these tests yourself, here are the instructions.

On the Amazon Lex console (AWS Console > Amazon Lex), navigate to the bot entitled Sagemaker-Jumpstart-Flan-LLM-Agent-Fallback-Bot. This bot has already been configured to call the LLMAgentOrchestrator Lambda function whenever the FallbackIntent is triggered.
In the navigation pane, choose Intents.
Choose Build at the top right corner
4. Wait for the build process to complete. When it’s done, you get a success message, as shown in the following screenshot.
Test the bot by entering the test cases.

Cleanup
To avoid additional charges, delete the resources created by our solution by following these steps:

On the AWS CloudFormation console, select the stack named LLMAgentStack (or the custom name you picked).
Choose Delete
Check that the stack is deleted from the CloudFormation console.

Important: double-check that the stack is successfully deleted by ensuring that the Flan-UL2 inference endpoint is removed.

To check, go to AWS console > Sagemaker > Endpoints > Inference page.
The page should list all active endpoints.
Make sure sm-jumpstart-flan-bot-endpoint does not exist like the below screenshot.

Considerations for production
Deploying LLM agents to production requires taking extra steps to ensure reliability, performance, and maintainability. Here are some considerations prior to deploying agents in production:

Selecting the LLM model to power the agent loop: For the solution discussed in this post, we used a Flan-UL2 model without fine-tuning to perform task planning or tool selection. In practice, using an LLM that is fine-tuned to directly output tool or API requests can increase reliability and performance, as well as simplify development. We could fine-tune an LLM on tool selection tasks or use a model that directly decodes tool tokens like Toolformer.

Using fine-tuned models can also simplify adding, removing, and updating tools available to an agent. With prompt-only based approaches, updating tools requires modifying every prompt inside the agent orchestrator, such as those for task planning, tool parsing, and tool dispatch. This can be cumbersome, and the performance may degrade if too many tools are provided in context to the LLM.

Reliability and performance: LLM agents can be unreliable, especially for complex tasks that cannot be completed within a few loops. Adding output validations, retries, structuring outputs from LLMs into JSON or yaml, and enforcing timeouts to provide escape hatches for LLMs stuck in loops can enhance reliability.

Conclusion
In this post, we explored how to build an LLM agent that can utilize multiple tools from the ground up, using low-level prompt engineering, AWS Lambda functions, and SageMaker JumpStart as building blocks. We discussed the architecture of LLM agents and the agent loop in detail. The concepts and solution architecture introduced in this blog post may be appropriate for agents that use a small number of a predefined set of tools. We also discussed several strategies for using agents in production. Agents for Bedrock, which is in preview, also provides a managed experience for building agents with native support for agentic tool invocations.

About the Author
John Hwang is a Generative AI Architect at AWS with special focus on Large Language Model (LLM) applications, vector databases, and generative AI product strategy. He is passionate about helping companies with AI/ML product development, and the future of LLM agents and co-pilots. Prior to joining AWS, he was a Product Manager at Alexa, where he helped bring conversational AI to mobile devices, as well as a derivatives trader at Morgan Stanley. He holds B.S. in computer science from Stanford University.

Researchers at Heriot-Watt University and Alana AI Propose FurChat: A …

Large Language Models(LLMs) have taken center stage in a world where technology is making leaps and bounds. These LLMs are incredibly sophisticated computer programs that can understand, generate, and interact with a human language in a remarkably natural way. In recent research, an innovative embodied conversational agent known as FurChat has been unveiled. LLMs like GPT-3.5 have pushed the boundaries of what’s possible in natural language processing. They can understand context, answer questions, and even generate text that feels like it is written by a normal human being. This powerful capability has opened doors to countless opportunities in various domains like robotics. 

Researchers at Heriot-Watt University and Alana AI Propose FurChat, a revolutionary system that can function as a receptionist, engage in dynamic conversions, and convey emotions through facial expressions. Furchat’s deployment at the National Robotarium exemplifies its transformative potential, facilitating natural conversations with visitors and offering various information on facilities, news, research, and upcoming events.

Furhat robot, a humanoid robotic bust has a three-dimensional mask that closely resembles a human face and employs a micro projector to project an animated facial expression onto this mask. The robot is mounted on a monitored platform that allows its head to move and nod, enhancing its lifelike interactions. To facilitate communication, Furhat is equipped with a microphone array and speakers, enabling it to recognize and respond to human speech.  

Its system is designed for seamless applications. Dialogue Management involves three main components: NLU, DM, and a custom database. NLU analyzes incoming text, classifies intents, and assesses confidence. DM maintains conversational flow, sends prompts to LLM, and processes responses. A custom database is created by web-scraping the Nation Robotarium’s website, which provides data relevant to user intents. Prompt engineering ensures natural responses from LLM. It combines a few shot-learning and prompt-learning techniques to generate context-aware replies. Gesture parsing leverages Furhat SDK’s facial gestures and LLM’s sentiment recognition from text to synchronize facial expressions with speech, creating an immersive interaction. Amazon Polly is used for text-to-speech conversion, which is available in FurhatOS.

In the future, researchers are gearing up to expand its capabilities. They have their sights set on enabling multiuser interactions, an area of active research in the field of receptionist robots. Furthermore, to tackle the issue posed by hallucinations in language models, they plan to explore strategies such as finetuning the language model and experimenting with direct conversation generation, reducing reliance on NLU components. A significant milestone for the researchers is the demonstration of FurChat at the Sigdial conference. It will serve as a platform to demonstrate the system’s capabilities to a broader audience of peers and experts.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Researchers at Heriot-Watt University and Alana AI Propose FurChat: A New Embodied Conversational Agent Based on Large Language Models appeared first on MarkTechPost.

Meet NExT-GPT: An End-to-End General-Purpose Any-to-Any Multimodal Lar …

Multimodal LLMs can enhance human-computer interaction by enabling more natural and intuitive communication between users and AI systems through voice, text, and visual inputs. This can lead to more contextually relevant and comprehensive responses in applications like chatbots, virtual assistants, and content recommendation systems. They are built upon the foundations of traditional unimodal language models, like GPT-3, while incorporating additional capabilities to handle different data types.

However, multimodal LLMs may require a large amount of data to perform well, making them less sample-efficient than other AI models. Aligning data from different modalities during training can be challenging. Due to the lack of overall end-to-end training in error propagation, content understanding and multimodal generation capabilities can be very limited. As the information transfer between different modules is entirely based on discrete texts produced by the LLM, noise and errors are inevitable. Ensuring that the information from each modality is properly synchronized is essential for practical training. 

To tackle these issues, the researchers at NeXT++, the School of Computing ( NUS ), built NexT-GPT. It is an any-to-any Multimodal LLM designed to handle input and output in any combination of text, image, video, and audio modalities. It enables the encoders to encode the inputs in various modalities, which are further projected onto the representations of the LLM. 

Their method involves modifying the existing open-source LLM as the core to process input information. After projection, the produced multimodal signals with specific instructions are directed to different encoders, and finally, content is generated in corresponding modalities. Training their model from scratch is cost-effective, so they use the existing pre-trained high-performance encoders and decoders such as Q-Former, ImageBind, and the state-of-the-art latent diffusion models.

They introduced a lightweight alignment learning technique by which the LLM-centric alignment at the encoding side and the instruction-following alignment at the decoding side efficiently require minimal parameter adjustments for effective semantic alignment. They even introduce a modality-switching instruction tuning to empower their any-to-any MM-LLM with human-level capabilities. This will bridge the gap between the feature space of different modalities and ensure fluent semantics understanding of other inputs to perform alignment learning for NExT-GPT. 

Modality-switching instruction tuning (MosIT) supports complex cross-modal understanding and reasoning and enables sophisticated multimodal content generation. They even constructed a high-quality dataset comprising a wide range of multimodal inputs and outputs, offering the necessary complexity and variability to facilitate the training of MM-LLMs to handle diverse user interactions and accurately deliver desired responses.

At last, their research showcases the potential of any-to-any MMLLMs in bridging the gap between various modalities and paving the way for more human-like AI systems in the future.

Check out the Paper and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Meet NExT-GPT: An End-to-End General-Purpose Any-to-Any Multimodal Large Language Models (MM-LLMs) appeared first on MarkTechPost.

UCI and Harvard Researchers Introduce TalkToModel that Explains Machin …

Machine learning models have become indispensable tools in various professional fields, driving applications in smartphones, software packages, and online services. However, the complexity of these models has rendered their underlying processes and predictions increasingly opaque, even to seasoned computer scientists.

To address this challenge and bolster trust in these advanced computational tools, researchers at the University of California-Irvine and Harvard University have unveiled an innovative solution: TalkToModel, an interactive dialog system aimed at elucidating machine learning models and their predictions for both experts and non-technical users.

Existing attempts at Explainable Artificial Intelligence (XAI) have faced limitations, often leaving room for interpretation in their explanations. TalkToModel bridges this gap by providing users with straightforward and relevant answers to their queries about AI models and their operations. The system comprises three essential components: an adaptive dialog engine, an execution unit, and a conversational interface. The dialog engine interprets natural language input and generates coherent responses. The execution component crafts AI explanations, which are then translated into accessible language for users. The conversational interface serves as the platform through which users interact with the system.

In testing the effectiveness of TalkToModel, professionals, and students were invited to provide feedback. The results were encouraging, with the majority of participants finding the system both useful and engaging. Notably, 73% of healthcare workers expressed willingness to use TalkToModel to gain insights into the predictions of AI-based diagnostic tools. Additionally, 85% of machine learning developers found it more user-friendly than other XAI tools.

This promising feedback suggests that TalkToModel could enhance understanding and trust in AI predictions. As this platform continues to evolve, there is potential for it to be released to the wider public, further contributing to the ongoing efforts to demystify AI and bolster confidence in its capabilities. By enabling open-ended conversations with machine learning models, TalkToModel exemplifies a significant step towards making advanced AI systems more accessible and understandable to a broader audience.

Check out the Paper and Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post UCI and Harvard Researchers Introduce TalkToModel that Explains Machine Learning Models to its Users appeared first on MarkTechPost.

Build a classification pipeline with Amazon Comprehend custom classifi …

“Data locked away in text, audio, social media, and other unstructured sources can be a competitive advantage for firms that figure out how to use it“
Only 18% of organizations in a 2019 survey by Deloitte reported being able to take advantage of unstructured data. The majority of data, between 80% and 90%, is unstructured data. That is a big untapped resource that has the potential to give businesses a competitive edge if they can find out how to use it. It can be difficult to find insights from this data, particularly if efforts are needed to classify, tag, or label it. Amazon Comprehend custom classification can be useful in this situation. Amazon Comprehend is a natural-language processing (NLP) service that uses machine learning to uncover valuable insights and connections in text.
Document categorization or classification has significant benefits across business domains –

Improved search and retrieval – By categorizing documents into relevant topics or categories, it makes it much easier for users to search and retrieve the documents they need. They can search within specific categories to narrow down results.
Knowledge management – Categorizing documents in a systematic way helps to organize an organization’s knowledge base. It makes it easier to locate relevant information and see connections between related content.
Streamlined workflows – Automatic document sorting can help streamline many business processes like processing invoices, customer support, or regulatory compliance. Documents can be automatically routed to the right people or workflows.
Cost and time savings – Manual document categorization is tedious, time-consuming, and expensive. AI techniques can take over this mundane task and categorize thousands of documents in a short time at a much lower cost.
Insight generation – Analyzing trends in document categories can provide useful business insights. For example, an increase in customer complaints in a product category could signify some issues that need to be addressed.
Governance and policy enforcement – Setting up document categorization rules helps to ensure that documents are classified correctly according to an organization’s policies and governance standards. This allows for better monitoring and auditing.
Personalized experiences – In contexts like website content, document categorization allows for tailored content to be shown to users based on their interests and preferences as determined from their browsing behavior. This can increase user engagement.

The complexity of developing a bespoke classification machine learning model varies depending on a variety of aspects such as data quality, algorithm, scalability, and domain knowledge, to mention a few. It’s essential to start with a clear problem definition, clean and relevant data, and gradually work through the different stages of model development. However, businesses can create their own unique machine learning models using Amazon Comprehend custom classification to automatically classify text documents into categories or tags, to meet business specific requirements and map to business technology and document categories. As human tagging or categorization is no longer necessary, this can save businesses a lot of time, money, and labor. We have made this process simple by automating the whole training pipeline.
In first part of this multi-series blog post, you will learn how to create a scalable training pipeline and prepare training data for Comprehend Custom Classification models. We will introduce a custom classifier training pipeline that can be deployed in your AWS account with few clicks. We are using the BBC news dataset, and will be training a classifier to identify the class (e.g. politics, sports) that a document belongs to. The pipeline will enable your organization to rapidly respond to changes and train new models without having to start from scratch each time. You may scale up and train multiple models based on your demand easily.
Prerequisites

An active AWS account (Click here to create a new AWS account)
Access to Amazon Comprehend, Amazon S3, Amazon Lambda, Amazon Step Function, Amazon SNS, and Amazon CloudFormation
Training data (semi-structure or text) prepared in following section
Basic knowledge about Python and Machine Learning in general

Prepare training data
This solution can take input as either text format (ex. CSV) or semi-structured format (ex. PDF).
Text input
Amazon Comprehend custom classification supports two modes: multi-class and multi-label.
In multi-class mode, each document can have one and only one class assigned to it. The training data should be prepared as two-column CSV file with each line of the file containing a single class and the text of a document that demonstrates the class.

CLASS, Text of document 1
CLASS, Text of document 2

Example for BBC news dataset:

Business, Europe blames US over weak dollar…
Tech, Cabs collect mountain of mobiles…

In multi-label mode, each document has at least one class assigned to it, but can have more. Training data should be as a two-column CSV file, which each line of the file containing one or more classes and the text of the training document. More than one class should be indicated by using a delimiter between each class.

CLASS, Text of document 1
CLASS|CLASS|CLASS, Text of document 2

No header should be included in the CSV file for either of the training mode.
Semi-structured input
Starting in 2023, Amazon Comprehend now supports training models using semi-structured documents. The training data for semi-structure input is comprised of a set of labeled documents, which can be pre-identified documents from a document repository that you already have access to. The following is an example of an annotations file CSV data required for training (Sample Data):

CLASS, document1.pdf, 1
CLASS, document1.pdf, 2

The annotations CSV file contains three columns: The first column contains the label for the document, the second column is the document name (i.e., file name), and the last column is the page number of the document that you want to include in the training dataset. In most cases, if the annotations CSV file is located at the same folder with all other document, then you just need to specify the document name in the second column. However, if the CSV file is located in a different location, then you’d need to specify the path to location in the second column, such as path/to/prefix/document1.pdf.
For details, how to prepare your training data, please refer to here.
Solution overview

Amazon Comprehend training pipeline starts when training data (.csv file for text input and annotation .csv file for semi-structure input) is uploaded to a dedicated Amazon Simple Storage Service (Amazon S3) bucket.
An AWS Lambda function is invoked by Amazon S3 trigger such that every time an object is uploaded to specified Amazon S3 location, the AWS Lambda function retrieves the source bucket name and the key name of the uploaded object and pass it to training step function workflow.
In training step function, after receiving the training data bucket name and object key name as input parameters, a custom model training workflow kicks-off as a series of lambdas functions as described:

StartComprehendTraining: This AWS Lambda function defines a ComprehendClassifier object depending on the type of input files (i.e., text or semi-structured) and then kicks-off an Amazon Comprehend custom classification training task by calling create_document_classifier Application Programming Interfact (API), which returns a training Job Amazon Resource Names (ARN) . Subsequently, this function checks the status of the training job by invoking describe_document_classifier API. Finally, it returns a training Job ARN and job status, as output to the next stage of training workflow.
GetTrainingJobStatus: This AWS Lambda checks the job status of training job in every 15 minutes, by calling describe_document_classifier API, until training job status changes to Complete or Failed.
GenerateMultiClass or GenerateMultiLabel: If you select yes for performance report when launching the stack, one of these two AWS Lambdas will run analysis according to your Amazon Comprehend model outputs, which generates per class performance analysis and save it to Amazon S3.
GenerateMultiClass: This AWS Lambda will be called if your input is MultiClass and you select yes for performance report.
GenerateMultiLabel: This AWS Lambda will be called if your input is MultiLabel and you select yes for performance report.

Once the training is done successfully, the solution generates following outputs:

Custom Classification Model: A trained model ARN will be available in your account for future inference work.
Confusion Matrix [Optional]: A confusion matrix (confusion_matrix.json) will be available in user defined output Amazon S3 path, depending on the user selection.
Amazon Simple Notification Service notification [Optional]: A notification email will be sent about training job status to the subscribers, depending on the initial user selection.

Walkthrough
Launching the solution
To deploy your pipeline, complete the following steps:

Choose Launch Stack button:

Choose Next

Specify the pipeline details with the options fitting your use case:

Information for each stack detail:

Stack name (Required) – the name you specified for this AWS CloudFormation stack. The name must be unique in the Region in which you’re creating it.
Q01ClassifierInputBucketName (Required) – The Amazon S3 bucket name to store your input data. It should be a globally unique name and AWS CloudFormation stack helps you create the bucket while it’s being launched.
Q02ClassifierOutputBucketName (Required) – The Amazon S3 bucket name to store outputs from Amazon Comprehend and the pipeline. It should also be a globally unique name.
Q03InputFormat – A dropdown selection, you can choose text (if your training data is csv files) or semi-structure (if your training data are semi-structure [e.g., PDF files]) based on your data input format.
Q04Language – A dropdown selection, choosing the language of documents from supported list. Please note, currently only English is supported if your input format is semi-structure.
Q05MultiClass – A dropdown selection, select yes if your input is MultiClass mode. Otherwise, select no.
Q06LabelDelimiter – Only required if your Q05MultiClass answer is no. This delimiter is used in your training data to separate each class.
Q07ValidationDataset – A dropdown selection, change the answer to yes if you want to test the performance of trained classifier with your own test data.
Q08S3ValidationPath – Only required if your Q07ValidationDataset answer is yes.
Q09PerformanceReport – A dropdown selection, select yes if you want to generate the class-level performance report post model training. The report will be saved in you specified output bucket in Q02ClassifierOutputBucketName.
Q10EmailNotification – A dropdown selection. Select yes if you want to receive notification after model is trained.
Q11EmailID – Enter valid email address for receiving performance report notification. Please note, you have to confirm subscription from your email after AWS CloudFormation stack is launched, before you could receive notification when training is completed.

In the Amazon Configure stack options section, add optional tags, permissions, and other advanced settings.

Choose Next
Review the stack details and select I acknowledge that AWS CloudFormation might create AWS IAM resources.

Choose Submit. This initiates pipeline deployment in your AWS account.
After the stack is deployed successfully, then you can start using the pipeline. Create a /training-data folder under your specified Amazon S3 location for input. Note: Amazon S3 automatically applies server-side encryption (SSE-S3) for each new object unless you specify a different encryption option. Please refer Data protection in Amazon S3 for more details on data protection and encryption in Amazon S3.

Upload your training data to the folder. (If the training data are semi-structure, then upload all the PDF files before uploading .csv format label information).

You’re done! You’ve successfully deployed your pipeline and you can check the pipeline status in deployed step function. (You will have a trained model in your Amazon Comprehend custom classification panel).

If you choose the model and its version inside Amazon Comprehend Console, then you can now see more details about the model you just trained. It includes the Mode you select, which corresponds to the option Q05MultiClass, the number of labels, and the number of trained and test documents inside your training data. You could also check the overall performance below; however, if you want to check detailed performance for each class, then please refer to the Performance Report generated by the deployed pipeline.
Service quotas
Your AWS account has default quotas for Amazon Comprehend and AmazonTextract, if inputs are in semi-structure format. To view service quotas, please refer here for Amazon Comprehend and here for AmazonTextract.
Clean up
To avoid incurring ongoing charges, delete the resources you created as part of this solution when you’re done.

On the Amazon S3 console, manually delete the contents inside buckets you created for input and output data.
On the AWS CloudFormation console, choose Stacks in the navigation pane.
Select the main stack and choose Delete.

This automatically deletes the deployed stack.

Your trained Amazon Comprehend custom classification model will remain in your account. If you don’t need it anymore, in Amazon Comprehend console, delete the created model.

Conclusion
In this post, we showed you the concept of a scalable training pipeline for Amazon Comprehend custom classification models and providing an automated solution to efficiently training new models. The AWS CloudFormation template provided makes it possible for you to create your own text classification models effortlessly, catering to demand scales. The solution adopts the recent announced Euclid feature and accepts inputs in text or semi-structured format.
Now, we encourage you, our readers, to test these tools. You can find more details about training data preparation and understand the custom classifier metrics. Try it out and see firsthand how it can streamline your model training process and enhance efficiency. Please share your feedback to us!

About the Authors
Sandeep Singh is a Senior Data Scientist with AWS Professional Services. He is passionate about helping customers innovate and achieve their business objectives by developing state-of-the-art AI/ML powered solutions. He is currently focused on Generative AI, LLMs, prompt engineering, and scaling Machine Learning across enterprises. He brings recent AI advancements to create value for customers.
Yanyan Zhang is a Senior Data Scientist in the Energy Delivery team with AWS Professional Services. She is passionate about helping customers solve real problems with AI/ML knowledge. Recently, her focus has been on exploring the potential of Generative AI and LLM. Outside of work, she loves traveling, working out and exploring new things.
Wrick Talukdar is a Senior Architect with the Amazon Comprehend Service team. He works with AWS customers to help them adopt machine learning on a large scale. Outside of work, he enjoys reading and photography.