Intelligently search Drupal content using Amazon Kendra

Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra helps you easily aggregate content from a variety of content repositories into a centralized index that lets you quickly search all your enterprise data and find the most accurate answer. Drupal is a content management software. It’s used to make many of the websites and applications we use every day. Drupal has a great feature set, like straightforward content authoring, reliable performance, and security. Many organizations use Drupal to store their content. One of the key requirements for many customers using Drupal is the ability to easily and securely find accurate information across all the documents in the data source.
With the Amazon Kendra Drupal connector, you can index Drupal content, filter the types of custom content you want to index, and easily search through Drupal content using Amazon Kendra intelligent search.
This post shows you how to use the Amazon Kendra Drupal connector to configure the connector as a data source for your Amazon Kendra index and search your Drupal documents. Based on the configuration of the Drupal connector, you can synchronize the connector to crawl and index different types of Drupal content such as blogs and wikis. The connector also ingests the access control list (ACL) information for each file. The ACL information is used for user context filtering, where search results for a query are filtered by what a user has authorized access to.
Prerequisites
To try out the Amazon Kendra connector for Drupal using this post as a reference, you need the following:

An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies and IAM roles for Drupal data sources.
Basic knowledge of AWS and working knowledge of Drupal administration.
Drupal set up with a user with the Administrator role. We will store the administrator user name and password in AWS Secrets Manager.

Configure the data source using the Amazon Kendra connector for Drupal
To add a data source to your Amazon Kendra index using the Drupal connector, you can use an existing index or create a new index. Then complete the following steps. For more information on this topic, refer to the Amazon Kendra Developer Guide.

On the Amazon Kendra console, open your index and choose Data sources in the navigation pane.
Choose Add data source.
Under Drupal, choose Add connector.
In the Specify data source details section, enter a name and description and choose Next.
On the Define access and security section, for Drupal Host URL, enter the Drupal site URL.
To configure the SSL certificates, you can create a self-signed certificate for this setup using the openssl x509 -in mydrupalsite.pem -out drupal.crt command and store the certificate in an Amazon Simple Storage Service (Amazon S3) bucket. For more details on generating a private key and the certificate, refer to Generating Certificates.
Choose Browse S3 and choose the S3 bucket with the SSL certificate.
Under Authentication, you have two options:

Use Secrets Manager to create new Drupal authentication credentials. You need a Drupal admin user name and password (additionally, a client ID and client secret for OAuth 2.0 authentication).
Use an existing Secrets Manager secret that has the Drupal authentication credentials you want the connector to access (additionally, a client ID and client secret for OAuth 2.0 authentication).

Choose Save and add secret.
For IAM role, choose Create a new role or choose an existing IAM role configured with appropriate IAM policies to access the Secrets Manager secret, Amazon Kendra index, and data source.

Refer to IAM roles for data sources for the required permissions for the IAM role.

Choose Next.
In the Configure sync settings section, select Articles, Basic pages, Basic blocks, Custom content types, and Custom Blocks along with options to crawl comments and attachments as needed.
Optionally, enter the include/exclude patterns for the entity titles.
Provide information about your sync scope (full or delta only) and specify the run schedule.
Choose Next.
In the Set field mappings section, add custom Drupal fields you want to sync and their respective Amazon Kendra field mappings. The required fields are pre-mapped by Amazon Kendra.
Choose Next.
Review the configuration settings and save the data source.
Choose Sync now on the created data source to start data synchronization with the Amazon Kendra Index.

The time required to crawl and sync the contents into Amazon Kendra varies based on the volume of content and the throughput.

You can now search the indexed Drupal content using the search console or a search application. Optionally, you can search with ACL with the following additional steps.

Go to the index page that you created and on the User access control tab, choose Edit settings.
Under Access control settings, select Yes, keep the default values for Username and Groups, choose JSON for Token type, and keep the user-group expansion as None.
On the next page, retain the default values (or change them based on your capacity requirements) and choose Update.

Perform intelligent search with Amazon Kendra
Before you try searching on the Amazon Kendra console or using the API, make sure that the data source sync is complete. To check, view the data sources and verify if the last sync was successful.

To start your search, on the Amazon Kendra console, choose Search indexed content in the navigation pane.

You’re redirected to the Amazon Kendra search console. Now you can search information from the Drupal documents you indexed using Amazon Kendra.

For this post, we search for a document stored in the Drupal data source.
Expand Test query with an access token and choose Apply token.
For Username, enter the email address associated with your Drupal account.
Choose Apply.

Now the user can only see the content they have access based on the user name or groups specified. In our example, the Drupal user with the test@amazon.com email doesn’t have access to any documents on Drupal, so none are displayed.

Limitations
Note the following limitations when using this solution:

The content types (such as article, or basic page) that aren’t associated with any view cannot be crawled.
If an administrator doesn’t have access to a block, then you can’t crawl the data from the block.
The document body for article, basic page, basic block, user-defined content type, and user-defined block type is displayed in HTML format. If the HTML content is not well-formed, then the HTML related tags will appear in the document body and therefore can be seen on the Amazon Kendra search results. This is the same with comments of article, basic page, basic block, user-defined content type, user-defined block type.
The content type or block type without description or body will not be injected into the Amazon Kendra index because there is a validation on the Amazon Kendra SDK side. However, Drupal allows you to create the content type without description or body. Only the comments and attachments of the respective content types or block types (if they exist) will be injected into the Amazon Kendra index.

Clean up
To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Drupal, delete that data source. Delete any IAM users created.
Conclusion
With the Amazon Kendra Drupal connector, your organization can search contents stored in a Drupal site securely using intelligent search powered by Amazon Kendra. In this post, we introduced you to the integration, but there are many additional features that we didn’t cover, such as the following:

You can map additional fields to Amazon Kendra index attributes and enable them for faceting, search, and display in the search results
You can integrate the Drupal data source with the Custom Document Enrichment (CDE) capability in Amazon Kendra to perform additional attribute mapping logic and even custom content transformation during ingestion

To learn more about the possibilities with Drupal, refer to the Amazon Kendra Developer Guide.
For more information on other Amazon Kendra built-in connectors for popular data sources, refer to the Amazon Kendra Connectors page.

About the authors
Channa Basavaraja is a Senior Solutions Architect at AWS with over 2 decades of experience building distributed business solutions. His areas of depth span Machine Learning, app/mobile dev, event-driven architecture, and IoT/edge computing.
Yuanhua Wang is a software engineer at AWS with more than 15 years of experience in the technology industry. His interests are software architecture and build tools on cloud computing.

Intuitivo achieves higher throughput while saving on AI/ML costs using …

This is a guest post by Jose Benitez, Founder and Director of AI and Mattias Ponchon, Head of Infrastructure at Intuitivo.
Intuitivo, a pioneer in retail innovation, is revolutionizing shopping with its cloud-based AI and machine learning (AI/ML) transactional processing system. This groundbreaking technology enables us to operate millions of autonomous points of purchase (A-POPs) concurrently, transforming the way customers shop. Our solution outpaces traditional vending machines and alternatives, offering an economical edge with its ten times cheaper cost, easy setup, and maintenance-free operation. Our innovative new A-POPs (or vending machines) deliver enhanced customer experiences at ten times lower cost because of the performance and cost advantages AWS Inferentia delivers. Inferentia has enabled us to run our You Only Look Once (YOLO) computer vision models five times faster than our previous solution and supports seamless, real-time shopping experiences for our customers. Additionally, Inferentia has also helped us reduce costs by 95 percent compared to our previous solution. In this post, we cover our use case, challenges, and a brief overview of our solution using Inferentia.
The changing retail landscape and need for A-POP
The retail landscape is evolving rapidly, and consumers expect the same easy-to-use and frictionless experiences they are used to when shopping digitally. To effectively bridge the gap between the digital and physical world, and to meet the changing needs and expectations of customers, a transformative approach is required. At Intuitivo, we believe that the future of retail lies in creating highly personalized, AI-powered, and computer vision-driven autonomous points of purchase (A-POP). This technological innovation brings products within arm’s reach of customers. Not only does it put customers’ favorite items at their fingertips, but it also offers them a seamless shopping experience, devoid of long lines or complex transaction processing systems. We’re excited to lead this exciting new era in retail.
With our cutting-edge technology, retailers can quickly and efficiently deploy thousands of A-POPs. Scaling has always been a daunting challenge for retailers, mainly due to the logistic and maintenance complexities associated with expanding traditional vending machines or other solutions. However, our camera-based solution, which eliminates the need for weight sensors, RFID, or other high-cost sensors, requires no maintenance and is significantly cheaper. This enables retailers to efficiently establish thousands of A-POPs, providing customers with an unmatched shopping experience while offering retailers a cost-effective and scalable solution.

Using cloud inference for real-time product identification
While designing a camera-based product recognition and payment system, we ran into a decision of whether this should be done on the edge or the cloud. After considering several architectures, we designed a system that uploads videos of the transactions to the cloud for processing.
Our end users start a transaction by scanning the A-POP’s QR code, which triggers the A-POP to unlock and then customers grab what they want and go. Preprocessed videos of these transactions are uploaded to the cloud. Our AI-powered transaction pipeline automatically processes these videos and charges the customer’s account accordingly.
The following diagram shows the architecture of our solution.

Unlocking high-performance and cost-effective inference using AWS Inferentia
As retailers look to scale operations, cost of A-POPs becomes a consideration. At the same time, providing a seamless real-time shopping experience for end-users is paramount. Our AI/ML research team focuses on identifying the best computer vision (CV) models for our system. We were now presented with the challenge of how to simultaneously optimize the AI/ML operations for performance and cost.
We deploy our models on Amazon EC2 Inf1 instances powered by Inferentia, Amazon’s first ML silicon designed to accelerate deep learning inference workloads. Inferentia has been shown to reduce inference costs significantly. We used the AWS Neuron SDK—a set of software tools used with Inferentia—to compile and optimize our models for deployment on EC2 Inf1 instances.
The code snippet that follows shows how to compile a YOLO model with Neuron. The code works seamlessly with PyTorch and functions such as torch.jit.trace()and neuron.trace()record the model’s operations on an example input during the forward pass to build a static IR graph.

from ultralytics import YOLO
import torch_neuronx
import torch

batch_size = 1
imgsz = (640, 640)
im = torch.zeros(batch_size, 3, *imgsz).to(‘cpu’)  # mock input

# Compiler options
half = True  # fp16
fp8 = False
dynamic = False  # dynamic batch

f = ‘yolov8n.neuronx’  # output model name
neuronx_cc_args = [‘–auto-cast’, ‘none’]

if half:
    neuronx_cc_args = [‘–auto-cast’, ‘all’, ‘–auto-cast-type’, ‘fp16’]
elif fp8:
    neuronx_cc_args = [‘–auto-cast’, ‘all’, ‘–auto-cast-type’, ‘fp8_e4m3’]

model = torch.load(‘yolov8n.pt’)[‘model’]
model.eval()
model.float()
model = model.fuse()
neuronx_model = torch_neuronx.trace(
    model,
    example_inputs=im,
    compiler_args=neuronx_cc_args,
)

if dynamic:
    neuronx_model = torch_neuronx.dynamic_batch(neuronx_model)

neuronx_model.save(f)

We migrated our compute-heavy models to Inf1. By using AWS Inferentia, we achieved the throughput and performance to match our business needs. Adopting Inferentia-based Inf1 instances in the MLOps lifecycle was a key to achieving remarkable results:

Performance improvement: Our large computer vision models now run five times faster, achieving over 120 frames per second (FPS), allowing for seamless, real-time shopping experiences for our customers. Furthermore, the ability to process at this frame rate not only enhances transaction speed, but also enables us to feed more information into our models. This increase in data input significantly improves the accuracy of product detection within our models, further boosting the overall efficacy of our shopping systems.
Cost savings: We slashed inference costs. This significantly enhanced the architecture design supporting our A-POPs.

Data parallel inference was easy with AWS Neuron SDK
To improve performance of our inference workloads and extract maximum performance from Inferentia, we wanted to use all available NeuronCores in the Inferentia accelerator. Achieving this performance was easy with the built-in tools and APIs from the Neuron SDK. We used the torch.neuron.DataParallel() API. We’re currently using inf1.2xlarge which has one Inferentia accelerator with four Neuron accelerators. So we’re using torch.neuron.DataParallel() to fully use the Inferentia hardware and use all available NeuronCores. This Python function implements data parallelism at the module level on models created by the PyTorch Neuron API. Data parallelism is a form of parallelization across multiple devices or cores (NeuronCores for Inferentia), referred to as nodes. Each node contains the same model and parameters, but data is distributed across the different nodes. By distributing the data across multiple nodes, data parallelism reduces the total processing time of large batch size inputs compared to sequential processing. Data parallelism works best for models in latency-sensitive applications that have large batch size requirements.
Looking ahead: Accelerating retail transformation with foundation models and scalable deployment
As we venture into the future, the impact of foundation models on the retail industry cannot be overstated. Foundation models can make a significant difference in product labeling. The ability to quickly and accurately identify and categorize different products is crucial in a fast-paced retail environment. With modern transformer-based models, we can deploy a greater diversity of models to serve more of our AI/ML needs with higher accuracy, improving the experience for users and without having to waste time and money training models from scratch. By harnessing the power of foundation models, we can accelerate the process of labeling, enabling retailers to scale their A-POP solutions more rapidly and efficiently.
We have begun implementing Segment Anything Model (SAM), a vision transformer foundation model that can segment any object in any image (we will discuss this further in another blog post). SAM allows us to accelerate our labeling process with unparalleled speed. SAM is very efficient, able to process approximately 62 times more images than a human can manually create bounding boxes for in the same timeframe. SAM’s output is used to train a model that detects segmentation masks in transactions, opening up a window of opportunity for processing millions of images exponentially faster. This significantly reduces training time and cost for product planogram models.

Our product and AI/ML research teams are excited to be at the forefront of this transformation. The ongoing partnership with AWS and our use of Inferentia in our infrastructure will ensure that we can deploy these foundation models cost effectively. As early adopters, we’re working with the new AWS Inferentia 2-based instances. Inf2 instances are built for today’s generative AI and large language model (LLM) inference acceleration, delivering higher performance and lower costs. Inf2 will enable us to empower retailers to harness the benefits of AI-driven technologies without breaking the bank, ultimately making the retail landscape more innovative, efficient, and customer-centric.
As we continue to migrate more models to Inferentia and Inferentia2, including transformers-based foundational models, we are confident that our alliance with AWS will enable us to grow and innovate alongside our trusted cloud provider. Together, we will reshape the future of retail, making it smarter, faster, and more attuned to the ever-evolving needs of consumers.
Conclusion
In this technical traverse, we’ve highlighted our transformational journey using AWS Inferentia for its innovative AI/ML transactional processing system. This partnership has led to a five times increase in processing speed and a stunning 95 percent reduction in inference costs compared to our previous solution. It has changed the current approach of the retail industry by facilitating a real-time and seamless shopping experience.
If you’re interested in learning more about how Inferentia can help you save costs while optimizing performance for your inference applications, visit the Amazon EC2 Inf1 instances and Amazon EC2 Inf2 instances product pages. AWS provides various sample codes and getting started resources for Neuron SDK that you can find on the Neuron samples repository.

About the Authors
Matias Ponchon is the Head of Infrastructure at Intuitivo. He specializes in architecting secure and robust applications. With extensive experience in FinTech and Blockchain companies, coupled with his strategic mindset, helps him to design innovative solutions. He has a deep commitment to excellence, that’s why he consistently delivers resilient solutions that push the boundaries of what’s possible.
Jose Benitez is the Founder and Director of AI at Intuitivo, specializing in the development and implementation of computer vision applications. He leads a talented Machine Learning team, nurturing an environment of innovation, creativity, and cutting-edge technology. In 2022, Jose was recognized as an ‘Innovator Under 35’ by MIT Technology Review, a testament to his groundbreaking contributions to the field. This dedication extends beyond accolades and into every project he undertakes, showcasing a relentless commitment to excellence and innovation.
Diwakar Bansal is an AWS Senior Specialist focused on business development and go-to-market for Gen AI and Machine Learning accelerated computing services. Previously, Diwakar has led product definition, global business development, and marketing of technology products for IoT, Edge Computing, and Autonomous Driving focusing on bringing AI and Machine Learning to these domains.

Empower your business users to extract insights from company documents …

Enterprises seek to harness the potential of Machine Learning (ML) to solve complex problems and improve outcomes. Until recently, building and deploying ML models required deep levels of technical and coding skills, including tuning ML models and maintaining operational pipelines. Since its introduction in 2021, Amazon SageMaker Canvas has enabled business analysts to build, deploy, and use a variety of ML models – including tabular, computer vision, and natural language processing – without writing a line of code. This has accelerated the ability of enterprises to apply ML to use cases such as time-series forecasting, customer churn prediction, sentiment analysis, industrial defect detection, and many others.
As announced on October 5, 2023, SageMaker Canvas expanded its support of models to foundation models (FMs) – large language models used to generate and summarize content. With the October 12, 2023 release, SageMaker Canvas lets users ask questions and get responses that are grounded in their enterprise data. This ensures that results are context-specific, opening up additional use cases where no-code ML can be applied to solve business problems. For example, business teams can now formulate responses consistent with an organization’s specific vocabulary and tenets, and can more quickly query lengthy documents to get responses specific and grounded to the contents of those documents. All this content is performed in a private and secure manner, ensuring that all sensitive data is accessed with proper governance and safeguards.
To get started, a cloud administrator configures and populates Amazon Kendra indexes with enterprise data as data sources for SageMaker Canvas. Canvas users select the index where their documents are, and can ideate, research, and explore knowing that the output will always be backed by their sources-of-truth. SageMaker Canvas uses state-of-the-art FMs from Amazon Bedrock and Amazon SageMaker JumpStart. Conversations can be started with multiple FMs side-by-side, comparing the outputs and truly making generative-AI accessible to everyone.
In this post, we will review the recently released feature, discuss the architecture, and present a step-by-step guide to enable SageMaker Canvas to query documents from your knowledge base, as shown in the following screen capture.

Solution overview
Foundation models can produce hallucinations – responses that are generic, vague, unrelated, or factually incorrect. Retrieval Augmented Generation (RAG) is a frequently used approach to reduce hallucinations. RAG architectures are used to retrieve data from outside of an FM, which is then used to perform in-context learning to answer the user’s query. This ensures that the FM can use data from a trusted knowledge base and use that knowledge to answer users’ questions, reducing the risk of hallucination.
With RAG, the data external to the FM and used to augment user prompts can come from multiple disparate data sources, such as document repositories, databases, or APIs. The first step is to convert your documents and any user queries into a compatible format to perform relevancy semantic search. To make the formats compatible, a document collection, or knowledge library, and user-submitted queries are converted into numerical representations using embedding models.
With this release, RAG functionality is provided in a no-code and seamless manner. Enterprises can enrich the chat experience in Canvas with Amazon Kendra as the underlying knowledge management system. The following diagram illustrates the solution architecture.

Connecting SageMaker Canvas to Amazon Kendra requires a one-time set-up. We describe the set-up process in detail in Setting up Canvas to query documents. If you haven’t already set-up your SageMaker Domain, refer to Onboard to Amazon SageMaker Domain.
As part of the domain configuration, a cloud administrator can choose one or more Kendra indices that the business analyst can query when interacting with the FM through SageMaker Canvas.
After the Kendra indices are hydrated and configured, business analysts use them with SageMaker Canvas by starting a new chat and selecting “Query Documents” toggle. SageMaker Canvas will then manage the underlying communication between Amazon Kendra and the FM of choice to perform the following operations:

Query the Kendra indices with the question coming from the user.
Retrieve the snippets (and the sources) from Kendra indices.
Engineer the prompt with the snippets with the original query so that the foundation model can generate an answer from the retrieved documents.
Provide the generated answer to the user, along with references to the pages/documents that were used to formulate the response.

Setting up Canvas to query documents
In this section, we will walk you through the steps to set up Canvas to query documents served through Kendra indexes. You should have the following prerequisites:

SageMaker Domain setup – Onboard to Amazon SageMaker Domain
Create a Kendra index (or more than one)
Setup the Kendra Amazon S3 connector – follow the Amazon S3 Connector – and upload PDF files and other documents to the Amazon S3 bucket associated with the Kendra index
Setup IAM so that Canvas has the appropriate permissions, including those required for calling Amazon Bedrock and/or SageMaker endpoints – follow the Set-up Canvas Chat documentation

Now you can update the Domain so that it can access the desired indices. On the SageMaker console, for the given Domain, select Edit under the Domain Settings tab. Enable the toggle “Enable query documents with Amazon Kendra” which can be found at the Canvas Settings step. Once activated, choose one or more Kendra indices that you want to use with Canvas. Once activated, choose one or more Kendra indices that you want to use with Canvas.

That’s all that’s needed to configure Canvas query documents feature. Users can now jump into a chat within Canvas and start using the knowledge bases that have been attached to the Domain through the Kendra indexes. The maintainers of the knowledge-base can continue to update the source-of-truth and with the syncing capability in Kendra, the chat users will automatically be able to use the up-to-date information in a seamless manner.
Using the Query Documents feature for chat
As a SageMaker Canvas user, the Query Documents feature can be accessed from within a chat. To start the chat session, click or search for the “Generate, extract and summarize content” button from the Ready-to-use models tab in SageMaker Canvas.

Once there, you can turn on and off Query Documents with the toggle at the top of the screen. Check out the information prompt to learn more about the feature.

When Query Documents is enabled, you can choose among a list of Kendra indices enabled by the cloud administrator.

You can select an index when starting a new chat. You can then ask a question in the UX with knowledge being automatically sourced from the selected index. Note that after a conversation has started against a specific index, it is not possible to switch to another index.

For the questions asked, the chat will show the answer generated by the FM along with the source documents that contributed to generating the answer. When clicking any of the source documents, Canvas opens a preview of the document, highlighting the excerpt used by the FM.

Conclusion
Conversational AI has immense potential to transform customer and employee experience by providing a human-like assistant with natural and intuitive interactions such as:

Performing research on a topic or search and browse the organization’s knowledge base
Summarizing volumes of content to rapidly gather insights
Searching for Entities, Sentiments, PII and other useful data, and increasing the business value of unstructured content
Generating drafts for documents and business correspondence
Creating knowledge articles from disparate internal sources (incidents, chat logs, wikis)

The innovative integration of chat interfaces, knowledge retrieval, and FMs enables enterprises to provide accurate, relevant responses to user questions by using their domain knowledge and sources-of-truth.
By connecting SageMaker Canvas to knowledge bases in Amazon Kendra, organizations can keep their proprietary data within their own environment while still benefiting from state-of-the-art natural language capabilities of FMs. With the launch of SageMaker Canvas’s Query Documents feature, we are making it easy for any enterprise to use LLMs and their enterprise knowledge as source-of-truth to power a secure chat experience. All this functionality is available in a no-code format, allowing businesses to avoid handling the repetitive and non-specialized tasks.
To learn more about SageMaker Canvas and how it helps make it easier for everyone to start with Machine Learning, check out the SageMaker Canvas announcement. Learn more about how SageMaker Canvas helps foster collaboration between data scientists and business analysts by reading the Build, Share & Deploy post. Finally, to learn how to create your own Retrieval Augmented Generation workflow, refer to SageMaker JumpStart RAG.
References
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.

About the Authors
Davide Gallitelli is a Senior Specialist Solutions Architect for AI/ML. He is based in Brussels and works closely with customers all around the globe that are looking to adopt Low-Code/No-Code Machine Learning technologies, and Generative AI. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.
Bilal Alam is an Enterprise Solutions Architect at AWS with a focus on the Financial Services industry. On most days Bilal is helping customers with building, uplifting and securing their AWS environment to deploy their most critical workloads. He has extensive experience in Telco, networking, and software development. More recently, he has been looking into using AI/ML to solve business problems.
Pashmeen Mistry is a Senior Product Manager at AWS. Outside of work, Pashmeen enjoys adventurous hikes, photography, and spending time with his family.
Dan Sinnreich is a Senior Product Manager at AWS, helping to democratize low-code/no-code machine learning. Previous to AWS, Dan built and commercialized enterprise SaaS platforms and time-series models used by institutional investors to manage risk and construct optimal portfolios. Outside of work, he can be found playing hockey, scuba diving, and reading science fiction.

Researchers from Google and the University of Toronto Introduce Ground …

Large language models (LLMs) for action production in various live contexts, such as ALFWORLD and ALPHACODE, have shown promise in earlier efforts. Examples include SAYCAN, REACT, TOOLFORMER, and SWIFTSAGE. LLMs are used similarly to follow expert trails, understand environmental changes, plan and carry out future activities, and compose API requests. Several studies, including REFLEXION and SELF-REFINE, have demonstrated that repeatedly performing a task with numerous rounds of self-reflection may significantly enhance task completion. LLMs are asked to modify a previous execution plan in light of environmental feedback. Such adjustments are incorporated into the action generator’s prompt for the subsequent round. 

MINIWOB++ has recently been utilized as a testbed to evaluate LLM’s performance on modularized computing workloads. Using comprehensive trace examples of the task for direct supervision (WebGUM), self-supervision, or few/many shot prompting (SYNAPSE) are standard methods for learning a task. They have completed dozens of computer jobs with a task completion rate greater than 90%, seemingly solving the computer control issue. Nonetheless, the need for expert traces constrains the agent’s capacity to learn new jobs. Can an agent independently know and enhance its control over a computer without utilizing well-chosen traces as guidance? Researchers from Google Research and the University of Toronto suggest a zero-shot agent to answer this query. 

Their agent is built on top of PaLM2, a recent LLM, and it uses a single set of instruction prompts for all activities rather than task-specific prompts. Additionally, contemporary efforts like RCI, ADAPLANNER, and SYNAPSE use screen representations that might include a lot more data than what is displayed to the user on the screen. For instance, Fig. 1 illustrates items that are contained in the HTML that are provided to the LLM but are not displayed on the screen. Arbitrarily, using this new knowledge makes the agent’s ability to complete the task easier. However, in typical usage scenarios, such information might not be easily accessible and, depending on it, could limit how widely the agent can be applied. 

Figure 1 shows disparate displays on screens. Fig. 1a–1c shows the social media task before and after pressing the “more” button (seed=2). HTML has already made the material visible before clicking. Fig. 1d-1e: The click-tab-2 (seed=0) has a similar problem.

13 rather difficult jobs on MINIWOB++ that are meant to span many screens were carefully evaluated, and they discovered that 5 of them included HTML that contained such information—multi-screen information in a single observation. These are the contributions they made: First, in comparison to earlier studies, they adopt a condensed screen depiction, which makes the test environment more all-encompassing and realistic. Second, they provide a straightforward but effective action planner that, in a single pass, precisely plans out executable operations on a state. They demonstrate that such a “naive” approach can complete nearly all the simple tasks on the MINIWOB++ benchmark using the most recent LLM capacity. 

To help the agent successfully learn from exploratory failures and advance in more difficult tasks, they suggest a systematic thought management technique that draws influence from Reflexion. Their agent achieves performance equivalent to previous few/many-shot state-of-the-art after a few rounds of tries. Their agent is the first zero-shot design for computer control tasks that they are aware of, according to research.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Researchers from Google and the University of Toronto Introduce Groundbreaking Zero-Shot Agent for Autonomous Learning and Task Execution in Live Computer Environments appeared first on MarkTechPost.

Meet LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Mod …

The introduction of Pre-trained Language Models (PLMs) has signified a transformative shift in the field of Natural Language Processing. They have demonstrated exceptional proficiency in performing a wide range of language tasks, including Natural Language Understanding (NLU) and Natural Language Generation (NLG). These models typically incorporate millions or even billions of parameters, leading to substantial computational and memory requirements. However, the considerable computational and memory needs of these models present significant challenges, as acknowledged by the research community.

In this paper, the authors introduce a novel quantization framework known as LoRA-Fine-Tuning-aware Quantization (LoftQ). This framework is specifically tailored for pre-trained models that necessitate quantization and LoRA fine-tuning. The framework actively combines low-rank approximation, working in conjunction with quantization to jointly approximate the original high-precision pre-trained weights.

The above image demonstrates QLoRA performance with different bits. Left: QLoRA initialization of LLAMA-2-13b on WikiText-2. Right: Apply QLoRA to LLAMA-2-13b on WikiText-2 language modelling task. Smaller perplexity indicates better performance. 

Quantization Methods. We apply two quantization methods to demonstrate LoftQ is compatible with different quantization functions:

• Uniform quantization is a classic quantization method. It uniformly divides a continuous interval into 2N categories and stores a local maximum absolute value for dequantization.

• NF4 and its 2-bit variant NF2 are quantization methods used in QLoRA. They assume that the high-precision values are drawn from a Gaussian distribution and map these values to discrete slots that have equal probability.

We perform 2-bit and 4-bit quantization on all models, achieving compression ratios of 25-30% and 15-20% at the 4-bit and 2-bit levels, respectively. All the experiments are conducted on NVIDIA A100 GPUs.

The evaluation of their quantization framework is carried out through extensive experiments on various downstream tasks, including NLU, question answering, summarization, and NLG. The results of these experiments demonstrate that LoftQ consistently surpasses QLoRA across all precision levels. For example, with 4-bit quantization, they attain a 1.1 and 0.8 improvement in Rouge-1 for XSum and CNN/DailyMail, respectively. As the field of NLP continues to advance, it is expected that further innovations and optimizations will help bridge the gap between the immense potential of PLMs and their practical deployment, benefiting a wide range of applications and users.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Meet LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models appeared first on MarkTechPost.

Google DeepMind Proposes An Artificial Intelligence Framework for Soci …

Generative AI systems, which create content across different formats, are becoming more widespread. These systems are used in various fields, including medicine, news, politics, and social interaction, providing companionship. Using natural language output, these systems produced information in a single format, such as text or graphics. To make generative AI systems more adaptable, there is an increasing trend in improving them to operate with additional formats, such as audio (including voice and music) and video.

The increasing use of generative AI systems highlights the need to assess potential risks associated with their deployment. As these technologies become more prevalent and integrated into various applications, concerns arise regarding public safety. Consequently, evaluating the potential risks posed by generative AI systems is becoming a priority for AI developers, policymakers, regulators, and civil society.

The growing use of these systems highlights the necessity to evaluate potential dangers related to implementing generative AI systems. As a result, it is becoming more important for AI developers, regulators, and civil society to assess the possible threats posed by generative AI systems. The development of AI that might spread false information raises moral questions about how such technologies will affect society.

Consequently, a recent study by Google DeepMind researchers offers a thorough approach to assessing AI systems’ social and ethical hazards across several contextual layers. The DeepMind framework systematically assesses risks at three distinct levels: the system’s capabilities, human interactions with the technology, and the broader systemic impacts it may have. 

They emphasized that it is crucial to recognize that even highly capable systems may only necessarily cause harm if used problematically within a specific context. Also, the framework examines real-world human interactions with the AI system. This involves considering factors such as who utilizes the technology and whether it operates as intended.

Finally, the framework checks how AI delves into the risks that may arise when AI is extensively adopted. This evaluation considers how technology influences larger social systems and institutions. The researchers emphasize how important context is in determining how risky AI is. Each layer of the framework is permeated by contextual concerns, emphasizing the importance of knowing who will use the AI and why. For instance, even if an AI system produces factually accurate outputs, users’ interpretation and subsequent dissemination of these outputs may have unintended consequences only apparent within certain contextual constraints.

The researchers provided a case study concentrating on misinformation to demonstrate this strategy. The evaluation includes assessing an AI’s tendency for factual errors, observing how users interact with the system, and measuring any subsequent repercussions, such as the spread of incorrect information. This interconnection of model behavior with actual harm occurring in a given context leads to actionable insights.

DeepMind’s context-based approach underscores the importance of moving beyond isolated model metrics. It emphasizes the critical need to evaluate how AI systems operate within the complex reality of social contexts. This holistic assessment is crucial for harnessing the benefits of AI while minimizing associated risks.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Google DeepMind Proposes An Artificial Intelligence Framework for Social and Ethical AI Risk Assessment appeared first on MarkTechPost.

Detection and high-frequency monitoring of methane emission point sour …

Methane (CH4) is a major anthropogenic greenhouse gas that‘s a by-product of oil and gas extraction, coal mining, large-scale animal farming, and waste disposal, among other sources. The global warming potential of CH4 is 86 times that of CO2 and the Intergovernmental Panel on Climate Change (IPCC) estimates that methane is responsible for 30 percent of observed global warming to date. Rapidly reducing leakage of CH4 into the atmosphere represents a critical component in the fight against climate change. In 2021, the U.N. introduced The Global Methane Pledge at the Climate Change Conference (COP26), with a goal to take “fast action on methane to keep a 1.5C future within reach.” The Pledge has 150 signatories including the U.S. and EU.
Early detection and ongoing monitoring of methane sources is a key component of meaningful action on methane and is therefore becoming a concern for policy makers and organizations alike. Implementing affordable, effective methane detection solutions at scale – such as on-site methane detectors or aircraft-mounted spectrometers – is challenging, as they are often impractical or prohibitively expensive. Remote sensing using satellites, on the other hand, can provide the global-scale, high-frequency, and cost-effective detection functionality that stakeholders desire.
In this blog post, we show you how you can use Sentinel 2 satellite imagery hosted on the AWS Registry of Open Data in combination with Amazon SageMaker geospatial capabilities to detect point sources of CH4 emissions and monitor them over time. Drawing on recent findings from the earth observation literature you will learn how you can implement a custom methane detection algorithm and use it to detect and monitor methane leakage from a variety of sites across the globe. This post includes accompanying code on GitHub that provides additional technical detail and helps you to get started with your own methane monitoring solution.
Traditionally, running complex geospatial analyses was a difficult, time-consuming, and resource-intensive undertaking. Amazon SageMaker geospatial capabilities make it easier for data scientists and machine learning engineers to build, train, and deploy models using geospatial data. Using SageMaker geospatial capabilities, you can efficiently transform or enrich large-scale geospatial datasets, accelerate model building with pre-trained machine learning (ML) models, and explore model predictions and geospatial data on an interactive map using 3D accelerated graphics and built-in visualization tools.
Remote sensing of methane point sources using multispectral satellite imagery
Satellite-based methane sensing approaches typically rely on the unique transmittance characteristics of CH4. In the visible spectrum, CH4 has transmittance values equal or close to 1, meaning it’s undetectable by the naked eye. Across certain wavelengths, however, methane does absorb light (transmittance <1), a property which can be exploited for detection purposes. For this, the short wavelength infrared (SWIR) spectrum (1500–2500 nm spectral range) is typically chosen, which is where CH4 is most detectable. Hyper- and multispectral satellite missions (that is, those with optical instruments that capture image data within multiple wavelength ranges (bands) across the electromagnetic spectrum) cover these SWIR ranges and therefore represent potential detection instruments. Figure 1 plots the transmittance characteristics of methane in the SWIR spectrum and the SWIR coverage of various candidate multispectral satellite instruments (adapted from this study).

Figure 1 – Transmittance characteristics of methane in the SWIR spectrum and coverage of Sentinel-2 multi-spectral missions
Many multispectral satellite missions are limited either by a low revisit frequency (for example, PRISMA Hyperspectral at approximately 16 days) or by low spatial resolution (for example, Sentinel 5 at 7.5 km x 7.5 km). The cost of accessing data is an additional challenge: some dedicated constellations operate as commercial missions, potentially making CH4 emission insights less readily available to researchers, decision makers, and other concerned parties due to financial constraints. ESA’s Sentinel-2 multispectral mission, which this solution is based on, strikes an appropriate balance between revisit rate (approximately 5 days), spatial resolution (approximately 20 m) and open access (hosted on the AWS Registry of Open Data).
Sentinel-2 has two bands that cover the SWIR spectrum (at a 20 m resolution): band-11 (1610 nm central wavelength) and band-12 (2190 nm central wavelength). Both bands are suitable for methane detection, while band-12 has significantly higher sensitivity to CH4 absorption (see Figure 1). Intuitively there are two possible approaches to using this SWIR reflectance data for methane detection. First, you could focus on just a single SWIR band (ideally the one that is most sensitive to CH4 absorption) and compute the pixel-by-pixel difference in reflectance across two different satellite passes. Alternatively, you use data from a single satellite pass for detection by using the two adjacent spectral SWIR bands that have similar surface and aerosol reflectance properties but have different methane absorption characteristics.
The detection method we implement in this blog post combines both approaches. We draw on recent findings from the earth observation literature and compute the fractional change in top-of-the-atmosphere (TOA) reflectance Δρ (that is, reflectance measured by Sentinel-2 including contributions from atmospheric aerosols and gases) between two satellite passes and the two SWIR bands; one baseline pass where no methane is present (base) and one monitoring pass where an active methane point source is suspected (monitor). Mathematically, this can be expressed as follows:
Equation (1)
where ρ is the TOA reflectance as measured by Sentinel-2, cmonitor and cbase are computed by regressing TOA reflectance values of band-12 against those of band-11 across the entire scene (that is, ρb11 = c * ρb12). For more details, refer to this study on high-frequency monitoring of anomalous methane point sources with multispectral Sentinel-2 satellite observations.
Implement a methane detection algorithm with SageMaker geospatial capabilities
To implement the methane detection algorithm, we use the SageMaker geospatial notebook within Amazon SageMaker Studio. The geospatial notebook kernel is pre-equipped with essential geospatial libraries such as GDAL, GeoPandas, Shapely, xarray, and Rasterio, enabling direct visualization and processing of geospatial data within the Python notebook environment. See the getting started guide to learn how to start using SageMaker geospatial capabilities.
SageMaker provides a purpose-built API designed to facilitate the retrieval of satellite imagery through a consolidated interface using the SearchRasterDataCollection API call. SearchRasterDataCollection relies on the following input parameters:

Arn: The Amazon resource name (ARN) of the queried raster data collection
AreaOfInterest: A polygon object (in GeoJSON format) representing the region of interest for the search query
TimeRangeFilter: Defines the time range of interest, denoted as {StartTime: <string>, EndTime: <string>}
PropertyFilters: Supplementary property filters, such as specifications for maximum acceptable cloud cover, can also be incorporated

This method supports querying from various raster data sources which can be explored by calling ListRasterDataCollections. Our methane detection implementation uses Sentinel-2 satellite imagery, which can be globally referenced using the following ARN: arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8.
This ARN represents Sentinel-2 imagery, which has been processed to Level 2A (surface reflectance, atmospherically corrected). For methane detection purposes, we will use top-of-atmosphere (TOA) reflectance data (Level 1C), which doesn’t include the surface level atmospheric corrections that would make changes in aerosol composition and density (that is, methane leaks) undetectable.
To identify potential emissions from a specific point source, we need two input parameters: the coordinates of the suspected point source and a designated timestamp for methane emission monitoring. Given that the SearchRasterDataCollection API uses polygons or multi-polygons to define an area of interest (AOI), our approach involves expanding the point coordinates into a bounding box first and then using that polygon to query for Sentinel-2 imagery using SearchRasterDateCollection.
In this example, we monitor a known methane leak originating from an oil field in Northern Africa. This is a standard validation case in the remote sensing literature and is referenced, for example, in this study. A fully executable code base is provided on the amazon-sagemaker-examples GitHub repository. Here, we highlight only selected code sections that represent the key building blocks for implementing a methane detection solution with SageMaker geospatial capabilities. See the repository for additional details.
We start by initializing the coordinates and target monitoring date for the example case.
#coordinates and date for North Africa oil field
#see here for reference: https://doi.org/10.5194/amt-14-2771-2021
point_longitude = 5.9053
point_latitude = 31.6585
target_date = ‘2019-11-20’
#size of bounding box in each direction around point
distance_offset_meters = 1500

The following code snippet generates a bounding box for the given point coordinates and then performs a search for the available Sentinel-2 imagery based on the bounding box and the specified monitoring date:
def bbox_around_point(lon, lat, distance_offset_meters):
#Equatorial radius (km) taken from https://nssdc.gsfc.nasa.gov/planetary/factsheet/earthfact.html
earth_radius_meters = 6378137
lat_offset = math.degrees(distance_offset_meters / earth_radius_meters)
lon_offset = math.degrees(distance_offset_meters / (earth_radius_meters * math.cos(math.radians(lat))))
return geometry.Polygon([
[lon – lon_offset, lat – lat_offset],
[lon – lon_offset, lat + lat_offset],
[lon + lon_offset, lat + lat_offset],
[lon + lon_offset, lat – lat_offset],
[lon – lon_offset, lat – lat_offset],
])

#generate bounding box and extract polygon coordinates
aoi_geometry = bbox_around_point(point_longitude, point_latitude, distance_offset_meters)
aoi_polygon_coordinates = geometry.mapping(aoi_geometry)[‘coordinates’]

#set search parameters
search_params = {
“Arn”: “arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8”, # Sentinel-2 L2 data
“RasterDataCollectionQuery”: {
“AreaOfInterest”: {
“AreaOfInterestGeometry”: {
“PolygonGeometry”: {
“Coordinates”: aoi_polygon_coordinates
}
}
},
“TimeRangeFilter”: {
“StartTime”: “{}T00:00:00Z”.format(as_iso_date(target_date)),
“EndTime”: “{}T23:59:59Z”.format(as_iso_date(target_date))
}
},
}
#query raster data using SageMaker geospatial capabilities
sentinel2_items = geospatial_client.search_raster_data_collection(**search_params)
The response contains a list of matching Sentinel-2 items and their corresponding metadata. These include Cloud-Optimized GeoTIFFs (COG) for all Sentinel-2 bands, as well as thumbnail images for a quick preview of the visual bands of the image. Naturally, it’s also possible to access the full-resolution satellite image (RGB plot), shown in Figure 2 that follows.
Figure 2 – Satellite image (RGB plot) of AOI
As previously detailed, our detection approach relies on fractional changes in top-of-the-atmosphere (TOA) SWIR reflectance. For this to work, the identification of a good baseline is crucial. Finding a good baseline can quickly become a tedious process that involves plenty of trial and error. However, good heuristics can go a long way in automating this search process. A search heuristic that has worked well for cases investigated in the past is as follows: for the past day_offset=n days, retrieve all satellite imagery, remove any clouds and clip the image to the AOI in scope. Then compute the average band-12 reflectance across the AOI. Return the Sentinel tile ID of the image with the highest average reflectance in band-12.
This logic is implemented in the following code excerpt. Its rationale relies on the fact that band-12 is highly sensitive to CH4 absorption (see Figure 1). A greater average reflectance value corresponds to a lower absorption from sources such as methane emissions and therefore provides a strong indication for an emission free baseline scene.
def approximate_best_reference_date(lon, lat, date_to_monitor, distance_offset=1500, cloud_mask=True, day_offset=30):

#initialize AOI and other parameters
aoi_geometry = bbox_around_point(lon, lat, distance_offset)
BAND_12_SWIR22 = “B12”
max_mean_swir = None
ref_s2_tile_id = None
ref_target_date = date_to_monitor

#loop over n=day_offset previous days
for day_delta in range(-1 * day_offset, 0):
date_time_obj = datetime.strptime(date_to_monitor, ‘%Y-%m-%d’)
target_date = (date_time_obj + timedelta(days=day_delta)).strftime(‘%Y-%m-%d’)

#get Sentinel-2 tiles for current date
s2_tiles_for_target_date = get_sentinel2_meta_data(target_date, aoi_geometry)

#loop over available tiles for current date
for s2_tile_meta in s2_tiles_for_target_date:
s2_tile_id_to_test = s2_tile_meta[‘Id’]
#retrieve cloud-masked (optional) L1C band 12
target_band_data = get_s2l1c_band_data_xarray(s2_tile_id_to_test, BAND_12_SWIR22, clip_geometry=aoi_geometry, cloud_mask=cloud_mask)
#compute mean reflectance of SWIR band
mean_swir = target_band_data.sum() / target_band_data.count()

#ensure the visible/non-clouded area is adequately large
visible_area_ratio = target_band_data.count() / (target_band_data.shape[1] * target_band_data.shape[2])
if visible_area_ratio <= 0.7: #<– ensure acceptable cloud cover
continue

#update maximum ref_s2_tile_id and ref_target_date if applicable
if max_mean_swir is None or mean_swir > max_mean_swir:
max_mean_swir = mean_swir
ref_s2_tile_id = s2_tile_id_to_test
ref_target_date = target_date

return (ref_s2_tile_id, ref_target_date)

Using this method allows us to approximate a suitable baseline date and corresponding Sentinel-2 tile ID. Sentinel-2 tile IDs carry information on the mission ID (Sentinel-2A/Sentinel-2B), the unique tile number (such as, 32SKA), and the date the image was taken among other information and uniquely identify an observation (that is, a scene). In our example, the approximation process suggests October 6, 2019 (Sentinel-2 tile: S2B_32SKA_20191006_0_L2A), as the most suitable baseline candidate.
Next, we can compute the corrected fractional change in reflectance between the baseline date and the date we’d like to monitor. The correction factors c (see Equation 1 preceding) can be calculated with the following code:
def compute_correction_factor(tif_y, tif_x):

#get flattened arrays for regression
y = np.array(tif_y.values.flatten())
x = np.array(tif_x.values.flatten())
np.nan_to_num(y, copy=False)
np.nan_to_num(x, copy=False)

#fit linear model using least squares regression
x = x[:,np.newaxis] #reshape
c, _, _, _ = np.linalg.lstsq(x, y, rcond=None)

return c[0]
The full implementation of Equation 1 is given in the following code snippet:
def compute_corrected_fractional_reflectance_change(l1_b11_base, l1_b12_base, l1_b11_monitor, l1_b12_monitor):

#get correction factors
c_monitor = compute_correction_factor(tif_y=l1_b11_monitor, tif_x=l1_b12_monitor)
c_base = compute_correction_factor(tif_y=l1_b11_base, tif_x=l1_b12_base)

#get corrected fractional reflectance change
frac_change = ((c_monitor*l1_b12_monitor-l1_b11_monitor)/l1_b11_monitor)-((c_base*l1_b12_base-l1_b11_base)/l1_b11_base)
return frac_change
Finally, we can wrap the above methods into an end-to-end routine that identifies the AOI for a given longitude and latitude, monitoring date and baseline tile, acquires the required satellite imagery, and performs the fractional reflectance change computation.
def run_full_fractional_reflectance_change_routine(lon, lat, date_monitor, baseline_s2_tile_id, distance_offset=1500, cloud_mask=True):

#get bounding box
aoi_geometry = bbox_around_point(lon, lat, distance_offset)

#get S2 metadata
s2_meta_monitor = get_sentinel2_meta_data(date_monitor, aoi_geometry)

#get tile id
grid_id = baseline_s2_tile_id.split(“_”)[1]
s2_tile_id_monitor = list(filter(lambda x: f”_{grid_id}_” in x[“Id”], s2_meta_monitor))[0][“Id”]

#retrieve band 11 and 12 of the Sentinel L1C product for the given S2 tiles
l1_swir16_b11_base = get_s2l1c_band_data_xarray(baseline_s2_tile_id, BAND_11_SWIR16, clip_geometry=aoi_geometry, cloud_mask=cloud_mask)
l1_swir22_b12_base = get_s2l1c_band_data_xarray(baseline_s2_tile_id, BAND_12_SWIR22, clip_geometry=aoi_geometry, cloud_mask=cloud_mask)
l1_swir16_b11_monitor = get_s2l1c_band_data_xarray(s2_tile_id_monitor, BAND_11_SWIR16, clip_geometry=aoi_geometry, cloud_mask=cloud_mask)
l1_swir22_b12_monitor = get_s2l1c_band_data_xarray(s2_tile_id_monitor, BAND_12_SWIR22, clip_geometry=aoi_geometry, cloud_mask=cloud_mask)

#compute corrected fractional reflectance change
frac_change = compute_corrected_fractional_reflectance_change(
l1_swir16_b11_base,
l1_swir22_b12_base,
l1_swir16_b11_monitor,
l1_swir22_b12_monitor
)

return frac_change
Running this method with the parameters we determined earlier yields the fractional change in SWIR TOA reflectance as an xarray.DataArray. We can perform a first visual inspection of the result by running a simple plot() invocation on this data array. Our method reveals the presence of a methane plume at the center of the AOI that was undetectable in the RGB plot seen previously.
Figure 3 – Fractional reflectance change in TOA reflectance (SWIR spectrum)
As a final step, we extract the identified methane plume and overlay it on a raw RGB satellite image to provide the important geographic context. This is achieved by thresholding, which can be implemented as shown in the following:
def get_plume_mask(change_in_reflectance_tif, threshold_value):
cr_masked = change_in_reflectance_tif.copy()
#set values above threshold to nan
cr_masked[cr_masked > treshold_value] = np.nan
#apply mask on nan values
plume_tif = np.ma.array(cr_masked, mask=cr_masked==np.nan)

return plume_tif
For our case, a threshold of -0.02 fractional change in reflectance yields good results but this can change from scene to scene and you will have to calibrate this for your specific use case. Figure 4 that follows illustrates how the plume overlay is generated by combining the raw satellite image of the AOI with the masked plume into a single composite image that shows the methane plume in its geographic context.

Figure 4 – RGB image, fractional reflectance change in TOA reflectance (SWIR spectrum), and methane plume overlay for AOI
Solution validation with real-world methane emission events
As a final step, we evaluate our method for its ability to correctly detect and pinpoint methane leakages from a range of sources and geographies. First, we use a controlled methane release experiment specifically designed for the validation of space-based point-source detection and quantification of onshore methane emissions. In this 2021 experiment, researchers performed several methane releases in Ehrenberg, Arizona over a 19-day period. Running our detection method for one of the Sentinel-2 passes during the time of that experiment produces the following result showing a methane plume:
Figure 5 – Methane plume intensities for Arizona Controlled Release Experiment
The plume generated during the controlled release is clearly identified by our detection method. The same is true for other known real-world leakages (in Figure 6 that follows) from sources such as a landfill in East Asia (left) or an oil and gas facility in North America (right).
Figure 6 – Methane plume intensities for an East Asian landfill (left) and an oil and gas field in North America (right)
In sum, our method can help identify methane emissions both from controlled releases and from various real-world point sources across the globe. This works best for on-shore point sources with limited surrounding vegetation. It does not work for off-shore scenes due to the high absorption (that is, low transmittance) of the SWIR spectrum by water. Given that the proposed detection algorithm relies on variations in methane intensity, our method also requires pre-leakage observations. This can make monitoring of leakages with constant emission rates challenging.
Clean up
To avoid incurring unwanted charges after a methane monitoring job has completed, ensure that you terminate the SageMaker instance and delete any unwanted local files.
Conclusion
By combining SageMaker geospatial capabilities with open geospatial data sources you can implement your own highly customized remote monitoring solutions at scale. This blog post focused on methane detection, a focal area for governments, NGOs and other organizations seeking to detect and ultimately avoid harmful methane emissions. You can get started today in your own journey into geospatial analytics by spinning up a Notebook with the SageMaker geospatial kernel and implement your own detection solution. See the GitHub repository to get started building your own satellite-based methane detection solution. Also check out the sagemaker-examples repository for further examples and tutorials on how to use SageMaker geospatial capabilities in other real-world remote sensing applications.

About the authors
Dr. Karsten Schroer is a Solutions Architect at AWS. He supports customers in leveraging data and technology to drive sustainability of their IT infrastructure and build cloud-native data-driven solutions that enable sustainable operations in their respective verticals. Karsten joined AWS following his PhD studies in applied machine learning & operations management. He is truly passionate about technology-enabled solutions to societal challenges and loves to dive deep into the methods and application architectures that underlie these solutions.
Janosch Woschitz is a Senior Solutions Architect at AWS, specializing in geospatial AI/ML. With over 15 years of experience, he supports customers globally in leveraging AI and ML for innovative solutions that capitalize on geospatial data. His expertise spans machine learning, data engineering, and scalable distributed systems, augmented by a strong background in software engineering and industry expertise in complex domains such as autonomous driving.

How can Pre-Trained Visual Representations Help Solve Long-Horizon Man …

In the research paper “Universal Visual Decomposer: Long-Horizon Manipulation Made Easy”, the authors address the challenge of teaching robots to perform long-horizon manipulation tasks from visual observations. These tasks involve multiple stages and are often encountered in real-world scenarios like cooking and tidying. Learning such complex skills is challenging due to compounding errors, vast action and observation spaces, and the absence of meaningful learning signals for each step.

The authors introduce an innovative solution called the Universal Visual Decomposer (UVD). UVD is an off-the-shelf task decomposition method that leverages pre-trained visual representations designed for robotic control. It does not require task-specific knowledge and can be applied to various tasks without additional training. UVD works by discovering subgoals within visual demonstrations, which aids in policy learning and generalization to unseen tasks.

The core idea behind UVD is that pre-trained visual representations are capable of capturing temporal progress in short videos of goal-directed behavior. By applying these representations to long, unsegmented task videos, UVD identifies phase shifts in the embedding space, signifying subtask transitions. This approach is entirely unsupervised and imposes zero additional training costs on standard visuomotor policy training.

UVD’s effectiveness is demonstrated through extensive evaluations in both simulation and real-world tasks. It outperforms baseline methods in imitation and reinforcement learning settings, showcasing the advantage of automated visual task decomposition using the UVD framework.

In conclusion, the researchers have introduced the Universal Visual Decomposer (UVD) as an off-the-shelf solution for decomposing long-horizon manipulation tasks using pre-trained visual representations. UVD offers a promising approach to improving robotic policy learning and generalization, with successful applications in both simulated and real-world scenarios.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post How can Pre-Trained Visual Representations Help Solve Long-Horizon Manipulation? Meet Universal Visual Decomposer (UVD): An off-the-Shelf Method for Identifying Subgoals from Videos appeared first on MarkTechPost.

This AI Research Introduces ‘RAFA’: A Principled Artificial Intell …

While LLMs’ reasoning capabilities are excellent, they still need to be improved to apply those capabilities in practical settings. In particular, how to proveably accomplish a task with minimal interactions with the outside world (e.g., via an internal method of reasoning) is still a matter of conjecture.

To choreograph reasoning and action, a new study by Northwestern University, Tsinghua University, and the Chinese University of Hong Kong presents a moral framework called “reason for future, act for now” (RAFA), which provides verifiable regret guarantees. To be more precise, they create a long-term trajectory planner (“reason for future”) that learns from the memory buffer’s prompts for reasoning. 

Within a Bayesian adaptive MDP paradigm, they formally describe how to reason and act with LLMs. At each stage, the LLM agent does the first action of the planned trajectory (“act for now”), saves the gathered feedback in the memory buffer, and then re-invokes the reasoning routine to replan the future trajectory based on the current state.

Learning and planning in Bayesian adaptive Markov decision processes (MDPs) is the central principle, which is then used to represent reasoning in LLMs as MDPs. Similarly, they instruct LLMs to learn a more accurate posterior distribution over the unknown environment by consulting the memory buffer and designing a series of actions that will maximize some value function. When the external environment’s state changes, the LLM agent again calls on the reasoning routine to plot a new course of action. To maintain consistency in learning and planning, the researchers use a switching condition to determine if the more recent historical data should be used.

Several text-based benchmarks assess RAFA’s performance, including Game of 24, ALFWorld, BlocksWorld, and Tic-Tac-Toe. RAFA is an AI system that uses a linguistic model to carry out RL/PL tasks. The main points are summed up here.

In the game 24, RAFA determines how to get 24 by adding and subtracting four different natural numbers. The algorithm keeps track of the most recent formula and produces the next procedure to reach this objective. In terms of sample efficiency, RAFA performs exceptionally well.

ALFWorld is a virtual world where users may run simulations of household chores using embodied agents. RAFA achieves better results than competing frameworks like AdaPlanner, ReAct, and Reflexion.

In BlocksWorld, players are tasked with building structures out of blocks. Compared to other models such as Vicuna, RAP, and CoT, RAFA’s success rates are significantly higher.

RAFA acts as “O” in a game of Tic-Tac-Toe against a language model acting as “X.” The “O” penalty does not prevent RAFA from competing with and even outperforming the language model in some settings. The researchers believe selecting a different planning depth (B = 3 or B = 4) might improve or decrease sample efficiency.

In conclusion, RAFA is a flexible algorithm that excels in various settings and tasks, demonstrating amazing sample efficiency and often exceeding other existing frameworks.

Check out the Paper, Github, and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post This AI Research Introduces ‘RAFA’: A Principled Artificial Intelligence Framework for Autonomous LLM Agents with Provable Sample Efficiency appeared first on MarkTechPost.

Revolutionizing Document Parsing: Meet DSG – The First End-to-End Tr …

The Document Structure Generator (DSG) is a powerful system for parsing and generating structured documents. DSG surpasses commercial OCR tools’ capabilities and sets new performance standards, positioning itself as a powerful and adaptable solution for diverse real-world applications. Researchers delve into the innovative features and impressive outcomes of DSG, highlighting its potential to revolutionize document processing.

Traditional document-to-structure systems rely on heuristics and lack end-to-end trainability. The DSG offers a solution, the first end-to-end trainable system for hierarchical document parsing. It employs deep neural networks to parse entities, capturing sequences and nested structures. DSG introduces an extended syntax for queries and proves valuable for practical use by allowing seamless adaptation to new documents without manual re-engineering.

Document structure parsing is essential for extracting hierarchical information from documents, particularly PDFs and scans, which can challenge storage and downstream tasks. Existing solutions, like OCR, focus on text retrieval but need help with hierarchical structure inference. The DSG is introduced as an innovative system, employing a deep neural network to parse entities, preserving their relationships, and facilitating the creation of structured hierarchical formats. It addresses the need for end-to-end trainable systems in this domain.

The DSG is a system for hierarchical document parsing, utilizing a deep neural network to parse entities and capture their sequences and nested structure. It’s end-to-end trainable, demonstrating effectiveness and flexibility. The authors contribute to the E-Periodica dataset, enabling DSG evaluation. It surpasses commercial OCR tools and achieves state-of-the-art performance. Performance assessment includes separate evaluations for entity detection and structure generation, using benchmarking adapted from related tasks like scene graph generation.

Evaluation primarily relies on the E-Periodica dataset, neglecting the system’s generalizability to different document types. Detailed computational resource analysis for training and inference needs to be included. While DSG outperforms commercial OCR tools, it lacks an in-depth comparison or analysis of OCR tool limitations. Training challenges and potential biases in data are not discussed, and the paper needs a comprehensive analysis of system error cases and failure modes. Understanding these aspects is crucial for future enhancements.

In conclusion, the DSG presents a fully trainable system for document parsing, effectively capturing entity sequences and nested structures. It surpasses commercial OCR tools, achieving state-of-the-art hierarchical document parsing. The authors introduce the challenging E-Periodica dataset for evaluation, featuring diverse semantic categories and intricate nested structures. DSG’s end-to-end training flexibility marks a significant advancement in document structure processing, representing a pioneering solution in the field.

Future research should assess DSG’s applicability to diverse documents and datasets, examine its computational demands and efficiency, and comprehensively analyze its limitations and potential failure modes. Investigating training data availability and biases and comparing DSG to commercial OCR tools are essential. Continuous refinement based on user feedback and real-world use is vital for enhancing the system’s practicality and effectiveness.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Revolutionizing Document Parsing: Meet DSG – The First End-to-End Trainable System for Hierarchical Structure Extraction appeared first on MarkTechPost.

Intelligent document processing with Amazon Textract, Amazon Bedrock, …

In today’s information age, the vast volumes of data housed in countless documents present both a challenge and an opportunity for businesses. Traditional document processing methods often fall short in efficiency and accuracy, leaving room for innovation, cost-efficiency, and optimizations. Document processing has witnessed significant advancements with the advent of Intelligent Document Processing (IDP). With IDP, businesses can transform unstructured data from various document types into structured, actionable insights, dramatically enhancing efficiency and reducing manual efforts. However, the potential doesn’t end there. By integrating generative artificial intelligence (AI) into the process, we can further enhance IDP capabilities. Generative AI not only introduces enhanced capabilities in document processing, it also introduces a dynamic adaptability to changing data patterns. This post takes you through the synergy of IDP and generative AI, unveiling how they represent the next frontier in document processing.
We discuss IDP in detail in our series Intelligent document processing with AWS AI services (Part 1 and Part 2). In this post, we discuss how to extend a new or existing IDP architecture with large language models (LLMs). More specifically, we discuss how we can integrate Amazon Textract with LangChain as a document loader and Amazon Bedrock to extract data from documents and use generative AI capabilities within the various IDP phases.
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) through easy-to-use APIs.
Solution overview
The following diagram is a high-level reference architecture that explains how you can further enhance an IDP workflow with foundation models. You can use LLMs in one or all phases of IDP depending on the use case and desired outcome.

In this architecture, LLMs are used to perform specific tasks within the IDP workflow.

Document classification – In addition to using Amazon Comprehend, you can use an LLM to classify documents using few-shot prompting. Few-shot prompt involves prompting the language model with a few examples of different classes and a list of all possible classes, and then asking the model to classify a given piece of text from a document using one of the classes.
Summarization – You can also use LLMs to summarize larger documents to provide precise summaries within the extraction phase of IDP. For example, a financial analysis system may involve analyzing hundreds of pages of earnings documents of a company. You can use a language model to summarize the key aspects of the earnings, enabling analysts to make business decisions.
Standardization and in-context Q&A – In addition to extracting exact information out of documents using the Amazon Textract Analyze Document functionality, you can use LLMs to extract information that may otherwise not be explicitly inferred from a document. For example, a patient discharge summary may have the patient’s hospital admit date and discharge date but may not explicitly specify the total number of days the patient was in the hospital. You can use an LLM to deduce the total number of days the patient was admitted in the hospital, given the two dates extracted by Amazon Textract. This value can then be assigned with a well-known alias in a key-value format, also known as a normalized key, which makes consumption and post-processing even more straightforward.
 Templating and normalizations – An IDP pipeline often generates output that must conform to a specific deterministic schema. This is so that the output generated using the IDP workflow can be consumed into a downstream system, for example a relational database. The benefit of defining a deterministic schema is also to achieve key normalization so that we have a known set of keys to process in our postprocessing logic. For example, we may want to define “DOB” as a normalized key for “date of birth,” “birth date,” “birthday date,” “date born,” and so on, because documents may come with any variation of these. We use LLMs to perform such templating and normalized key-value extractions on any document.
Spellcheck and corrections – Although Amazon Textract can extract the exact values from scanned documents (printed or handwritten), you can use a language model to identify if word misspellings and grammatical errors exist in the extracted data from. This is important in situations where the data may be extracted from poor quality or handwritten documents and used for generating marketing materials, flash reports, and so on. In addition to having a human manually review low-score extractions from Amazon Textract, you can use an LLM to augment the review process by providing correction recommendations to the human reviewer, thereby speeding up the review process.

In the following sections, we dive deep into how Amazon Textract is integrated into generative AI workflows using LangChain to process documents for each of these specific tasks. The code blocks provided here have been trimmed down for brevity. Refer to our GitHub repository for detailed Python notebooks and a step-by-step walkthrough.
Amazon Textract LangChain document loader
Text extraction from documents is a crucial aspect when it comes to processing documents with LLMs. You can use Amazon Textract to extract unstructured raw text from documents and preserve the original semi-structured or structured objects like key-value pairs and tables present in the document. Document packages like healthcare and insurance claims or mortgages consist of complex forms that contain a lot of information across structured, semi-structured, and unstructured formats. Document extraction is an important step here because LLMs benefit from the rich content to generate more accurate and relevant responses, which otherwise could impact the quality of the LLMs’ output.
LangChain is a powerful open-source framework for integrating with LLMs. LLMs in general are versatile but may struggle with domain-specific tasks where deeper context and nuanced responses are needed. LangChain empowers developers in such scenarios to build agents that can break down complex tasks into smaller sub-tasks. The sub-tasks can then introduce context and memory into LLMs by connecting and chaining LLM prompts.
LangChain offers document loaders that can load and transform data from documents. You can use them to structure documents into preferred formats that can be processed by LLMs. The AmazonTextractPDFLoader is a service loader type of document loader that provides quick way to automate document processing by using Amazon Textract in combination with LangChain. For more details on AmazonTextractPDFLoader, refer to the LangChain documentation. To use the Amazon Textract document loader, you start by importing it from the LangChain library:

from langchain.document_loaders import AmazonTextractPDFLoader

You can load a document from a HTTPS URL endpoint as well as documents hosted in Amazon Simple Storage Service (Amazon S3) buckets via Amazon S3 object URLs (also called path style access):

https_loader = AmazonTextractPDFLoader(“https://sample-website.com/sample-doc.pdf”)
https_document = https_loader.load()

s3_loader = AmazonTextractPDFLoader(“s3://sample-bucket/sample-doc.pdf”)
s3_document = s3_loader.load()

You can also store documents in Amazon S3 and refer to them using the s3:// URL pattern, as explained in Accessing a bucket using S3://, and pass this S3 path to the Amazon Textract PDF loader:

import boto3
textract_client = boto3.client(‘textract’, region_name=’us-east-2′)

file_path = “s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf”
loader = AmazonTextractPDFLoader(file_path, client=textract_client)
documents = loader.load()

A multi-page document will contain multiple pages of text, which can then be accessed via the documents object, which is a list of pages. The following code loops through the pages in the documents object and prints the document text, which is available via the page_content attribute:

print(len(documents))

for document in documents:
print(document.page_content)

Document classification
Amazon Comprehend and LLMs can be effectively utilized for document classification. Amazon Comprehend is a natural language processing (NLP) service that uses ML to extract insights from text. Amazon Comprehend also supports custom classification model training with layout awareness on documents like PDFs, Word, and image formats. For more information about using the Amazon Comprehend document classifier, refer to Amazon Comprehend document classifier adds layout support for higher accuracy.
When paired with LLMs, document classification becomes a powerful approach for managing large volumes of documents. LLMs are helpful in document classification because they can analyze the text, patterns, and contextual elements in the document using natural language understanding. You can also fine-tune them for specific document classes. When a new document type introduced in the IDP pipeline needs classification, the LLM can process text and categorize the document given a set of classes. The following is a sample code that uses the LangChain document loader powered by Amazon Textract to extract the text from the document and use it for classifying the document. We use the Anthropic Claude v2 model via Amazon Bedrock to perform the classification.
In the following example, we first extract text from a patient discharge report and use an LLM to classify it given a list of three different document types—DISCHARGE_SUMMARY, RECEIPT, and PRESCRIPTION. The following screenshot shows our report.

We use the following code:

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader(“./samples/document.png”)
document = loader.load()

template = “””

Given a list of classes, classify the document into one of these classes. Skip any preamble text and just give the class name.

<classes>DISCHARGE_SUMMARY, RECEIPT, PRESCRIPTION</classes>
<document>{doc_text}<document>
<classification>”””

prompt = PromptTemplate(template=template, input_variables=[“doc_text”])
bedrock_llm = Bedrock(client=bedrock, model_id=”anthropic.claude-v2″)

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
class_name = llm_chain.run(document[0].page_content)

print(f”The provided document is = {class_name}”)

The code produces the following output:
The provided document is a DISCHARGE_SUMMARY
Document summarization
Summarization involves condensing a given text or document into a shorter version while retaining its key information. This technique is beneficial for efficient information retrieval, which enables users to quickly grasp the key points of a document without reading the entire content. Although Amazon Textract doesn’t directly perform text summarization, it provides the foundational capabilities of extracting the entire text from documents. This extracted text serves as an input to our LLM model for performing text summarization tasks.
Using the same sample discharge report, we use AmazonTextractPDFLoader to extract text from this document. As before, we use the Claude v2 model via Amazon Bedrock and initialize it with a prompt that contains the instructions on what to do with the text (in this case, summarization). Finally, we run the LLM chain by passing in the extracted text from the document loader. This runs an inference action on the LLM with the prompt that consists of the instructions to summarize, and the document’s text marked by Document. See the following code:

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader(“./samples/discharge-summary.png”)
document = loader.load()

template = “””

Given a full document, give me a concise summary. Skip any preamble text and just give the summary.

<document>{doc_text}</document>
<summary>”””

prompt = PromptTemplate(template=template, input_variables=[“doc_text”])
bedrock_llm = Bedrock(client=bedrock, model_id=”anthropic.claude-v2″)

num_tokens = bedrock_llm.get_num_tokens(document[0].page_content)
print (f”Our prompt has {num_tokens} tokens nn=========================n”)

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
summary = llm_chain.run(document[0].page_content)

print(summary.replace(“</summary>”,””).strip())

The code generates the summary of a patient discharge summary report:

Our prompt has 797 tokens
=========================
35 yo M admitted for epigastric abdominal pain, nausea, fatigue. Found to likely have ulcer. Discharged with activity restrictions, antibiotics, diet changes, and follow up.

The preceding example used a single-page document to perform summarization. However, you will likely deal with documents containing multiple pages that need summarization. A common way to perform summarization on multiple pages is to first generate summaries on smaller chunks of text and then combine the smaller summaries to get a final summary of the document. Note that this method requires multiple calls to the LLM. The logic for this can be crafted easily; however, LangChain provides a built-in summarize chain that can summarize large texts (from multi-page documents). The summarization can happen either via map_reduce or with stuff options, which are available as options to manage the multiple calls to the LLM. In the following example, we use map_reduce to summarize a multi-page document. The following figure illustrates our workflow.

Let’s first start by extracting the document and see the total token count per page and the total number of pages:

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock

bedrock_llm = Bedrock(client=bedrock, model_id=”anthropic.claude-v2″)

loader = AmazonTextractPDFLoader(f”s3://{data_bucket}/bedrock-sample/health_plan.pdf”)
document = loader.load()
num_docs = len(document)
print (f”There are {num_docs} pages in the document”)
for index, doc in enumerate(document):
num_tokens_first_doc = bedrock_llm.get_num_tokens(doc.page_content)
print (f”Page {index+1} has approx. {num_tokens_first_doc} tokens”)

There are 5 pages in the document
Page 1 has approx. 533 tokens
Page 2 has approx. 1323 tokens
Page 3 has approx. 997 tokens
Page 4 has approx. 1643 tokens
Page 5 has approx. 867 tokens

Next, we use LangChain’s built-in load_summarize_chain to summarize the entire document:
from langchain.chains.summarize import load_summarize_chain

summary_chain = load_summarize_chain(llm=bedrock_llm,
chain_type=’map_reduce’)
output = summary_chain.run(document)
print(output.strip())
Standardization and Q&A
In this section, we discuss standardization and Q&A tasks.
Standardization
Output standardization is a text generation task where LLMs are used to provide a consistent formatting of the output text. This task is particularly useful for automation of key entity extraction that requires the output to be aligned with desired formats. For example, we can follow prompt engineering best practices to fine-tune an LLM to format dates into MM/DD/YYYY format, which may be compatible with a database DATE column. The following code block shows an example of how this is done using an LLM and prompt engineering. Not only do we standardize the output format for the date values, we also prompt the model to generate the final output in a JSON format so that it is easily consumable in our downstream applications. We use LangChain Expression Language (LCEL) to chain together two actions. The first action prompts the LLM to generate a JSON format output of just the dates from the document. The second action takes the JSON output and standardizes the date format. Note that this two-step action may also be performed in a single step with proper prompt engineering, as we’ll see in normalization and templating.

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader(“./samples/discharge-summary.png”)
document = loader.load()

bedrock_llm = Bedrock(client=bedrock, model_id=”anthropic.claude-v2″)

template1 = “””

Given a full document, answer the question and format the output in the format specified. Skip any preamble text and just generate the JSON.

<format>
{{
“key_name”:”key_value”
}}
</format>
<document>{doc_text}</document>
<question>{question}</question>”””

template2 = “””

Given a JSON document, format the dates in the value fields precisely in the provided format. Skip any preamble text and just generate the JSON.

<format>DD/MM/YYYY</format>
<json_document>{json_doc}</json_document>
“””

prompt1 = PromptTemplate(template=template1, input_variables=[“doc_text”, “question”])
llm_chain = LLMChain(prompt=prompt1, llm=bedrock_llm, verbose=True)

prompt2 = PromptTemplate(template=template2, input_variables=[“json_doc”])
llm_chain2 = LLMChain(prompt=prompt2, llm=bedrock_llm, verbose=True)

chain = (
llm_chain
| {‘json_doc’: lambda x: x[‘text’] }
| llm_chain2
)

std_op = chain.invoke({ “doc_text”: document[0].page_content,
“question”: “Can you give me the patient admitted and discharge dates?”})

print(std_op[‘text’])

{
“admit_date”:”07/09/2020″,
“discharge_date”:”08/09/2020″
}

The output of the preceding code sample is a JSON structure with dates 07/09/2020 and 08/09/2020, which are in the format DD/MM/YYYY and are the patient’s admit and discharge date from the hospital, respectively, according to the discharge summary report.
Q&A with Retrieval Augmented Generation
LLMs are known to retain factual information, often referred to as their world knowledge or world view. When fine-tuned, they can produce state-of-the-art results. However, there are constraints to how effectively an LLM can access and manipulate this knowledge. As a result, in tasks that heavily rely on specific knowledge, their performance might not be optimal for certain use cases. For instance, in Q&A scenarios, it’s essential for the model to adhere strictly to the context provided in the document without relying solely on its world knowledge. Deviating from this can lead to misrepresentations, inaccuracies, or even incorrect responses. The most commonly used method to address this problem is known as Retrieval Augmented Generation (RAG). This approach synergizes the strengths of both retrieval models and language models, enhancing the precision and quality of the responses generated.
LLMs can also impose token limitations due to their memory constraints and the limitations of the hardware they run on. To handle this problem, techniques like chunking are used to divide large documents into smaller portions that fit within the token limits of LLMs. On the other hand, embeddings are employed in NLP primarily to capture the semantic meaning of words and their relationships with other words in a high-dimensional space. These embeddings transform words into vectors, allowing models to efficiently process and understand textual data. By understanding the semantic nuances between words and phrases, embeddings enable LLMs to generate coherent and contextually relevant outputs. Note the following key terms:

Chunking – This process breaks down large amounts of text from documents into smaller, meaningful chunks of text.
Embeddings – These are fixed-dimensional vector transformations of each chunk that retain the semantic information from the chunks. These embeddings are subsequently loaded into a vector database.
Vector database – This is a database of word embeddings or vectors that represent the context of words. It acts as a knowledge source that aides NLP tasks in document processing pipelines. The benefit of the vector database here is that is allows only the necessary context to be provided to the LLMs during text generation, as we explain in the following section.

RAG uses the power of embeddings to understand and fetch relevant document segments during the retrieval phase. By doing so, RAG can work within the token limitations of LLMs, ensuring the most pertinent information is selected for generation, resulting in more accurate and contextually relevant outputs.
The following diagram illustrates the integration of these techniques to craft the input to LLMs, enhancing their contextual understanding and enabling more relevant in-context responses. One approach involves similarity search, utilizing both a vector database and chunking. The vector database stores embeddings representing semantic information, and chunking divides text into manageable sections. Using this context from similarity search, LLMs can run tasks such as question answering and domain-specific operations like classification and enrichment.

For this post, we use a RAG-based approach to perform in-context Q&A with documents. In the following code sample, we extract text from a document and then split the document into smaller chunks of text. Chunking is required because we may have large multi-page documents and our LLMs may have token limits. These chunks are then loaded into the vector database for performing similarity search in the subsequent steps. In the following example, we use the Amazon Titan Embed Text v1 model, which performs the vector embeddings of the document chunks:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import BedrockEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.chains import RetrievalQA

loader = AmazonTextractPDFLoader(“amazon_10k.pdf”)
document = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400,
separators=[“nn”, “n”, “.”, “!”, “?”, “,”, ” “, “”],
chunk_overlap=0)
texts = text_splitter.split_documents(document)
embeddings = BedrockEmbeddings(client=bedrock,
model_id=”amazon.titan-embed-text-v1″)
db = FAISS.from_documents(documents=texts,
embedding=embeddings)

retriever = db.as_retriever(search_type=’mmr’, search_kwargs={“k”: 3})

template = “””

Answer the question as truthfully as possible strictly using only the provided text, and if the answer is not contained within the text, say “I don’t know”. Skip any preamble text and reasoning and give just the answer.

<text>{context}</text>
<question>{question}</question>
<answer>”””

# define the prompt template
qa_prompt = PromptTemplate(template=template, input_variables=[“context”,”question”])

chain_type_kwargs = { “prompt”: qa_prompt, “verbose”: False } # change verbose to True if you need to see what’s happening

bedrock_llm = Bedrock(client=bedrock, model_id=”anthropic.claude-v2″)
qa = RetrievalQA.from_chain_type(
llm=bedrock_llm,
chain_type=”stuff”,
retriever=retriever,
chain_type_kwargs=chain_type_kwargs,
verbose=False # change verbose to True if you need to see what’s happening
)

question=”Who is the administrator for this plan?”
result = qa.run(question)
print(result.strip())

The code creates a relevant context for the LLM using the chunks of text that are returned by the similarity search action from the vector database. For this example, we use an open-source FAISS vector store as a sample vector database to store vector embeddings of each chunk of text. We then define the vector database as a LangChain retriever, which is passed into the RetrievalQA chain. This internally runs a similarity search query on the vector database that returns the top n (where n=3 in our example) chunks of text that are relevant to the question. Finally, the LLM chain is run with the relevant context (a group of relevant chunks of text) and the question for the LLM to answer. For a step-by-step code walkthrough of Q&A with RAG, refer to the Python notebook on GitHub.
As an alternative to FAISS, you can also use Amazon OpenSearch Service vector database capabilities, Amazon Relational Database Service (Amazon RDS) for PostgreSQL with the pgvector extension as vector databases, or open-source Chroma Database.
Q&A with tabular data
Tabular data within documents can be challenging for LLMs to process because of its structural complexity. Amazon Textract can be augmented with LLMs because it enables extracting tables from documents in a nested format of elements such as page, table, and cells. Performing Q&A with tabular data is a multi-step process, and can be achieved via self-querying. The following is an overview of the steps:

 Extract tables from documents using Amazon Textract. With Amazon Textract, the tabular structure (rows, columns, headers) can be extracted from a document.
 Store the tabular data into a vector database along with metadata information, such as the header names and the description of each header.
Use the prompt to construct a structured query, using an LLM, to derive the data from the table.
Use the query to extract the relevant table data from the vector database.

For example, in a bank statement, given the prompt “What are the transactions with more than $1000 in deposits,” the LLM would complete the following steps:

Craft a query, such as “Query: transactions” , “filter: greater than (Deposit$)”.
Convert the query into a structured query.
Apply the structured query to the vector database where our table data is stored.

For a step-by-step sample code walkthrough of Q&A with tabular, refer to the Python notebook in GitHub.
Templating and normalizations
In this section, we look at how to use prompt engineering techniques and LangChain’s built-in mechanism to generate an output with extractions from a document in a specified schema. We also perform some standardization on the extracted data, using the techniques discussed previously. We start by defining a template for our desired output. This will serve as a schema and encapsulate the details about each entity we want to extract from the document’s text.

output_template= {
“doctor_name”:{ “type”: “string”, “description”: “The doctor or provider’s full name” },
“provider_id”:{ “type”: “string”, “description”: “The doctor or provider’s ID” },
“patient_name”:{ “type”: “string”, “description”: “The patient’s full name” },

}

Note that for each of the entities, we use the description to explain what that entity is to help assist the LLM in extracting the value from the document’s text. In the following sample code, we use this template to craft our prompt for the LLM along with the text extracted from the document using AmazonTextractPDFLoader and subsequently perform inference with the model:

from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

template = “””

You are a helpful assistant. Please extract the following details from the document and format the output as JSON using the keys. Skip any preamble text and generate the final answer.

<details>
{details}
</details>

<keys>
{keys}
</keys>

<document>
{doc_text}
<document>

<final_answer>”””

details = “n”.join([f”{key}: {value[‘description’]}” for key, value in output_template.items()])
keys = “n”.join([f”{key}” for key, value in output_template.items()])

prompt = PromptTemplate(template=template, input_variables=[“details”, “keys”, “doc_text”])
bedrock_llm = Bedrock(client=bedrock, model_id=”anthropic.claude-v2″)

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
output = llm_chain.run({“doc_text”: full_text, “details”: details, “keys”: keys})

print(output)

{
“doctor_name”: “Mateo Jackson, Phd”,
“provider_id”: “XA/7B/00338763”,
“patient_name”: “John Doe”,

}

As you can see, the {keys} part of the prompt is the keys from our template, and the {details} are the keys along with their description. In this case, we don’t prompt the model explicitly with the format of the output other than specifying in the instruction to generate the output in JSON format. This works for the most part; however, because the output from LLMs is non-deterministic text generation, we want to specify the format explicitly as part of the instruction in the prompt. To solve this, we can use LangChain’s structured output parser module to take advantage of the automated prompt engineering that helps convert our template to a format instruction prompt. We use the template defined earlier to generate the format instruction prompt as follows:

from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

response_schems = list()

for key, value in output_template.items():
schema = ResponseSchema(name=key, description=value[‘description’], type=value[‘type’])
response_schems.append(schema)
output_parser = StructuredOutputParser.from_response_schemas(response_schems)
format_instructions= output_parser.get_format_instructions()
print(format_instructions)

The format_instructions variable now holds the format instruction prompt:

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing ““`json” and ““`”:

“`json
{
“doctor_name”: string // The doctor or provider’s full name
“provider_id”: string // The doctor or provider’s ID
“patient_name”: string // The patient’s full name

}
“`

We then use this variable within our original prompt as an instruction to the LLM so that it extracts and formats the output in the desired schema by making a small modification to our prompt:

template = “””

You are a helpful assistant. Please extract the following details from the document and strictly follow the instructions described in the format instructions to format the output. Skip any preamble text and generate the final answer. Do not generate incomplete answer.

<details>
{details}
</details>

<format_instructions>
{format_instructions}
</format_instructions>

<document>
{doc_text}
<document>

<final_answer>”””

So far, we have only extracted the data out of the document in a desired schema. However, we still need to perform some standardization. For example, we want the patient’s admitted date and discharge date to be extracted in DD/MM/YYYY format. In this case, we augment the description of the key with the formatting instruction:

new_output_template= {

“admitted_date”:{ “type”: “string”, “description”: “Date the patient was admitted to the hospital, this should be formatted in DD/MM/YYYY format.” },
“discharge_date”:{ “type”: “string”, “description”: “Date the patient was discharged from the hospital, this should be formatted in DD/MM/YYYY format.”

}

Refer to the Python notebook in GitHub for a full step-by-step walkthrough and explanation.
Spellchecks and corrections
LLMs have demonstrated remarkable abilities in understanding and generating human-like text. One of the lesser-discussed but immensely useful applications of LLMs is their potential in grammatical checks and sentence correction in documents. Unlike traditional grammar checkers that rely on a set of predefined rules, LLMs use patterns that they have identified from vast amounts of text data to determine what constitutes as correct or fluent language. This means they can detect nuances, context, and subtleties that rule-based systems might miss.
Imagine the text extracted from a patient discharge summary that reads “Patient Jon Doe, who was admittd with sever pnemonia, has shown significant improvemnt and can be safely discharged. Followups are scheduled for nex week.” A traditional spellchecker might recognize “admittd,” “pneumonia,” “improvement,” and “nex” as errors. However, the context of these errors could lead to further mistakes or generic suggestions. An LLM, equipped with its extensive training, might suggest: “Patient John Doe, who was admitted with severe pneumonia, has shown significant improvement and can be safely discharged. Follow-ups are scheduled for next week.”
The following is a poorly handwritten sample document with the same text as explained previously.

We extract the document with an Amazon Textract document loader and then instruct the LLM, via prompt engineering, to rectify the extracted text to correct any spelling and or grammatical mistakes:

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader(“./samples/hand_written_note.pdf”)
document = loader.load()

template = “””

Given a detailed ‘Document’, perform spelling and grammatical corrections. Ensure the output is coherent, polished, and free from errors. Skip any preamble text and give the answer.

<document>{doc_text}</<document>
<answer>
“””

prompt = PromptTemplate(template=template, input_variables=[“doc_text”])
llm = Bedrock(client=bedrock, model_id=”anthropic.claude-v2″)
llm_chain = LLMChain(prompt=prompt, llm=llm)

try:
txt = document[0].page_content
std_op = llm_chain.run({“doc_text”: txt})

print(“Extracted text”)
print(“==============”)
print(txt)

print(“nCorrected text”)
print(“==============”)
print(std_op.strip())
print(“n”)
except Exception as e:
print(str(e))

The output of the preceding code shows the original text extracted by the document loader followed by the corrected text generated by the LLM:

Extracted text
==============
Patient John Doe, who was ad mitta with sever pnequonia, has shown Signif i art improumet & can be safely discharged. Follow w/s are scheduled for nen week. Patient John Doe, who was ad mitta with sever pnequonia, has shown Signif i art improumet & can be safely discharged. Follow w/s are scheduled for nen week.

Corrected text
==============
Patient John Doe, who was admitted with severe pneumonia, has shown significant improvement and can be safely discharged. Follow-up appointments are scheduled for next week.

Keep in mind that as powerful as LLMs are, it’s essential to view their suggestions as just that—suggestions. Although they capture the intricacies of language impressively well, they aren’t infallible. Some suggestions might change the intended meaning or tone of the original text. Therefore, it’s crucial for human reviewers to use LLM-generated corrections as a guide, not an absolute. The collaboration of human intuition with LLM capabilities promises a future where our written communication is not just error-free, but also richer and more nuanced.
Conclusion
Generative AI is changing how you can process documents with IDP to derive insights. In the post Enhancing AWS intelligent document processing with generative AI, we discussed the various stages of the pipeline and how AWS customer Ricoh is enhancing their IDP pipeline with LLMs. In this post, we discussed various mechanisms of augmenting the IDP workflow with LLMs via Amazon Bedrock, Amazon Textract, and the popular LangChain framework. You can get started with the new Amazon Textract document loader with LangChain today using the sample notebooks available in our GitHub repository. For more information on working with generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS.

About the Authors
Sonali Sahu is leading intelligent document processing with the AI/ML services team in AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.
Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS AI services.
Chinmayee Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing and generative AI solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.

T-Mobile US, Inc. uses artificial intelligence through Amazon Transcri …

This post is co-authored by Dhurjati Brahma, Senior Systems Architect at T-Mobile US, Inc and Jim Chao, Principal Engineer/Architect at T-Mobile US, Inc and Nicholas Zellerhoff Associate Systems Architect at T-Mobile US, Inc.
T-Mobile US, Inc. provides a Voicemail to Text service to its customers, which allows customers to quickly read through their voicemails and respond to and manage messages in any order without having to dial into their voicemailbox. This service is delivered by the T-Mobile Voicemail system and uses Amazon Transcribe to convert voicemail messages to text. In 2023, T-Mobile launched the Voicemail to Text Translate feature. Powered by Amazon Translate, this feature lets customers request voicemail transcriptions in their language of choice from the native Visual Voicemail application available on major Android manufacturers’ devices, beginning with flagship devices and all future devices available from major device partners.

NATIVE VISUAL VOICEMAIL DIALER FEATURING VOICEMAIL TO TEXT TRANSLATE FEATURE – ALL MODELS

The history
Two years ago, in 2021, T-Mobile engineering teams partnered with AWS to launch a new AI-powered feature called Voicemail to Text with automatic language detection and to improve quality and performance for customers. Voicemail to Text provided customers with the additional benefit of receiving voicemail transcriptions at no extra cost. Voicemail to Text is available in 39 different spoken languages and dialects and uses auto-language detection provided by Amazon Transcribe. These languages included English and Spanish, as well as many European, Middle Eastern, Asian, and African languages. The full list of supported languages can be found in Amazon Transcribe documentation. Since its introduction, the usage of Voicemail to Text service has grown and it transcribes 126 million voicemail messages per month as of July 2023. T-Mobile has partnered with AWS to analyze the key application metrics of this service, such as language distribution, the number of messages per language, the daily total and unique active customers, and so on. This data helped in scaling the service and making key business decisions to improve the Voicemail to Text customer experience.
The challenge
Upon analysis of the weekly language distribution of voicemail transcriptions, T-Mobile observed that approximately 10 percent of voicemail transcriptions were received in a language other than US English, with Spanish being the predominant language.

In addition, market research based on U.S. Census Bureau data showed that approximately 22 percent of the US population spoke a language other than English. This showed a need for a Voicemail to Text translation feature that could bridge this language gap and drove T-Mobile and AWS teams to brainstorm a solution. The idea was to give T-Mobile’s Voicemail to Text customers an option to receive voicemail transcriptions in the language of their choice delivered through SMS, email, or through the Visual Voicemail (VVM) application.
The solution
T-Mobile decided to use Amazon Translate, a neural machine translation (MT) service to augment their existing Voicemail to Text service with real-time text translation between supported languages because Amazon Translate provided the high-quality translation services required in the industry. T-Mobile already had their voicemail system connected to AWS through a private AWS Direct Connect link and was using the Amazon Transcribe API to get transcriptions. Following the same design pattern, T-Mobile added an integration with the Amazon Translate API to translate voicemail transcripts from the source language detected by Amazon Transcribe to the customers’ preferred language.
Here is a high-level architecture diagram that illustrates the T-Mobile Voicemail to Text Translate solution.

From a customer perspective, to enable the Visual Voicemail Translate feature, a customer needs a Voicemail to Text feature service operator code (SOC) enabled on their mobile plan and should have one of the supported major Android manufacturer devices with the Translate feature API enabled. The customer can then visit the Visual Voicemail settings page to select a language from a list of 75 different languages and dialects supported by Amazon Translate. This will enable customers to receive voicemail transcription in the supported language of their choice.
The results
With Amazon Translate, T-Mobile was able to deliver a delightful new customer experience that accommodates the language preference of its customers and makes voicemail more accessible to people who speak various languages. This new capability helps to break language barriers by making it easier for speakers of different languages to communicate.
Conclusion
By using Amazon Transcribe and Amazon Translate language AI services, T-Mobile was able to enhance its voicemail service by delivering message transcriptions in a language that customers can understand. By choosing to use AWS managed AI services, T-Mobile was able to expedite the delivery of this new customer experience and avoid operational burdens of maintaining additional servers and software in their data centers. With Amazon Transcribe and Amazon Translate, Voicemail to Text and Voicemail to Text Translate services are delivered with low latency and high accuracy.
For more information, check out Getting started with Amazon Translate and Getting started with Amazon Transcribe to explore how you can use these services with your applications. Follow the Artificial Intelligence category on AWS Machine Learning Blog to stay up to date with new capabilities and use cases for various AWS AI services.

About the Authors
Dhurjati Brahma is a Senior Systems Architect at T-Mobile US, Inc. He has over 15+ years of experience in designing, building, and managing robust and scalable virtualized messaging solutions within T-Mobile’s network. He is passionate to collaborate with various cross-functional teams and vendors to securely integrate T-Mobile’s messaging systems with public cloud to launch exciting new products and services for T-Mobile’s customers. He holds a Master’s degree in Electrical Engineering from University of Alabama at Birmingham. Outside of work, he enjoys going on hikes, listening to classical music, practicing meditation, and spending time with his family and friends.
Jim Chao is the Principal Engineer/Architect at Core Messaging Service Design at T-Mobile US, Inc. He has more than two decades of experience in design and architecture of mobile messaging service systems and platforms. Lately, he has been dedicating his time to the next generation of messaging service using machine learning and generative AI. He holds a Master’s degree of Computer Information Systems. Outside the work he spends time with family and does a lot of religious study and practice as well as traveling to the religious places above 5000 meters in the mountains of Tibet.
Nicholas Zellerhoff is an Associate Systems Architect for T-Mobile as part of the Service Technology Development team functioning as lead development engineer for Native Visual Voicemail services. When not in office, Nick enjoys everything outdoors, from backyard BBQs with friends to backcountry hiking in the North Cascades.
Alex Bulatkin is a solutions architect at AWS. He enjoys helping communication service providers build innovative solutions in AWS that are redefining the telecom industry. He is passionate about working with customers on bringing the power of AWS AI services into their applications. Alex is based in the Denver metropolitan area and likes to hike, ski, and snowboard.
Prabhakaran Balasubramaniam is a Principal Customer Solutions Manager at AWS. He loves to help telecom customers leverage new technologies to solve their problems. Prabhakaran is based in the Dallas-Fort Worth area and likes sports.

PyTorchEdge Unveils ExecuTorch: Empowering On-Device Inference for Mob …

In a groundbreaking move, PyTorch Edge introduced its new component, ExecuTorch, a cutting-edge solution poised to revolutionize on-device inference capabilities across mobile and edge devices. This ambitious endeavor has garnered support from industry stalwarts, including Arm, Apple, and Qualcomm Innovation Center, cementing ExecuTorch’s position as a trailblazing force in the field of on-device AI.

ExecuTorch is a pivotal step towards addressing the fragmentation prevailing within the on-device AI ecosystem. With a meticulously crafted design offering extension points for seamless third-party integration, this innovation accelerates the execution of machine learning (ML) models on specialized hardware. Notably, esteemed partners have contributed custom delegate implementations to optimize model inference execution on their respective hardware platforms, further enhancing ExecuTorch’s efficacy.

The creators of ExecuTorch have thoughtfully provided the following:

Extensive documentation.

Offering in-depth insights into its architecture.

High-level components.

Exemplar ML models running on the platform.

Additionally, comprehensive end-to-end tutorials are available, guiding users through the process of exporting and executing models on a diverse range of hardware devices. The PyTorch Edge community eagerly anticipates witnessing the inventive applications of ExecuTorch that will undoubtedly emerge.

At the heart of ExecuTorch lies a compact runtime featuring a lightweight operator registry capable of catering to the expansive PyTorch ecosystem of models. This runtime provides a streamlined pathway to execute PyTorch programs on an array of edge devices, spanning from mobile phones to embedded hardware. ExecuTorch ships with a Software Developer Kit (SDK) and toolchain, delivering an intuitive user experience for ML Developers. This seamless workflow empowers developers to transition from model authoring to training seamlessly and, finally, to device delegation within a single PyTorch environment. The suite of tools also enables on-device model profiling and offers improved methods for debugging the original PyTorch model.

Built from the ground up with a composable architecture, ExecuTorch empowers ML developers to make informed decisions regarding the components they leverage and offers entry points for extension if required. This design confers several benefits to the ML community, including enhanced portability, productivity gains, and superior performance. The platform demonstrates compatibility across diverse computing platforms, from high-end mobile phones to resource-constrained embedded systems and microcontrollers.

PyTorch Edge’s visionary approach extends beyond ExecuTorch, aiming to bridge the gap between research and production environments. By leveraging the capabilities of PyTorch, ML engineers can now seamlessly author and deploy models across dynamic and evolving environments, encompassing servers, mobile devices, and embedded hardware. This inclusive approach caters to the increasing demand for on-device solutions in domains such as Augmented Reality (AR), Virtual Reality (VR), Mixed Reality (MR), Mobile, IoT, and beyond.

PyTorch Edge envisions a future where research seamlessly transitions to production, offering a comprehensive framework for deploying a wide range of ML models to edge devices. The platform’s core components exhibit portability, ensuring compatibility across devices with varying hardware configurations and performance capabilities. PyTorch Edge paves the way for a thriving ecosystem in the realm of on-device AI by empowering developers with well-defined entry points and representations.

In conclusion, ExecuTorch stands as a testament to PyTorch Edge’s commitment to advancing on-device AI. With the backing of industry leaders and a forward-thinking approach, the platform heralds a new era of on-device inference capabilities across mobile and edge devices, promising innovative breakthroughs in the field of AI.

Check out the Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post PyTorchEdge Unveils ExecuTorch: Empowering On-Device Inference for Mobile and Edge Devices appeared first on MarkTechPost.

Deciphering Memorization in Neural Networks: A Deep Dive into Model Si …

To learn statistically, one must balance memorization of training data and transfer to test samples. However, the success of overparameterized neural models casts doubt on this theory; these models can memorize yet still generalize well, as seen by their ability to correctly match random labels, for example. To attain perfect accuracy in classification, i.e., interpolate the training set, such models are commonly used in practice. This has sparked a slew of studies investigating the generalizability of these models.

Feldman recently showed that memorization may be required for generalization in certain contexts. Here, “memorization” is defined by a stability-based term with theoretical underpinnings; high memorization instances are those that the model can only correctly categorize if included in the training set. For practical neural networks, this term permits estimation of the degree of memorization1 of a training sample. Feldman and Zhang examined a ResNet’s memorization profile while using it to classify images using industry-standard standards.

While this is an intriguing initial look at what real-world models remember, a fundamental question remains: do larger neural models memorize more? New York-based Google researchers answer this topic empirically, providing a complete look at image classification standards. They discover that training examples display a surprising variety of memorization trajectories across model sizes, with some samples showing cap-shaped or growing memorization and others revealing decreasing memorization under larger models. 

To produce high-quality models of varied sizes, practitioners use a systematic process, knowledge distillation. Specifically, it entails creating high-quality little (student) models with guidance from high-performing large (teacher) models.

Feldman’s concept of memorization has been used to theoretically examine the relationship between memorization and generalization across a range of model sizes. The following are their contributions based on the results of controlled experiments: 

A quantitative investigation of the relationship between model complexity (such as the depth or width of a ResNet) and memorization for image classifiers is presented. The primary findings show that as the complexity of the model increases, the distribution of memorization across examples becomes increasingly bi-modal. They also note that other computationally tractable methods of assessing memorization and, for example, difficulty miss capturing this essential trend.

They give instances displaying different memorization score trajectories across model sizes, and they identify the four most frequent trajectory types, including those where memorization increases with model complexity, to investigate the bi-modal memorization trend further. Specifically, nebulous and mislabeled cases are found to follow this pattern.

Regarding samples that the one-hot (i.e., non-distilled) student memorizes, the researchers conclude with a quantitative study showing that distillation tends to impede memorization. Interestingly, they find memorization is hampered primarily for the cases in which memorization improves with model size. This finding suggests that distillation aids generalization by reducing the need to memorize such challenging circumstances.

The researchers begin by quantitatively analyzing the relationship between model complexity (the depth and width of a ResNet used for image classification) and memorization. They provide a graphic representation of the relationship between ResNet depth and memorization score on two well-known datasets (CIFAR-100 and ImageNet). Their investigation reveals that contrary to their initial beliefs, the memorization score decreases after reaching a depth of 20.

Researchers conclude that a greater bimodal distribution of memorization across diverse examples occurs as model complexity increases. They also point out a problem with current computationally feasible approaches for evaluating memorization and example difficulty by showing that these methods fail to capture this crucial pattern.

The study group gives examples with varied memorizing score trajectories across different model sizes to dig deeper into the bi-modal memorization pattern. They single out four main classes of trajectories, one of which involves memorization improving with model complexity. In particular, they discover that both unclear and mislabeled samples tend to follow this pattern.

The study concludes with a quantitative analysis showing that the process of distillation, by which knowledge is transferred from a big instructor model to a smaller student model, is associated with a decrease in memorization. This blockade is most noticeable for samples memorized by the one-hot, non-distilled student model. It’s interesting to note that distillation predominantly reduces memorization when memorization rises with increased model size. Based on this evidence, we can conclude that distillation improves generalization by preventing us from memorizing too many difficult examples.

In Conclusion:

The discovery by Google researchers has substantial practical implications and potential future directions for research. First, it’s important to use caution while memorizing specific data using only proxies. Various metrics defined in terms of model training or model inference have been proposed as effective surrogates for the memorization score in prior publications. These proxies provide a high agreement rate with memorization. Still, researchers have found that they differ greatly in distribution and fail to represent essential features of the memorization behavior of real-world models. This suggests a path forward for locating effectively computable proxies for memorization scores. The complexity of examples has been previously classified as a predetermined model size. The investigation results highlight the value of considering several model sizes when characterizing examples. For instance, Feldman defines the long tail examples of a dataset as the ones with the highest memorization score for a certain architecture. The results show that memorized information for one model size may not apply to another. 

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Deciphering Memorization in Neural Networks: A Deep Dive into Model Size, Memorization, and Generalization on Image Classification Benchmarks appeared first on MarkTechPost.

Meet BOSS: A Reinforcement Learning (RL) Framework that Trains Agents …

Introducing BOSS (Bootstrapping your own SkillS): a groundbreaking approach that leverages large language models to autonomously build a versatile skill library for tackling intricate tasks with minimal guidance. Compared to conventional unsupervised skill acquisition techniques and simplistic bootstrapping methods, BOSS performs better in executing unfamiliar tasks within novel environments. This innovation marks a significant leap in autonomous skill acquisition and application.

Reinforcement learning seeks to optimize policies in Markov Decision Processes for maximizing expected returns—past RL research pre-trained reusable skills for complex tasks. Unsupervised RL, focusing on curiosity, controllability, and diversity, learned skills without human input. The language was used for skill parameterization and open-loop planning. BOSS extends skill repertoires with large language models, guiding exploration and rewarding skill chain completion, yielding higher success rates in long-horizon task execution.

Traditional robot learning relies heavily on supervision, while humans excel at learning complex tasks independently. Researchers introduced BOSS as a framework to acquire diverse long-horizon skills with minimal human intervention autonomously. Through skill bootstrapping and guided by large language models (LLMs), BOSS progressively builds and combines skills to handle complex tasks. Unsupervised environment interactions enhance its policy robustness for solving challenging tasks in new environments.

BOSS introduces a two-phase framework. In the first phase, it acquires a foundational skill set using unsupervised RL objectives. The second phase, skill bootstrapping, employs LLMs to guide skill chaining and rewards based on skill completion. This approach allows agents to construct complex behaviors from basic skills. Experiments in household environments show that LLM-guided bootstrapping outperforms naïve bootstrapping and prior unsupervised methods in executing unfamiliar long-horizon tasks in new settings.

Experimental findings confirm BOSS, guided by LLMs, excels in solving extended household tasks in novel settings, surpassing prior LLM-based planning and unsupervised exploration methods. Results present inter-quartile means and standard deviations of oracle-normalized returns and oracle-normalized success rates for tasks of varying lengths in ALFRED evaluations. LLM-guided bootstrapping-trained agents outperform those from naïve bootstrapping and prior unsupervised methods. BOSS can autonomously acquire diverse, complex behaviors from basic skills, showcasing its potential for expert-free robotic skill acquisition.

The BOSS framework, guided by LLMs, excels in autonomously solving intricate tasks without expert guidance. LLM-guided bootstrapping-trained agents outperform naive bootstrapping and prior unsupervised methods when executing unfamiliar functions in new environments. Realistic household experiments confirm BOSS’s effectiveness in acquiring diverse, complex behaviors from basic skills, emphasizing its potential for autonomous robotics skill acquisition. BOSS also demonstrates promise in connecting reinforcement learning with natural language understanding, utilising pre-trained language models for guided learning.

Future research directions may include:

Investigating reset-free RL for autonomous skill learning.

Proposing long-horizon task breakdown with BOSS’s skill-chaining approach.

Expanding unsupervised RL for low-level skill acquisition.

Enhancing the integration of reinforcement learning with natural language understanding in the BOSS framework is also a promising avenue. Applying BOSS to diverse domains and evaluating its performance in various environments and task contexts offers potential for further exploration.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Meet BOSS: A Reinforcement Learning (RL) Framework that Trains Agents to Solve New Tasks in New Environments with LLM Guidance appeared first on MarkTechPost.