i-genie, Author at i-genie.co.uk

The Legal Accountability of AI-Generated Deepfakes in Election Misinfo …

Posted on June 2, 2025 by i-genie

How Deepfakes Are Created

Generative AI models enable the creation of highly realistic fake media. Most deepfakes today are produced by training deep neural networks on real images, video or audio of a target person. The two predominant AI architectures are generative adversarial networks (GANs) and autoencoders. A GAN consists of a generator network that produces synthetic images and a discriminator network that tries to distinguish fakes from real data. Through iterative training, the generator learns to produce outputs that increasingly fool the discriminator¹. Autoencoder-based tools similarly learn to encode a target face and then decode it onto a source video. In practice, deepfake creators use accessible software: open-source tools like DeepFaceLab and FaceSwap dominate video face-swapping (one estimate suggests DeepFaceLab was used for over 95% of known deepfake videos)². Voice-cloning tools (often built on similar AI principles) can mimic a person’s speech from minutes of audio. Commercial platforms like Synthesia allow text-to-video avatars (turning typed scripts into lifelike “spokespeople”), which have already been misused in disinformation campaigns³. Even mobile apps (e.g. FaceApp, Zao) let users do basic face swaps in minutes⁴. In short, advances in GANs and related models make deepfakes cheaper and easier to generate than ever.

Diagram of a generative adversarial network (GAN): A generator network creates fake images from random input and a discriminator network distinguishes fakes from real examples. Over time the generator improves until its outputs “fool” the discriminator⁵

During creation, a deepfake algorithm is typically trained on a large dataset of real images or audio from the target. The more varied and high-quality the training data, the more realistic the deepfake. The output often then undergoes post-processing (color adjustments, lip-syncing refinements) to enhance believability¹. Technical defenses focus on two fronts: detection and authentication. Detection uses AI models to spot inconsistencies (blinking irregularities, audio artifacts or metadata mismatches) that betray a synthetic origin⁵. Authentication embeds markers before dissemination – for example, invisible watermarks or cryptographically signed metadata indicating authenticity⁶. The EU AI Act will soon mandate that major AI content providers embed machine-readable “watermark” signals in synthetic media⁷. However, as GAO notes, detection is an arms race – even a marked deepfake can sometimes evade notice – and labels alone don’t stop false narratives from spreading⁸⁹.

Deepfakes in Recent Elections: Examples

Deepfakes and AI-generated imagery already have made headlines in election cycles around the world. In the 2024 U.S. primary season, a digitally-altered audio robocall mimicked President Biden’s voice urging Democrats not to vote in the New Hampshire primary. The caller (“Susan Anderson”) was later fined $6 million by the FCC and indicted under existing telemarketing laws¹⁰¹¹. (Importantly, FCC rules on robocalls applied regardless of AI: the perpetrator could have used a voice actor or recording instead.) Also in 2024, former President Trump posted on social media a collage implying that pop singer Taylor Swift endorsed his campaign, using AI-generated images of Swift in “Swifties for Trump” shirts¹². The posts sparked media uproar, though analysts noted the same effect could have been achieved without AI (e.g., by photoshopping text on real images)¹². Similarly, Elon Musk’s X platform carried AI-generated clips, including a parody “Ad” depicting Vice-President Harris’s voice via an AI clone¹³.

Beyond the U.S., deepfake-like content has appeared globally. In Indonesia’s 2024 presidential election, a video surfaced on social media in which a convincingly generated image of the late President Suharto appeared to endorse the candidate of the Golkar Party. Days later, the endorsed candidate (who is Suharto’s son-in-law) won the presidency¹⁴. In Bangladesh, a viral deepfake video superimposed the face of opposition leader Rumeen Farhana onto a bikini-clad body – an incendiary fabrication designed to discredit her in the conservative Muslim-majority society¹⁵. Moldova’s pro-Western President Maia Sandu has been repeatedly targeted by AI-driven disinformation; one deepfake video falsely showed her resigning and endorsing a Russian-friendly party, apparently to sow distrust in the electoral process¹⁶. Even in Taiwan (amidst tensions with China), a TikTok clip circulated that synthetically portrayed a U.S. politician making foreign-policy statements – stoking confusion ahead of Taiwanese elections¹⁷. In Slovakia’s recent campaign, AI-generated audio mimicking the liberal party leader suggested he plotted vote-rigging and beer-price hikes – instantly spreading on social media just days before the election¹⁸. These examples show that deepfakes have touched diverse polities (from Bangladesh and Indonesia to Moldova, Slovakia, India and beyond), often aiming to undermine candidates or confuse voters¹⁵¹⁸.

Notably, many of the most viral “deepfakes” in 2024 were actually circulated as obvious memes or claims, rather than subtle deceptions. Experts observed that outright undetectable AI deepfakes were relatively rare; more common were AI-generated memes plainly shared by partisans, or cheaply doctored “cheapfakes” made with basic editing tools¹³¹⁹. For instance, social media was awash with memes of Kamala Harris in Soviet garb or of Black Americans holding Trump signs¹³, but these were typically used satirically, not meant to be secretly believed. Nonetheless, even unsophisticated fakes can sway opinion: a U.S. study found that false presidential ads (not necessarily AI-made) did change voter attitudes in swing states. In sum, deepfakes are a real and growing phenomenon in election campaigns²⁰²¹ worldwide – a trend taken seriously by voters and regulators alike.

U.S. Legal Framework and Accountability

In the U.S., deepfake creators and distributors of election misinformation face a patchwork of tools, but no single comprehensive federal “deepfake law.” Existing laws relevant to disinformation include statutes against impersonating government officials, electioneering (such as the Bipartisan Campaign Reform Act, which requires disclaimers on political ads), and targeted statutes like criminal electioneering communications. In some cases ordinary laws have been stretched: the NH robocall used the Telephone Consumer Protection Act and mail/telemarketing fraud provisions, resulting in the $6M fine and a criminal charge. Similarly, voice impostors can potentially violate laws against “false advertising” or “unlawful corporate communications.” However, these laws were enacted before AI, and litigators have warned they often do not fit neatly. For example, deceptive deepfake claims not tied to a specific victim do not easily fit into defamation or privacy torts. Voter intimidation laws (prohibiting threats or coercion) also leave a gap for non-threatening falsehoods about voting logistics or endorsements.

Recognizing these gaps, some courts and agencies are invoking other theories. The U.S. Department of Justice has recently charged individuals under broad fraud statutes (e.g. for a plot to impersonate an aide to swing votes in 2020), and state attorneys general have considered deepfake misinformation as interference with voting rights. Notably, the Federal Election Commission (FEC) is preparing to enforce new rules: in April 2024 it issued an advisory opinion limiting “non-candidate electioneering communications” that use falsified media, effectively requiring that political ads use only real images of the candidate. If finalized, that would make it unlawful for campaigns to pay for ads depicting a candidate saying things they never did. Similarly, the Federal Trade Commission (FTC) and Department of Justice (DOJ) have signaled that purely commercial deepfakes could violate consumer protection or election laws (for example, liability for mass false impersonation or for foreign-funded electioneering).

U.S. Legislation and Proposals

Federal lawmakers have proposed new statutes. The DEEPFAKES Accountability Act (H.R.5586 in the 118th Congress) would, among other things, impose a disclosure requirement: political ads featuring a manipulated media likeness would need clear disclaimers identifying the content as synthetic. It also increases penalties for producing false election videos or audio intended to influence the vote. While not yet enacted, supporters argue it would provide a uniform rule for all federal and state campaigns. The Brennan Center supports transparency requirements over outright bans, suggesting laws should narrowly target deceptive deepfakes in paid ads or certain categories (e.g. false claims about time/place/manner of voting) while carving out parody and news coverage.

At the state level, over 20 states have passed deepfake laws specifically for elections. For example, Florida and California forbid distributing falsified audio/visual media of candidates with intent to deceive voters (though Florida’s law exempts parody). Some states (like Texas) define “deepfake” in statutes and allow candidates to sue or revoke candidacies of violators. These measures have had mixed success: courts have struck down overly broad provisions that acted as prior restraints (e.g. Minnesota’s 2023 law was challenged for threatening injunctions against anyone “reasonably believed” to violate it). Critically, these state laws raise First Amendment issues: political speech is highly protected, so any restriction must be tightly tailored. Already, Texas and Virginia statutes are under legal review, and Elon Musk’s company has sued under California’s law (which requires platforms to label or block deepfakes) as unconstitutional. In practice, most lawsuits have so far centered on defamation or intellectual property (for instance, a celebrity suing over a botched celebrity-deepfake video), rather than election-focused statutes.

Policy Recommendations: Balancing Integrity and Speech

Given the rapidly evolving technology, experts recommend a multi-pronged approach. Most stress transparency and disclosure as core principles. For example, the Brennan Center urges requiring any political communication that uses AI-synthesized images or voice to include a clear label. This could be a digital watermark or a visible disclaimer. Transparency has two advantages: it forces campaigns and platforms to “own” the use of AI, and it alerts audiences to treat the content with skepticism.

Outright bans on all deepfakes would likely violate free speech, but targeted bans on specific harms (e.g. automated phone calls impersonating voters, or videos claiming false polling information) may be defensible. Indeed, Florida already penalizes misuse of recordings in voter suppression. Another recommendation is limited liability: tying penalties to demonstrable intent to mislead, not to the mere act of content creation. Both U.S. federal proposals and EU law generally condition fines on the “appearance of fraud” or deception.

Technical solutions can complement laws. Watermarking original media (as encouraged by the EU AI Act) could deter the reuse of authentic images in doctored fakes. Open tools for deepfake detection – some supported by government research grants – should be deployed by fact-checkers and social platforms. Making detection datasets publicly available (e.g. the MIT OpenDATATEST) helps improve AI models to spot fakes. International cooperation is also urged: cross-border agreements on information-sharing could help trace and halt disinformation campaigns. The G7 and APEC have all recently committed to fighting election interference via AI, which may lead to joint norms or rapid response teams.

Ultimately, many analysts believe the strongest “cure” is a well-informed public: education campaigns to teach voters to question sensational media, and a robust independent press to debunk falsehoods swiftly. While the law can penalize the worst offenders, awareness and resilience in the electorate are crucial buffers against influence operations. As Georgia Tech’s Sean Parker quipped in 2019, “the real question is not if deepfakes will influence elections, but who will be empowered by the first effective one.” Thus policies should aim to deter malicious use without unduly chilling innovation or satire.

References:

2024 Deepfakes Guide and Statistics

https://www.wired.com/story/synthesia-ai-deepfakes-it-control-riparbelli/.

https://www.gao.gov/products/gao-24-107292.

https://technologyquotient.freshfields.com/post/102jb19/eu-ai-act-unpacked-8-new-rules-on-deepfakes.

https://knightcolumbia.org/blog/we-looked-at-78-election-deepfakes-political-misinformation-is-not-an-ai-problem.

https://www.npr.org/2024/12/21/nx-s1-5220301/deepfakes-memes-artificial-intelligence-elections.

https://apnews.com/article/artificial-intelligence-elections-disinformation-chatgpt-bc283e7426402f0b4baa7df280a4c3fd.

https://www.lawfaremedia.org/article/new-and-old-tools-to-tackle-deepfakes-and-election-lies-in-2024.

https://www.brennancenter.org/our-work/research-reports/regulating-ai-deepfakes-and-synthetic-media-political-arena.

https://firstamendment.mtsu.edu/article/political-deepfakes-and-elections/.

https://www.ncsl.org/technology-and-communication/deceptive-audio-or-visual-media-deepfakes-2024-legislation.

https://law.unh.edu/sites/default/files/media/2022/06/nagumotu_pp113-157.pdf.

https://dfrlab.org/2024/10/02/brazil-election-ai-research/.

https://dfrlab.org/2024/11/26/brazil-election-ai-deepfakes/.

https://freedomhouse.org/article/eu-digital-services-act-win-transparency.

The post The Legal Accountability of AI-Generated Deepfakes in Election Misinformation appeared first on MarkTechPost.

Deploy Amazon SageMaker Projects with Terraform Cloud

Posted on June 2, 2025 by i-genie

Amazon SageMaker Projects empower data scientists to self-serve Amazon Web Services (AWS) tooling and infrastructure to organize all entities of the machine learning (ML) lifecycle, and further enable organizations to standardize and constrain the resources available to their data science teams in pre-packaged templates.
For AWS customers using Terraform to define and manage their infrastructure-as-code (IaC), the current best practice for enabling Amazon SageMaker Projects carries a dependency on AWS CloudFormation to facilitate integration between AWS Service Catalog and Terraform. This blocks enterprise customers whose IT governance prohibit use of vendor-specific IaC such as CloudFormation from using Terraform Cloud.
This post outlines how you can enable SageMaker Projects with Terraform Cloud, removing the CloudFormation dependency.
AWS Service Catalog engine for Terraform Cloud
SageMaker Projects are directly mapped to AWS Service Catalog products. To obviate the use of CloudFormation, these products must be designated as Terraform products that use the AWS Service Catalog Engine (SCE) for Terraform Cloud. This module, actively maintained by Hashicorp, contains AWS-native infrastructure for integrating Service Catalog with Terraform Cloud so that your Service Catalog products are deployed using the Terraform Cloud platform.
By following the steps in this post, you can use the Service Catalog engine to deploy SageMaker Projects directly from Terraform Cloud.
Prerequisites
To successfully deploy the example, you must have the following:

An AWS account with the necessary permissions to create and manage SageMaker Projects and Service Catalog products. See the Service Catalog documentation for more information on Service Catalog permissions.
An existing Amazon SageMaker Studio domain with an associated Amazon SageMaker user profile. The SageMaker Studio domain must have SageMaker Projects enabled. See Use quick setup for Amazon SageMaker AI.
A Unix terminal with the AWS Command Line Interface (AWS CLI) and Terraform installed. See the Installing or updating to the latest version of the AWS CLIand the Install Terraform for more information about installation.
An existing Terraform Cloud account with the necessary permissions to create and manage workspaces. See the following tutorials to quickly create your own account:

HCP Terraform – intro and sign Up
Log In to HCP Terraform from the CLI

See Terraform teams and organizations documentation for more information about Terraform Cloud permissions.
Deployment steps

Clone the sagemaker-custom-project-templates repository from the AWS Samples GitHub to your local machine, update the submodules, and navigate to the mlops-terraform-cloud directory.

$ git clone https://github.com/aws-samples/sagemaker-custom-project-templates.git
$ cd sagemaker-custom-project_templates
$ git submodule update –init –recursive
$ cd mlops-terraform-cloud

The preceding code base above creates a Service Catalog portfolio, adds the SageMaker Project template as a Service Catalog product to the portfolio, allows the SageMaker Studio role to access the Service Catalog product, and adds the necessary tags to make the product visible in SageMaker Studio. See Create Custom Project Templates in the SageMaker Projects Documentation for more information about this process.

$ terraform login

This prompts your browser to sign into your HCP account and generates a security token. Copy this security token and paste it back into your terminal.

Navigate to your AWS account and retrieve the SageMaker user role Amazon Resource Name (ARN) for the SageMaker user profile associated with your SageMaker Studio domain. This role is used to grant SageMaker Studio users permissions to create and manage SageMaker Projects.

In the AWS Management Console for Amazon SageMaker, choose Domains from the navigation pane
Select your studio domain
Under User Profiles, select your user profile
In the User Details, copy the ARN

Create a tfvars file with the necessary variables for the Terraform Cloud workspace

$ cp terraform.tfvars.example terraform.tfvars

Set the appropriate values in the newly created tfvars file. The following variables are required:

tfc_organization = “my-tfc-organization”
tfc_team = “aws-service-catalog”
token_rotation_interval_in_days = 30
sagemaker_user_role_arns = [“arn:aws:iam::XXXXXXXXXXX:role/service-role/AmazonSageMaker-ExecutionRole”]

Make sure that your desired Terraform Cloud (TFC) organization has the proper entitlements and that your tfc_team is unique for this deployment. See the Terraform Organizations Overview for more information on creating organizations.

Initialize the Terraform Cloud workspace

$ terraform init

Apply the Terraform Cloud workspace

$ terraform apply

Go back to the SageMaker console using the user profile associated with the SageMaker user role ARN that you copied previously and choose Open Studio application
In the navigation pane, choose Deployments and then choose Projects
Choose Create project, select the mlops-tf-cloud-example product and then choose Next
In Project details, enter a unique name for the template and (option) enter a project description. Choose Create
In a separate tab or window, go back to your Terraform Cloud account’s Workspaces and you’ll see a workspace being provisioned directly from your SageMaker Project deployment. The naming convention of the Workspace will be <ACCOUNT_ID>-<SAGEMAKER_PROJECT_ID>

Further customization
This example can be modified to include custom Terraform in your SageMaker Project template. To do so, define your Terraform in the mlops-product/product directory. When ready to deploy, be sure to archive and compress this Terraform using the following command:

$ cd mlops-product
$ tar -czf product.tar.gz product

Cleanup
To remove the resources deployed by this example, run the following from the project directory:

$ terraform destroy

Conclusion
In this post you defined, deployed, and provisioned a SageMaker Project custom template purely in Terraform. With no dependencies on other IaC tools, you can now enable SageMaker Projects strictly within your Terraform Enterprise infrastructure.

About the author
Max Copeland is a Machine Learning Engineer for AWS, leading customer engagements spanning ML-Ops, data science, data engineering, and generative AI.

How ZURU improved the accuracy of floor plan generation by 109% using …

Posted on June 1, 2025 by i-genie

ZURU Tech is on a mission to change the way we build, from town houses and hospitals to office towers, schools, apartment blocks, and more. Dreamcatcher is a user-friendly platform developed by ZURU that allows users with any level of experience to collaborate in the building design and construction process. With the simple click of a button, an entire building can be ordered, manufactured and delivered to the construction site for assembly.
ZURU collaborated with AWS Generative AI Innovation Center and AWS Professional Services to implement a more accurate text-to-floor plan generator using generative AI. With it, users can specify a description of the building they want to design using natural language. For example, instead of designing the foundation, walls, and key aspects of a building from scratch, a user could enter, “Create a house with three bedrooms, two bathrooms, and an outdoor space for entertainment.” The solution would generate a unique floor plan within the 3D design space, allowing users with a non-technical understanding of architecture and construction to create a well-designed house
In this post, we show you why a solution using a large language model (LLM) was chosen. We explore how model selection, prompt engineering, and fine-tuning can be used to improve results. And we explain how the team made sure they could iterate quickly through an evaluation framework using key services such as Amazon Bedrock and Amazon SageMaker.
Understanding the challenge
The foundation for generating a house within Dreamcatcher’s 3D building system is to first confirm we can generate a 2D floor plan based on the user’s prompt. The ZURU team found that generating 2D floor plans, such as the one in the following image, using different machine learning (ML) techniques requires success across two key criteria.

First, the model must understand rooms, the purpose of each room, and their orientation to one another within a two-dimensional vector system. This can also be described as how well the model can adhere to the features described from a user’s prompt. Second, there is also a mathematical component to making sure rooms adhere to criteria such as specific dimensions and floor space. To be certain that they were on the right track and to allow for fast R&D iteration cycles, the ZURU team created a novel evaluation framework that would measure the output of different models based on showing the level of accuracy across these two key metrics.
The ZURU team initially looked at using generative adversarial networks (GAN) for floor plan generation, but experimentation with a GPT2 LLM had positive results based on the test framework. This reinforced the idea that an LLM-based approach could provide the required accuracy for a text-to–floor plan generator.
Improving the results
To improve on the results of the GPT2 model, we worked together and defined two further experiments. The first was a prompt engineering approach. Using Anthropic’s Claude 3.5 Sonnet in Amazon Bedrock the team was able to evaluate the impact of a leading proprietary model with contextual examples included in the prompts. The second approach focused on using fine-tuning with Llama 3B variants to evaluate the improvement of accuracy when the model weights are directly influenced using high-quality examples.
Dataset preparation and analysis
To create the initial dataset, floor plans from thousands of houses were gathered from publicly available sources and reviewed by a team of in-house architects. To streamline the review process, the ZURU team built a custom application with a simple yes/no decision mechanism similar to those found in popular social matching applications, allowing architects to quickly approve plans compatible with the ZURU building system or reject those with disqualifying features. This intuitive approach significantly accelerated ZURU’s evaluation process while maintaining clear decision criteria for each floor plan.
To further enhance this dataset, we began with careful dataset preparation including filtering out the low-quality data (30%) by evaluating the metric score of ground truth dataset. Following this filtering mechanism, data points not achieving 100% accuracy on instruction adherence are removed from the training dataset. This data preparation technique helped to improve the efficiency and quality of the fine-tuning and prompt engineering by more than 20%.
During our exploratory data analysis we found that the dataset contained prompts that can match multiple floor plans as well as floor plans that could match multiple prompts. By moving all related prompt and floor plan combinations to the same data split (either training, validation, or testing) we were able to prevent data leakage and promote robust evaluation.
Prompt engineering approach
As part of our approach, we implemented dynamic matching for few-shot prompting that is different from traditional static sampling methods. Combining this with the implementation of prompt decomposition, we could increase the overall accuracy of the generated floor plan content.
With a dynamic few-shot prompting methodology, we retrieve the most relevant examples at run time based on the details of the input prompt from a high-quality dataset and provide this as part of the prompt to the generative AI model.
The dynamic few-shot prompting approach is further enhanced by prompt decomposition, where we break down complex tasks into smaller, more manageable components to achieve better results from language models. By decomposing queries, each component can be optimized for its specific purpose. We found that combining these methods resulted in improved relevancy in example selection and lower latency in retrieving the example data, leading to better performance and higher quality results.
Prompt engineering architecture
The workflow and architecture implemented for prototyping shown in the following figure demonstrates a systematic approach to AI model optimization. When a user query such as “Build me a house with three bedrooms and two bathrooms” is entered, the workflow follows these steps:

We use prompt decomposition to execute three smaller tasks that retrieve highly relevant examples that match the same features for a house that the user has requested
We use the relevant examples and inject it into the prompt to perform dynamic few-shot prompting to generate a floor plan
We use the reflection technique to ask the generative AI model to self-reflect and asses that the generated content adheres to our requirements

Deep dive on workflow and architecture
The first step in our workflow is to understand the unique features of the house, which we can use as search criteria to find the most relevant examples in the subsequent steps. For this step, we use Amazon Bedrock, which provides a serverless API-driven endpoint for inference. From the wide range of generative AI models offered by Amazon Bedrock, we choose Mistral 7B, which provides the right balance between cost, latency, and accuracy required for this small decomposed step.
The second step is to search for the most relevant examples using the unique features we found. We use Amazon Bedrock Knowledge Bases backed by Amazon OpenSearch Serverless as a vector database to implement metadata filtering and hybrid search to retrieve the most relevant record identifiers. Amazon Simple Storage Service (Amazon S3) is used for storage of the data set, and Amazon Bedrock Knowledge Bases provides a managed solution for vectorizing and indexing the metadata into the vector database.
In the third step, we retrieve the actual floor plan data by record identifier using Amazon DynamoDB. By splitting the search and retrieval of floor plan examples into two steps, we were able to use purpose-built services with Amazon OpenSearch, allowing for low-latency search, and DynamoDB for low-latency data retrieval by key value leading to optimized performance.
After retrieving the most relevant examples for the user’s prompt, in step four we use Amazon Bedrock and Anthropic’s Claude 3.5 Sonnet as a model with leading benchmarks in deep reasoning and mathematics to generate our new floor plan.
Finally, in step five, we implement reflection. We use Amazon Bedrock with Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock again and pass the original prompt, instructions, examples and newly generated floor plan back with a final instruction for the model to reflect and double-check its generated floor plan and correct mistakes.
Fine-tuning approach
We explored two methods for optimizing LLMs for automated floorplan generation: full parameter fine-tuning and Low-Rank Adaptation (LoRA)–based fine-tuning. Full fine-tuning adjusts all LLM parameters, which requires significant memory and training time. In contrast, LoRA tunes only a small subset of parameters, reducing memory requirements and training time.
Workflow and architecture
We implemented our workflow containing data processing, fine-tuning, and inference and testing steps shown in the following figure below, all within a SageMaker Jupyter Lab Notebook provisioned with an ml.p4.24xlarge instance, giving us access to Nvidia A100 GPUs. Because we used a Jupyter notebook and ran all parts of our workflow interactively, we were able to iterate quickly and debug our experiments while maturing the training and testing scripts.

Deep dive on fine tuning workflow
One key insight from our experiments was the critical importance of dataset quality and diversity. Further to our initial dataset preparation, when fine-tuning a model, we found that carefully selecting training samples with larger diversity helped the model learn more robust representations. Additionally, although larger batch sizes generally improved performance (within memory constraints), we had to carefully balance this against computational resources (320 GB GPU memory in an ml.p4.24xlarge instance) and training time (ideally within 1–2 days).
We conducted several iterations to optimize performance, experimenting with various approaches including initial few-sample quick instruction fine-tuning, larger dataset fine-tuning, fine-tuning with early stopping, comparing Llama 3.1 8B and Llama 3 8B models, and varying instruction length in fine-tuning samples. Through these iterations, we found that full fine-tuning of the Llama 3.1 8B model using a curated dataset of 200,000 samples produced the best results.
The training process for full fine-tuning Llama 3.1 8B with BF16 and a microbatch size of three involved eight epochs with 30,000 steps, taking 25 hours to complete. In contrast, the LoRA approach showed significant computational efficiency, requiring only 2 hours of training time and producing an 89 MB checkpoint.
Evaluation framework
The testing framework implements an efficient evaluation methodology that optimizes resource utilization and time while maintaining statistical validity. Key components include:

A prompt deduplication system that identifies and consolidates duplicate instructions in the test dataset, reducing computational overhead and enabling faster iteration cycles for model improvement
A distribution-based performance assessment that filters unique test cases, promotes representative sampling through statistical analysis, and projects results across the full dataset
A metric-based evaluation that implements scoring across key criteria enabling comparative analysis against both the baseline GPT2 model and other approaches.

Results and business impact
To understand how well each approach in our experiment performed, we used the evaluation framework and compared several key metrics. For the purposes of this post, we focus on two of these key metrics. The first reflects how well the model was able to follow users’ instructions to reflect the features required in the house. The second metric looks at how well the features of the house adhered to instructions in mathematical and positioning and orientation. The following image show these results in a graph.

We found that the prompt engineering approach with Anthropic’s Claude 3.5 Sonnet as well as the full fine-tuning approach with Llama 3.1 8b increased the instruction adherence quality over the baseline GPT2 model by 109%, showing that, depending on a team’s skillsets, both approaches could be used to improve the quality of understanding an LLM when generating content such as floor plans.
When looking at mathematical correctness, our prompt engineering approach wasn’t able to create significant improvements over the baseline, but full fine-tuning was a clear winner with a 54% increase over the baseline GPT2 results.
The LoRA-based tuning approach achieves slightly lower performance scores being 20% less in the metric scores on instruction adherence and 50% lower scores on mathematical correctness compared to full fine-tuning, demonstrating the tradeoffs that can be made when it comes to time, cost, and hardware compared to model accuracy.
Conclusion
ZURU Tech has set its vision on fundamentally transforming the way we design and construct buildings. In this post, we highlighted the approach to building and improving a text-to–floor plan generator based on LLMs to create a highly useable and streamlined workflow within a 3D-modeling system. We dived into advanced concepts of prompt engineering using Amazon Bedrock and detailed approaches to fine-tuning LLMs using Amazon SageMaker, showing the different tradeoffs you can make to significantly improve on the accuracy of the content that is generated.
To learn more about the Generative AI Innovation Center program, get in touch with your account team.

About the Authors
Federico Di Mattia is the team leader and Product Owner of ZURU AI at ZURU Tech in Modena, Italy. With a focus on AI-driven innovation, he leads the development of Generative AI solutions that enhance business processes and drive ZURU’s growth.
Niro Amerasinghe is a Senior Solutions Architect based out of Auckland, New Zealand. With experience in architecture, product development, and engineering, he helps customers in using Amazon Web Services (AWS) to grow their businesses.
Haofei Feng is a Senior Cloud Architect at AWS with over 18 years of expertise in DevOps, IT Infrastructure, Data Analytics, and AI. He specializes in guiding organizations through cloud transformation and generative AI initiatives, designing scalable and secure GenAI solutions on AWS. Based in Sydney, Australia, when not architecting solutions for clients, he cherishes time with his family and Border Collies.
Sheldon Liu is an applied scientist, ANZ Tech Lead at the AWS Generative AI Innovation Center. He partners with enterprise customers across diverse industries to develop and implement innovative generative AI solutions, accelerating their AI adoption journey while driving significant business outcomes.
Xuefeng Liu leads a science team at the AWS Generative AI Innovation Center in the Asia Pacific regions. His team partners with AWS customers on generative AI projects, with the goal of accelerating customers’ adoption of generative AI.
Simone Bartoli is a Machine Learning Software Engineer at ZURU Tech, in Modena, Italy. With a background in computer vision, machine learning, and full-stack web development, Simone specializes in creating innovative solutions that leverage cutting-edge technologies to enhance business processes and drive growth.
Marco Venturelli is a Senior Machine Learning Engineer at ZURU Tech in Modena, Italy. With a background in computer vision and AI, he leverages his experience to innovate with generative AI, enriching the Dreamcatcher software with smart features.
Stefano Pellegrini is a Generative AI Software Engineer at ZURU Tech in Italy. Specializing in GAN and diffusion-based image generation, he creates tailored image-generation solutions for various departments across ZURU.
Enrico Petrucci is a Machine Learning Software Engineer at ZURU Tech, based in Modena, Italy. With a strong background in machine learning and NLP tasks, he currently focuses on leveraging Generative AI and Large Language Models to develop innovative agentic systems that provide tailored solutions for specific business cases.

Going beyond AI assistants: Examples from Amazon.com reinventing indus …

Posted on May 31, 2025 by i-genie

Generative AI revolutionizes business operations through various applications, including conversational assistants such as Amazon’s Rufus and Amazon Seller Assistant. Additionally, some of the most impactful generative AI applications operate autonomously behind the scenes, an essential capability that empowers enterprises to transform their operations, data processing, and content creation at scale. These non-conversational implementations, often in the form of agentic workflows powered by large language models (LLMs), execute specific business objectives across industries without direct user interaction.
Non-conversational applications offer unique advantages such as higher latency tolerance, batch processing, and caching, but their autonomous nature requires stronger guardrails and exhaustive quality assurance compared to conversational applications, which benefit from real-time user feedback and supervision.
This post examines four diverse Amazon.com examples of such generative AI applications:

Amazon.com listing creation and catalog data quality improvements – Demonstrating how LLMs are helping selling partners and Amazon.com create higher-quality listings at scale
Prescription processing in Amazon Pharmacy – Showcasing implementation in a highly regulated environment and task decomposition for agentic workflows
Review highlights – Illustrating massive scale batch processing, traditional machine learning (ML) integration, use of smaller LLMs, and cost-effective solution at scale
Amazon Ads creative image and video generation – Highlighting multimodal generative AI and responsible AI practices in creative endeavors

Each case study reveals different aspects of implementing non-conversational generative AI applications, from technical architecture to operational considerations. Throughout these examples, you will learn how the comprehensive suite of AWS services, including Amazon Bedrock and Amazon SageMaker, are the key to success. Finally, we list key learnings commonly shared across these use cases.
Creating high-quality product listings on Amazon.com
Creating high-quality product listings with comprehensive details helps customers make informed purchase decisions. Traditionally, selling partners manually entered dozens of attributes per product. The new generative AI solution, launched in 2024, transforms this process by proactively acquiring product information from brand websites and other sources to improve the customer experience across numerous product categories.
Generative AI simplifies the selling partner experience by enabling information input in various formats such as URLs, product images, or spreadsheets and automatically translating this into the required structure and format. Over 900,000 selling partners have used it, with nearly 80% of generated listing drafts accepted with minimal edits. AI-generated content provides comprehensive product details that help with clarity and accuracy, which can contribute to product discoverability in customer searches.
For new listings, the workflow begins with selling partners providing initial information. The system then generates comprehensive listings using multiple information sources, including titles, descriptions, and detailed attributes. Generated listings are shared with selling partners for approval or editing.
For existing listings, the system identifies products that can be enriched with additional data.
Data integration and processing for a large variety of outputs
The Amazon team built robust connectors for internal and external sources with LLM-friendly APIs using Amazon Bedrock and other AWS services to seamlessly integrate into Amazon.com backend systems.
A key challenge is synthesizing diverse data into cohesive listings across more than 50 attributes, both textual and numerical. LLMs require specific control mechanisms and instructions to accurately interpret ecommerce concepts because they might not perform optimally with such complex, varied data. For example, LLMs might misinterpret “capacity” in a knife block as dimensions rather than number of slots, or mistake “Fit Wear” as a style description instead of a brand name. Prompt engineering and fine-tuning were extensively used to address these cases.
Generation and validation with LLMs
The generated product listings should be complete and correct. To help this, the solution implements a multistep workflow using LLMs for both generation and validation of attributes. This dual-LLM approach helps prevent hallucinations, which is critical when dealing with safety hazards or technical specifications. The team developed advanced self-reflection techniques to make sure the generation and validation processes complement each other effectively.
The following figure illustrates the generation process with validation both performed by LLMs.

Figure 1. Product Listing creation workflow

Multi-layer quality assurance with human feedback
Human feedback is central to the solution’s quality assurance. The process includes Amazon.com experts for initial evaluation and selling partner input for acceptance or edits. This provides high-quality output and enables ongoing enhancement of AI models.
The quality assurance process includes automated testing methods combining ML-, algorithm-, or LLM-based evaluations. Failed listings undergo regeneration, and successful listings proceed to further testing. Using causal inference models, we identify underlying features affecting listing performance and opportunities for enrichment. Ultimately, listings that pass quality checks and receive selling partner acceptance are published, making sure customers receive accurate and comprehensive product information.
The following figure illustrates the workflow of going to production with testing, evaluation, and monitoring of product listing generation.

Figure 2. Product Listing testing and human in the loop workflow

Application-level system optimization for accuracy and cost
Given the high standards for accuracy and completeness, the team adopted a comprehensive experimentation approach with an automated optimization system. This system explores various combinations of LLMs, prompts, playbooks, workflows, and AI tools to iterate for higher business metrics, including cost. Through continuous evaluation and automated testing, the product listing generator effectively balances performance, cost, and efficiency while staying adaptable to new AI developments. This approach means customers benefit from high-quality product information, and selling partners have access to cutting-edge tools for creating listings efficiently.
Generative AI-powered prescription processing in Amazon Pharmacy
Building upon the human-AI hybrid workflows previously discussed in the seller listing example, Amazon Pharmacy demonstrates how these principles can be applied in a Health Insurance Portability and Accountability Act (HIPAA)-regulated industry. Having shared a conversational assistant for patient care specialists in the post Learn how Amazon Pharmacy created their LLM-based chat-bot using Amazon SageMaker, we now focus on automated prescription processing, which you can read about in The life of a prescription at Amazon Pharmacy and the following research paper in Nature Magazine.
At Amazon Pharmacy, we developed an AI system built on Amazon Bedrock and SageMaker to help pharmacy technicians process medication directions more accurately and efficiently. This solution integrates human experts with LLMs in creation and validation roles to enhance precision in medication instructions for our patients.
Agentic workflow design for healthcare accuracy
The prescription processing system combines human expertise (data entry technicians and pharmacists) with AI support for direction suggestions and feedback. The workflow, shown in the following diagram, begins with a pharmacy knowledge-based preprocessor standardizing raw prescription text in Amazon DynamoDB, followed by fine-tuned small language models (SLMs) on SageMaker identifying critical components (dosage, frequency).

(a)

(b)
(c)

Figure 3. (a) Data entry technician and pharmacist workflow with two GenAI modules, (b) Suggestion module workflow and (c) Flagging module workflow

The system seamlessly integrates experts such as data entry technicians and pharmacists, where generative AI complements the overall workflow towards agility and accuracy to better serve our patients. A direction assembly system with safety guardrails then generates instructions for data entry technicians to create their typed directions through the suggestion module. The flagging module flags or corrects errors and enforces further safety measures as feedback provided to the data entry technician. The technician finalizes highly accurate, safe-typed directions for pharmacists who can either provide feedback or execute the directions to the downstream service.
One highlight from the solution is the use of task decomposition, which empowers engineers and scientists to break the overall process into a multitude of steps with individual modules made of substeps. The team extensively used fine-tuned SLMs. In addition, the process employs traditional ML procedures such as named entity recognition (NER) or estimation of final confidence with regression models. Using SLMs and traditional ML in such contained, well-defined procedures significantly improved processing speed while maintaining rigorous safety standards due to incorporation of appropriate guardrails on specific steps.
The system comprises multiple well-defined substeps, with each subprocess operating as a specialized component working semi-autonomously yet collaboratively within the workflow toward the overall objective. This decomposed approach, with specific validations at each stage, proved more effective than end-to-end solutions while enabling the use of fine-tuned SLMs. The team used AWS Fargate to orchestrate the workflow given its current integration into existing backend systems.
In their product development journey, the team turned to Amazon Bedrock, which provided high-performing LLMs with ease-of-use features tailored to generative AI applications. SageMaker enabled further LLM selections, deeper customizability, and traditional ML methods. To learn more about this technique, see How task decomposition and smaller LLMs can make AI more affordable and read about the Amazon Pharmacy business case study.
Building a reliable application with guardrails and HITL
To comply with HIPAA standards and provide patient privacy, we implemented strict data governance practices alongside a hybrid approach that combines fine-tuned LLMs using Amazon Bedrock APIs with Retrieval Augmented Generation (RAG) using Amazon OpenSearch Service. This combination enables efficient knowledge retrieval while maintaining high accuracy for specific subtasks.
Managing LLM hallucinations—which is critical in healthcare—required more than just fine-tuning on large datasets. Our solution implements domain-specific guardrails built on Amazon Bedrock Guardrails, complemented by human-in-the-loop (HITL) oversight to promote system reliability.
The Amazon Pharmacy team continues to enhance this system through real-time pharmacist feedback and expanded prescription format capabilities. This balanced approach of innovation, domain expertise, advanced AI services, and human oversight not only improves operational efficiency, but means that the AI system properly augments healthcare professionals in delivering optimal patient care.
Generative AI-powered customer review highlights
Whereas our previous example showcased how Amazon Pharmacy integrates LLMs into real-time workflows for prescription processing, this next use case demonstrates how similar techniques—SLMs, traditional ML, and thoughtful workflow design—can be applied to offline batch inferencing at massive scale.
Amazon has introduced AI-generated customer review highlights to process over 200 million annual product reviews and ratings. This feature distills shared customer opinions into concise paragraphs highlighting positive, neutral, and negative feedback about products and their features. Shoppers can quickly grasp consensus while maintaining transparency by providing access to related customer reviews and keeping original reviews available.
The system enhances shopping decisions through an interface where customers can explore review highlights by selecting specific features (such as picture quality, remote functionality, or ease of installation for a Fire TV). Features are visually coded with green check marks for positive sentiment, orange minus signs for negative, and gray for neutral—which means shoppers can quickly identify product strengths and weaknesses based on verified purchase reviews. The following screenshot shows review highlights regarding noise level for a product.

Figure 4. An example product review highlights for a product.

A recipe for cost-effective use of LLMs for offline use cases
The team developed a cost-effective hybrid architecture combining traditional ML methods with specialized SLMs. This approach assigns sentiment analysis and keyword extraction to traditional ML while using optimized SLMs for complex text generation tasks, improving both accuracy and processing efficiency. The following diagram shows ttraditional ML and LLMs working to provide the overall workflow.

Figure 5. Use of traditional ML and LLMs in a workflow.

The feature employs SageMaker batch transform for asynchronous processing, significantly reducing costs compared to real-time endpoints. To deliver a near zero-latency experience, the solution caches extracted insights alongside existing reviews, reducing wait times and enabling simultaneous access by multiple customers without additional computation. The system processes new reviews incrementally, updating insights without reprocessing the complete dataset. For optimal performance and cost-effectiveness, the feature uses Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances for batch transform jobs, providing up to 40% better price-performance to alternatives.
By following this comprehensive approach, the team effectively managed costs while handling the massive scale of reviews and products so that the solution remained both efficient and scalable.
Amazon Ads AI-powered creative image and video generation
Having explored mostly text-centric generative AI applications in previous examples, we now turn to multimodal generative AI with Amazon Ads creative content generation for sponsored ads. The solution has capabilities for image and video generation, the details of which we share in this section. In common, this solution uses Amazon Nova creative content generation models at its core.
Working backward from customer need, a March 2023 Amazon survey revealed that nearly 75% of advertisers struggling with campaign success cited creative content generation as their primary challenge. Many advertisers—particularly those without in-house capabilities or agency support—face significant barriers due to the expertise and costs of producing quality visuals. The Amazon Ads solution democratizes visual content creation, making it accessible and efficient for advertisers of different sizes. The impact has been substantial: advertisers using AI-generated images in Sponsored Brands campaigns saw nearly 8% click-through rates (CTR) and submitted 88% more campaigns than non-users.
Last year, the AWS Machine Learning Blog published a post detailing the image generation solution. Since then, Amazon has adopted Amazon Nova Canvas as its foundation for creative image generation, creating professional-grade images from text or image prompts with features for text-based editing and controls for color scheme and layout adjustments.
In September 2024, the Amazon Ads team included the creation of short-form video ads from product images. This feature uses foundation models available on Amazon Bedrock to give customers control over visual style, pacing, camera motion, rotation, and zooming through natural language, using an agentic workflow to first describe video storyboards and then generate the content for the story. The following screenshot shows an example of creative image generation for product backgrounds on Amazon Ads.

Figure 6. Ads image generation example for a product.

As discussed in the original post, responsible AI is at the center of the solution, and Amazon Nova creative models come with built-in controls to support safety and responsible AI use, including watermarking and content moderation.
The solution uses AWS Step Functions with AWS Lambda functions to orchestrate serverless orchestration of both image and video generation processes. Generated content is stored in Amazon Simple Storage Service (Amazon S3) with metadata in DynamoDB, and Amazon API Gateway provides customer access to the generation capabilities. The solution now employs Amazon Bedrock Guardrails in addition to maintaining Amazon Rekognition and Amazon Comprehend integration at various steps for additional safety checks. The following screenshot shows creative AI-generated videos on Amazon Ads campaign builder.

Figure 7. Ads video generation for a product

Creating high-quality ad creatives at scale presented complex challenges. The generative AI model needed to produce appealing, brand-appropriate images across diverse product categories and advertising contexts while remaining accessible to advertisers regardless of technical expertise. Quality assurance and improvement are fundamental to both image and video generation capabilities. The system undergoes continual enhancement through extensive HITL processes enabled by Amazon SageMaker Ground Truth. This implementation delivers a powerful tool that transforms advertisers’ creative process, making high-quality visual content creation more accessible across diverse product categories and contexts.
This is just the beginning of Amazon Ads using generative AI to empower advertising customers to create the content they need to drive their advertising objectives. The solution demonstrates how reducing creative barriers directly increases advertising activity while maintaining high standards for responsible AI use.
Key technical learnings and discussions
Non-conversational applications benefit from higher latency tolerance, enabling batch processing and caching, but require robust validation mechanisms and stronger guardrails due to their autonomous nature. These insights apply to both non-conversational and conversational AI implementations:

Task decomposition and agentic workflows – Breaking complex problems into smaller components has proven valuable across implementations. This deliberate decomposition by domain experts enables specialized models for specific subtasks, as demonstrated in Amazon Pharmacy prescription processing, where fine-tuned SLMs handle discrete tasks such as dosage identification. This strategy allows for specialized agents with clear validation steps, improving reliability and simplifying maintenance. The Amazon seller listing use case exemplifies this through its multistep workflow with separate generation and validation processes. Additionally, the review highlights use case showcased cost-effective and controlled use of LLMs by using traditional ML for preprocessing and performing parts that could be associated with an LLM task.
Hybrid architectures and model selection – Combining traditional ML with LLMs provides better control and cost-effectiveness than pure LLM approaches. Traditional ML excels at well-defined tasks, as shown in the review highlights system for sentiment analysis and information extraction. Amazon teams have strategically deployed both large and small language models based on requirements, integrating RAG with fine-tuning for effective domain-specific applications like the Amazon Pharmacy implementation.
Cost optimization strategies – Amazon teams achieved efficiency through batch processing, caching mechanisms for high-volume operations, specialized instance types such as AWS Inferentia and AWS Trainium, and optimized model selection. Review highlights demonstrates how incremental processing reduces computational needs, and Amazon Ads used Amazon Nova foundation models (FMs) to cost-effectively create creative content.
Quality assurance and control mechanisms – Quality control relies on domain-specific guardrails through Amazon Bedrock Guardrails and multilayered validation combining automated testing with human evaluation. Dual-LLM approaches for generation and validation help prevent hallucinations in Amazon seller listings, and self-reflection techniques improve accuracy. Amazon Nova creative FMs provide inherent responsible AI controls, complemented by continual A/B testing and performance measurement.
HITL implementation – The HITL approach spans multiple layers, from expert evaluation by pharmacists to end-user feedback from selling partners. Amazon teams established structured improvement workflows, balancing automation and human oversight based on specific domain requirements and risk profiles.
Responsible AI and compliance – Responsible AI practices include content ingestion guardrails for regulated environments and adherence to regulations such as HIPAA. Amazon teams integrated content moderation for user-facing applications, maintained transparency in review highlights by providing access to source information, and implemented data governance with monitoring to promote quality and compliance.

These patterns enable scalable, reliable, and cost-effective generative AI solutions while maintaining quality and responsibility standards. The implementations demonstrate that effective solutions require not just sophisticated models, but careful attention to architecture, operations, and governance, supported by AWS services and established practices.
Next steps
The examples from Amazon.com shared in this post illustrate how generative AI can create value beyond traditional conversational assistants. We invite you to follow these examples or create your own solution to discover how generative AI can reinvent your business or even your industry. You can visit the AWS generative AI use cases page to start the ideation process.
These examples showed that effective generative AI implementations often benefit from combining different types of models and workflows. To learn what FMs are supported by AWS services, refer to Supported foundation models in Amazon Bedrock and Amazon SageMaker JumpStart Foundation Models. We also suggest you explore Amazon Bedrock Flows, which can ease the path towards building workflows. Additionally, we remind you that Trainium and Inferentia accelerators provide important cost savings in these applications.
Agentic workflows, as illustrated in our examples, have proven particularly valuable. We recommend exploring Amazon Bedrock Agents for quickly building agentic workflows.
Successful generative AI implementation extends beyond model selection—it represents a comprehensive software development process from experimentation to application monitoring. To begin building your foundation across these essential services, we invite you to explore Amazon QuickStart.
Conclusion
These examples demonstrate how generative AI extends beyond conversational assistants to drive innovation and efficiency across industries. Success comes from combining AWS services with strong engineering practices and business understanding. Ultimately, effective generative AI solutions focus on solving real business problems while maintaining high standards of quality and responsibility.
To learn more about how Amazon uses AI, refer to Artificial Intelligence in Amazon News.

About the Authors
Burak Gozluklu is a Principal AI/ML Specialist Solutions Architect and lead GenAI Scientist Architect for Amazon.com on AWS, based in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. He maintains his connection to academia as a research affiliate at MIT. Outside of work, Burak is an enthusiast of yoga.
Emilio Maldonado is a Senior leader at Amazon responsible for Product Knowledge, oriented at building systems to scale the e-commerce Catalog metadata, organize all product attributes, and leverage GenAI to infer precise information that guides Sellers and Shoppers to interact with products. He’s passionate about developing dynamic teams and forming partnerships. He holds a Bachelor of Science in C.S. from Tecnologico de Monterrey (ITESM) and an MBA from Wharton, University of Pennsylvania.
Wenchao Tong is a Sr. Principal Technologist at Amazon Ads in Palo Alto, CA, where he spearheads the development of GenAI applications for creative building and performance optimization. His work empowers customers to enhance product and brand awareness and drive sales by leveraging innovative AI technologies to improve creative performance and quality. Wenchao holds a Master’s degree in Computer Science from Tongji University. Outside of work, he enjoys hiking, board games, and spending time with his family.
Alexandre Alves is a Sr. Principal Engineer at Amazon Health Services, specializing in ML, optimization, and distributed systems. He helps deliver wellness-forward health experiences.
Puneet Sahni is Sr. Principal Engineer in Amazon. He works on improving the data quality of all products available in Amazon catalog. He is passionate about leveraging product data to improve our customer experiences. He has a Master’s degree in Electrical engineering from Indian Institute of Technology (IIT) Bombay. Outside of work he enjoying spending time with his young kids and travelling.
Vaughn Schermerhorn is a Director at Amazon, where he leads Shopping Discovery and Evaluation—spanning Customer Reviews, content moderation, and site navigation across Amazon’s global marketplaces. He manages a multidisciplinary organization of applied scientists, engineers, and product leaders focused on surfacing trustworthy customer insights through scalable ML models, multimodal information retrieval, and real-time system architecture. His team develops and operates large-scale distributed systems that power billions of shopping decisions daily. Vaughn holds degrees from Georgetown University and San Diego State University and has lived and worked in the U.S., Germany, and Argentina. Outside of work, he enjoys reading, travel, and time with his family.
Tarik Arici is a Principal Applied Scientist at Amazon Selection and Catalog Systems (ASCS), working on Catalog Quality Enhancement using GenAI workflows. He has a PhD in Electrical and Computer Engineering from Georgia Tech. Outside of work, Tarik enjoys swimming and biking.

Apple and Duke Present a Reinforcement Learning Approach That Enables …

Posted on May 30, 2025 by i-genie

Long CoT reasoning improves large language models’ performance on complex tasks but comes with drawbacks. The typical “think-then-answer” method slows down response times, disrupting real-time interactions like those in chatbots. It also risks inaccuracies, as errors in earlier reasoning steps can lead to a misleading final answer. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay responses until all reasoning is complete. While RL is commonly used to train reasoning models, it mainly rewards final answers, overlooking useful intermediate insights. There is growing interest in teaching models that alternate between thinking and answering, but this remains a challenge.

RL has become a popular method to enhance reasoning in LLMs, building on its success in aligning models with human preferences. Two common reward types guide RL: outcome-based rewards (ORM), which focus on the final answer, and process-based rewards (PRM), which provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to issues like reward hacking. Separately, efforts to improve LLM reasoning have explored prompting strategies, structured reasoning, tool integration, and methods to reduce latency and improve efficiency.

Researchers from Apple and Duke University introduce Interleaved Reasoning, a new RL approach that enables language models to alternate between thinking and answering when solving complex, multi-step questions. Instead of waiting until the end to respond, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule-based reward, the model is trained to produce helpful reasoning steps, leading to over 80% faster responses and up to 19.3% better accuracy. Trained only on QA and logic datasets, the method demonstrates strong generalization to more challenging benchmarks, such as MATH, GPQA, and MMLU.

The study proposes a reinforcement learning framework to train LLMs for Interleaved Reasoning, where models alternate between internal thinking and user-facing intermediate answers. Each intermediate step, or “sub-answer,” is shared once the model reaches a meaningful milestone in reasoning. A specialized training template with <think> and <answer> tags is used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. Notably, intermediate rewards are applied only when specific criteria are met, ensuring the model prioritizes overall correctness. They also test different reward schemes, such as all-or-none, partial credit, and time-discounted rewards, to optimize the quality of reasoning.

The interleaved reasoning approach was evaluated on both familiar and unfamiliar datasets using Qwen2.5 models (1.5B and 7B). Unlike traditional methods that separate thinking and answering, the interleaved method provides answers incrementally, improving both speed and usefulness. When combined with intermediate rewards, it significantly enhances model performance while reducing response delays by over 80%. Even without exposure to new domains during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved reasoning in making AI systems more responsive and effective in real-world, multi-step reasoning tasks.

In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. Using the Qwen2.5-1.5B model, the authors show that providing timely intermediate feedback during training boosts accuracy and accelerates response generation. Different RL strategies were tested, with PPO showing stable results, and conditional, time-discounted rewards proving to be the most effective. The method scales well to complex tasks and outperforms traditional think-then-answer baselines. Unlike token-level reward models, this approach employs simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, interleaved reasoning enhances reasoning quality and efficiency without relying on external tools.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Apple and Duke Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy appeared first on MarkTechPost.

DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Deliverin …

Posted on May 30, 2025 by i-genie

DeepSeek, the Chinese AI Unicorn, has released an updated version of its R1 reasoning model, named DeepSeek-R1-0528. This release enhances the model’s capabilities in mathematics, programming, and general logical reasoning, positioning it as a formidable open-source alternative to leading models like OpenAI’s o3 and Google’s Gemini 2.5 Pro.

Technical Enhancements

The R1-0528 update introduces significant improvements in reasoning depth and inference accuracy. Notably, the model’s performance on the AIME 2025 math benchmark has increased from 70% to 87.5%, reflecting a more profound reasoning process that averages 23,000 tokens per question, up from 12,000 in the previous version. This enhancement is attributed to increased computational resources and algorithmic optimizations applied during post-training.

In addition to mathematical reasoning, the model has shown improved performance in code generation tasks. According to LiveCodeBench benchmarks, R1-0528 ranks just below OpenAI’s o4 mini and o3 models, outperforming xAI’s Grok 3 mini and Alibaba’s Qwen 3 in code generation tasks.

Open-Source Model Weights

DeepSeek continues its commitment to open-source and open weights AI by releasing R1-0528 under the MIT license, allowing developers to modify and deploy the model freely. The model’s weights are available on Hugging Face, and detailed documentation is provided for local deployment and API integration . This approach contrasts with the proprietary nature of many leading AI models, promoting transparency and accessibility in AI development.

Distilled Model for Lightweight Deployment

Recognizing the need for more accessible AI solutions, DeepSeek has also released a distilled version of R1-0528, named DeepSeek-R1-0528-Qwen3-8B. This model, fine-tuned from Alibaba’s Qwen3-8B using text generated by R1-0528, achieves state-of-the-art performance among open-source models on the AIME 2024 benchmark. It is designed to run efficiently on a single GPU, making advanced AI capabilities more accessible to developers with limited computational resources.

Censorship Considerations

While DeepSeek’s advancements in AI are noteworthy, the R1-0528 model has been observed to exhibit stricter content moderation compared to its predecessors. Independent testing revealed that the model avoids or provides limited responses to politically sensitive topics, such as the Tiananmen Square protests and the status of Taiwan, aligning with Chinese regulations that mandate AI models to adhere to content restrictions .

Here are the reasoning traces on the internment camps question–again mentioning Xianjiang, and reasoning quite clearly about why it’s not complying. pic.twitter.com/ooEwmF23TY— xlr8harder (@xlr8harder) May 29, 2025

Global Implications

The release of R1-0528 underscores China’s growing influence in the AI sector, challenging the dominance of U.S.-based companies. DeepSeek’s ability to develop competitive AI models at a fraction of the cost of their Western counterparts has prompted responses from companies like OpenAI, which have expressed concerns about the potential for these models to be manipulated by the Chinese government . This development highlights the shifting dynamics in global AI development and the increasing importance of open-source models in fostering innovation and competition.

Conclusion

DeepSeek’s R1-0528 model represents a significant advancement in open-source AI, offering enhanced reasoning capabilities and accessibility for developers. By providing both a full-scale model and a distilled version suitable for single-GPU deployment, DeepSeek is making strides in democratizing AI technology. However, the model’s adherence to content moderation policies reflects the complex interplay between technological advancement and regulatory compliance. As the AI landscape continues to evolve, DeepSeek’s developments will likely play a pivotal role in shaping the future of open-source AI.

Check out the Open-Source Weights and Try it now. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced Math and Code Performance with Single-GPU Efficiency appeared first on MarkTechPost.

A Coding Guide for Building a Self-Improving AI Agent Using Google’s …

Posted on May 30, 2025 by i-genie

In this tutorial, we will explore how to create a sophisticated Self-Improving AI Agent using Google’s cutting-edge Gemini API. This self-improving agent demonstrates autonomous problem-solving, dynamically evaluates performance, learns from successes and failures, and iteratively enhances its capabilities through reflective analysis and self-modification. The tutorial walks through structured code implementation, detailing mechanisms for memory management, capability tracking, iterative task analysis, solution generation, and performance evaluation, all integrated within a powerful self-learning feedback loop.

Copy CodeCopiedUse a different Browserimport google.generativeai as genai
import json
import time
import re
from typing import Dict, List, Any
from datetime import datetime
import traceback

We set up the foundational components to build an AI-powered self-improving agent utilizing Google’s Generative AI API. Libraries such as json, time, re, and datetime facilitate structured data management, performance tracking, and text processing, while type hints (Dict, List, Any) help ensure robust and maintainable code.

Copy CodeCopiedUse a different Browserclass SelfImprovingAgent:
def __init__(self, api_key: str):
“””Initialize the self-improving agent with Gemini API”””
genai.configure(api_key=api_key)
self.model = genai.GenerativeModel(‘gemini-1.5-flash’)

self.memory = {
‘successful_strategies’: [],
‘failed_attempts’: [],
‘learned_patterns’: [],
‘performance_metrics’: [],
‘code_improvements’: []
}

self.capabilities = {
‘problem_solving’: 0.5,
‘code_generation’: 0.5,
‘learning_efficiency’: 0.5,
‘error_handling’: 0.5
}

self.iteration_count = 0
self.improvement_history = []

def analyze_task(self, task: str) -> Dict[str, Any]:
“””Analyze a given task and determine approach”””
analysis_prompt = f”””
Analyze this task and provide a structured approach:
Task: {task}

Please provide:
1. Task complexity (1-10)
2. Required skills
3. Potential challenges
4. Recommended approach
5. Success criteria

Format as JSON.
“””

try:
response = self.model.generate_content(analysis_prompt)
json_match = re.search(r'{.*}’, response.text, re.DOTALL)
if json_match:
return json.loads(json_match.group())
else:
return {
“complexity”: 5,
“skills”: [“general problem solving”],
“challenges”: [“undefined requirements”],
“approach”: “iterative improvement”,
“success_criteria”: [“task completion”]
}
except Exception as e:
print(f”Task analysis error: {e}”)
return {“complexity”: 5, “skills”: [], “challenges”: [], “approach”: “basic”, “success_criteria”: []}

def solve_problem(self, problem: str) -> Dict[str, Any]:
“””Attempt to solve a problem using current capabilities”””
self.iteration_count += 1
print(f”n=== Iteration {self.iteration_count} ===”)
print(f”Problem: {problem}”)

task_analysis = self.analyze_task(problem)
print(f”Task Analysis: {task_analysis}”)

solution_prompt = f”””
Based on my previous learning and capabilities, solve this problem:
Problem: {problem}

My current capabilities: {self.capabilities}
Previous successful strategies: {self.memory[‘successful_strategies’][-3:]} # Last 3
Known patterns: {self.memory[‘learned_patterns’][-3:]} # Last 3

Provide a detailed solution with:
1. Step-by-step approach
2. Code implementation (if applicable)
3. Expected outcome
4. Potential improvements
“””

try:
start_time = time.time()
response = self.model.generate_content(solution_prompt)
solve_time = time.time() – start_time

solution = {
‘problem’: problem,
‘solution’: response.text,
‘solve_time’: solve_time,
‘iteration’: self.iteration_count,
‘task_analysis’: task_analysis
}

quality_score = self.evaluate_solution(solution)
solution[‘quality_score’] = quality_score

self.memory[‘performance_metrics’].append({
‘iteration’: self.iteration_count,
‘quality’: quality_score,
‘time’: solve_time,
‘complexity’: task_analysis.get(‘complexity’, 5)
})

if quality_score > 0.7:
self.memory[‘successful_strategies’].append(solution)
print(f” Solution Quality: {quality_score:.2f} (Success)”)
else:
self.memory[‘failed_attempts’].append(solution)
print(f” Solution Quality: {quality_score:.2f} (Needs Improvement)”)

return solution

except Exception as e:
print(f”Problem solving error: {e}”)
error_solution = {
‘problem’: problem,
‘solution’: f”Error occurred: {str(e)}”,
‘solve_time’: 0,
‘iteration’: self.iteration_count,
‘quality_score’: 0.0,
‘error’: str(e)
}
self.memory[‘failed_attempts’].append(error_solution)
return error_solution

def evaluate_solution(self, solution: Dict[str, Any]) -> float:
“””Evaluate the quality of a solution”””
evaluation_prompt = f”””
Evaluate this solution on a scale of 0.0 to 1.0:

Problem: {solution[‘problem’]}
Solution: {solution[‘solution’][:500]}… # Truncated for evaluation

Rate based on:
1. Completeness (addresses all aspects)
2. Correctness (logically sound)
3. Clarity (well explained)
4. Practicality (implementable)
5. Innovation (creative approach)

Respond with just a decimal number between 0.0 and 1.0.
“””

try:
response = self.model.generate_content(evaluation_prompt)
score_match = re.search(r'(d+.?d*)’, response.text)
if score_match:
score = float(score_match.group(1))
return min(max(score, 0.0), 1.0)
return 0.5
except:
return 0.5

def learn_from_experience(self):
“””Analyze past performance and improve capabilities”””
print(“n Learning from experience…”)

if len(self.memory[‘performance_metrics’]) < 2:
return

learning_prompt = f”””
Analyze my performance and suggest improvements:

Recent Performance Metrics: {self.memory[‘performance_metrics’][-5:]}
Successful Strategies: {len(self.memory[‘successful_strategies’])}
Failed Attempts: {len(self.memory[‘failed_attempts’])}

Current Capabilities: {self.capabilities}

Provide:
1. Performance trends analysis
2. Identified weaknesses
3. Specific improvement suggestions
4. New capability scores (0.0-1.0 for each capability)
5. New patterns learned

Format as JSON with keys: analysis, weaknesses, improvements, new_capabilities, patterns
“””

try:
response = self.model.generate_content(learning_prompt)

json_match = re.search(r'{.*}’, response.text, re.DOTALL)
if json_match:
learning_results = json.loads(json_match.group())

if ‘new_capabilities’ in learning_results:
old_capabilities = self.capabilities.copy()
for capability, score in learning_results[‘new_capabilities’].items():
if capability in self.capabilities:
self.capabilities[capability] = min(max(float(score), 0.0), 1.0)

print(f” Capability Updates:”)
for cap, (old, new) in zip(self.capabilities.keys(),
zip(old_capabilities.values(), self.capabilities.values())):
change = new – old
print(f” {cap}: {old:.2f} → {new:.2f} ({change:+.2f})”)

if ‘patterns’ in learning_results:
self.memory[‘learned_patterns’].extend(learning_results[‘patterns’])

self.improvement_history.append({
‘iteration’: self.iteration_count,
‘timestamp’: datetime.now().isoformat(),
‘learning_results’: learning_results,
‘capabilities_before’: old_capabilities,
‘capabilities_after’: self.capabilities.copy()
})

print(f” Learned {len(learning_results.get(‘patterns’, []))} new patterns”)

except Exception as e:
print(f”Learning error: {e}”)

def generate_improved_code(self, current_code: str, improvement_goal: str) -> str:
“””Generate improved version of code”””
improvement_prompt = f”””
Improve this code based on the goal:

Current Code:
{current_code}

Improvement Goal: {improvement_goal}
My current capabilities: {self.capabilities}
Learned patterns: {self.memory[‘learned_patterns’][-3:]}

Provide improved code with:
1. Enhanced functionality
2. Better error handling
3. Improved efficiency
4. Clear comments explaining improvements
“””

try:
response = self.model.generate_content(improvement_prompt)

improved_code = {
‘original’: current_code,
‘improved’: response.text,
‘goal’: improvement_goal,
‘iteration’: self.iteration_count
}

self.memory[‘code_improvements’].append(improved_code)
return response.text

except Exception as e:
print(f”Code improvement error: {e}”)
return current_code

def self_modify(self):
“””Attempt to improve the agent’s own code”””
print(“n Attempting self-modification…”)

current_method = “””
def solve_problem(self, problem: str) -> Dict[str, Any]:
# Current implementation
pass
“””

improved_method = self.generate_improved_code(
current_method,
“Make problem solving more efficient and accurate”
)

print(“Generated improved method structure”)
print(“Note: Actual self-modification requires careful implementation in production”)

def run_improvement_cycle(self, problems: List[str], cycles: int = 3):
“””Run a complete improvement cycle”””
print(f” Starting {cycles} improvement cycles with {len(problems)} problems”)

for cycle in range(cycles):
print(f”n{‘=’*50}”)
print(f”IMPROVEMENT CYCLE {cycle + 1}/{cycles}”)
print(f”{‘=’*50}”)

cycle_results = []
for problem in problems:
result = self.solve_problem(problem)
cycle_results.append(result)
time.sleep(1)

self.learn_from_experience()

if cycle < cycles – 1:
self.self_modify()

avg_quality = sum(r.get(‘quality_score’, 0) for r in cycle_results) / len(cycle_results)
print(f”n Cycle {cycle + 1} Summary:”)
print(f” Average Solution Quality: {avg_quality:.2f}”)
print(f” Current Capabilities: {self.capabilities}”)
print(f” Total Patterns Learned: {len(self.memory[‘learned_patterns’])}”)

time.sleep(2)

def get_performance_report(self) -> str:
“””Generate a comprehensive performance report”””
if not self.memory[‘performance_metrics’]:
return “No performance data available yet.”

metrics = self.memory[‘performance_metrics’]
avg_quality = sum(m[‘quality’] for m in metrics) / len(metrics)
avg_time = sum(m[‘time’] for m in metrics) / len(metrics)

report = f”””
AGENT PERFORMANCE REPORT
{‘=’*40}

Total Iterations: {self.iteration_count}
Average Solution Quality: {avg_quality:.3f}
Average Solve Time: {avg_time:.2f}s

Successful Solutions: {len(self.memory[‘successful_strategies’])}
Failed Attempts: {len(self.memory[‘failed_attempts’])}
Success Rate: {len(self.memory[‘successful_strategies’]) / max(1, self.iteration_count) * 100:.1f}%

Current Capabilities:
{json.dumps(self.capabilities, indent=2)}

Patterns Learned: {len(self.memory[‘learned_patterns’])}
Code Improvements: {len(self.memory[‘code_improvements’])}
“””

return report

We define the above class, SelfImprovingAgent, as implementing a robust framework leveraging Google’s Gemini API for autonomous task-solving, self-assessment, and adaptive learning. It incorporates structured memory systems, capability tracking, iterative problem-solving with continuous improvement cycles, and even attempts controlled self-modification. This advanced implementation allows the agent to progressively enhance its accuracy, efficiency, and problem-solving sophistication over time, creating a dynamic AI that can autonomously evolve and adapt.

Copy CodeCopiedUse a different Browserdef main():
“””Main function to demonstrate the self-improving agent”””

API_KEY = “Use Your GEMINI KEY Here”

if API_KEY == “Use Your GEMINI KEY Here”:
print(” Please set your Gemini API key in the API_KEY variable”)
print(“Get your API key from: https://makersuite.google.com/app/apikey”)
return

agent = SelfImprovingAgent(API_KEY)

test_problems = [
“Write a function to calculate the factorial of a number”,
“Create a simple text-based calculator that handles basic operations”,
“Design a system to find the shortest path between two points in a graph”,
“Implement a basic recommendation system for movies based on user preferences”,
“Create a machine learning model to predict house prices based on features”
]

print(” Self-Improving Agent Demo”)
print(“This agent will attempt to solve problems and improve over time”)

agent.run_improvement_cycle(test_problems, cycles=3)

print(“n” + agent.get_performance_report())

print(“n” + “=”*50)
print(“TESTING IMPROVED AGENT”)
print(“=”*50)

final_problem = “Create an efficient algorithm to sort a large dataset”
final_result = agent.solve_problem(final_problem)

print(f”nFinal Problem Solution Quality: {final_result.get(‘quality_score’, 0):.2f}”)

The main() function serves as the entry point for demonstrating the SelfImprovingAgent class. It initializes the agent with the user’s Gemini API key and defines practical programming and system design tasks. The agent then iteratively tackles these tasks, analyzing its performance to refine its problem-solving abilities over multiple improvement cycles. Finally, it tests the agent’s enhanced capabilities with a new complex task, showcasing measurable progress and providing a detailed performance report.

Copy CodeCopiedUse a different Browserdef setup_instructions():
“””Print setup instructions for Google Colab”””
instructions = “””
SETUP INSTRUCTIONS FOR GOOGLE COLAB:

1. Install the Gemini API client:
!pip install google-generativeai

2. Get your Gemini API key:
– Go to https://makersuite.google.com/app/apikey
– Create a new API key
– Copy the key

3. Replace ‘your-gemini-api-key-here’ with your actual API key

4. Run the code!

CUSTOMIZATION OPTIONS:
– Modify test_problems list to add your own challenges
– Adjust improvement cycles count
– Add new capabilities to track
– Extend the learning mechanisms

IMPROVEMENT IDEAS:
– Add persistent memory (save/load agent state)
– Implement more sophisticated evaluation metrics
– Add domain-specific problem types
– Create visualization of improvement over time
“””
print(instructions)

if __name__ == “__main__”:
setup_instructions()
print(“n” + “=”*60)
main()

Finally, we define the setup_instructions() function, which guides users through preparing their Google Colab environment to run the self-improving agent. It explains step-by-step how to install dependencies, set up and configure the Gemini API key, and highlight various options for customizing and enhancing the agent’s functionality. This approach simplifies user onboarding, facilitating easy experimentation and extending the agent’s capabilities further.

In conclusion, the implementation demonstrated in this tutorial offers a comprehensive framework for creating AI agents that perform tasks and actively enhance their capabilities over time. By harnessing the Gemini API’s advanced generative power and integrating a structured self-improvement loop, developers can build agents capable of sophisticated reasoning, iterative learning, and self-modification.

Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide for Building a Self-Improving AI Agent Using Google’s Gemini API with Intelligent Adaptation Features appeared first on MarkTechPost.

Revolutionizing earth observation with geospatial foundation models on …

Posted on May 30, 2025 by i-genie

Emerging transformer-based vision models for geospatial data—also called geospatial foundation models (GeoFMs)—offer a new and powerful technology for mapping the earth’s surface at a continental scale, providing stakeholders with the tooling to detect and monitor surface-level ecosystem conditions such as forest degradation, natural disaster impact, crop yield, and many others.
GeoFMs represent an emerging research field and are a type of pre-trained vision transformer (ViT) specifically adapted to geospatial data sources. GeoFMs offer immediate value without training. The models excel as embedding models for geospatial similarity search and ecosystem change detection. With minimal labeled data, GeoFMs can be fine-tuned for custom tasks such as land surface classification, semantic segmentation, or pixel-level regression. Many leading models are available under very permissive licenses making them accessible for a wide audience. Examples include SatVision-Base, Prithvi-100M, SatMAE, and Clay (used in this solution).
In this post, we explore how Clay Foundation’s Clay foundation model, available on Hugging Face, can be deployed for large-scale inference and fine-tuning on Amazon SageMaker. For illustrative purposes, we focus on a deforestation use case from the Amazon rainforest, one of the most biodiverse ecosystems in the world. Given the strong evidence that the Amazon forest system could soon be reaching a tipping point, it presents an important domain of study and a high-impact application area for GeoFMs, for example, through early detection of forest degradation. However, the solution presented here generalizes to a wide range of geospatial use cases. It also comes with ready-to-deploy code samples to help you get started quickly with deploying GeoFMs in your own applications on AWS.
Let’s dive in!
Solution overview
At the core of our solution is a GeoFM. Architecturally, GeoFMs build on the ViT architecture first introduced in the seminal 2022 research paper An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. To account for the specific properties of geospatial data (multiple channels ranging from ultraviolet to infrared, varying electromagnetic spectrum coverage, and spatio-temporal nature of data), GeoFMs incorporate several architectural innovations such as variable input size (to capture multiple channels) or the addition of positional embeddings that capture spatio-temporal aspects such as seasonality and location on earth. The pre-training of these models is conducted on unlabeled geospatial data sampled from across the globe using masked autoencoders (MAE) as self-supervised learners. Sampling from global-scale data helps ensure that diverse ecosystems and surface types are represented appropriately in the training set. What results are general purpose models that can be used for three core use cases:

Geospatial similarity search: Quickly map diverse surface types with semantic geospatial search using the embeddings to find similar items (such as deforested areas).
Embedding-based change detection: Analyze a time series of geospatial embeddings to identify surface disruptions over time for a specific region.
Custom geospatial machine learning: Fine-tune a specialized regression, classification, or segmentation model for geospatial machine learning (ML) tasks. While this requires a certain amount of labeled data, overall data requirements are typically much lower compared to training a dedicated model from the ground up.

The general solution flow is shown in the following diagram. Note that this flow diagram is highly abstracted and omits certain architectural details for reasons of clarity. For a full architecture diagram demonstrating how the flow can be implemented on AWS, see the accompanying GitHub repository. This repository also contains detailed deployment instructions to get you started quickly with applying GeoFMs to your own use cases.

Retrieve and process satellite imagery for GeoFM inference or training: The first step is to get the raw geospatial data into a format that’s consumable by the GeoFM. This entails breaking down the large raw satellite imagery into equally-sized 256×256 pixel chips (the size that the mode expects) and normalizing pixel values, among other data preparation steps required by the GeoFM that you choose. This routine can be conducted at scale using an Amazon SageMaker AI processing job.
Retrieve model weights and deploy the GeoFM: Next, retrieve the open weights of the GeoFM from a model registry of your choice (HuggingFace in this example) and deploy the model for inference. The best deployment option ultimately depends on how the model is consumed. If you need to generate embedding asynchronously, use a SageMaker AI processing or transform step. For real-time inference, consider deploying to a SageMaker AI real-time endpoint, which can be configured to auto-scale with demand, allowing for large-scale inference. In this example, we use a SageMaker AI processing job with a custom Docker image for generating embeddings in batch.
Generate geospatial embeddings: The GeoFM is an encoder-only model, meaning that it outputs an embedding vector. During inference, you perform a forward pass of the pre-processed satellite image chip through the GeoFM. This produces the corresponding embedding vector, which can be thought of as a compressed representation of the information contained in the image. This process is equivalent to using text embedding models for RAG use cases or similar.

The generated geospatial embeddings can be used largely as-is for two key use cases: geospatial similarity search and ecosystem change detection.

Run similarity search on the embeddings to identify semantically similar images: The GeoFM embeddings reside in the same vector space. This allows us to identify similar items by identifying vectors that are very close to a given query point. A common high-performance search algorithm for this is approximate nearest neighbor (ANN). For scalability and search performance, we index the embedding vectors in a vector database.
Analyze time-series of embeddings for break points that indicate change: Instead of looking for similarity between embedding vectors, you can also look for distance. Doing this for a specific region and across time lets you pinpoint specific times where change occurs. This allows you to use embeddings for surface change detection over time, a very common use case in geospatial analytics.

Optionally, you can also fine-tune a model on top of the GeoFM.

Train a custom head and run inference: To fine-tune a model you add a custom (and typically lightweight) head on top of the GeoFM and fine-tune it on a (often small) labeled dataset. The GeoFM weights remain frozen and are not retrained. The custom head takes the GeoFM-generated embedding vectors as input and produces classification masks, pixel-level recessions results, or simply a class per image, depending on the use case.

We explore the key steps of this workflow in the next sections. For additional details on the implementation—including. how to build a high-quality user interface with Solara—see the accompanying GitHub repository.
Geospatial data processing and embedding generation
Our comprehensive, four-stage data processing pipeline transforms raw satellite imagery into analysis-ready vector embeddings that power advanced geospatial analytics. This orchestrated workflow uses Amazon SageMaker AI Pipelines to create a robust, reproducible, and scalable processing architecture. The end-to-end solution can process Earth observation data for a selected region of interest, with built-in flexibility to adapt to different use cases. In this example, we use Sentinel-2 imagery from the Amazon Registry of Open Data for monitoring deforestation in the Brazilian rainforest. However, our pipeline architecture is designed to work seamlessly with other satellite image providers and resolutions (such as NAIP with 1m/pixel resolution, or Maxar and Planet Labs up to below 1m/pixel resolution).
Pipeline architecture overview

The SageMaker pipeline consists of four processing steps, shown in the preceding figure, each step builds on the outputs of the previous steps with intermediate results stored in Amazon Simple Storage Service (Amazon S3).

Pre-process satellite tiles: Divides the satellite imagery into chips. We chose a chip size of 256×256 pixels as expected by Clay v1. For Sentinel-2 images this corresponds to an area of 2.56 x 2.56 km2.
Generate embeddings: Creates 768-dimensional vector representations for the chips using the Clay v1 model.
Process embeddings: Performs dimensionality reduction and computes similarity metrics (for downstream analyses).
Consolidate and index: Consolidates outputs and loads embeddings vectors into a Vector store.

# Pipeline definition (simplified)
pipeline = Pipeline(
name=f”EmbeddingPipeline-{aoi_name}”,
steps=[preprocess_step, embeddings_step, postprocessing_step, consolidation_step],
parameters=[bucket_name_param, aoi_name_param, s2_grid_id_param,
chip_size_param, batch_size_param, s2_bands_param],
sagemaker_session=pipeline_session
)

Step 1: Satellite data acquisition and chipping
The pipeline starts by accessing Sentinel-2 multispectral satellite imagery through the AWS Open Data program from S3 buckets. This imagery provides 10-meter resolution across multiple spectral bands including RGB (visible light) and NIR (near-infrared), which are critical for environmental monitoring.
This step filters out chips that have excessive cloud cover and divides large satellite scenes into manageable 256×256 pixel chips, which enables efficient parallel processing and creates uniform inputs for the foundation model. This step also runs on a SageMaker AI Processing job with a custom Docker image optimized for geospatial operations.

# Chip generation – Simplified version
def generate_chips_from_xarray(ds, chip_size, aoi_name, s2_scene_id):
“””Generates image chips of specified size from xarray dataset
using Area of Interest (AOI) name and Sentinel-2 scene id”””
try:
# Calculate optimal number of jobs for parallel processing
n_jobs = min(os.cpu_count(), 16) # Use optimal number of cores or available CPU count
logger.info(f”Using {n_jobs} parallel jobs for chip generation”)

# Generate all x,y coordinates for processing
coordinates = [(x, y)
for x in range(0, ds.sizes[‘x’], chip_size)
for y in range(0, ds.sizes[‘y’], chip_size)]

logger.info(f”Generating {len(coordinates)} chips from scene”)

# Process chips in parallel
with parallel_backend(‘threading’, n_jobs=n_jobs):
results = Parallel(verbose=1)(
delayed(process_single_chip)(
x, y, ds, chip_size, s2_scene_id
) for x, y in coordinates
)

# Filter out None results (e.g., chips with too many cloud pixels)
valid_results = [r for r in results if r is not None]

logger.info(f”Successfully generated {len(valid_results)} valid chips”)

return valid_results
except Exception as e:
logger.error(f”Error in chip generation for scene {s2_scene_id}: {str(e)}”)
raise

For each chip, this step generates:

NetCDF datacubes (.netcdf) containing the full multispectral information
RGB thumbnails (.png) for visualization
Rich metadata (.parquet) with geolocation, timestamps, and other metadata

Step 2: Embedding generation using a Clay foundation model
The second step transforms the preprocessed image chips into vector embeddings using the Clay v1 foundation model. This is the most computationally intensive part of the pipeline, using multiple GPU instances (ml.g5.xlarge) to efficiently process the satellite imagery.
For each chip, this step:

Accesses the NetCDF datacube from Amazon S3
Normalizes the spectral bands according to the Clay v1 model’s input requirements
Generates both patch-level and class token (CLS) embeddings
Stores the embeddings as NumPy arrays (.npy) alongside the original data on S3 as intermediate store

While Clay can use all Sentinel-2 spectral bands, our implementation uses RGB and NIR as input bands to generate a 768-dimensional embedding, which provide excellent results in our examples. Customers can easily adapt the input bands based on their specific use-cases. These embeddings encapsulate high-level features such as vegetation patterns, urban structures, water bodies, and land use characteristics—without requiring explicit feature engineering.
Step 3: Embedding processing and analysis
The third step analyzes the embeddings to extract meaningful insights, particularly for time-series analysis. Running on high-memory instances, this step:

Performs dimensionality reduction on the embeddings using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) (to be used later for change detection)
Computes cosine similarity between embeddings over time (an alternative for change detection)
Identifies significant changes in the embeddings that might indicate surface changes
Saves processed embeddings in Parquet format for efficient querying

The output includes processed embedding files that contain both the original high-dimensional vectors and their reduced representations, along with computed similarity metrics.
For change detection applications, this step establishes a baseline for each geographic location and calculates deviations from this baseline over time. These deviations, captured as vector distances, provide a powerful indicator of surface changes like deforestation, urban development, or natural disasters.
Step 4: Consolidation and vector database integration
The final pipeline step consolidates the processed embeddings into a unified dataset and loads them into vector databases optimized for similarity search. The outputs include consolidated embedding files, GeoJSON grid files for visualization, and configuration files for frontend applications.
The solution supports two vector database options:

Amazon OpenSearch Serverless for production deployments with very low latency and high scalability requirements
LanceDB for lightweight deployments for smaller regions

Both options provide efficient ANN search capabilities, enabling sub-second query performance. The choice between them depends on the scale of deployment, integration requirements, and operational preferences.
With this robust data processing and embedding generation foundation in place, let’s explore the real-world applications enabled by the pipeline, beginning with geospatial similarity search.
Geospatial similarity search
Organizations working with Earth observation data have traditionally struggled with efficiently identifying specific landscape patterns across large geographic regions. Traditional Earth observation analysis requires specialized models trained on labeled datasets for each target feature. This approach forces organizations into a lengthy process of data collection, annotation, and model training before obtaining results.
In contrast, the GeoFM-powered similarity search converts satellite imagery into 768-dimensional vector embeddings that capture the semantic essence of landscape features, eliminating the need for manual feature engineering and computation of specialized indices like NDVI or NDWI.
This capability uses the Clay foundation model’s pre-training on diverse global landscapes to understand complex relationships between features without explicit programming. The result is an intuitive image-to-image search capability where users can select a reference area—such as early-stage deforestation or wildfire damage—and instantly find similar patterns across vast territories in seconds rather than weeks.
Similarity search implementation
Our implementation provides a streamlined workflow for finding similar geographic areas using the embeddings generated by the data processing pipeline. The search process involves:

Reference area selection: Users select a reference chip representing a search term (for example, a deforested patch, urban development, or agricultural field)
Search parameters: Users specify the number of results and a similarity threshold
Vector search execution: The system retrieves similar chips using cosine similarity between embeddings
Result visualization: Matching chips are highlighted on the map

Let’s dive deeper on a real-world application, taking our running example of detecting deforestation in the Mato Grosso region of the Brazilian Amazon. Traditional monitoring approaches often detect forest loss too late—after significant damage has already occurred. The Clay-powered similarity search capability offers a new approach by enabling early detection of emerging deforestation patterns before they expand into large-scale clearing operations.
Using a single reference chip showing the initial signs of forest degradation—such as selective logging, small clearings, or new access roads—analysts can instantly identify similar patterns across vast areas of the Amazon rainforest. As demonstrated in the following example images, the system effectively recognizes the subtle signatures of early-stage deforestation based on a single reference image. This capability enables environmental protection agencies and conservation organizations to deploy resources precisely, improving the anti-deforestation efforts by addressing threats to prevent major forest loss. While a single reference chip image led to good results in our examples, alternative approaches exist, such as an average vector strategy, which leverages embeddings from multiple reference images to enhance the similarity search results.

Ecosystem change detection
Unlike vector-based similarity search, change detection focuses on measuring the distance between embedding vectors over time, the core assumption being that the more distant embedding vectors are to each other, the more dissimilar the underlying satellite imagery is. If applied to a single region over time, this lets you pinpoint so called change points—periods where significant and long-lasting change in surface conditions occurred.
Our solution implements a timeline view of Sentinel-2 satellite observations from 2018 to present. Each observation point corresponds to a unique satellite image, allowing for detailed temporal analysis. While embedding vectors are highly dimensional, we use the previously computed PCA (and optionally t-SNE) to reduce dimensionality to a single dimension for visualization purposes.
Let’s review a compelling example from our analysis of deforestation in the Amazon. The following image is a timeseries plot of geospatial embeddings (first principal component) for a single 256×256 pixel chip. Cloudy images and major outliers have been removed.

Points clustered closely on the y-axis indicate similar ground conditions; sudden and persistent discontinuities in the embedding values signal significant change. Here’s what the analysis shows:

Stable forest conditions from 2018 through 2020
A significant discontinuity in embedding values during 2021. Closer review of the underlying satellite imagery shows clear evidence of forest clearing and conversion to agricultural fields
Further transformation visible in 2024 imagery

Naturally, we need a way to automate the process of change detection so that it can be applied at scale. Given that we do not typically have extensive changepoint training datasets, we need an unsupervised approach that works without labeled data. The intuition behind unsupervised change detection is the following: identify what normal looks like, then highlight large enough deviations from normal and flag them as change points; after a change point has occurred, characterize the new normal and repeat the process.
The following function performs harmonic regression analysis on the embeddings timeseries data, specifically designed to model yearly seasonality patterns. The function fits a harmonic regression with a specified frequency (default 365 days for annual patterns) to the embedding data of a baseline period (the year 2018 in this example). It then generates predictions and calculates error metrics (absolute and percentage deviations). Large deviations from the normal seasonal pattern indicate change and can be automatically flagged using thresholding.

def fit_harmonic_regression(data, outliers, date_col=”date”, y_col=”patch_emb_pca_1″,
baseline_start_date=datetime.datetime(2018,1,1),
monitoring_start_date=datetime.datetime(2019,1,1),
deg=3, reg=0.001, avg_days_yr=365):
“””
Fits a harmonic regression model to time series data using Lasso regularization.

Parameters:
– data: DataFrame containing the time series data
– outliers: List of dates to exclude from the fitting process
– date_col: Name of the column containing dates (default: ‘date’)
– y_col: Name of the column containing target values (default: ‘patch_emb_pca_1’)
– baseline_start_date: Start date for the baseline period (default: Jan 1, 2018)
– monitoring_start_date: Start date for the monitoring period (default: Jan 1, 2019)
– deg: Degree of polynomial features (default: 3)
– reg: Regularization strength for Lasso (default: 0.001)
– avg_days_yr: Number of days in a year (default: 365)
“””

# Convert date column to datetime format and create numerical representation
data[date_col] = pd.to_datetime(data[date_col])
data[“date_numerical”] = data[date_col].apply(lambda x: mdates.date2num(x))
t_full = data[“date_numerical”]
y_full = data[y_col]

# Filter data for model fitting:
# – Remove outliers
# – Keep only data between baseline_start_date and monitoring_start_date
t_fitting = data[
(~data[“date”].isin(outliers)) &
(data[date_col] >= baseline_start_date) &
(data[date_col] < monitoring_start_date)
][“date_numerical”]

y_fitting = data[
(~data[“date”].isin(outliers)) &
(data[date_col] >= baseline_start_date) &
(data[date_col] < monitoring_start_date)
][y_col]

# Create design matrix for harmonic regression
# Convert time to angular frequency (2π/period)
w = 2 * np.pi / avg_days_yr

# Generate polynomial features from sine and cosine transformations
poly = PolynomialFeatures(deg)
# Create feature matrix for fitting period
X_fitting = poly.fit_transform(np.column_stack((
np.sin(w*t_fitting),
np.cos(w*t_fitting)
)))
# Create feature matrix for all data points
X_full = poly.fit_transform(np.column_stack((
np.sin(w*t_full),
np.cos(w*t_full)
)))

# Initialize and fit Lasso regression model
lasso_model = Lasso(alpha=reg)
lasso_model.fit(X_fitting, y_fitting)

# Generate predictions for all time points
y_fit = lasso_model.predict(X_full)

# Calculate error metrics
absolute_deviation = np.abs(y_full – y_fit)
percentage_deviation = (absolute_deviation / np.abs(y_fit)) * 100

# Compile results into a DataFrame
df = pd.DataFrame()
df[“date”] = data[date_col]
df[“date_numerical”] = data[“date_numerical”]
df[f”{y_col}_true”] = data[y_col] # Original values
df[f”{y_col}_pred”] = y_fit # Predicted values
df[f”{y_col}_abs_error”] = absolute_deviation # Absolute error
df[f”{y_col}_perc_error”] = percentage_deviation # Percentage error
# Add time-based features
df[“year”] = df[“date”].apply(lambda x: x.year)
df[“month”] = df[“date”].apply(lambda x: x.month)
df[“year_month”] = df.apply(
lambda x: “{}_{}”.format(str(x.year), str(x.month)),
axis=1
)
# Return the fitted model, polynomial transformer, and results DataFrame
return lasso_model, poly, df

When applied to the chips across an area of observation and defining a threshold on the maximum deviation from the fitted harmonic regression, we can automatically map change intensity allowing analysts to quickly zoom in on problematic areas.

While this method performs well in our analyses, it is also quite rigid in that it requires a careful tuning of error thresholds and the definition of a baseline period. There are more sophisticated approaches available ranging from general-purpose time-series analyses that automate the baseline definition and change point detection using recursive methods (for example, Gaussian Processes) to specialized algorithms for geospatial change detection (for example, LandTrendr, and Continuous Change Detection and Classification (CCDC)).
In sum, our approach to change detection demonstrates the power of geospatial embedding vectors in tracking environmental changes over time, providing valuable insights for land use monitoring, environmental protection, and urban planning applications.
GeoFM fine-tuning for your custom use case
Fine-tuning is a specific implementation of transfer learning, in which a pre-trained foundation model is adapted to specific tasks through targeted additional training on specialized labeled datasets. For GeoFMs, these specific tasks can target agriculture, disaster monitoring or urban analysis. The model retains its broad spatial understanding while developing expertise for particular regions, ecosystems or analytical tasks. This approach significantly reduces computational and data requirements compared to building specialized models from scratch, without sacrificing accuracy. Fine-tuning typically involves preserving the pre-trained Clay’s encoder—which has already learned rich representations of spectral patterns, spatial relationships, and temporal dynamics from massive satellite imagery, while attaching and training a specialized task-specific head.
For pixel-wise prediction tasks—such as land use segmentation—the specialized head is typically a decoder architecture, whereas for class-level outputs (classification tasks) the head can be as basic as a multilayer perceptron network. Training focuses exclusively on the new decoder that captures the feature representations from model’s frozen encoder and gradually transforms them back to full-resolution images where each pixel is classified according to its land use type.
The segmentation framework combines the powerful pre-trained Clay encoder with an efficient convolutional decoder, taking Clay’s rich understanding of satellite imagery and converting it into detailed land use maps. The lightweight decoder features convolutional layers and pixel shuffle upsampling techniques that capture the feature representations from Clay’s frozen encoder and gradually transforms them back to full-resolution images where each pixel is classified according to its land use type. By freezing the encoder (which contains 24 transformer heads and 16 attention heads) and only training the compact decoder, the model achieves a good balance between computational efficiency and segmentation accuracy.
We applied this segmentation architecture on a labeled land use land cover (LULC) dataset from Impact Observatory and hosted on the Amazon Registry of Open Data. For illustrative purposes, we again focused on our running example from Brazil’s Mato Grosso region. We trained the decoder head for 10 epochs which took 17 minutes total and tracked intersection over union (IOU) and F1 score as segmentation accuracy metrics. After just one training epoch, the model already achieved 85.7% validation IOU. With the full 10 epochs completed, performance increased to an impressive 92.4% IOU and 95.6% F1 score. In the following image, we show ground truth satellite imagery (upper) and the model’s predictions (lower). The visual comparison highlights how accurately this approach can classify different land use categories.

Conclusion
Novel GeoFMs provide an encouraging new approach to geospatial analytics. Through their extensive pre-training, these models have incorporated a deep implicit understanding of geospatial data and can be used out-of-the-box for high-impact use cases such as similarity search or change detection. They can also serve as the basis for specialized models using a fine-tuning process that is significantly less data-hungry (fewer labeled data needed) and has lower compute requirements.
In this post, we have shown how you can deploy a state-of-the-art GeoFM (Clay) on AWS and have explored one specific use case – monitoring deforestation in the Amazon rainforest – in greater detail. The same approach is applicable to a large variety of industry use case. For example, insurance companies can use a similar approach to ours to assess damage after natural disasters including hurricanes, floods or fires and keep track of their insured assets. Agricultural organizations can use GeoFMs for crop type identification, crop yield predictions, or other use cases. We also envision high-impact use cases in industries like urban planning, emergency and disaster response, supply chain and global trade, sustainability and environmental modeling, and many others. To get started applying GeoFMs to your own earth observation use case, check out the accompanying GitHub repository, which has the prerequisites and a step-by-step walkthrough to run it on your own area of interest.

About the Authors
Dr. Karsten Schroer is a Senior Machine Learning (ML) Prototyping Architect at AWS, focused on helping customers leverage artificial intelligence (AI), ML, and generative AI technologies. With deep ML expertise, he collaborates with companies across industries to design and implement data- and AI-driven solutions that generate business value. Karsten holds a PhD in applied ML.
Bishesh Adhikari is a Senior ML Prototyping Architect at AWS with over a decade of experience in software engineering and AI/ML. Specializing in GenAI, LLMs, NLP, CV, and GeoSpatial ML, he collaborates with AWS customers to build solutions for challenging problems through co-development. His expertise accelerates customers’ journey from concept to production, tackling complex use cases across various industries. In his free time, he enjoys hiking, traveling, and spending time with family and friends.
Dr. Iza Moise is a Senior Machine Learning (ML) Prototyping Architect at AWS, with expertise in both traditional ML and advanced techniques like foundation models and vision transformers. She focuses on applied ML across diverse scientific fields, publishing and reviewing at Amazon’s internal ML conferences. Her strength lies in translating theoretical advances into practical solutions that deliver measurable impact through thoughtful implementation.

Create an agentic RAG application for advanced knowledge discovery wit …

Posted on May 30, 2025 by i-genie

Agentic Retrieval Augmented Generation (RAG) applications represent an advanced approach in AI that integrates foundation models (FMs) with external knowledge retrieval and autonomous agent capabilities. These systems dynamically access and process information, break down complex tasks, use external tools, apply reasoning, and adapt to various contexts. They go beyond simple question answering by performing multi-step processes, making decisions, and generating complex outputs.
In this post, we demonstrate an example of building an agentic RAG application using the LlamaIndex framework. LlamaIndex is a framework that connects FMs with external data sources. It helps ingest, structure, and retrieve information from databases, APIs, PDFs, and more, enabling the agent and RAG for AI applications.
This application serves as a research tool, using the Mistral Large 2 FM on Amazon Bedrock generate responses for the agent flow. The example application interacts with well-known websites, such as Arxiv, GitHub, TechCrunch, and DuckDuckGo, and can access knowledge bases containing documentation and internal knowledge.
This application can be further expanded to accommodate broader use cases requiring dynamic interaction with internal and external APIs, as well as the integration of internal knowledge bases to provide more context-aware responses to user queries.
Solution overview
This solution uses the LlamaIndex framework to build an agent flow with two main components: AgentRunner and AgentWorker. The AgentRunner serves as an orchestrator that manages conversation history, creates and maintains tasks, executes task steps, and provides a user-friendly interface for interactions. The AgentWorker handles the step-by-step reasoning and task execution.
For reasoning and task planning, we use Mistral Large 2 on Amazon Bedrock. You can use other text generation FMs available from Amazon Bedrock. For the full list of supported models, see Supported foundation models in Amazon Bedrock. The agent integrates with GitHub, arXiv, TechCrunch, and DuckDuckGo APIs, while also accessing internal knowledge through a RAG framework to provide context-aware answers.
In this solution, we present two options for building the RAG framework:

Document integration with Amazon OpenSearch Serverless – The first option involves using LlamaIndex to programmatically load and process documents. It splits the documents into chunks using various chunking strategies and then stores these chunks in an Amazon OpenSearch Serverless vector store for future retrieval.
Document integration with Amazon Bedrock Knowledge Bases – The second option uses Amazon Bedrock Knowledge Bases, a fully managed service that handles the loading, processing, and chunking of documents. This service can quickly create a new vector store on your behalf with a few configurations and clicks. You can choose from Amazon OpenSearch Serverless, Amazon Aurora PostgreSQL-Compatible Edition Serverless, and Amazon Neptune Analytics. Additionally, the solution includes a document retrieval rerank feature to enhance the relevance of the responses.

You can select the RAG implementation option that best suits your preference and developer skill level.
The following diagram illustrates the solution architecture.

In the following sections, we present the steps to implement the agentic RAG application. You can also find the sample code in the GitHub repository.
Prerequisites
The solution has been tested in the AWS Region us-west-2. Complete the following steps before proceeding:

Set up the following resources:

Create an Amazon SageMaker
Create a SageMaker domain user profile.
Launch Amazon SageMaker Studio, select JupyterLab, and create a space.
Select the instance t3.medium and the image SageMaker Distribution 2.3.1, then run the space.

Request model access:

On the Amazon Bedrock console, choose Model access in the navigation pane.
Choose Modify model access.
Select the models Mistral Large 2 (24.07), Amazon Titan Text Embeddings V2, and Rerank 1.0 from the list, and request access to these models.

Configure AWS Identity and Access Management (IAM) permissions:

In the SageMaker console, go to the SageMaker user profile details and find the execution role that the SageMaker notebook uses. It should look like AmazonSageMaker-ExecutionRole-20250213T123456.

In the IAM console, create an inline policy for this execution role. that your role can perform the following actions:

Access to Amazon Bedrock services including:

Reranking capabilities
Retrieving information
Invoking models
Listing available foundation models

IAM permissions to:

Create policies
Attach policies to roles within your account

Full access to Amazon OpenSearch Serverless service

Run the following command in the JupyterLab notebook terminal to download the sample code from GitHub:

git init
git remote add origin https://github.com/aws-samples/mistral-on-aws.git
git sparse-checkout init
git sparse-checkout set “notebooks/mistral-llamaindex-agentic-rag”
git pull origin main

Finally, install the required Python packages by running the following command in the terminal:

cd mistral-llamaindex-agentic-rag
pip install -r requirements.txt

Initialize the models
Initialize the FM used for orchestrating the agentic flow with Amazon Bedrock Converse API. This API provides a unified interface for interacting with various FMs available on Amazon Bedrock. This standardization simplifies the development process, allowing developers to write code one time and seamlessly switch between different models without adjusting for model-specific differences. In this example, we use the Mistral Large 2 model on Amazon Bedrock.
Next, initialize the embedding model from Amazon Bedrock, which is used for converting document chunks into embedding vectors. For this example, we use Amazon Titan Text Embeddings V2. See the following code:

# Initialise and configure the BedrockConverse LLM with the Mistral Large 2 model and set it as the default in Settings

from llama_index.llms.bedrock_converse import BedrockConverse
from llama_index.core import Settings
llm = BedrockConverse(model=”mistral.mistral-large-2407-v1:0″, max_tokens = 2048)
Settings.llm = BedrockConverse(model=”mistral.mistral-large-2407-v1:0″, max_tokens = 2048)

# Initialise and configure the embedding model with Amazon Titan Text Embeddings V2, and set it as the default in Settings

from llama_index.embeddings.bedrock import BedrockEmbedding
embed_model = BedrockEmbedding(model_name=”amazon.titan-embed-text-v2:0″)
Settings.embed_model = BedrockEmbedding(model_name=”amazon.titan-embed-text-v2:0″)

Integrate API tools
Implement two functions to interact with the GitHub and TechCrunch APIs. The APIs shown in this post don’t require credentials. To provide clear communication between the agent and the foundation model, follow Python function best practices, including:

Type hints for parameter and return value validation
Detailed docstrings explaining function purpose, parameters, and expected returns
Clear function descriptions

The following code sample shows the function that integrates with the GitHub API. After the function is created, use the FunctionTool.from_defaults() method to wrap the function as a tool and integrate it seamlessly into the LlamaIndex workflow.
See the code repository for the full code samples of the function that integrates with the TechCrunch API.

# Define a function to search GitHub repositories by topic, sorting by stars or update date, and return top results
import requests
def github_search(topic: str, num_results: int = 3, sort_by: str = “stars”) -> list:
“””
Retrieve a specified number of GitHub repositories based on a given topic,
ranked by the specified criteria.

This function uses the GitHub API to search for repositories related to a
specific topic or keyword. The results can be sorted by the number of stars
(popularity) or the most recent update, with the most relevant repositories
appearing first according to the chosen sorting method.

Parameters:
———–
topic : str
The topic or keyword to search for in GitHub repositories.
The topic cannot contain blank spaces.
num_results : int, optional
The number of repository results to retrieve. Defaults to 3.
sort_by : str, optional
The criterion for sorting the results. Options include:
– ‘stars’: Sort by the number of stars (popularity).
– ‘updated’: Sort by the date of the last update (most recent first).
Defaults to ‘stars’.

Returns:
——–
list
A list of dictionaries, where each dictionary contains information
about a repository. Each dictionary includes:
– ‘html_url’: The URL of the repository.
– ‘description’: A brief description of the repository.
– ‘stargazers_count’: The number of stars (popularity) the repository has.
“””

url = f”https://api.github.com/search/repositories?q=topic:{topic}&sort={sort_by}&order=desc”
response = requests.get(url).json()
code_repos = [
{
‘html_url’: item[‘html_url’],
‘description’: item[‘description’],
‘stargazers_count’: item[‘stargazers_count’],
}
for item in response[‘items’][:num_results]
]
return code_repos

github_tool = FunctionTool.from_defaults(fn=github_search)

For arXiv and DuckDuckGo integration, we use LlamaIndex’s pre-built tools instead of creating custom functions. You can explore other available pre-built tools in the LlamaIndex documentation to avoid duplicating existing solutions.

# Import and configure the ArxivToolSpec and DuckDuckGoSearchToolSpec from LlamaIndex prebuilt tools

from llama_index.tools.arxiv import ArxivToolSpec
from llama_index.tools.duckduckgo import DuckDuckGoSearchToolSpec

arxiv_tool = ArxivToolSpec()
search_tool = DuckDuckGoSearchToolSpec()

api_tools = arxiv_tool.to_tool_list() + search_tool.to_tool_list()

# Consolidate all tools into one list.
api_tools.extend([news_tool, github_tool])

RAG option 1: Document integration with Amazon OpenSearch Serverless
Next, programmatically build the RAG component using LlamaIndex to load, process, and chunk documents. store the embedding vectors in Amazon OpenSearch Serverless. This approach offers greater flexibility for advanced scenarios, such as loading various file types (including .epub and .ppt) and selecting advanced chunking strategies based on file types (such as HTML, JSON, and code).
Before moving forward, you can download some PDF documents for testing from the AWS website using the following command, or you can use your own documents. The following documents are AWS guides that help in choosing the right generative AI service (such as Amazon Bedrock or Amazon Q) based on use case, customization needs, and automation potential. They also assist in selecting AWS machine learning (ML) services (such as SageMaker) for building models, using pre-trained AI, and using cloud infrastructure.

# download test documents from below links
!wget -O docs/genai_on_aws.pdf https://docs.aws.amazon.com/pdfs/decision-guides/latest/generative-ai-on-aws-how-to-choose/generative-ai-on-aws-how-to-choose.pdf?did=wp_card&trk=wp_card#guide
!wget -O docs/ml_on_aws.pdf https://docs.aws.amazon.com/pdfs/decision-guides/latest/machine-learning-on-aws-how-to-choose/machine-learning-on-aws-how-to-choose.pdf?did=wp_card&trk=wp_card#guide

Load the PDF documents using SimpleDirectoryReader() in the following code. For a full list of supported file types, see the LlamaIndex documentation.

# use Llamaindex to load documents
from llama_index.core import SimpleDirectoryReader
loader = SimpleDirectoryReader(‘docs/’)
documents = loader.load_data()

Next, create an Amazon OpenSearch Serverless collection as the vector database. Check the utils.py file for details on the create_collection() function.

# Create Amazon OpenSearch Serverless collection
from utils import *
import sagemaker
import random

region_name = “us-west-2”
suffix = random.randrange(1, 500)
collection_name = “llamaindex-blog-“+str(suffix)
notebook_execution_role = sagemaker.get_execution_role()
endpoint = create_collection(collection_name, notebook_execution_role)

After you create the collection, create an index to store embedding vectors:

## create an index in the collection
index_name = “pdf-rag”
create_index(index_name, endpoint, emb_dim=1024)

Next, use the following code to implement a document search system using LlamaIndex integrated with Amazon OpenSearch Serverless. It first sets up AWS authentication to securely access OpenSearch Service, then configures a vector client that can handle 1024-dimensional embeddings (specifically designed for the Amazon Titan Embedding V2 model). The code processes input documents by breaking them into manageable chunks of 1,024 tokens with a 20-token overlap, converts these chunks into vector embeddings, and stores them in the OpenSearch Serverless vector index. You can select a different or more advanced chunking strategy by modifying the transformations parameter in the VectorStoreIndex.from_documents() method. For more information and examples, see the LlamaIndex documentation.

import boto3
from llama_index.vector_stores.opensearch import OpensearchVectorStore, OpensearchVectorClient
from opensearchpy import RequestsHttpConnection, AWSV4SignerAuth
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.node_parser import SentenceSplitter

## integrate Amazon OpenSearch Serverless collection and index to llamaindex

dim = 1024 # Amazon Titan Embedding V2 model dimension
service = ‘aoss’
credentials = boto3.Session().get_credentials()
awsauth = AWSV4SignerAuth(credentials, region_name, service)

client = OpensearchVectorClient(
endpoint,
index_name,
dim,
embedding_field=”vector”,
text_field=”chunk”,
http_auth=awsauth,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
)

# initialise vector store and save document chunks to the vector store
vector_store = OpensearchVectorStore(client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
transformations=[SentenceSplitter(chunk_size=1024, chunk_overlap=20)]
)

You can add a reranking step in the RAG pipeline, which improves the quality of information retrieved by making sure that the most relevant documents are presented to the language model, resulting in more accurate and on-topic responses:

from llama_index.postprocessor.bedrock_rerank import AWSBedrockRerank
reranker = AWSBedrockRerank(
top_n=3,
model_id=”amazon.rerank-v1:0″,# another rerank model option is: cohere.rerank-v3-5:0
region_name=”us-west-2″,
)
query_engine = index.as_query_engine(
similarity_top_k=10,
node_postprocessors=[reranker],
)

Use the following code to test the RAG framework. You can compare results by enabling or disabling the reranker model.

response = query_engine.query(
“In which situation should I use Amazon Bedrock over Amazon SageMaker?”,
)

Next, convert the vector store into a LlamaIndex QueryEngineTool, which requires a tool name and a comprehensive description. This tool is then combined with other API tools to create an agent worker that executes tasks in a step-by-step manner. The code initializes an AgentRunner to orchestrate the entire workflow, analyzing text inputs and generating responses. The system can be configured to support parallel tool execution for improved efficiency.

# create QueryEngineTool based on the OpenSearch vector store

from llama_index.core.tools import QueryEngineTool, ToolMetadata
oss_tool = QueryEngineTool(
query_engine=query_engine,
metadata=ToolMetadata(
name=”oss_guide_tool”,
description=”””
These decision guides help users select appropriate AWS machine learning and generative AI services based on specific needs.
They cover pre-built solutions, customizable platforms, and infrastructure options for ML workflows,
while outlining how generative AI can automate processes, personalize content, augment data, reduce costs,
and enable faster experimentation in various business contexts.”””,
),
)

all_tools = api_tools +[oss_tool]

agent_worker = FunctionCallingAgentWorker.from_tools(
all_tools,
llm=llm,
verbose=True, # Set verbose=True to display the full trace of steps.
system_prompt = system_prompt,
# allow_parallel_tool_calls = True # Uncomment this line to allow multiple tool invocations
)
agent = AgentRunner(agent_worker)
response = agent.chat(text_input)

You have now completed building the agentic RAG application using LlamaIndex and Amazon OpenSearch Serverless. You can test the chatbot application with your own questions. For example, ask about the latest news and features regarding Amazon Bedrock, or inquire about the latest papers and most popular GitHub repositories related to generative AI.
RAG option 2: Document integration with Amazon Bedrock Knowledge Bases
In this section, you use Amazon Bedrock Knowledge Bases to build the RAG framework. You can create an Amazon Bedrock knowledge base on the Amazon Bedrock console or follow the provided notebook example to create it programmatically. Create a new Amazon Simple Storage Service (Amazon S3) bucket for the knowledge base, then upload the previously downloaded files to this S3 bucket. You can select different embedding models and chunking strategies that work better for your data. After you create the knowledge base, remember to sync the data. Data synchronization might take a few minutes.
To enable your newly created knowledge base to invoke the rerank model, you need to modify its permissions. First, open the Amazon Bedrock console and locate the service role that matches the one shown in the following screenshot.

Choose the role and add the following provided IAM permission policy as an inline policy. This additional authorization grants your knowledge base the necessary permissions to successfully invoke the rerank model on Amazon Bedrock.

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: “bedrock:InvokeModel”,
“Resource”: “arn:aws:bedrock:us-west-2::foundation-model/amazon.rerank-v1:0”
},
{
“Effect”: “Allow”,
“Action”: “bedrock:Rerank”,
“Resource”: “*”
}
]
}

Use the following code to integrate the knowledge base into the LlamaIndex framework. Specific configurations can be provided in the retrieval_config parameter, where numberOfResults is the maximum number of retrieved chunks from the vector store, and overrideSearchType has two valid values: HYBRID and SEMANTIC. In the rerankConfiguration, you can optionally provide a rerank modelConfiguration and numberOfRerankedResults to sort the retrieved chunks by relevancy scores and select only the defined number of results. For the full list of available configurations for retrieval_config, refer to the Retrieve API documentation.

# Configure a knowledge base retriever using AmazonKnowledgeBasesRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.retrievers.bedrock import AmazonKnowledgeBasesRetriever

# maximum number of relevant text chunks that will be retrieved
# If you need quick, focused answers: lower numbers (1-3)
# If you need detailed, comprehensive answers: higher numbers (5-10)
top_k = 10

# search mode options: HYBRID, SEMANTIC
# HYBRID search combines the strengths of semantic search and keyword search
# Balances semantic understanding with exact matching
# https://docs.llamaindex.ai/en/stable/examples/retrievers/bedrock_retriever/
search_mode = “HYBRID”

kb_retriever = AmazonKnowledgeBasesRetriever(
knowledge_base_id=knowledge_base_id,
retrieval_config={
“vectorSearchConfiguration”: {
“numberOfResults”: top_k,
“overrideSearchType”: search_mode,
‘rerankingConfiguration’: {
‘bedrockRerankingConfiguration’: {
‘modelConfiguration’: {
‘modelArn’: ‘arn:aws:bedrock:us-west-2::foundation-model/amazon.rerank-v1:0’
},
‘numberOfRerankedResults’: 3
},
‘type’: ‘BEDROCK_RERANKING_MODEL’
}
},

}
)
kb_engine = RetrieverQueryEngine(retriever=kb_retriever)

Like the first option, you can create the knowledge base as a QueryEngineTool in LlamaIndex and combine it with other API tools. Then, you can create a FunctionCallingAgentWorker using these combined tools and initialize an AgentRunner to interact with them. By using this approach, you can chat with and take advantage of the capabilities of the integrated tools.

# Create a query tool for Bedrock Knowledge Base
kb_tool = QueryEngineTool(
query_engine=kb_engine,
metadata=ToolMetadata(
name=”kb_tool”,
description=”””
These decision guides help users select appropriate AWS machine learning and generative AI services based on specific needs.
They cover pre-built solutions, customizable platforms, and infrastructure options for ML workflows,
while outlining how generative AI can automate processes, personalize content, augment data, reduce costs,
and enable faster experimentation in various business contexts.”””,
),
)

# Update the agent to include all API tools and the Knowledge Base tool.
all_tools = api_tools +[kb_tool]

Now you have built the agentic RAG solution using LlamaIndex and Amazon Bedrock Knowledge Bases.
Clean up
When you finish experimenting with this solution, use the following steps to clean up the AWS resources to avoid unnecessary costs:

In the Amazon S3 console, delete the S3 bucket and data created for this solution.
In the OpenSearch Service console, delete the collection that was created for storing the embedding vectors.
In the Amazon Bedrock Knowledge Bases console, delete the knowledge base you created.
In the SageMaker console, navigate to your domain and user profile, and launch SageMaker Studio to stop or delete the JupyterLab instance.

Conclusion
This post demonstrated how to build a powerful agentic RAG application using LlamaIndex and Amazon Bedrock that goes beyond traditional question answering systems. By integrating Mistral Large 2 as the orchestrating model with external APIs (GitHub, arXiv, TechCrunch, and DuckDuckGo) and internal knowledge bases, you’ve created a versatile technology discovery and research tool.
We showed you two complementary approaches to implement the RAG framework: a programmatic implementation using LlamaIndex with Amazon OpenSearch Serverless, providing maximum flexibility for advanced use cases, and a managed solution using Amazon Bedrock Knowledge Bases that simplifies document processing and storage with minimal configuration. You can try out the solution using the following code sample.
For more relevant information, see Amazon Bedrock, Amazon Bedrock Knowledge Bases, Amazon OpenSearch Serverless, and Use a reranker model in Amazon Bedrock. Refer to Mistral AI in Amazon Bedrock to see the latest Mistral models that are available on both Amazon Bedrock and AWS Marketplace.

About the Authors
Ying Hou, PhD, is a Sr. Specialist Solution Architect for Gen AI at AWS, where she collaborates with model providers to onboard the latest and most intelligent AI models onto AWS platforms. With deep expertise in Gen AI, ASR, computer vision, NLP, and time-series forecasting models, she works closely with customers to design and build cutting-edge ML and GenAI applications. Outside of architecting innovative AI solutions, she enjoys spending quality time with her family, getting lost in novels, and exploring the UK’s national parks.
Preston Tuggle is a Sr. Specialist Solutions Architect with the Third-Party Model Provider team at AWS. He focuses on working with model providers across Amazon Bedrock and Amazon SageMaker, helping them accelerate their go-to-market strategies through technical scaling initiatives and customer engagement.

Text-to-image basics with Amazon Nova Canvas

Posted on May 30, 2025 by i-genie

AI image generation has emerged as one of the most transformative technologies in recent years, revolutionizing how you create and interact with visual content. Amazon Nova Canvas is a generative model in the suite of Amazon Nova creative models that enables you to generate realistic and creative images from plain text descriptions.
This post serves as a beginner’s guide to using Amazon Nova Canvas. We begin with the steps to get set up on Amazon Bedrock. Amazon Bedrock is a fully managed service that hosts leading foundation models (FMs) for various use cases such as text, code, and image generation; summarization; question answering; and custom use cases that involve fine-tuning and Retrieval Augmented Generation (RAG). In this post, we focus on the Amazon Nova image generation models available in AWS Regions in the US, in particular, the Amazon Nova Canvas model. We then provide an overview of the image generation process (diffusion) and dive deep into the input parameters for text-to-image generation with Amazon Nova Canvas.
Get started with image generation on Amazon Bedrock
Complete the following steps to get setup with access to Amazon Nova Canvas and the image playground:

Create an AWS account if you don’t have one already.
Open the Amazon Bedrock console as an AWS Identity and Access Management (IAM) administrator or appropriate IAM user.
Confirm and choose one of the Regions where the Amazon Nova Canvas model is available (for example, US East (N. Virginia)).
In the navigation pane, choose Model access under Bedrock configurations.

Under What is Model access, choose Modify model access or Enable specific models (if not yet activated).

Select Nova Canvas, then choose Next.

On the Review and submit page, choose Submit.

Refresh the Base models If you see the Amazon Nova Canvas model in the Access Granted status, you are ready to proceed with the next steps.

In the navigation pane, choose Image / Video under Playgrounds.

Choose Select model, then choose Amazon and Nova Canvas. Then choose Apply.

You are all set up to start generating images with Amazon Nova Canvas on Amazon Bedrock. The following screenshot shows an example of our playground.

Understanding the generation process
Amazon Nova Canvas uses diffusion-based approaches to generate images:

Starting point – The process begins with random noise (a pure static image).
Iterative denoising – The model gradually removes noise in steps, guided by your prompts. The amount of noise to remove at each step is learned at training. For instance, for a model to generate an image of a cat, it has to be trained on multiple cat images, and iteratively insert noise into the image until it is complete noise. When learning the amount of noise to add at each step, the model effectively learns the reverse process, starting with a noisy image and iteratively subtracting noise to arrive at the image of a cat.
Text conditioning – The text prompt serves as the conditioning that guides the image generation process. The prompt is encoded as a numerical vector, referenced against similar vectors in a text-image embedding space that corresponds to images, and then using these vectors, a noisy image is transformed into an image that captures the input prompt.
Image conditioning – In addition to text prompts, Amazon Nova Canvas also accepts images as inputs.
Safety and fairness – To comply with safety and fairness goals, both the prompt and the generated output image go through filters. If no filter is triggered, the final image is returned.

Prompting fundamentals
Image generation begins with effective prompting—the art of crafting text descriptions that guide the model toward your desired output. Well-constructed prompts include specific details about subject, style, lighting, perspective, mood, and composition, and work better when structured as image captions rather than a command or conversation. For example, rather than saying “generate an image of a mountain,” a more effective prompt might be “a majestic snow-capped mountain peak at sunset with dramatic lighting and wispy clouds, photorealistic style.” Refer to Amazon Nova Canvas prompting best practices for more information about prompting.
Let’s address the following prompt elements and observe their impact on the final output image:

Subject descriptions (what or who is in the image) – In the following example, we use the prompt “a cat sitting on a chair.”

Style references (photography, oil painting, 3D render) – In the following examples, we use the prompts “A cat sitting on a chair, oil painting style” and then “A cat sitting on a chair, anime style.”

Compositional elements and technical specifications (foreground, background, perspective, lighting) – In the following examples, we use the prompts “A cat sitting on a chair, mountains in the background,” and “A cat sitting on a chair, sunlight from the right low angle shot.”

Positive and negative prompts
Positive prompts tell the model what to include. These are the elements, styles, and characteristics you want to observe in the final image. Avoid the use of negation words like “no,” “not,” or “without” in your prompt. Amazon Nova Canvas has been trained on image-caption pairs, and captions rarely describe what isn’t in an image. Therefore, the model has never learned the concept of negation. Instead, use negative prompts to specify elements to exclude from the output.
Negative prompts specify what to avoid. Common negative prompts include “blurry,” “distorted,” “low quality,” “poor anatomy,” “bad proportions,” “disfigured hands,” or “extra limbs,” which help models avoid typical generation artifacts.
In the following examples, we first use the prompt “An aerial view of an archipelago,” then we refine the prompt as “An aerial view of an archipelago. Negative Prompt: Beaches.”

The balance between positive and negative prompting creates a defined creative space for the model to work within, often resulting in more predictable and desirable outputs.
Image dimensions and aspect ratios
Amazon Nova Canvas is trained on 1:1, portrait and landscape resolutions, with generation tasks having a maximum output resolution of 4.19 million pixels (that is, 2048×2048, 2816×1536). For editing tasks, the image should be 4,096 pixels on its longest side, have an aspect ratio between 1:4 and 4:1, and have a total pixel count of 4.19 million or smaller. Understanding dimensional limitations helps avoid stretched or distorted results, particularly for specialized composition needs.
Classifier-free guidance scale
The classifier-free guidance (CFG) scale controls how strictly the model follows your prompt:

Low values (1.1–3) – More creative freedom for the AI, potentially more aesthetic, but low contrast and less prompt-adherent results
Medium values (4–7) – Balanced approach, typically recommended for most generations
High values (8–10) – Strict prompt adherence, which can produce more precise results but sometimes at the cost of natural aesthetics and increased color saturation

In the following examples, we use the prompt “Cherry blossoms, bonsai, Japanese style landscape, high resolution, 8k, lush greens in the background.”
The first image with CFG 2 captures some elements of cherry blossoms and bonsai. The second image with CFG 8 adheres more to the prompt with a potted bonsai, more pronounced cherry blossom flowers, and lush greens in the background.

Think of CFG scale as adjusting how literally your instructions are taken into consideration vs. how much artistic interpretation it applies.
Seed values and reproducibility
Every image generation begins with a randomization seed—essentially a starting number that determines initial conditions:

Seeds are typically represented as long integers (for example, 1234567890)
Using the same seed, prompt, and parameters reproduces identical images every time
Saving seeds allows you to revisit successful generations or create variations on promising results
Seed values have no inherent quality; they are simply different starting points

Reproducibility through seed values is essential for professional workflows, allowing refined iterations on the prompt or other input parameters to clearly see their effect, rather than completely random generations. The following images are generated using two slightly different prompts (“A portrait of a girl smiling” vs. “A portrait of a girl laughing”), while holding the seed value and all other parameters constant.

All preceding images in this post have been generated using the text-to-image (TEXT_IMAGE) task type of Amazon Nova Canvas, available through the Amazon Bedrock InvokeModel API. The following is the API request and response structure for image generation:

#Request Structure
{
“taskType”: “TEXT_IMAGE”,
“textToImageParams”: {
“text”: string, #Positive Prompt
“negativeText”: string #Negative Prompt
},
“imageGenerationConfig”: {
“width”: int, #Image Resolution Width
“height”: int, #Image Resolution Width
“quality”: “standard” | “premium”, #Image Quality
“cfgScale”: float, #Classifer Free Guidance Scale
“seed”: int, #Seed value
“numberOfImages”: int #Number of images to be generated (max 5)
}
}
#Response Structure
{
“images”: “images”: string[], #list of Base64 encoded images
“error”: string
}

Code example
This solution can also be tested locally with a Python script or a Jupyter notebook. For this post, we use an Amazon SageMaker AI notebook using Python (v3.12). For more information, see Run example Amazon Bedrock API requests using an Amazon SageMaker AI notebook. For instructions to set up your SageMaker notebook instance, refer to Create an Amazon SageMaker notebook instance. Make sure the instance is set up in the same Region where Amazon Nova Canvas access is enabled. For this post, we create a Region variable to match the Region where Amazon Nova Canvas is enabled (us-east-1). You must modify this variable if you’ve enabled the model in a different Region. The following code demonstrates text-to-image generation by invoking the Amazon Nova Canvas v1.0 model using Amazon Bedrock. To understand the API request and response structure for different types of generations, parameters, and more code examples, refer to Generating images with Amazon Nova.

import base64 #For encoding/decoding base64 data
import io #For handling byte streams
import json #For JSON processing
import boto3 #AWS SDK for Python
from PIL import Image #Python Imaging Library for image processing
from botocore.config import Config #For AWS client configuration

#Create a variable to fix the region to where Nova Canvas is enabled
region = “us-east-1”

#Setup an Amazon Bedrock runtime client
client = boto3.client(service_name=’bedrock-runtime’, region_name=region, config=Config(read_timeout=300))

#Set the content type and accept headers for the API call
accept = “application/json”
content_type = “application/json”

#Define the prompt for image generation
prompt = “””A cat sitting on a chair, mountains in the background, low angle shot.”””

#Create the request body with generation parameters
api_request= json.dumps({
“taskType”: “TEXT_IMAGE”, #Specify text-to-image generation
“textToImageParams”: {
“text”: prompt
},
“imageGenerationConfig”: {
“numberOfImages”: 1, #Generate one image
“height”: 720, #Image height in pixels
“width”: 1280, #Image width in pixels
“cfgScale”: 7.0, #CFG Scale
“seed”: 0 #Seed number for generation
}
})
#Call the Bedrock model to generate the image
response = client.invoke_model(body=api_request, modelId=’amazon.nova-canvas-v1:0′, accept=accept,
contentType=content_type)

#Parse the JSON response
response_json = json.loads(response.get(“body”).read())

#Extract the base64-encoded image from the response
base64_image = response_json.get(“images”)[0]
#Convert the base64 string to ASCII bytes
base64_bytes = base64_image.encode(‘ascii’)
#Decode the base64 bytes to get the actual image bytes
image_data = base64.b64decode(base64_bytes)

#Convert bytes to an image object
output_image = Image.open(io.BytesIO(image_data))
#Display the image
output_image.show()
#Save the image to current working directory
output_image.save(‘output_image.png’)

Clean up
When you have finished testing this solution, clean up your resources to prevent AWS charges from being incurred:

Back up the Jupyter notebooks in the SageMaker notebook instance.
Shut down and delete the SageMaker notebook instance.

Cost considerations
Consider the following costs from the solution deployed on AWS:

You will incur charges for generative AI inference on Amazon Bedrock. For more details, refer to Amazon Bedrock pricing.
You will incur charges for your SageMaker notebook instance. For more details, refer to Amazon SageMaker pricing.

Conclusion
This post introduced you to AI image generation, and then provided an overview of accessing image models available on Amazon Bedrock. We then walked through the diffusion process and key parameters with examples using Amazon Nova Canvas. The code template and examples demonstrated in this post aim to get you familiar with the basics of Amazon Nova Canvas and get started with your AI image generation use cases on Amazon Bedrock.
For more details on text-to-image generation and other capabilities of Amazon Nova Canvas, see Generating images with Amazon Nova. Give it a try and let us know your feedback in the comments.

About the Author
Arjun Singh is a Sr. Data Scientist at Amazon, experienced in artificial intelligence, machine learning, and business intelligence. He is a visual person and deeply curious about generative AI technologies in content creation. He collaborates with customers to build ML and AI solutions to achieve their desired outcomes. He graduated with a Master’s in Information Systems from the University of Cincinnati. Outside of work, he enjoys playing tennis, working out, and learning new skills.

This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web …

Posted on May 29, 2025 by i-genie

Web navigation focuses on teaching machines how to interact with websites to perform tasks such as searching for information, shopping, or booking services. Building a capable web navigation agent is a complex task because it requires understanding the structure of websites, interpreting user goals, and making a series of decisions across multiple steps. These tasks are further complicated by the need for agents to adapt in dynamic web environments, where content can change frequently and where multimodal information, such as text and images, must be understood together.

A key problem in web navigation is the absence of reliable and detailed reward models that can guide agents in real-time. Existing methods primarily rely on multimodal large language models (MLLMs) like GPT-4o and GPT-4o-mini as evaluators, which are expensive, slow, and often inaccurate, especially when handling long sequences of actions in multi-step tasks. These models use prompting-based evaluation or binary success/failure feedback but fail to provide step-level guidance, often leading to errors such as repeated actions or missing critical steps like clicking specific buttons or filling form fields. This limitation reduces the practicality of deploying web agents in real-world scenarios, where efficiency, accuracy, and cost-effectiveness are crucial.

The research team from Yonsei University and Carnegie Mellon University introduced WEB-SHEPHERD, a process reward model specifically designed for web navigation tasks. WEB-SHEPHERD is the first model to evaluate web navigation agents at the step level, using structured checklists to guide assessments. The researchers also developed the WEBPRM COLLECTION, a dataset of 40,000 step-level annotated web navigation tasks, and the WEBREWARDBENCH benchmark for evaluating PRMs. These resources were designed to enable WEB-SHEPHERD to provide detailed feedback by breaking down complex tasks into smaller, measurable subgoals.

WEB-SHEPHERD works by generating a checklist for each task based on the user’s instruction, such as “Search for product” or “Click on product page,” and evaluates the agent’s progress against these subgoals. The model uses next-token prediction to generate feedback and assigns rewards based on checklist completion. This process enables WEB-SHEPHERD to assess the correctness of each step with fine-grained judgment. The model estimates the reward for each step by combining the probabilities of “Yes,” “No,” and “In Progress” tokens and averages these across the checklist. This detailed scoring system enables agents to receive targeted feedback on their progress, enhancing their ability to navigate complex websites.

The researchers demonstrated that WEB-SHEPHERD significantly outperforms existing models. On the WEBREWARDBENCH benchmark, WEB-SHEPHERD achieved a Mean Reciprocal Rank (MRR) score of 87.6% and a trajectory accuracy of 55% in the text-only setting, compared to GPT-4o-mini’s 47.5% MRR and 0% trajectory accuracy without checklists. When tested in WebArena-lite using GPT-4o-mini as the policy model, WEB-SHEPHERD achieved a 34.55% success rate, which is 10.9 points higher than using GPT-4o-mini as the evaluator, while also being ten times more cost-efficient. In ablation studies, the researchers observed that WEB-SHEPHERD’s performance dropped significantly when checklists or feedback were removed, proving their importance for accurate reward assignments. They also showed that multimodal input, surprisingly, did not always improve performance and sometimes introduced noise.

This research highlights the critical role of detailed process-level rewards in building reliable web agents. The team’s work addresses the core challenge of web navigation—evaluating complex, multi-step actions—and offers a solution that is both scalable and cost-effective. With WEB-SHEPHERD, agents can now receive accurate feedback during navigation, enabling them to make better decisions and complete tasks more effectively.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency appeared first on MarkTechPost.

National University of Singapore Researchers Introduce Dimple: A Discr …

Posted on May 29, 2025 by i-genie

In recent months, there has been growing interest in applying diffusion models—originally designed for continuous data, such as images—to natural language processing tasks. This has led to the development of Discrete Diffusion Language Models (DLMs), which treat text generation as a denoising process. Unlike traditional autoregressive models, DLMs enable parallel decoding and provide better control over structure, offering advantages such as flexible initialization of entire sequences, explicit control over output format, and improved infilling through bidirectional attention. Furthermore, their non-sequential nature opens the door to faster generation. Despite these benefits, most current multimodal large language models (MLLMs)—such as LLaMA, Qwen-VL, and InternVL—still rely solely on autoregressive methods.

Work in diffusion-based language models has explored both continuous and discrete diffusion spaces. Continuous approaches, such as DiffuSeq and SED, use embedding or relaxed categorical spaces for smoother generation. In contrast, discrete models like SDDM and RDM tailor the diffusion process to linguistic structures. Training techniques vary, but commonly use masked language modeling losses or entropy-based score matching. Some hybrid models, such as AR-Diffusion and SSD-LM, combine autoregressive and diffusion strategies to leverage the strengths of both approaches. Meanwhile, open-source MLLMs such as LLaVA and InternVL have advanced through visual instruction tuning and joint pretraining, yet still follow an autoregressive generation scheme.

Researchers at the National University of Singapore present Dimple, the first Discrete DMLLM, which integrates a vision encoder with a discrete diffusion-based language model. To overcome the instability and performance issues of purely diffusion-based training, they introduce a two-phase training method—Autoregressive-then-Diffusion—combining initial autoregressive alignment with subsequent diffusion-based masked language modeling. Dimple-7B surpasses LLaVA-NEXT by 3.9% on benchmarks. The team also introduces Confident Decoding for dynamic token generation and explores Structure Priors for precise control over output. These innovations significantly improve inference efficiency, generation flexibility, and structural controllability without sacrificing performance.

Dimple is a Discrete Diffusion Multimodal LLM that integrates a vision encoder with a diffusion-based language model. To address inefficiencies in diffusion training, such as sparse supervision and limited generation coverage, the model is trained in two phases: first with autoregressive training using a causal attention mask for vision-language alignment, then with diffusion training to restore generation capabilities. During inference, a dynamic “Confident Decoding” strategy adapts token updates based on prediction confidence. Despite using significantly fewer training samples, Dimple exhibits competitive performance on multiple benchmarks, outperforming similar-scale autoregressive models, although it trails behind larger-scale state-of-the-art systems.

The experiments evaluate Dimple, a DMLLM, against autoregressive models on instruction-following tasks. Dimple, trained with a hybrid strategy that combines autoregressive and diffusion tuning, exhibits strong performance, surpassing models with similar training data on most benchmarks. Although it lags behind models trained on much larger datasets, Dimple benefits from a stronger base language model. Ablation studies reveal that combining autoregressive and diffusion tuning mitigates issues like length bias and improves consistency. Prefilling further boosts inference speed significantly, with only minor performance drops, making the model both efficient and competitive in multimodal understanding tasks.

In conclusion, Dimple, the first DMLLM, is designed to overcome the limitations of purely discrete diffusion training, such as instability and length bias. Dimple employs a hybrid training approach that starts with autoregressive learning, followed by diffusion tuning, yielding the Dimple-7B model, which outperforms LLaVA-NEXT by 3.9%. A decoding strategy, confident decoding, significantly reduces inference steps, while prefilling improves speed with minimal performance trade-offs. Dimple also enables structured and controllable outputs through structure priors, offering fine-grained control over format and length capabilities that autoregressive models struggle to provide.

Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation appeared first on MarkTechPost.

Incorrect Answers Improve Math Reasoning? Reinforcement Learning with …

Posted on May 29, 2025 by i-genie

In natural language processing (NLP), RL methods, such as reinforcement learning with human feedback (RLHF), have been utilized to enhance model outputs by optimizing responses based on feedback signals. A specific variant, reinforcement learning with verifiable rewards (RLVR), extends this approach by utilizing automatic signals, such as mathematical correctness or syntactic features, as feedback, enabling the large-scale tuning of language models. RLVR is especially interesting because it promises to enhance models’ reasoning abilities without needing extensive human supervision. This intersection of automated feedback and reasoning tasks forms an exciting area of research, where developers aim to uncover how models can learn to reason mathematically, logically, or structurally using limited supervision.

A persistent challenge in machine learning is building models that can reason effectively under minimal or imperfect supervision. In tasks like mathematical problem-solving, where the correct answer might not be immediately available, researchers grapple with how to guide a model’s learning. Models often learn from ground-truth data, but it’s impractical to label vast datasets with perfect accuracy, particularly in reasoning tasks that require understanding complex structures like proofs or programmatic steps. Consequently, there’s an open question about whether models can learn to reason if they are exposed to noisy, misleading, or even incorrect signals during training. This issue is significant because models that overly rely on perfect feedback may not generalize well when such supervision is unavailable, thereby limiting their utility in real-world scenarios.

Several existing techniques aim to enhance models’ reasoning abilities through reinforcement learning (RL), with RLVR being a key focus. Traditionally, RLVR has used “ground truth” labels, correct answers verified by humans or automated tools, to provide rewards during training. Some approaches have relaxed this requirement by using majority vote labels or simple format-based heuristics, such as rewarding answers that follow a specific output style. Other methods have experimented with random rewards, offering positive signals without considering the correctness of the answer. These methods aim to explore whether models can learn even with minimal guidance, but they mostly concentrate on specific models, such as Qwen, raising concerns about generalizability across different architectures.

Researchers from the University of Washington, the Allen Institute for AI, and UC Berkeley investigate this question by testing various reward signals on Qwen2.5-Math, a family of large language models fine-tuned for mathematical reasoning. They tested ground-truth rewards, majority-vote rewards, format rewards based on boxed expressions, random rewards, and incorrect rewards. Remarkably, they observed that even completely spurious signals, like random rewards and rewards for wrong answers, could lead to substantial performance gains in Qwen models. For example, training Qwen2.5-Math-7B on MATH-500 with ground-truth rewards yielded a 28.8% improvement, while using incorrect labels resulted in a 24.6% gain. Random rewards still produced a 21.4% boost, and format rewards led to a 16.4% improvement. Majority-vote rewards provided a 26.5% accuracy gain. These improvements were not limited to a single model; Qwen2.5-Math-1.5B also showed strong gains: format rewards boosted accuracy by 17.6%, and incorrect labels by 24.4%. However, the same reward strategies failed to deliver similar benefits on other model families, such as Llama3 and OLMo2, which showed minimal or negative changes when trained with spurious rewards. For instance, Llama3.1-8B saw performance drops of up to 8.5% under certain spurious signals, highlighting the model-specific nature of the observed improvements.

The research team’s approach involved using RLVR training to fine-tune models with these varied reward signals, replacing the need for ground-truth supervision with heuristic or randomized feedback. They found that Qwen models, even without access to correct answers, could still learn to produce high-quality reasoning outputs. A key insight was that Qwen models tended to exhibit a distinct behavior called “code reasoning”, generating math solutions structured like code, particularly in Python-like formats, regardless of whether the reward signal was meaningful. This code reasoning tendency became more frequent over training, rising from 66.7% to over 90% in Qwen2.5-Math-7B when trained with spurious rewards. Answers that included code reasoning showed higher accuracy rates, often around 64%, compared to just 29% for answers without such reasoning patterns. These patterns emerged consistently, suggesting that spurious rewards may unlock latent capabilities learned during pretraining rather than introducing new reasoning skills.

Performance data underscored the surprising robustness of Qwen models. Gains from random rewards (21.4% on MATH-500) and incorrect labels (24.6%) nearly matched the ground-truth reward gain of 28.8%. Similar trends appeared across tasks, such as AMC, where format, wrong, and random rewards produced around an 18% improvement, only slightly lower than the 25% improvement from ground-truth or majority-vote rewards. Even on AIME2024, spurious rewards like format (+13.0%), incorrect (+8.7%), and random (+6.3%) led to meaningful gains, though the advantage of ground-truth labels (+12.8%) remained evident, particularly for AIME2025 questions created after model pretraining cutoffs.

Several Key Takeaways from the research include:

Qwen2.5-Math-7B gained 28.8% accuracy on MATH-500 with ground-truth rewards, but also 24.6% with incorrect rewards, 21.4% with random rewards, 16.4% with format rewards, and 26.5% with majority-vote rewards.

Code reasoning patterns emerged in Qwen models, increasing from 66.7% to 90%+ under RLVR, which boosted accuracy from 29% to 64%.

Non-Qwen models, such as Llama3 and OLMo2, did not show similar improvements, with Llama3.1-8B experiencing up to 8.5% performance drops on spurious rewards.

Gains from spurious signals appeared within 50 training steps in many cases, suggesting rapid elicitation of reasoning abilities.

The research warns that RLVR studies should avoid generalizing results based on Qwen models alone, as spurious reward effectiveness is not universal.

In conclusion, these findings suggest that while Qwen models can leverage spurious signals to improve performance, the same is not true for other model families. Non-Qwen models, such as Llama3 and OLMo2, showed flat or negative performance changes when trained with spurious signals. The research emphasizes the importance of validating RLVR methods on diverse models rather than relying solely on Qwen-centric results, as many recent papers have done.

Check out the Paper, Official Release and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Incorrect Answers Improve Math Reasoning? Reinforcement Learning with Verifiable Rewards (RLVR) Surprises with Qwen2.5-Math appeared first on MarkTechPost.

Part 3: Building an AI-powered assistant for investment research with …

Posted on May 29, 2025 by i-genie

In the financial services industry, analysts need to switch between structured data (such as time-series pricing information), unstructured text (such as SEC filings and analyst reports), and audio/visual content (earnings calls and presentations). Each format requires different analytical approaches and specialized tools, creating workflow inefficiencies. Add on top of this the intense time pressure resulting from rapidly evolving industry conditions and narrow decision windows. Delayed analysis can mean missed opportunities or failure to identify emerging risks, with potentially significant financial consequences.
AI-assistants boost productivity by automating routine data collection and processing tasks, surfacing relevant insights, and allowing analysts to focus on higher-value activities. However, a single AI agent struggles with complex, multistep investment research workflows to effectively handle the full spectrum of multiple specialized tasks. This is where multi-agent collaboration provides an advancement. By creating specialized AI subagents that excel in specific tasks and functionalities and using them together under a coordinated framework using an AI supervisor agent, we can address the full complexity of investment research workflows. A supervisor agent can intelligently decompose complex queries, delegate specialized tasks to subagents, and synthesize their outputs into comprehensive answers—similar to how a research team functions in the real world. The benefits of this approach are substantial, stemming from distributed problem-solving and specialization, improved accuracy through specialized expertise, enhanced scalability by adding new agent capabilities without rebuilding the entire system, and greater transparency because the reasoning process of each specialized agent can be tracked and verified.
Amazon Bedrock Agents uses the reasoning of foundation models (FMs), APIs, and data to break down user requests, gather relevant information, and efficiently complete tasks. With the multi-agent collaboration capability of Amazon Bedrock, you can build, deploy, and manage multiple AI agents working together on complex multistep tasks that require specialized skills. Amazon Bedrock Data Automation (BDA) enables the generation of useful insights from unstructured multimodal content such as documents, images, audio, and video for your AI-powered applications, and it can be used as a parser when setting up a knowledge base for Retrieval Augmented Generation (RAG) workflows.
In this post, we walk through how to build a multi-agent investment research assistant using the multi-agent collaboration capability of Amazon Bedrock. Our solution demonstrates how a team of specialized AI agents can work together to analyze financial news, evaluate stock performance, optimize portfolio allocations, and deliver comprehensive investment insights—all orchestrated through a unified, natural language interface. We use BDA as a parser when setting up a knowledge base for RAG workflows with multi-modal data. Amazon Nova understanding models are used as the large language models (LLMs) associated with the supervisor agents and subagents.
Check the video that demonstrates the solution,

Results from these agents are demonstrative, not to be taken as financial advice.
Solution overview
The multi-agent investment research assistant is composed of a supervisor agent and three subagents (quantitative analysis agent, news agent, and smart summarizer agent) working together in a coordinated framework, as shown in the following diagram.

The supervisor agent uses the instructions you provide to understand the structure and role of each collaborator agent. The supervisor agent orchestrates the overall investment research process by breaking down user prompts, delegating subtasks to specialized subagents, and consolidating their outputs to generate the final response. Each subagent is configured to carry out specialized tasks as follows:

Quantitative analysis agent

Functions – Powered by stock-query and portfolio-optimization action groups.
Responsibilities – Query and analyze historical stock data and build optimized portfolio allocations based on user inputs such as stock tickers or investment amount.

News agent

Functions – Powered by the web_search action group and the financial_analysis_KB knowledge base.
Responsibilities – Search and retrieve relevant financial data such as earnings reports and filings from the knowledge base for context. If information isn’t present in the knowledge base, construct a web query. Handles queries related to analyzing company financials, performance drivers, and management commentary.

Smart summarizer agent

Functions – Uses LLM capabilities with specific prompt instructions, no custom action group.
Responsibilities – Takes in output from other subagents, such as recent news and financial data, and synthesizes the information into structured investment insights. Summarizes long texts or documents such as earnings reports into concise insights.

Multi-agent collaboration works as follows:

User prompts the multi-agent investment research assistant with a high-level research query.
Supervisor agent breaks the query into subtasks and orchestrates and invokes the relevant subagents in parallel or sequentially as needed:

News agent to retrieve financial reports or data
Quantitative analysis agent for stock data analysis
Smart summarizer agent to summarize lengthy texts

Supervisor agent consolidates the subagent outputs along with the original query context.
Supervisor agent passes the consolidated information to the LLM to generate the final research insights.

This architecture uses the strengths of the different specialized agents to comprehensively address the various aspects of investment research, and the supervisor agent provides the intelligent orchestration.
Technical architecture
The following diagram is the technical architecture.

Some of the key components and the associated codes of the technical architecture are:

Amazon Bedrock Data Automation (BDA) for processing data – The following invokes a BDA job asynchronously using the invoke_data_automation_async() method from the bda-runtime client in boto3:

bda_response = bda_runtime_client.invoke_data_automation_async(
   inputConfiguration={
   “s3Uri”: s3_uri_input
   },
   outputConfiguration={
   “s3Uri”: s3_uri_output
   },
   )

Defining and building subagents with tools – The following creates a subagent using Agent.create. It defines two Lambda tools—one for fetching historical stock data and another for running portfolio optimization—then registers them with the subagent along with its role, goal, instructions, and the underlying FM.

quantitative_analysis_agent = Agent.create(
   name=”quantitative_analysis_agent”,
   role=”Financial Data Collector”,
   goal=”Retrieve real-time and historic stock prices as well as optimizing a portfolio given tickers.”,
   instructions=”””Specialist in real-time financial data extraction and portfolio optimization.
…”””,
   tools=[
   # Stock Data Lookup Tool
   {
   “code”: stock_data_lookup_arn,
   “definition”: {
   “name”: “stock_data_lookup”,
   “description”: “Gets the 1-month stock price history for a given stock ticker, formatted as JSON.”,
   “parameters”: {
   “ticker”: {“description”: “The ticker to retrieve price history for”, “type”: “string”, “required”: True}
   },
   },
   },
   ],
   llm=LLM,
)

Defining and building the supervisor agent (multi-agent investment research assistant) – The following creates a supervisor agent using SupervisorAgent.create. It defines the collaborator agents (subagents), and registers them with the supervisor agent along with its role, goal, instructions, and the underlying FM.

investment_research_assistant = SupervisorAgent.create(
   “investment_research_assistant”,
   role=”Investment Research Assistant”,
   goal=”A seasoned investment research expert responsible for orchestrating subagents …”,
   collaboration_type=”SUPERVISOR”,
   instructions=f”””You are a Investment Research Assistant, a financial research supervisor overseeing multiple specialized agents. Your goal is to coordinate and synthesize their outputs to create a structured stock investment analysis.
…”””,
   collaborator_agents=[
   {
   “agent”: “news_agent”,
   “instructions”: f”Always check the knowledge base (ID: {kb_id}) first. Use this collaborator for finding news and analyzing specific documents.”
   },
   {
   “agent”: “quantitative_analysis_agent”,
   “instructions”: “Use this collaborator for retrieving stock price history and performing portfolio optimization.”
   },
   {
   “agent”: “smart_summarizer_agent”,
   “instructions”: “Use this collaborator for synthesizing stock trends, financial data, and generating structured investment insights.”
   }
   ],
   collaborator_objects=[news_agent, quantitative_analysis_agent, smart_summarizer_agent],
   llm=LLM,
)

Invoking the supervisor agent (multi-agent investment research assistant) – The following uses the invoke method to pass in the defined request as a prompt and get response from the supervisor agent, using the multi-agent collaboration architecture.

request = “””What’s AMZN stock price doing over the last week
and relate that to recent news”””
print(f”Request:n{request}n”)
result = investment_research_assistant.invoke(
   request,
   enable_trace=True,
   trace_level=”core”,
)
print(f”Final answer:n{result}”)

Prerequisites
To deploy this solution, attach the appropriate permissions for Amazon Bedrock, AWS Lambda, Amazon OpenSearch Serverless, Amazon Simple Storage Service (Amazon S3), and AWS Identity and Access Management (IAM) to your execution role. You also need to deploy two AWS CloudFormation stacks: web_search and stock_data.
For more information on the permission policy, CloudFormation stacks, and setting up the environment, refer to the README.md in the investment_research_agent repository on GitHub.
Dive deeper into the solution
To dive deeper into the solution and the code shown in this post, refer to the investment research assistant agent GitHub repo. The repo contains instructions for the end-to-end solution, including setting up the supervisor agent and subagents, associated action groups, unstructured data (earnings reports PDF files and earnings call audio files), and structured data (stocks time series).
In the appendix at the end of this post, we share different questions asked by a financial analyst to the supervisor agent, the subagent invoked, and the answers from the supervisor agent.
Cleanup
To clean up this solution, make sure to delete all agents, knowledge bases, and associated Amazon OpenSearch Serverless instances. Delete both the web_search and stock_data stacks. This will make sure that no charges are incurred after you’re finished testing this solution. Read the associated notebook in the GitHub repository for more instructions on cleaning up the agents.
Conclusion
The multi-agent investment research assistant built using Amazon Bedrock Agents with its multi-agent collaboration feature represents how financial professionals can use AI to enhance their analytical capabilities. By using specialized agents—each focusing on distinct aspects of financial analysis—orchestrated by a supervisor agent, we’ve created a system that can process diverse data types, apply appropriate analytical techniques, and synthesize comprehensive insights in response to complex investment queries. The result is an AI assistant that can handle multistep financial analysis workflows that would otherwise overwhelm a single-agent system.
The application of multi-agent collaboration extends broadly across the financial sector, from equity research and portfolio management to risk assessment and compliance monitoring. A complex financial workflow that requires multiple specialized skills and data sources can benefit from this architectural approach.
The implementation uses key AWS services, including Amazon Bedrock for foundation models, the multi-agent collaboration capability of Amazon Bedrock Agents, Amazon S3 for document storage, Amazon OpenSearch Serverless for vector search capabilities, and Lambda functions for specialized actions. This serverless architecture provides scalability, and the knowledge base integration provides context from financial documents to ground the AI’s responses in factual data.
For more information:

Get started building your own multi-agent investment research assistant by accessing our complete solution code and implementation guide in the GitHub repo.
Explore how to customize the solution for your specific investment research needs by modifying the subagents’ specialized tools and action groups and by bringing your data securely to Amazon Bedrock Knowledge Bases.
You can also explore and run Amazon Bedrock multi-agent collaboration workshop with AWS specialists or on your own.

Read more about single agent solutions for investment research with multi-modal data:

Part 1 – Generative AI and multi-modal agents in AWS: The key to unlocking new value in financial markets, which demonstrates how multi-modal agents use AWS services to efficiently gather financial insights from an organization’s text, speech, and tabular data sources for a financial analyst’s query using open source LangChain agents as alternative implementation.
Part 2 – AI-powered assistants for investment research with multi-modal data: An application of Amazon Bedrock Agents, which introduces a solution using Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases that can help financial analysts use various multi-modal financial data to gather financial insights.

Read the blog post on Unlocking complex problem-solving with multi-agent collaboration on Amazon Bedrock for more details on benefits, design, and evaluation of multi-agent collaboration for enterprise applications.

About the Authors
Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.
Lucas Banerji is an AI/ML and GenAI specialist Solutions Architect at AWS. He is passionate about building agentic AI systems and exploring the frontier of what’s possible with intelligent automation. Lucas holds a degree in Computer Science from the University of Virginia. Outside of work, he enjoys running, practicing Muay Thai, and traveling the world.
Mohan Musti is a Principal Technical Account Manger based out of Dallas. Mohan helps customers architect and optimize applications on AWS. Mohan has Computer Science and Engineering from JNT University, India. In his spare time, he enjoys spending time with his family and camping.

Appendix
The following table contains example questions asked and responses from the AI-powered assistant for investment research based on structured and unstructured data.

Question Purpose
Question asked by financial analyst to AI-powered assistant
Subagent, action group and/or knowledge base invoked
Intermediate steps from Supervisor agent chain-of-thought reasoning
Multi-agent Investment Research Assistant – Responses

Prompt the agent to analyze the company’s financial health and performance.
Analyze Amazon’s financial health based on the 2024 10k report. Calculate important financial ratios. Limit to 5 sentences.
news_agent (financial_analysis_kb), smart_summarizer_agent
1) I need to specifically request the 2024 10-K report information from the knowledge base. Let me ask the news agent for this specific report. 2) Since the knowledge base contains recent financial information and forward-looking guidance, I’ll provide a comprehensive summary of Amazon’s financial metrics and projections. 3) Now I’ll ask the analyst agent to analyze this information and calculate key financial ratios.
Based on Amazon’s recent financial data, the company demonstrates strong financial health with total assets of $624.9B and a robust cash position of $78.8B, while generating $143.3B in net sales and $59.2B in net income. Key profitability metrics show improvement with a 46% gross margin, 4.4% operating margin, and 14.8% return on equity, while efficiency ratios remain healthy with inventory turnover at 13.2x and asset turnover at 1.4x. Liquidity ratios (current ratio: 1.06, quick ratio: 0.88) are adequate though tight, but are supported by strong operating cash flow of $54.3B, representing a 38% year-over-year increase. The company maintains a balanced leverage profile with a debt-to-equity ratio of 0.98 and solid interest coverage ratio of 8.5x, indicating good debt management. Despite ongoing high capital expenditures, Amazon’s improving margins, strong cash generation, and efficient operations suggest a healthy financial position with sustainable growth potential.

Prompt the agent to analyze the company’s financial performance from company’s earnings reports
What were the main drivers of AWS’s revenue growth in the Q1 2023 quarter? Please limit your answers to 5 sentences.
news_agent (financial_analysis_kb), smart_summarizer_agent
1.) I need to get information about AWS’s Q1 2023 earnings report. I’ll ask the news agent to check the knowledge base for this specific information. 2.) Based on the search results, I can provide a comprehensive answer about AWS’s Q1 2023 performance and growth drivers. The information comes directly from their earnings reports. 3.) I’ll synthesize this information into a concise response focusing on the main drivers and send it to the user.
AWS’s Q1 2023 revenue growth was primarily driven by continued cloud adoption, achieving net sales of $21.4 billion and a 16% year-over-year growth rate. However, the growth was tempered by customers across all segments optimizing their cloud spending in response to economic uncertainty. The business maintained strong fundamentals with an annualized sales run rate exceeding $85 billion, though momentum showed signs of slowing with April growth rates declining by approximately 500 basis points compared to Q1.

Prompt the agent to analyze the company’s management team and their plans for the future from company’s earnings reports
What are Amazon’s top priorities for 2024? What are the biggest risks facing the company? Please limit your answers to 5 sentences.
news_agent (financial_analysis_kb & web_search), smart_summarizer_agent
1.) I’ll ask the news agent to check the knowledge base for recent information about Amazon’s strategic priorities and risk factors for 2024. 2.) Let me get additional recent news and risk factors by asking the news agent to search for supplementary information. 3.) Based on the recent news searches, I can now provide a comprehensive overview of the significant risks currently facing Amazon
Amazon’s top priorities for 2024 include expanding international stores, growing grocery and B2B retail segments, scaling the “Buy with Prime” program, advancing healthcare initiatives, and developing low Earth orbit satellite capabilities for improved connectivity. The company faces several significant risks, including substantial AI-related capital expenditure pressures (expected $75 billion in 2024), increasing regulatory scrutiny from the FTC and competition authorities, potential impacts from new Chinese import tariffs, and intensifying competition in both cloud computing and retail sectors. Amazon is balancing these ambitious growth initiatives with cost optimization efforts, while viewing their strategic investments as potentially transformative for their long-term future.

Prompt the agent to build an optimal portfolio using the collected data
Optimize my portfolio with AAA, BBB, and CCC [anonymized tickers]
quantitative_analysis_agent (stock_data_lookup & portfolio tool)
1.) I need to get historical stock data for these three stocks before running portfolio optimization. I’ll ask the stock data agent for this information. 2.) Now that I have retrieved the historical price data for all three stocks, I’ll perform portfolio optimization using this data. 3.) I’ll synthesize this information into a clear, comprehensive response for the user.
Based on portfolio optimization analysis, here is the recommended allocation: Company A (AAA): 61.04% Company B (BBB): 32.54% Company C (CCC): 6.42% This allocation optimizes risk-adjusted returns based on recent market data. Please note that these allocations are based on mathematical optimization and historical data. Consider your personal investment goals, risk tolerance, and market conditions when making investment decisions.

Prompt the agent to look up news and relate it to stock data
What was happening to AMZN stock in November 2024, relate that to news in the same timeframe
news_agent (web_search), quantitative_analysis_agent (stock_data_lookup), & smart_summarizer_agent
1.) I’ll need to get both recent stock price data and news, then synthesize them together. I’ll request both simultaneously to be efficient. 2.) I’ll now synthesize the stock price movement with the recent news developments to provide a comprehensive analysis.
November 2024: Stock Price Movement: – Trading Range: $161.43 (low) to $206.21 (high) – Overall Trend: Significant upward momentum with notable volatility – Key Technical Levels: Resistance at $206, Support at $190. Key News Impact: 1. Q4 2024 Guidance (Released Oct 31): – Projected net sales: $181.5-188.5 billion – Expected growth: 7-11% year-over-year – This guidance served as a major catalyst for the stock’s movement Price-News Correlation: – The strong upward movement (~27% range) appears to be primarily driven by positive Q4 guidance – Investor confidence was bolstered by the projected sales growth – The wide guidance range reflected optimism around the holiday shopping season – Higher trading volumes were observed during price movements, indicating strong market participation The stock’s performance suggests that investors responded positively to Amazon’s forward-looking guidance, particularly regarding holiday season expectations. The volatility in price movements indicates active trading around these news events, with the market generally maintaining an optimistic outlook on Amazon’s near-term prospects.

A generative AI prototype with Amazon Bedrock transforms life sciences …

Posted on May 29, 2025 by i-genie

It takes biopharma companies over 10 years, at a cost of over $2 billion and with a failure rate of over 90%, to deliver a new drug to patients. The Market to Molecule (M2M) value stream process, which biopharma companies must apply to bring new drugs to patients, is resource-intensive, lengthy, and highly risky. Nine out of ten biopharma companies are AWS customers, and helping them streamline and transform the M2M processes can help deliver drugs to patients faster, reduce risk, and bring value to our customers.
Pharmaceutical companies are taking a new approach to drug discovery, looking for variants in the human genome and linking them to diseases. This genetic validation approach can improve the success ratio in the M2M value stream process by focusing on the root cause of disease and the gene variants.
As depicted in the following M2M value stream diagram, the Research process (and the Basic Research sub-process) is critical to downstream processes where linking the gene variant to a disease occurs, and is instrumental in defining the target molecule. This can be a critical step in expediting and reducing the cost of delivering a new drug to patients.

To transform the M2M value stream process, our customer has been working on associating genes with diseases by using their large dataset of over 2 million sequenced exomes (genes that are expressed into proteins). To accomplish this, the customer’s clinical scientists have to develop methods to navigate through the enormous dataset by using online genome browsers, a mechanical data-first experience that doesn’t fully meet the needs of users. Starting with a search query to get results, the typical interactions of navigating levels, filtering, waiting, and repeating the search can be time-consuming and tedious. Simplifying the UI from the traditional human browser to a conversational AI assistant can enhance the user experience in the clinical research process.
Generative AI is a promising next step in the evolutionary process of leading this change. As generative AI started to make significant impact in healthcare and life sciences, this use case was primed for generative AI experimentation. In collaboration with the customer, AWS built a custom approach of posting a question or a series of questions, allowing scientists to have more flexibility and agility for exploring the genome. Our customer aimed at saving researchers countless hours of work using a new generative AI-enabled gene assistant. By asking a question, or a series of questions, scientists have more flexibility and agility in exploring the genome. Identifying variants and their potential correlation with diseases can be done more efficiently using words, rather than filters, settings, and buttons. With a more streamlined research process, we can help increase the likelihood of leading to new breakthroughs.
This post explores deploying a text-to-SQL pipeline using generative AI models and Amazon Bedrock to ask natural language questions to a genomics database. We demonstrate how to implement an AI assistant web interface with AWS Amplify and explain the prompt engineering strategies adopted to generate the SQL queries. Finally, we present instructions to deploy the service in your own AWS account. Amazon Bedrock is a fully managed service that provides access to large language models (LLMs) and other foundation models (FMs) from leading AI companies through a single API, allowing you to use it instantly without much effort, saving developers valuable time. We used the AWS HealthOmics variant stores to store the Variant Call Format (VCF) files with omics data. A VCF file is typically the output of a bioinformatics pipeline. VCFs encode Single Nucleotide Polymorphisms (SNPs) and other structural genetic variants. The format is further described on the 1000 Genomes project website. We used the AWS HealthOmics – End to End workshop to deploy the variants and annotation stores.
Although this post focuses on a text-to-SQL approach to an omics database, the generative AI approaches discussed here can be applied to a variety of complex schemas of relational databases.
Text-to-SQL for genomics data
Text-to-SQL is a task in natural language processing (NLP) to automatically convert natural language text into SQL queries. This involves translating the written text into a structured format and using it to generate an accurate SQL query that can run on a database. The task is difficult because there are big differences between human language, which is flexible, ambiguous, and dependent on context, and SQL, which is structured.
Before LLMs for text-to-SQL, user queries had to be preprocessed to match specific templates, which were then used to rephrase the queries. This approach was use case-specific and required data preparation and manual work. Now, with LLMs, the text-to-SQL task has undergone a major transformation. LLMs continue to showcase key performance improvements in generating valid SQL queries from natural language queries. Relying on pre-trained models trained on massive datasets, LLMs can identify the relationships between words in language and accurately predict the next ones to be used.
However, although LLMs have remarkable performance in many text-to-SQL problems, they have limitations that lead to hallucinations. This post describes the main approaches used to overcome these limitations.
There are two key strategies to achieve high accuracy in text-to-SQL services:

Prompt engineering – The prompt is structured to annotate different components, such as pointing to columns and schemas, and then instructing the model on which type of SQL to create. These annotations act as instructions that guide the model in formatting the SQL output correctly. For example, a prompt might contain annotations showing specific table columns and guiding the model to generate a SQL query. This approach allows for more control over the model’s output by explicitly specifying the desired structure and format of the SQL query.
Fine-tuning – You can start with a pre-trained model on a large general text corpus and then proceed with an instruction-based fine-tuning with labeled examples to improve the model’s performance on text-to-SQL tasks. This process adapts the model to the target task by directly training it on the end task, but it requires a substantial number of text-SQL examples.

This post focuses on the prompt engineering strategy for SQL generation. AWS customers deploy prompt engineering strategies first because they’re efficient in returning high-quality results and require a less complex infrastructure and process. For more details and best practices on when to follow each approach, refer to Best practices to build generative AI applications on AWS.
We experimented with prompt engineering using chain-of-thought and tree-of-thought approaches to improve the reasoning and SQL generation capabilities. The chain-of-thought prompting technique guides the LLMs to break down a problem into a series of intermediate steps or reasoning steps, explicitly expressing their thought process before arriving at a definitive answer or output.
Using prompts, we compelled the LLM to generate a series of statements about its own reasoning, allowing the LLM to articulate its reasoning process to produce accurate and understandable outputs. The tree-of-thought approach introduces a structured branching approach to the reasoning process. Instead of a linear chain, we prompt the LLM to generate a tree-like structure, where each node represents a sub-task, sub-question, or intermediate step in the overall problem-solving process.
Solution Overview
The following architecture depicts the solution and AWS services we used to accomplish the prototype.

The workflow consists of the following steps:

A scientist submits a natural language question or request to a chat web application connected through Amplify and integrated with an AWS AppSync GraphQL API.
The request is submitted to Amazon API Gateway, which transfers the request to an AWS Lambda function that contains the text-to-SQL implementation. We recommend the implementation of a second helper Lambda function to fetch variants data, or gene names, or ClinVar listed diseases, to simplify the user experience and facilitate the SQL generation process.
The text-to-SQL Lambda function receives the natural language request, merges the input question with the prompt template, and submits to Amazon Bedrock to generate the SQL.

Our implementation also adds a step to simplify the incoming history into a single request. We submit a request to Amazon Bedrock to transform the historical inputs from that user session into a simplified natural language request. This step is optional.

With the generated SQL, the Lambda function submits the query to Amazon Athena to retrieve the genomic data from the Amazon Simple Storage Service (Amazon S3) bucket.

If successful, the Lambda function updates the user session stored in Amazon DynamoDB through an AWS AppSync request. That change will automatically appear on the UI that is subscribed to changes to the session table.
If an error occurs, the code attempts to re-generate the SQL query, passing the returned error as input and requesting it to fix the error. The Lambda function then reruns the re-generated SQL against Athena and returns the result.

Generative AI approaches to text-to-SQL
We tested the following prompt-engineering strategies:

LLM SQL agents
LLM with Retrieval Augmented Generation (RAG) to detect tables and columns of interest
Prompt engineering with full description of tables and columns of interest
Prompt engineering with chain-of-thought and tree-of-thought approaches
Prompt engineering with a dynamic few-shot approach

We didn’t achieve good results with SQL agents. We experimented with LangChain SQL agents. It was difficult for the agent to use contextual information from the dataset to generate accurate and syntactically correct SQL. A big challenge in omics data is that certain columns are arrays of structs or maps. At the time of building this project, the agents were incapable of detecting these nuances and failed to generate relevant SQL.
We experimented with a RAG approach to retrieve relevant tables and columns, given a user question. Then we informed the LLM by prompting it to generate a SQL query using only those tables and columns. A motivation behind this experiment is that a RAG approach can deal well with hundreds or thousands of columns or tables. However, this approach also didn’t return good results. This RAG approach returned too many irrelevant variables to be used in each SQL generation.
The next three approaches were successful, and we used them in combination to get the highest accuracy on synthetically correct SQL generation.
A first prompt idea we tested was to provide a full description of the main tables and columns to be used in the SQL generation given a user question. In the following example, we show a snapshot of the prompts used to describe the 1000 Genome variants tables. The goal of the prompt with database tables and column descriptions is to teach the LLM how to use the schema to generate queries. We approached it as if teaching a new developer that will write queries to that database, with examples of SQL queries to extract the correct dataset, how to filter the data, and only using the most relevant columns.

<table>
       <table_name>
       variants
       </table_name>
       <table_description>
       This table contains information about genetic variants.
       </table_description>
       <column>
              <column_name>contigname</column_name>
              <column_description>
This column specifies the name of the contig (a contiguous sequence of DNA) or chromosome where the variant is located. It is typicauy prefixed with “chr”. If the user asks for variants at the chromossome 22, use `chr22` to access variants in this table.
              </column_description>
              <example_use>
                      setect *
                      from variants
                      wnere contigname = ‘chr22’
                      and start between 45509414 and 45509418;
              </example_use>
       </column>
       <column>
              <column_name>start</column_name>
              <column_description>
                      The start position of the variant on the chromosome. This should
                      be used to compose the primary key of the variant, along with the
                      following tables: `contigname`, `end`, `referenceallele`, `alternatealleles`.
              </column_description>
              <example_use>
                      SELECT * FROM variants WHERE start > 100000 and end < 200000;
              </example_use>
       </column>
</table>

The team also worked with the creation of a prompt that used the concept of chain-of-thought and its evolution tree-of-thought to improve the reasoning and SQL generation capabilities.
The chain-of-thought prompting technique encourages LLMs to break down a problem into a series of intermediate steps, explicitly expressing their thought process before arriving at a definitive answer or output. This approach takes inspiration from the way humans often break down problems into smaller, manageable parts.
Through the use of prompts, we compelled the LLM to generate a chain-of-thought, letting the LLM articulate its reasoning process and produce more accurate and understandable outputs. This technique has the potential to improve performance on tasks that require multi-step reasoning, such as SQL generation from open-ended natural language questions. This approach presented excellent results with the FM that we tested.
As a next step in our experimentation, we used the tree-of-thought technique to generate even better results than the chain-of-thought approach. The tree-of-thought approach introduces a more structured and branching approach to the reasoning process. Instead of a linear chain, we prompt the LLM to generate a tree-like structure, where each node represents a sub-task, sub-question, or intermediate step in the overall problem-solving process. The following example presents how we used these two approaches in the prompt template:

Imagine three different experts are answering this question. All experts will write down 1 step
of their thinking, then share it with the group. Then all experts will go on to the next step, etc.
If any expert realises they’re wrong at any point then they leave. Each of the three experts should
explain their thinking along with the generated SQL statement. Your final step is to review the
generated SQL code for syntax errors. Pay close attention to any use of the UNNEST function – it
MUST be immediately followed by ‘AS t(unpacked)’ rather than ‘AS t’ . If you find a syntax error
with the generated SQL, produce a corrected version within <SQL_FIXED> tags. Only produce
the <SQL_FIXED> code if you find a syntax problem in the <SQL_QUERY> tags.

Finally, we tested a few-shot and a dynamic few-shot approach. The few-shot approach is a prompting technique used in prompt engineering for LLMs. It involves providing the LLM with a few examples or demonstrations, along with the input prompt, to guide the model’s generation or output. In the few-shot setting, the prompt comprises the following:

An instruction or task description
A few examples or demonstrations of the desired output, given a specific input
The new input for which the LLM will generate an output

By exposing the LLM to these examples, the model recognizes better patterns and infers the underlying rules or mappings between the input and desired output.
The dynamic few-shot approach extends the few-shot prompting technique. It introduces the concept of dynamically generating or selecting the examples or demonstrations used in the prompt, based on the specific input or context. In this approach, instead of providing a fixed set of examples, the prompt generation process involves:

Analyzing the input or context
Creating embeddings of the examples and of the input, and retrieving or generating relevant examples or demonstrations tailored to the specific input by applying a semantic search
Constructing the prompt with the selected examples and the input

Conclusion
This post demonstrated how to implement a text-to-SQL solution to democratize the access to omics data for users that aren’t data analytics specialists. The approach used HealthOmics and Amazon Bedrock to generate SQL based on natural language queries. This approach has the potential to provide access to omics data to a larger audience than what is available today.
The code is available in the accompanying GitHub repo. The deployment instructions for the HealthOmics variants and annotation store can be found in the AWS HealthOmics – End to End workshop. The deployment instructions for the text-to-SQL project are available in the README file.
We would like to acknowledge Thomaz Silva and Saeed Elnaj for their contributions to this blog. It couldn’t have been done without them.

About the Authors
Ganesh Raam Ramadurai is a Senior Technical Program Manager at Amazon Web Services (AWS), where he leads the PACE (Prototyping and Cloud Engineering) team. He specializes in delivering innovative, AI/ML and Generative AI-driven prototypes that help AWS customers explore emerging technologies and unlock real-world business value. With a strong focus on experimentation, scalability, and impact, Ganesh works at the intersection of strategy and engineering—accelerating customer innovation and enabling transformative outcomes across industries.
Jeff Harman is a Senior Prototyping Architect on the Amazon Web Services (AWS) Prototyping and Cloud Engineering team, he specializes in developing innovative solutions that leverage AWS’s cloud infrastructure to meet complex business needs. Jeff Harman is a seasoned technology professional with over three decades of experience in software engineering, enterprise architecture, and cloud computing. Prior to his tenure at AWS, Jeff held various leadership roles at Webster Bank, including Vice President of Platform Architecture for Core Banking, Vice President of Enterprise Architecture, and Vice President of Application Architecture. During his time at Webster Bank, he was instrumental in driving digital transformation initiatives and enhancing the bank’s technological capabilities. He holds a Master of Science degree from the Rochester Institute of Technology, where he conducted research on creating a Java-based, location-independent desktop environment—a forward-thinking project that anticipated the growing need for remote computing solutions. Based in Unionville, Connecticut, Jeff continues to be a driving force in the field of cloud computing, applying his extensive experience to help organizations harness the full potential of AWS technologies.
Kosal Sen is a Design Technologist on the Amazon Web Services (AWS) Prototyping and Cloud Engineering team. Kosal specializes in creating solutions that bridge the gap between technology and actual human needs. As an AWS Design Technologist, that means building prototypes on AWS cloud technologies, and ensuring they bring empathy and value into the real world. Kosal has extensive experience spanning design, consulting, software development, and user experience. Prior to AWS, Kosal held various roles where he combined technical skillsets with human-centered design principles across enterprise-scale projects.