InterVision accelerates AI development using AWS LLM League and Amazon …

Cities and local governments are continuously seeking ways to enhance their non-emergency services, recognizing that intelligent, scalable contact center solutions play a crucial role in improving citizen experiences. InterVision Systems, LLC (InterVision), an AWS Premier Tier Services Partner and Amazon Connect Service Delivery Partner, has been at the forefront of this transformation, with their contact center solution designed specifically for city and county services called ConnectIV CX for Community Engagement. Though their solution already streamlines municipal service delivery through AI-powered automation and omnichannel engagement, InterVision recognized an opportunity for further enhancement with advanced generative AI capabilities.
InterVision used the AWS LLM League program to accelerate their generative AI development for non-emergency (311) contact centers. As AWS LLM League events began rolling out in North America, this initiative represented a strategic milestone in democratizing machine learning (ML) and enabling partners to build practical generative AI solutions for their customers.
Through this initiative, InterVision’s solutions architects, engineers, and sales teams participated in fine-tuning large language models (LLMs) using Amazon SageMaker AI specifically for municipal service scenarios. InterVision used this experience to enhance their ConnectIV CX solution and demonstrated how AWS Partners can rapidly develop and deploy domain-specific AI solutions.
This post demonstrates how AWS LLM League’s gamified enablement accelerates partners’ practical AI development capabilities, while showcasing how fine-tuning smaller language models can deliver cost-effective, specialized solutions for specific industry needs.
Understanding the AWS LLM League
The AWS LLM League represents an innovative approach to democratizing ML through gamified enablement. The program proves that with the right tools and guidance, almost any role—from solutions architects and developers to sales teams and business analysts—can successfully fine-tune and deploy generative AI models without requiring deep data science expertise. Though initially run as larger multi-organization events such as at AWS re:Invent, the program has evolved to offer focused single-partner engagements that align directly with specific business objectives. This targeted approach allows for customization of the entire experience around real-world use cases that matter most to the participating organization.
The program follows a three-stage format designed to build practical generative AI capabilities. It begins with an immersive hands-on workshop where participants learn the fundamentals of fine-tuning LLMs using Amazon SageMaker JumpStart. SageMaker JumpStart is an ML hub that can help you accelerate your ML journey.
The competition then moves into an intensive model development phase. During this phase, participants iterate through multiple fine-tuning approaches, which can include dataset preparation, data augmentation, and other techniques. Participants submit their models to a dynamic leaderboard, where each submission is evaluated by an AI system that measures the model’s performance against specific benchmarks. This creates a competitive environment that drives rapid experimentation and learning, because participants can observe how their fine-tuned models perform against larger foundation models (FMs), encouraging optimization and innovation.
The program culminates in an interactive finale structured like a live game show as seen in the following figure, where top-performing participants showcase their models’ capabilities through real-time challenges. Model responses are evaluated through a triple-judging system: an expert panel assessing technical merit, an AI benchmark measuring performance metrics, and audience participation providing real-world perspective. This multi-faceted evaluation verifies that models are assessed not just on technical performance, but also on practical applicability.

The power of fine-tuning for business solutions
Fine-tuning an LLM is a type of transfer learning, a process that trains a pre-trained model on a new dataset without training from scratch. This process can produce accurate models with smaller datasets and less training time. Although FMs offer impressive general capabilities, fine-tuning smaller models for specific domains often delivers exceptional results at lower cost. For example, a fine-tuned 3B parameter model can outperform larger 70B parameter models in specialized tasks, while requiring significantly less computational resources. A 3B parameter model can run on an ml.g5.4xlarge instance, whereas a 70B parameter model would require the much more powerful and costly ml.g5.48xlarge instance. This approach aligns with recent industry developments, such as DeepSeek’s success in creating more efficient models through knowledge distillation techniques. Distillation is often implemented through a form of fine-tuning, where a smaller student model learns by mimicking the outputs of a larger, more complex teacher model.
In InterVision’s case, the AWS LLM League program was specifically tailored around their ConnectIV CX solution for community engagement services. For this use case, fine-tuning enables precise handling of municipality-specific procedures and responses aligned with local government protocols. Furthermore, the customized model provides reduced operational cost compared to using larger FMs, and faster inference times for better customer experience.
Fine-tuning with SageMaker Studio and SageMaker Jumpstart
The solution centers on SageMaker JumpStart in Amazon SageMaker Studio, which is a web-based integrated development environment (IDE) for ML that lets you build, train, debug, deploy, and monitor your ML models. With SageMaker JumpStart in SageMaker Studio, ML practitioners use a low-code/no-code (LCNC) environment to streamline the fine-tuning process and deploy their customized models into production.
Fine-tuning FMs with SageMaker Jumpstart involves a few steps in SageMaker Studio:

Select a model – SageMaker JumpStart provides pre-trained, publicly available FMs for a wide range of problem types. You can browse and access FMs from popular model providers for text and image generation models that are fully customizable.
Provide a training dataset – You select your training dataset that is saved in Amazon Simple Storage Service (Amazon S3), allowing you to use the virtually limitless storage capacity.
Perform fine-tuning – You can customize hyperparameters prior to the fine-tuning job, such as epochs, learning rate, and batch size. After choosing Start, SageMaker Jumpstart will handle the entire fine-tuning process.
Deploy the model – When the fine-tuning job is complete, you can access the model in SageMaker Studio and choose Deploy to start inferencing it. In addition, you can import the customized models to Amazon Bedrock, a managed service that enables you to deploy and scale models for production.
Evaluate the model and iterate – You can evaluate a model in SageMaker Studio using Amazon SageMaker Clarify, an LCNC solution to assess the model’s accuracy, explain model predictions, and review other relevant metrics. This allows you to identify areas where the model can be improved and iterate on the process.

This streamlined approach significantly reduces the complexity of developing and deploying specialized AI models while maintaining high performance standards and cost-efficiency. For the AWS LLM League model development phase, the workflow is depicted in the following figure.

During the model development phase, you start with a default base model and initial dataset uploaded into an S3 bucket. You then use SageMaker JumpStart to fine-tune your model. You then submit the customized model to the AWS LLM League leaderboard, where it will be evaluated against a larger pre-trained model. This allows you to benchmark your model’s performance and identify areas for further improvement.
The leaderboard, as shown in the following figure, provides a ranking of how you stack up against your peers. This will motivate you to refine your dataset, adjust the training hyperparameters, and resubmit an updated version of your model. This gamified experience fosters a spirit of friendly competition and continuous learning. The top-ranked models from the leaderboard will ultimately be selected to compete in the AWS LLM League’s finale game show event.

Empowering InterVision’s AI capabilities
The AWS LLM League engagement provided InterVision with a practical pathway to enhance their AI capabilities while addressing specific customer needs. InterVision participants could immediately apply their learning to solve real business challenges by aligning the competition with their ConnectIV CX solution use cases.
The program’s intensive format proved highly effective, enabling InterVision to compress their AI development cycle significantly. The team successfully integrated fine-tuned models into their environment, enhancing the intelligence and context-awareness of customer interactions. This hands-on experience with SageMaker JumpStart and model fine-tuning created immediate practical value.

“This experience was a true acceleration point for us. We didn’t just experiment with AI—we compressed months of R&D into real-world impact. Now, our customers aren’t asking ‘what if?’ anymore, they’re asking ‘what’s next?’”
– Brent Lazarenko, Head of Technology and Innovation at InterVision.

Using the knowledge gained through the program, InterVision has been able to enhance their technical discussions with customers about generative AI implementation. Their ability to demonstrate practical applications of fine-tuned models has helped facilitate more detailed conversations about AI adoption in customer service scenarios. Building on this foundation, InterVision developed an internal virtual assistant using Amazon Bedrock, incorporating custom models, multi-agent collaboration, and retrieval architectures connected to their knowledge systems. This implementation serves as a proof of concept for similar customer solutions while demonstrating practical applications of the skills gained through the AWS LLM League.
As InterVision progresses toward AWS Generative AI Competency, these achievements showcase how partners can use AWS services to develop and implement sophisticated AI solutions that address specific business needs.
Conclusion
The AWS LLM League program demonstrates how gamified enablement can accelerate partners’ AI capabilities while driving tangible business outcomes. Through this focused engagement, InterVision not only enhanced their technical capabilities in fine-tuning language models, but also accelerated the development of practical AI solutions for their ConnectIV CX environment. The success of this partner-specific approach highlights the value of combining hands-on learning with real-world business objectives.
As organizations continue to explore generative AI implementations, the ability to efficiently develop and deploy specialized models becomes increasingly critical. The AWS LLM League provides a structured pathway for partners and customers to build these capabilities, whether they’re enhancing existing solutions or developing new AI-powered services.
Learn more about implementing generative AI solutions:

Explore how you can participate in the AWS LLM League
Explore how you can get started to build, train, and deploy ML models at scale with SageMaker AI
Use model customization with your own data in Amazon Bedrock
Connect with AWS Partners like InterVision who are building innovative AI solutions

You can also visit the AWS Machine Learning blog for more stories about partners and customers implementing generative AI solutions across various industries.

About the Authors
Vu Le is a Senior Solutions Architect at AWS with more than 20 years of experience. He works closely with AWS Partners to expand their cloud business and increase adoption of AWS services. Vu has deep expertise in storage, data modernization, and building resilient architectures on AWS, and has helped numerous organizations migrate mission-critical systems to the cloud. Vu enjoys photography, his family, and his beloved corgi.
Jaya Padma Mutta is a Manager Solutions Architects at AWS based out of Seattle. She is focused on helping AWS Partners build their cloud strategy. She enables and mentors a team of technical Solution Architects aligned to multiple global strategic partners. Prior to joining this team, Jaya spent over 5 years in AWS Premium Support Engineering leading global teams, building processes and tools to improve customer experience. Outside of work, she loves traveling, nature, and is an ardent dog-lover.
Mohan CV is a Principal Solutions Architect at AWS, based in Northern Virginia. He has an extensive background in large-scale enterprise migrations and modernization, with a specialty in data analytics. Mohan is passionate about working with new technologies and enjoys assisting customers in adapting them to meet their business needs.
Rajesh Babu Nuvvula is a Solutions Architect in the Worldwide Public Sector team at AWS. He collaborates with public sector partners and customers to design and scale well-architected solutions. Additionally, he supports their cloud migrations and application modernization initiatives. His areas of expertise include designing distributed enterprise applications and databases.
Brent Lazarenko is the Head of Technology & AI at InterVision Systems, where he’s shaping the future of AI, cloud, and data modernization for over 1,700 clients. A founder, builder, and innovator, he scaled Virtuosity into a global powerhouse before a successful private equity exit. Armed with an MBA, MIT AI & leadership creds, and PMP/PfMP certifications, he thrives at the intersection of tech and business. When he’s not driving digital transformation, he’s pushing the limits of what’s next in AI, Web3, and the cloud.

Improve Amazon Nova migration performance with data-aware prompt optim …

In the era of generative AI, new large language models (LLMs) are continually emerging, each with unique capabilities, architectures, and optimizations. Among these, Amazon Nova foundation models (FMs) deliver frontier intelligence and industry-leading cost-performance, available exclusively on Amazon Bedrock. Since its launch in 2024, generative AI practitioners, including the teams in Amazon, have started transitioning their workloads from existing FMs and adopting Amazon Nova models.
However, when transitioning between different foundation models, the prompts created for your original model might not be as performant for Amazon Nova models without prompt engineering and optimization. Amazon Bedrock prompt optimization offers a tool to automatically optimize prompts for your specified target models (in this case, Amazon Nova models). It can convert your original prompts to Amazon Nova-style prompts. Additionally, during the migration to Amazon Nova, a key challenge is making sure that performance after migration is at least as good as or better than prior to the migration. To mitigate this challenge, thorough model evaluation, benchmarking, and data-aware optimization are essential, to compare the Amazon Nova model’s performance against the model used before the migration, and optimize the prompts on Amazon Nova to align performance with that of the previous workload or improve upon them.
In this post, we present an LLM migration paradigm and architecture, including a continuous process of model evaluation, prompt generation using Amazon Bedrock, and data-aware optimization. The solution evaluates the model performance before migration and iteratively optimizes the Amazon Nova model prompts using user-provided dataset and objective metrics. We demonstrate successful migration to Amazon Nova for three LLM tasks: text summarization, multi-class text classification, and question-answering implemented by Retrieval Augmented Generation (RAG). We also discuss the lessons learned and best practices for you to implement the solution for your real-world use cases.
Migrating your generative AI workloads to Amazon Nova
Migrating the model from your generative AI workload to Amazon Nova requires a structured approach to achieve performance consistency and improvement. It includes evaluating and benchmarking the old and new models, optimizing prompts on the new model, and testing and deploying the new models in your production. In this section, we present a four-step workflow and a solution architecture, as shown in the following architecture diagram.

The workflow includes the following steps:

Evaluate the source model and collect key performance metrics based on your business use case, such as response accuracy, response format correctness, latency, and cost, to set a performance baseline as the model migration target.
Automatically update the structure, instruction, and language of your prompts to adapt to the Amazon Nova model for accurate, relevant, and faithful outputs. We will discuss this more in the next section.
Evaluate the optimized prompts on the migrated Amazon Nova model to meet the performance target defined in Step 1. You can conduct the optimization in Step 2 as an iterative process until the optimized prompts meet your business criteria.
Conduct A/B testing to validate the Amazon Nova model performance in your testing and production environment. When you’re satisfied, you can deploy the Amazon Nova model, settings, and prompts in production. 

This four-step workflow needs to run continuously, to adapt to variations in both the model and the data, driven by the changes in business use cases. The continuous adaptation provides ongoing optimization and helps maximize overall model performance.
Data-aware prompt optimization on Amazon Nova
In this section, we present a comprehensive optimization methodology, taking two steps. The first step is to use Amazon Bedrock prompt optimization to refine your prompt structure, and then use an innovative data-aware prompt optimization approach to further optimize the prompt to improve the Amazon Nova model performance.
Amazon Bedrock prompt optimization
Amazon Bedrock provides a prompt optimization feature that rewrites prompts to improve performance for your use cases. Prompt optimization streamlines the way that AWS developers interact with FMs on Amazon Bedrock, automatically adapts the prompts to the selected models, and generates for better performance.
As the first step, you can use prompt optimization to adapt your prompt to Amazon Nova. By analyzing the prompt you provide, the feature interprets the task, system prompt, and instruction within the prompt, and automatically crafts the prompt with Amazon Nova specific format and appropriate words, phrases, and sentences. The following example shows how prompt optimization converts a typical prompt for a summarization task on Anthropic’s Claude Haiku into a well-structured prompt for an Amazon Nova model, with sections that begin with special markdown tags such as ## Task, ### Summarization Instructions, and ### Document to Summarize.

Model
Prompt

Anthropic’s Claude 3 Haiku
Human: Act like you are an intelligent AI assistant. You are required to provide a summarization based on given document. Please use below instructions when generating the response. The document is provided in <doc></doc> XML tags. Please be brief and concise in your answer. Do not add any information that is not mentioned in the document. Do not provide any preamble and directly start with the summarization. Do not make up the answer, If you don’t know the answer, just say that I don’t know.<doc>{document}</doc>Assistant:

Amazon Nova Lite with Amazon Bedrock prompt optimization
### Task Your task is to summarize the given document enclosed in <doc></doc> tags in a brief and concise manner. Follow these guidelines:### Summarization Instructions – Read the document carefully to understand its main points and key information. – Identify the core ideas, arguments, and supporting details presented in the document. – Synthesize the essential information into a clear and succinct summary. – Use your own words to paraphrase the key points – do not copy verbatim from the original text. – Omit any extraneous or redundant information not central to the main ideas. – Do not introduce new information or make up content not present in the original document. – If you cannot summarize the document due to lack of understanding, simply respond “I don’t know.”### Document to Summarize <doc>{document}</doc>Please provide your concise summary immediately without any preamble.

We applied the preceding prompts to the Anthropic Claude 3 Haiku and Amazon Nova Lite models, respectively, using the public xsum dataset. To evaluate the model performance, because the summarization task doesn’t have a predefined ground truth, we designed an LLM judge as shown in the following prompt to validate the summarization quality:

You are an AI assistant, your task is to compare the following LLM-generated summary with the original document, rate how well it captures the key points and conveys the most critical information, on a scale of 1-5.
    
    The score should be based on the following performance criteria:
    – Consistency: characterizes the summary’s factual and logical correctness. It should stay true to the original text, not introduce additional information, and use the same terminology.
    – Relevance: captures whether the summary is limited to the most pertinent information in the original text. A relevant summary focuses on the essential facts and key messages, omitting unnecessary details or trivial information.
    – Fluency: describes the readability of the summary. A fluent summary is well-written and uses proper syntax, vocabulary, and grammar.
    – Coherence: measures the logical flow and connectivity of ideas. A coherent summary presents the information in a structured, logical, and easily understandable manner.
    
    Score 5 means the LLM-generated summary is the best summary fully aligned with the original document,
    Score 1 means the LLM-generated summary is the worst summary completely irrelevant to the original document.  

    Please also provide an explanation on why you provide the score. Keep the explanation as concise as possible.

    The LLM-generated summary is provided within the <summary> XML tag,
    The original document is provided within the <document> XML tag,

    In your response, present the score within the <score> XML tag, and the explanation within the <thinking> XML tag.

    DO NOT nest <score> and <thinking> element.
    DO NOT put any extra attribute in the <score> and <thinking> tag.
    
    <document>
    {document}
    </document>

    LLM generated summary:
    <summary>
    {summary}
    </summary>

The experiment, using 80 data samples, shows that the accuracy is improved on the Amazon Nova Lite model from 77.75% to 83.25% using prompt optimization.
Data-aware optimization
Although Amazon Bedrock prompt optimization supports the basic needs of prompt engineering, other prompt optimization techniques are available to maximize LLM performance, such as Multi-Aspect Critique, Self-Reflection, Gradient Descent and Beam Search, and Meta Prompting. Specifically, we observed requirements from users that they need to fine-tune their prompts against their optimization objective metrics they define, such as ROUGE, BERT-F1, or an LLM judge score, by using a dataset they provide. To meet these needs, we designed a data-aware optimization architecture as shown in the following diagram.

The data-aware optimization takes inputs. The first input is the user-defined optimization objective metrics; for the summarization task discussed in the previous section, you can use the BERT-F1 score or create your own LLM judge. The second input is a training dataset (DevSet) provided by the user to validate the response quality, for example, a summarization data sample with the following format.

Source Document
Summarization

Officers searched properties in the Waterfront Park and Colonsay View areas of the city on Wednesday. Detectives said three firearms, ammunition and a five-figure sum of money were recovered. A 26-year-old man who was arrested and charged appeared at Edinburgh Sheriff Court on Thursday.
A man has appeared in court after firearms, ammunition and cash were seized by police in Edinburgh.

<another document …>
<another summarization …>

The data-aware optimization uses these two inputs to improve the prompt for better Amazon Nova response quality. In this work, we use the DSPy (Declarative Self-improving Python) optimizer for the data-aware optimization. DSPy is a widely used framework for programming language models. It offers algorithms for optimizing the prompts for multiple LLM tasks, from simple classifiers and summarizers to sophisticated RAG pipelines. The dspy.MIPROv2 optimizer intelligently explores better natural language instructions for every prompt using the DevSet, to maximize the metrics you define.
We applied the MIPROv2 optimizer on top of the results optimized by Amazon Bedrock in the previous section for better Amazon Nova performance. In the optimizer, we specify the number of the instruction candidates in the generation space, use Bayesian optimization to effectively search over the space, and run it iteratively to generate instructions and few-shot examples for the prompt in each step:

# Initialize optimizer
teleprompter = MIPROv2(
    metric=metric,
    num_candidates=5,
    auto=”light”,
    verbose=False,
)

With the setting of num_candidates=5, the optimizer generates five candidate instructions:

0: Given the fields `question`, produce the fields `answer`.

1: Given a complex question that requires a detailed reasoning process, produce a structured response that includes a step-by-step reasoning and a final answer. Ensure the reasoning clearly outlines each logical step taken to arrive at the answer, maintaining clarity and neutrality throughout.

2: Given the fields `question` and `document`, produce the fields `answer`. Read the document carefully to understand its main points and key information. Identify the core ideas, arguments, and supporting details presented in the document. Synthesize the essential information into a clear and succinct summary. Use your own words to paraphrase the key points without copying verbatim from the original text. Omit any extraneous or redundant information not central to the main ideas. Do not introduce new information or make up content not present in the original document. If you cannot summarize the document due to lack of understanding, simply respond “I don’t know.

3: In a high-stakes scenario where you must summarize critical documents for an international legal case, use the Chain of Thought approach to process the question. Carefully read and understand the document enclosed in <doc></doc> tags, identify the core ideas and key information, and synthesize this into a clear and concise summary. Ensure that the summary is neutral, precise, and omits any extraneous details. If the document is too complex or unclear, respond with “I don’t know.

4: Given the fields `question` and `document`, produce the fields `answer`. The `document` field contains the text to be summarized. The `answer` field should include a concise summary of the document, following the guidelines provided. Ensure the summary is clear, accurate, and captures the core ideas without introducing new information.

We set other parameters for the optimization iteration, including the number of trials, the number of few-shot examples, and the batch size for the optimization process:

# Optimize program
optimized_program = teleprompter.compile(
        program.deepcopy(),
        trainset=trainset,
        num_trials=7,
        minibatch_size=20,
        minibatch_full_eval_steps=7,
        max_bootstrapped_demos=2,
        max_labeled_demos=2,
        requires_permission_to_run=False,
)

When the optimization starts, MIPROv2 uses each instruction candidate along with the mini-batch of the testing dataset we provided to infer the LLM and calculate the metrics we defined. After the loop is complete, the optimizer evaluates the best instruction by using the full testing dataset and calculates the full evaluation score. Based on the iterations, the optimizer provides the improved instruction for the prompt:

Given the fields `question` and `document`, produce the fields `answer`.
The `document` field contains the text to be summarized.
The `answer` field should include a concise summary of the document, following the guidelines provided.
Ensure the summary is clear, accurate, and captures the core ideas without introducing new information.

Applying the optimized prompt, the summarization accuracy generated by the LLM judge on Amazon Nova Lite model is further improved from 83.25% to 87.75%.
We also applied the optimization process on other LLM tasks, including a multi-class text classification task, and a question-answering task using RAG. In all the tasks, our approach optimized the migrated Amazon Nova model to out-perform the Anthropic Claude Haiku and Meta Llama models before migration. The following table and chart illustrate the optimization results.

Task
DevSet
Evaluation
Before Migration
After Migration (Amazon Bedrock Prompt Optimization)
After Migration (DSPy with Amazon Bedrock Prompt Optimization)

Summarization (Anthropic Claude 3 Haiku to Amazon Nova Lite)
80 samples
LLM Judge
77.75
83.25
87.75

Classification (Meta Llama 3.2 3B to Amazon Nova Micro)
80 samples
Accuracy
81.25
81.25
87.5

QA-RAG (Anthropic Claude 3 Haiku to Amazon Nova Lite)
50 samples
Semantic Similarity
52.71
51.6
57.15

For the text classification use case, we optimized the Amazon Nova Micro model using 80 samples, using the accuracy metrics to evaluate the optimization performance in each step. After seven iterations, the optimized prompt provides 87.5% accuracy, improved from the accuracy of 81.25% running on the Meta Llama 3.2 3B model.
For the question-answering use case, we used 50 samples to optimize the prompt for an Amazon Nova Lite model in the RAG pipeline, and evaluated the performance using a semantic similarity score. The score compares the cosine distance between the model’s answer and the ground truth answer. Comparing to the testing data running on Anthropic’s Claude 3 Haiku, the optimizer improved the score from 52.71 to 57.15 after migrating to the Amazon Nova Lite model and prompt optimization.
You can find more details of these examples in the GitHub repository.
Lessons learned and best practices
Through the solution design, we have identified best practices that can help you properly configure your prompt optimization to maximize the metrics you specify for your use case:

Your dataset for optimizer should be of high quality and relevancy, and well-balanced to cover the data patterns and edge cases of your use case, and nuances to minimize biases.
The metrics you defined as the target of optimization should be use case specific. For example, if your dataset has ground truth, then you can use statistical and programmatical machine learning (ML) metrics such as accuracy and semantic similarity If your dataset doesn’t include ground truth, a well-designed and human-aligned LLM judge can provide a reliable evaluation score for the optimizer.
The optimizer runs with a number of prompt candidates (parameter dspy.num_candidates) and uses the evaluation metric you defined to select the optimal prompt as the output. Avoid setting too few candidates that might miss opportunity for improvement. In the previous summarization example, we set five prompt candidates for optimizing through 80 training samples, and received good optimization performance.
The prompt candidates include a combination of prompt instructions and few-shot examples. You can specify the number of examples (parameter dspy.max_labeled_demos for examples from labeled samples, and parameter dspy.max_bootstrapped_demos for examples from unlabeled samples); we recommend the example number be no less than 2.
The optimization runs in iteration (parameter dspy.num_trials); you should set enough iterations that allow you to refine prompts based on different scenarios and performance metrics, and gradually enhance clarity, relevance, and adaptability. If you optimize both the instructions and the few-shot examples in the prompt, we recommend you set the iteration number to no less than 2, preferably between 5–10.

In your use case, if your prompt structure is complex with chain-of-thoughts or tree-of-thoughts, long instructions in the system prompt, and multiple inputs in the user prompt, you can use a task-specific class to abstract the DSPy optimizer. The class helps encapsulate the optimization logic, standardize the prompt structure and optimization parameters, and allow straightforward implementation of different optimization strategies. The following is an example of the class created for text classification task:

class Classification(dspy.Signature):

“”” You are a product search expert evaluating the quality of specific search results and deciding will that lead to a buying decision or not. You will be given a search query and the resulting product information and will classify the result against a provided classification class. Follow the given instructions to classify the search query using the classification scheme

Class Categories:

Class Label:

Category Label: Positive Search

The class is chosen when the search query and the product are a full match and hence the customer experience is positive

Category Label: Negative Search

The class is chosen when the search query and the product are fully misaligned, meaning you searched for something but the output is completely different

Category Label: Moderate Search

The class is chosen when the search query and the product may not be fully same, but still are complementing each other and maybe of similar category

“””

search_query = dspy.InputField(desc=”Search Query consisting of keywords”)

result_product_title = dspy.InputField(desc=”This is part of Product Description and indicates the Title of the product”)

result_product_description = dspy.InputField(desc=”This is part of Product Description and indicates the description of the product”)

thinking = dspy.OutputField(desc=”justification in the scratchpad, explaining the reasoning behind the classification choice and highlighting key factors that led to the decision”)

answer = dspy.OutputField(desc=”final classification label for the product result: positive_search/negative_search/moderate_search. “)

“”” Instructions:

Begin by creating a scratchpad where you can jot down your initial thoughts, observations, and any pertinent information related to the search query and product. This section is for your personal use and doesn’t require a formal structure.
Proceed to examine and dissect the search query. Pinpoint essential terms, brand names, model numbers, and specifications. Assess the user’s probable objective based on the query.
Subsequently, juxtapose the query with the product. Seek out precise correspondences in brand, model, and specifications. Recognize commonalities in functionality, purpose, or features. Reflect on how the product connects to or augments the item being queried.
Afterwards, employ a methodical classification approach, contemplating each step carefully
Conclude by verifying the classification. Scrutinize the selected category in relation to its description to confirm its precision. Take into account any exceptional circumstances or possible uncertainties.

“””

Conclusion
In this post, we introduced the workflow and architecture for you to migrate your current generative AI workload into Amazon Nova models, and presented a comprehensive prompt optimization approach using Amazon Bedrock prompt optimization and a data-aware prompt optimization methodology with DSPy. The results on three LLM tasks demonstrated the optimized performance of Amazon Nova in its intelligence classes and the model performance improved by Amazon Bedrock prompt optimization post-model migration, which is further enhanced by the data-aware prompt optimization methodology presented in this post.
The Python library and code examples are publicly available on GitHub. You can use this LLM migration method and the prompt optimization solution to migrate your workloads into Amazon Nova, or in other model migration processes.

About the Authors
Yunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.
Anupam Dewan is a Senior Solutions Architect with a passion for generative AI and its applications in real life. He and his team enable Amazon Builders who build customer facing application using generative AI. He lives in Seattle area, and outside of work loves to go on hiking and enjoy nature.
Shuai Wang is a Senior Applied Scientist and Manager at Amazon Bedrock, specializing in natural language processing, machine learning, large language modeling, and other related AI areas. Outside work, he enjoys sports, particularly basketball, and family activities.
Kashif Imran is a seasoned engineering and product leader with deep expertise in AI/ML, cloud architecture, and large-scale distributed systems. Currently a Senior Manager at AWS, Kashif leads teams driving innovation in generative AI and Cloud, partnering with strategic cloud customers to transform their businesses. Kashif holds dual master’s degrees in Computer Science and Telecommunications, and specializes in translating complex technical capabilities into measurable business value for enterprises.

Alibaba Qwen Team Just Released Qwen3: The Latest Generation of Large …

Despite the remarkable progress in large language models (LLMs), critical challenges remain. Many models exhibit limitations in nuanced reasoning, multilingual proficiency, and computational efficiency. Often, models are either highly capable in complex tasks but slow and resource-intensive, or fast but prone to superficial outputs. Furthermore, scalability across diverse languages and long-context tasks continues to be a bottleneck, particularly for applications requiring flexible reasoning styles or long-horizon memory. These issues limit the practical deployment of LLMs in dynamic real-world environments.

Qwen3 Just Released: A Targeted Response to Existing Gaps

Qwen3, the latest release in the Qwen family of models developed by Alibaba Group, aims to systematically address these limitations. Qwen3 introduces a new generation of models specifically optimized for hybrid reasoning, multilingual understanding, and efficient scaling across parameter sizes.

The Qwen3 series expands upon the foundation laid by earlier Qwen models, offering a broader portfolio of dense and Mixture of Experts (MoE) architectures. Designed for both research and production use cases, Qwen3 models target applications that require adaptable problem-solving across natural language, coding, mathematics, and broader multimodal domains.

Technical Innovations and Architectural Enhancements

Qwen3 distinguishes itself with several key technical innovations:

Hybrid Reasoning Capability:A core innovation is the model’s ability to dynamically switch between “thinking” and “non-thinking” modes. In “thinking” mode, Qwen3 engages in step-by-step logical reasoning—crucial for tasks like mathematical proofs, complex coding, or scientific analysis. In contrast, “non-thinking” mode provides direct and efficient answers for simpler queries, optimizing latency without sacrificing correctness.

Extended Multilingual Coverage:Qwen3 significantly broadens its multilingual capabilities, supporting over 100 languages and dialects, improving accessibility and accuracy across diverse linguistic contexts.

Flexible Model Sizes and Architectures:The Qwen3 lineup includes models ranging from 0.5 billion parameters (dense) to 235 billion parameters (MoE). The flagship model, Qwen3-235B-A22B, activates only 22 billion parameters per inference, enabling high performance while maintaining manageable computational costs.

Long Context Support:Certain Qwen3 models support context windows up to 128,000 tokens, enhancing their ability to process lengthy documents, codebases, and multi-turn conversations without degradation in performance.

Advanced Training Dataset:Qwen3 leverages a refreshed, diversified corpus with improved data quality control, aiming to minimize hallucinations and enhance generalization across domains.

Additionally, the Qwen3 base models are released under an open license (subject to specified use cases), enabling the research and open-source community to experiment and build upon them.

Empirical Results and Benchmark Insights

Benchmarking results illustrate that Qwen3 models perform competitively against leading contemporaries:

The Qwen3-235B-A22B model achieves strong results across coding (HumanEval, MBPP), mathematical reasoning (GSM8K, MATH), and general knowledge benchmarks, rivaling DeepSeek-R1 and Gemini 2.5 Pro series models.

The Qwen3-72B and Qwen3-72B-Chat models demonstrate solid instruction-following and chat capabilities, showing significant improvements over the earlier Qwen1.5 and Qwen2 series.

Notably, the Qwen3-30B-A3B, a smaller MoE variant with 3 billion active parameters, outperforms Qwen2-32B on multiple standard benchmarks, demonstrating improved efficiency without a trade-off in accuracy.

Early evaluations also indicate that Qwen3 models exhibit lower hallucination rates and more consistent multi-turn dialogue performance compared to previous Qwen generations.

Conclusion

Qwen3 represents a thoughtful evolution in large language model development. By integrating hybrid reasoning, scalable architecture, multilingual robustness, and efficient computation strategies, Qwen3 addresses many of the core challenges that continue to affect LLM deployment today. Its design emphasizes adaptability—making it equally suitable for academic research, enterprise solutions, and future multimodal applications.

Rather than offering incremental improvements, Qwen3 redefines several important dimensions in LLM design, setting a new reference point for balancing performance, efficiency, and flexibility in increasingly complex AI systems.

Check out the Blog, Models on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Alibaba Qwen Team Just Released Qwen3: The Latest Generation of Large Language Models in Qwen Series, Offering a Comprehensive Suite of Dense and Mixture-of-Experts (MoE) Models appeared first on MarkTechPost.

ViSMaP: Unsupervised Summarization of Hour-Long Videos Using Meta-Prom …

Video captioning models are typically trained on datasets consisting of short videos, usually under three minutes in length, paired with corresponding captions. While this enables them to describe basic actions like walking or talking, these models struggle with the complexity of long-form videos, such as vlogs, sports events, and movies that can last over an hour. When applied to such videos, they often generate fragmented descriptions focused on isolated actions rather than capturing the broader storyline. Efforts like MA-LMM and LaViLa have extended video captioning to 10-minute clips using LLMs, but hour-long videos remain a challenge due to a shortage of suitable datasets. Although Ego4D introduced a large dataset of hour-long videos, its first-person perspective limits its broader applicability. Video ReCap addressed this gap by training on hour-long videos with multi-granularity annotations, yet this approach is expensive and prone to annotation inconsistencies. In contrast, annotated short-form video datasets are widely available and more user-friendly.

Advancements in visual-language models have significantly enhanced the integration of vision and language tasks, with early works such as CLIP and ALIGN laying the foundation. Subsequent models, such as LLaVA and MiniGPT-4, extended these capabilities to images, while others adapted them for video understanding by focusing on temporal sequence modeling and constructing more robust datasets. Despite these developments, the scarcity of large, annotated long-form video datasets remains a significant hindrance to progress. Traditional short-form video tasks, like video question answering, captioning, and grounding, primarily require spatial or temporal understanding, whereas summarizing hour-long videos demands identifying key frames amidst substantial redundancy. While some models, such as LongVA and LLaVA-Video, can perform VQA on long videos, they struggle with summarization tasks due to data limitations.Researchers from Queen Mary University and Spotify introduce ViSMaP, an unsupervised method for summarising hour-long videos without requiring costly annotations. Traditional models perform well on short, pre-segmented videos but struggle with longer content where important events are scattered. ViSMaP bridges this gap by using LLMs and a meta-prompting strategy to iteratively generate and refine pseudo-summaries from clip descriptions created by short-form video models. The process involves three LLMs working in sequence for generation, evaluation, and prompt optimisation. ViSMaP achieves performance comparable to fully supervised models across multiple datasets while maintaining domain adaptability and eliminating the need for extensive manual labelling.

The study addresses cross-domain video summarization by training on a labelled short-form video dataset and adapting to unlabelled, hour-long videos from a different domain. Initially, a model is trained to summarize 3-minute videos using TimeSFormer features, a visual-language alignment module, and a text decoder, optimized by cross-entropy and contrastive losses. To handle longer videos, they are segmented into 3-minute clips, and pseudo-captions are generated. An iterative meta-prompting approach with multiple LLMs (generator, evaluator, optimizer) refines summaries. Finally, the model is fine-tuned on these pseudo-summaries using a symmetric cross-entropy loss to manage noisy labels and improve adaptation.

The study evaluates VisMaP across three scenarios: summarization of long videos using Ego4D-HCap, cross-domain generalization on MSRVTT, MSVD, and YouCook2 datasets, and adaptation to short videos using EgoSchema. VisMaP, trained on hour-long videos, is compared against supervised and zero-shot methods, such as Video ReCap and LaViLa+GPT3.5, demonstrating competitive or superior performance without supervision. Evaluations use CIDEr, ROUGE-L, METEOR scores, and QA accuracy. Ablation studies highlight the benefits of meta-prompting and component modules, such as contrastive learning and SCE loss. Implementation details include the use of TimeSformer, DistilBERT, and GPT-2, with training conducted on an NVIDIA A100 GPU.

In conclusion, ViSMaP is an unsupervised approach for summarizing long videos by utilizing annotated short-video datasets and a meta-prompting strategy. It first creates high-quality summaries through meta-prompting and then trains a summarization model, reducing the need for extensive annotations. Experimental results demonstrate that ViSMaP performs on par with fully supervised methods and adapts effectively across various video datasets. However, its reliance on pseudo labels from a source-domain model may impact performance under significant domain shifts. Additionally, ViSMaP currently relies solely on visual information. Future work could integrate multimodal data, introduce hierarchical summarization, and develop more generalizable meta-prompting techniques.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post ViSMaP: Unsupervised Summarization of Hour-Long Videos Using Meta-Prompting and Short-Form Datasets appeared first on MarkTechPost.

A Coding Tutorial of Model Context Protocol Focusing on Semantic Chunk …

Managing context effectively is a critical challenge when working with large language models, especially in environments like Google Colab, where resource constraints and long documents can quickly exceed available token windows. In this tutorial, we guide you through a practical implementation of the Model Context Protocol (MCP) by building a ModelContextManager that automatically chunks incoming text, generates semantic embeddings using Sentence-Transformers, and scores each chunk based on recency, importance, and relevance. You’ll learn how to integrate this manager with a Hugging Face sequence-to-sequence model, demonstrated here with FLAN-T5, to add, optimize, and retrieve only the most pertinent pieces of context. Along the way, we’ll cover token counting with a GPT-2 tokenizer, context-window optimization strategies, and interactive sessions that let you query and visualize your dynamic context in real time.

Copy CodeCopiedUse a different Browserimport torch
import numpy as np
from typing import List, Dict, Any, Optional, Union, Tuple
from dataclasses import dataclass
import time
import gc
from tqdm.notebook import tqdm

We import essential libraries for building a dynamic context manager: torch and numpy handle tensor and numerical operations, while typing and dataclasses provide structured type annotations and data containers. Utility modules, such as time and gc, support timestamping and memory cleanup, as well as tqdm.notebook offers interactive progress bars for chunk processing in Colab.

Copy CodeCopiedUse a different Browser@dataclass
class ContextChunk:
“””A chunk of text with metadata for the Model Context Protocol.”””
text: str
embedding: Optional[torch.Tensor] = None
importance: float = 1.0
timestamp: float = 0.0
metadata: Dict[str, Any] = None

def __post_init__(self):
if self.metadata is None:
self.metadata = {}
if self.timestamp == 0.0:
self.timestamp = time.time()

The ContextChunk dataclass encapsulates a single segment of text along with its embedding, a user-assigned importance score, a timestamp, and arbitrary metadata. Its __post_init__ method ensures that each chunk is stamped with the current time upon creation and that metadata defaults to an empty dictionary if none is provided.

Copy CodeCopiedUse a different Browserclass ModelContextManager:
“””
Manager for implementing Model Context Protocol in LLMs on Google Colab.
Handles context window optimization, token management, and relevance scoring.
“””

def __init__(
self,
max_context_length: int = 8192,
embedding_model: str = “sentence-transformers/all-MiniLM-L6-v2”,
relevance_threshold: float = 0.7,
recency_weight: float = 0.3,
importance_weight: float = 0.3,
semantic_weight: float = 0.4,
device: str = “cuda” if torch.cuda.is_available() else “cpu”
):
“””
Initialize the Model Context Manager.

Args:
max_context_length: Maximum number of tokens in context window
embedding_model: Model to use for text embeddings
relevance_threshold: Threshold for chunk relevance to be included
recency_weight: Weight for recency in relevance calculation
importance_weight: Weight for importance in relevance calculation
semantic_weight: Weight for semantic similarity in relevance calculation
device: Device to run computations on
“””
self.max_context_length = max_context_length
self.device = device
self.chunks = []
self.current_token_count = 0
self.relevance_threshold = relevance_threshold

self.recency_weight = recency_weight
self.importance_weight = importance_weight
self.semantic_weight = semantic_weight

try:
from sentence_transformers import SentenceTransformer
print(f”Loading embedding model {embedding_model}…”)
self.embedding_model = SentenceTransformer(embedding_model).to(self.device)
print(f”Embedding model loaded successfully on {self.device}”)
except ImportError:
print(“Installing sentence-transformers…”)
import subprocess
subprocess.run([“pip”, “install”, “sentence-transformers”])
from sentence_transformers import SentenceTransformer
self.embedding_model = SentenceTransformer(embedding_model).to(self.device)
print(f”Embedding model loaded successfully on {self.device}”)

try:
from transformers import GPT2Tokenizer
self.tokenizer = GPT2Tokenizer.from_pretrained(“gpt2”)
except ImportError:
print(“Installing transformers…”)
import subprocess
subprocess.run([“pip”, “install”, “transformers”])
from transformers import GPT2Tokenizer
self.tokenizer = GPT2Tokenizer.from_pretrained(“gpt2”)

def add_chunk(self, text: str, importance: float = 1.0, metadata: Dict[str, Any] = None) -> None:
“””
Add a new chunk of text to the context manager.

Args:
text: The text content to add
importance: Importance score (0-1)
metadata: Additional metadata for the chunk
“””
with torch.no_grad():
embedding = self.embedding_model.encode(text, convert_to_tensor=True)

chunk = ContextChunk(
text=text,
embedding=embedding,
importance=importance,
timestamp=time.time(),
metadata=metadata or {}
)

self.chunks.append(chunk)
self.current_token_count += len(self.tokenizer.encode(text))

if self.current_token_count > self.max_context_length:
self.optimize_context()

def optimize_context(self) -> None:
“””Optimize context by removing less relevant chunks to fit within token limit.”””
if not self.chunks:
return

print(“Optimizing context window…”)

scores = self.score_chunks()

sorted_indices = np.argsort(scores)[::-1]

new_chunks = []
new_token_count = 0

for idx in sorted_indices:
chunk = self.chunks[idx]
chunk_tokens = len(self.tokenizer.encode(chunk.text))

if new_token_count + chunk_tokens <= self.max_context_length:
new_chunks.append(chunk)
new_token_count += chunk_tokens
else:
if scores[idx] > self.relevance_threshold * 1.5:
for i, included_chunk in enumerate(new_chunks):
included_idx = sorted_indices[i]
if scores[included_idx] < self.relevance_threshold:
included_tokens = len(self.tokenizer.encode(included_chunk.text))
if new_token_count – included_tokens + chunk_tokens <= self.max_context_length:
new_chunks.remove(included_chunk)
new_token_count -= included_tokens
new_chunks.append(chunk)
new_token_count += chunk_tokens
break

removed_count = len(self.chunks) – len(new_chunks)
self.chunks = new_chunks
self.current_token_count = new_token_count

print(f”Context optimized: Removed {removed_count} chunks, {len(new_chunks)} remaining, using {new_token_count}/{self.max_context_length} tokens”)

gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()

def score_chunks(self, query: str = None) -> np.ndarray:
“””
Score chunks based on recency, importance, and semantic relevance.

Args:
query: Optional query to calculate semantic relevance against

Returns:
Array of scores for each chunk
“””
if not self.chunks:
return np.array([])

current_time = time.time()
max_age = max(current_time – chunk.timestamp for chunk in self.chunks) or 1.0
recency_scores = np.array([
1.0 – ((current_time – chunk.timestamp) / max_age)
for chunk in self.chunks
])

importance_scores = np.array([chunk.importance for chunk in self.chunks])

if query is not None:
query_embedding = self.embedding_model.encode(query, convert_to_tensor=True)
similarity_scores = np.array([
torch.cosine_similarity(chunk.embedding, query_embedding, dim=0).item()
for chunk in self.chunks
])

similarity_scores = (similarity_scores – similarity_scores.min()) / (similarity_scores.max() – similarity_scores.min() + 1e-8)
else:
similarity_scores = np.ones(len(self.chunks))

final_scores = (
self.recency_weight * recency_scores +
self.importance_weight * importance_scores +
self.semantic_weight * similarity_scores
)

return final_scores

def retrieve_context(self, query: str = None, k: int = None) -> str:
“””
Retrieve the most relevant context for a given query.

Args:
query: The query to retrieve context for
k: The maximum number of chunks to return (None = all relevant chunks)

Returns:
String containing the combined relevant context
“””
if not self.chunks:
return “”

scores = self.score_chunks(query)

relevant_indices = np.where(scores >= self.relevance_threshold)[0]

relevant_indices = relevant_indices[np.argsort(scores[relevant_indices])[::-1]]

if k is not None:
relevant_indices = relevant_indices[:k]

relevant_texts = [self.chunks[i].text for i in relevant_indices]
return “nn”.join(relevant_texts)

def get_stats(self) -> Dict[str, Any]:
“””Get statistics about the current context state.”””
return {
“chunk_count”: len(self.chunks),
“token_count”: self.current_token_count,
“max_tokens”: self.max_context_length,
“usage_percentage”: self.current_token_count / self.max_context_length * 100 if self.max_context_length else 0,
“avg_chunk_size”: self.current_token_count / len(self.chunks) if self.chunks else 0,
“oldest_chunk_age”: time.time() – min(chunk.timestamp for chunk in self.chunks) if self.chunks else 0,
}

def visualize_context(self):
“””Visualize the current context window distribution.”””
try:
import matplotlib.pyplot as plt
import pandas as pd

if not self.chunks:
print(“No chunks to visualize”)
return

scores = self.score_chunks()
chunk_sizes = [len(self.tokenizer.encode(chunk.text)) for chunk in self.chunks]
timestamps = [chunk.timestamp for chunk in self.chunks]
relative_times = [time.time() – ts for ts in timestamps]
importance = [chunk.importance for chunk in self.chunks]

df = pd.DataFrame({
‘Size (tokens)’: chunk_sizes,
‘Age (seconds)’: relative_times,
‘Importance’: importance,
‘Score’: scores
})

fig, axs = plt.subplots(2, 2, figsize=(14, 10))

axs[0, 0].bar(range(len(chunk_sizes)), chunk_sizes)
axs[0, 0].set_title(‘Token Distribution by Chunk’)
axs[0, 0].set_ylabel(‘Tokens’)
axs[0, 0].set_xlabel(‘Chunk Index’)

axs[0, 1].scatter(chunk_sizes, scores)
axs[0, 1].set_title(‘Score vs Chunk Size’)
axs[0, 1].set_xlabel(‘Tokens’)
axs[0, 1].set_ylabel(‘Score’)

axs[1, 0].scatter(relative_times, scores)
axs[1, 0].set_title(‘Score vs Chunk Age’)
axs[1, 0].set_xlabel(‘Age (seconds)’)
axs[1, 0].set_ylabel(‘Score’)

axs[1, 1].scatter(importance, scores)
axs[1, 1].set_title(‘Score vs Importance’)
axs[1, 1].set_xlabel(‘Importance’)
axs[1, 1].set_ylabel(‘Score’)

plt.tight_layout()
plt.show()

except ImportError:
print(“Please install matplotlib and pandas for visualization”)
print(‘!pip install matplotlib pandas’)

The ModelContextManager class orchestrates the end-to-end handling of context for LLMs by chunking input text, generating embeddings, and tracking token usage against a configurable limit. It implements relevance scoring (combining recency, importance, and semantic similarity), automatic context pruning, retrieval of the most pertinent chunks, and convenient utilities for monitoring and visualizing context statistics.

Copy CodeCopiedUse a different Browserclass MCPColabDemo:
“””Demonstration of Model Context Protocol in Google Colab with a Language Model.”””

def __init__(
self,
model_name: str = “google/flan-t5-base”,
max_context_length: int = 2048,
device: str = “cuda” if torch.cuda.is_available() else “cpu”
):
“””
Initialize the MCP Colab demo with a specified model.

Args:
model_name: Hugging Face model name
max_context_length: Maximum context length for the MCP manager
device: Device to run the model on
“””
self.device = device
self.context_manager = ModelContextManager(
max_context_length=max_context_length,
device=device
)

try:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
print(f”Loading model {model_name}…”)
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f”Model loaded successfully on {device}”)
except ImportError:
print(“Installing transformers…”)
import subprocess
subprocess.run([“pip”, “install”, “transformers”])
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f”Model loaded successfully on {device}”)

def add_document(self, text: str, chunk_size: int = 512, overlap: int = 50) -> None:
“””
Add a document to the context by chunking it appropriately.

Args:
text: Document text
chunk_size: Size of each chunk in characters
overlap: Overlap between chunks in characters
“””
chunks = []
for i in range(0, len(text), chunk_size – overlap):
chunk = text[i:i + chunk_size]
if len(chunk) > 20:
chunks.append(chunk)

print(f”Adding {len(chunks)} chunks to context…”)
for i, chunk in enumerate(tqdm(chunks)):
pos = i / len(chunks)
importance = 1.0 – 0.5 * min(pos, 1 – pos)

self.context_manager.add_chunk(
text=chunk,
importance=importance,
metadata={“source”: “document”, “position”: i, “total_chunks”: len(chunks)}
)

def process_query(self, query: str, max_new_tokens: int = 256) -> str:
“””
Process a query using the context manager and model.

Args:
query: The query to process
max_new_tokens: Maximum number of tokens in response

Returns:
Model response
“””
self.context_manager.add_chunk(query, importance=1.0, metadata={“type”: “query”})

relevant_context = self.context_manager.retrieve_context(query=query)

prompt = f”Context: {relevant_context}nnQuestion: {query}nnAnswer:”

inputs = self.tokenizer(prompt, return_tensors=”pt”).to(self.device)

print(“Generating response…”)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
)

response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

self.context_manager.add_chunk(
response,
importance=0.9,
metadata={“type”: “response”, “query”: query}
)

return response

def interactive_session(self):
“””Run an interactive session in the notebook.”””
from IPython.display import clear_output

print(“Starting interactive MCP session. Type ‘exit’ to end.”)
conversation_history = []

while True:
query = input(“nYour query: “)

if query.lower() == ‘exit’:
break

if query.lower() == ‘stats’:
print(“nContext Statistics:”)
stats = self.context_manager.get_stats()
for key, value in stats.items():
print(f”{key}: {value}”)
self.context_manager.visualize_context()
continue

if query.lower() == ‘clear’:
self.context_manager.chunks = []
self.context_manager.current_token_count = 0
conversation_history = []
clear_output(wait=True)
print(“Context cleared!”)
continue

response = self.process_query(query)
conversation_history.append((query, response))

print(“nResponse:”)
print(response)
print(“n” + “-“*50)

stats = self.context_manager.get_stats()
print(f”Context usage: {stats[‘token_count’]}/{stats[‘max_tokens’]} tokens ({stats[‘usage_percentage’]:.1f}%)”)

The MCPColabDemo class ties the context manager to a seq2seq LLM, loading FLAN-T5 (or any specified Hugging Face model) on the chosen device, and provides utility methods for chunking and ingesting entire documents, processing user queries by prepending only the most relevant context, and running an interactive Colab session complete with real-time stats, visualizations, and commands for clearing or inspecting the evolving context window.

Copy CodeCopiedUse a different Browserdef run_mcp_demo():
“””Run a simple demo of the Model Context Protocol.”””
print(“Running Model Context Protocol Demo…”)

context_manager = ModelContextManager(max_context_length=4096)

print(“Adding sample chunks…”)

context_manager.add_chunk(
“The Model Context Protocol (MCP) is a framework for managing context ”
“windows in large language models. It helps optimize token usage and improve relevance.”,
importance=1.0
)

context_manager.add_chunk(
“Context management involves techniques like sliding windows, chunking, ”
“and relevance filtering to handle large documents efficiently.”,
importance=0.8
)

for i in range(10):
context_manager.add_chunk(
f”This is test chunk {i} with some filler content to simulate a larger context ”
f”window that needs optimization. This helps demonstrate the MCP functionality ”
f”for context window management in language models on Google Colab.”,
importance=0.5 – (i * 0.02)
)

stats = context_manager.get_stats()
print(“nInitial Statistics:”)
for key, value in stats.items():
print(f”{key}: {value}”)

query = “How does the Model Context Protocol work?”
print(f”nRetrieving context for: ‘{query}'”)
context = context_manager.retrieve_context(query)
print(f”nRelevant context:n{context}”)

print(“nVisualizing context:”)
context_manager.visualize_context()

print(“nDemo complete!”)

The run_mcp_demo function ties everything together in a single script: it instantiates the ModelContextManager, adds a series of sample chunks with varying importance, prints out initial statistics, retrieves and displays the most relevant context for a test query, and finally visualizes the context window, providing a complete, end-to-end demonstration of the Model Context Protocol in action.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
run_mcp_demo()

Finally, this standard Python entry-point guard ensures that the run_mcp_demo() function executes only when the script is run directly (rather than imported as a module), triggering the end-to-end demonstration of the Model Context Protocol workflow.

In conclusion, we will have a fully functional MCP system that not only curbs runaway token usage but also prioritizes context fragments that truly matter for your queries. The ModelContextManager equips you with tools to balance semantic relevance, temporal freshness, and user-assigned importance. At the same time, the accompanying MCPColabDemo class provides an accessible framework for real-time experimentation and visualization. Armed with these patterns, you can extend the core principles by adjusting relevance thresholds, experimenting with various embedding models, or integrating with alternative LLM backends to tailor your domain-specific workflows. Ultimately, this approach enables you to create concise yet highly relevant prompts, resulting in more accurate and efficient responses from your language models.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post A Coding Tutorial of Model Context Protocol Focusing on Semantic Chunking, Dynamic Token Management, and Context Relevance Scoring for Efficient LLM Interactions appeared first on MarkTechPost.

Customize Amazon Nova models to improve tool usage

Modern large language models (LLMs) excel in language processing but are limited by their static training data. However, as industries require more adaptive, decision-making AI, integrating tools and external APIs has become essential. This has led to the evolution and rapid rise of agentic workflows, where AI systems autonomously plan, execute, and refine tasks. Accurate tool use is foundational for enhancing the decision-making and operational efficiency of these autonomous agents and building successful and complex agentic workflows.
In this post, we dissect the technical mechanisms of tool calling using Amazon Nova models through Amazon Bedrock, alongside methods for model customization to refine tool calling precision.
Expanding LLM capabilities with tool use
LLMs excel at natural language tasks but become significantly more powerful with tool integration, such as APIs and computational frameworks. Tools enable LLMs to access real-time data, perform domain-specific computations, and retrieve precise information, enhancing their reliability and versatility. For example, integrating a weather API allows for accurate, real-time forecasts, or a Wikipedia API provides up-to-date information for complex queries. In scientific contexts, tools like calculators or symbolic engines address numerical inaccuracies in LLMs. These integrations transform LLMs into robust, domain-aware systems capable of handling dynamic, specialized tasks with real-world utility.
Amazon Nova models and Amazon Bedrock
Amazon Nova models, unveiled at AWS re:Invent in December 2024, are optimized to deliver exceptional price-performance value, offering state-of-the-art performance on key text-understanding benchmarks at low cost. The series comprises three variants: Micro (text-only, ultra-efficient for edge use), Lite (multimodal, balanced for versatility), and Pro (multimodal, high-performance for complex tasks).
Amazon Nova models can be used for variety of tasks, from generation to developing agentic workflows. As such, these models have the capability to interface with external tools or services and use them through tool calling. This can be achieved through the Amazon Bedrock console (see Getting started with Amazon Nova in the Amazon Bedrock console) and APIs such as Converse and Invoke.
In addition to using the pre-trained models, developers have the option to fine-tune these models with multimodal data (Pro and Lite) or text data (Pro, Lite, and Micro), providing the flexibility to achieve desired accuracy, latency, and cost. Developers can also run self-service custom fine-tuning and distillation of larger models to smaller ones using the Amazon Bedrock console and APIs.
Solution overview
The following diagram illustrates the solution architecture.

For this post, we first prepared a custom dataset for tool usage. We used the test set to evaluate Amazon Nova models through Amazon Bedrock using the Converse and Invoke APIs. We then fine-tuned Amazon Nova Micro and Amazon Nova Lite models through Amazon Bedrock with our fine-tuning dataset. After the fine-tuning process was complete, we evaluated these customized models through provisioned throughput. In the following sections, we go through these steps in more detail.
Tools
Tool usage in LLMs involves two critical operations: tool selection and argument extraction or generation. For instance, consider a tool designed to retrieve weather information for a specific location. When presented with a query such as “What’s the weather in Alexandria, VA?”, the LLM evaluates its repertoire of tools to determine whether an appropriate tool is available. Upon identifying a suitable tool, the model selects it and extracts the required arguments—here, “Alexandria” and “VA” as structured data types (for example, strings)—to construct the tool call.
Each tool is rigorously defined with a formal specification that outlines its intended functionality, the mandatory or optional arguments, and the associated data types. Such precise definitions, known as tool config, make sure that tool calls are executed correctly and that argument parsing aligns with the tool’s operational requirements. Following this requirement, the dataset used for this example defines eight tools with their arguments and configures them in a structured JSON format. We define the following eight tools (we use seven of them for fine-tuning and hold out the weather_api_call tool during testing in order to evaluate the accuracy on unseen tool use):

weather_api_call – Custom tool for getting weather information
stat_pull – Custom tool for identifying stats
text_to_sql – Custom text-to-SQL tool
terminal – Tool for executing scripts in a terminal
wikipidea – Wikipedia API tool to search through Wikipedia pages
duckduckgo_results_json – Internet search tool that executes a DuckDuckGo search
youtube_search – YouTube API search tool that searches video listings
pubmed_search – PubMed search tool that searches PubMed abstracts

The following code is an example of what a tool configuration for terminal might look like:

{‘toolSpec’: {‘name’: ‘terminal’,
‘description’: ‘Run shell commands on this MacOS machine ‘,
‘inputSchema’: {‘json’: {‘type’: ‘object’,
‘properties’: {‘commands’: {‘type’: ‘string’,
‘description’: ‘List of shell commands to run. Deserialized using json.loads’}},
‘required’: [‘commands’]}}}},

Dataset
The dataset is a synthetic tool calling dataset created with assistance from a foundation model (FM) from Amazon Bedrock and manually validated and adjusted. This dataset was created for our set of eight tools as discussed in the previous section, with the goal of creating a diverse set of questions and tool invocations that allow another model to learn from these examples and generalize to unseen tool invocations.
Each entry in the dataset is structured as a JSON object with key-value pairs that define the question (a natural language user query for the model), the ground truth tool required to answer the user query, its arguments (dictionary containing the parameters required to execute the tool), and additional constraints like order_matters: boolean, indicating if argument order is critical, and arg_pattern: optional, a regular expression (regex) for argument validation or formatting. Later in this post, we use these ground truth labels to supervise the training of pre-trained Amazon Nova models, adapting them for tool use. This process, known as supervised fine-tuning, will be explored in detail in the following sections.
The size of the training set is 560 questions and the test set is 120 questions. The test set consists of 15 questions per tool category, totaling 120 questions. The following are some examples from the dataset:

{
“question”: “Explain the process of photosynthesis”,
“answer”: “wikipedia”,
“args”: {‘query’: ‘process of photosynthesis’
},
“order_matters”:False,
“arg_pattern”:None
}
{
“question”: “Display system date and time”,
“answer”: “terminal”,
“args”: {‘commands’: [‘date’
]
},
“order_matters”:True,
“arg_pattern”:None
}
{
“question”: “Upgrade the requests library using pip”,
“answer”: “terminal”,
“args”: {‘commands’: [‘pip install –upgrade requests’
]
},
“order_matters”:True,
“arg_pattern”: [r’pip(3?) install –upgrade requests’
]
}

Prepare the dataset for Amazon Nova
To use this dataset with Amazon Nova models, we need to additionally format the data based on a particular chat template. Native tool calling has a translation layer that formats the inputs to the appropriate format before passing the model. Here, we employ a DIY tool use approach with a custom prompt template. Specifically, we need to add the system prompt, the user message embedded with the tool config, and the ground truth labels as the assistant message. The following is a training example formatted for Amazon Nova. Due to space constraints, we only show the toolspec for one tool.

{“system”: [{“text”: “You are a bot that can handle different requests
with tools.”}],
“messages”: [{“role”: “user”,
“content”: [{“text”: “Given the following functions within <tools>,
please respond with a JSON for a function call with its proper arguments
that best answers the given prompt.

Respond in the format
{“name”: function name,”parameters”: dictionary of argument name and
its value}.
Do not use variables. Donot give any explanations.

ONLY output the resulting
JSON structure and nothing else.

Donot use the word ‘json’ anywhere in the
result.
<tools>
    {“tools”: [{“toolSpec”:{“name”:”youtube_search”,
    “description”: ” search for youtube videos associated with a person.
    the input to this tool should be a comma separated list, the first part
    contains a person name and the second a number that is the maximum number
    of video results to return aka num_results. the second part is optional”, 
    “inputSchema”:
    {“json”:{“type”:”object”,”properties”: {“query”:
    {“type”: “string”,
     “description”: “youtube search query to look up”}},
    “required”: [“query”]}}}},]}
</tools>
Generate answer for the following question.
<question>
List any products that have received consistently negative reviews
</question>”}]},
{“role”: “assistant”, “content”: [{“text”: “{‘name’:text_to_sql,’parameters’:
{‘table’: ‘product_reviews’,’condition’:
‘GROUP BY product_id HAVING AVG(rating) < 2’}}”}]}],
“schemaVersion”: “tooluse-dataset-2024”}

Upload dataset to Amazon S3
This step is needed later for the fine-tuning for Amazon Bedrock to access the training data. You can upload your dataset either through the Amazon Simple Storage Service (Amazon S3) console or through code.
Tool calling with base models through the Amazon Bedrock API
Now that we have created the tool use dataset and formatted it as required, let’s use it to test out the Amazon Nova models. As mentioned previously, we can use both the Converse and Invoke APIs for tool use in Amazon Bedrock. The Converse API enables dynamic, context-aware conversations, allowing models to engage in multi-turn dialogues, and the Invoke API allows the user to call and interact with the underlying models within Amazon Bedrock.
To use the Converse API, you simply send the messages, system prompt (if any), and the tool config directly in the Converse API. See the following example code:

response = bedrock_runtime.converse(
modelId=model_id,
messages=messages,
system=system_prompt,
toolConfig=tool_config,
)

To parse the tool and arguments from the LLM response, you can use the following example code:

for content_block in response[‘output’][‘message’][“content”]:

if “toolUse” in content_block:
out_tool_name=content_block[‘toolUse’][‘name’]
out_tool_inputs_dict=content_block[‘toolUse’][‘input’]
print(out_tool_name,out_tool_inputs_dict.keys())

For the question: “Hey, what’s the temperature in Paris right now?”, you get the following output:

weather_api_call dict_keys([‘country’, ‘city’])

To execute tool use through the Invoke API, first you need to prepare the request body with the user question as well as the tool config that was prepared before. The following code snippet shows how to convert the tool config JSON to string format, which can be used in the message body:

# Convert tools configuration to JSON string
formatted_tool_config = json.dumps(tool_config, indent=2)
prompt = prompt_template.replace(“{question}”, question)
prompt = prompt.replace(“{tool_config}”, formatted_tool_config)
# message template
messages = [{“role”: “user”, “content”: [{“text”: prompt}]}]
# Prepare request body
model_kwargs = {“system”:system_prompt, “messages”: messages, “inferenceConfig”: inferenceConfig,} body = json.dumps(model_kwargs)
response = bedrock_runtime.invoke_model(
body=body,
modelId=model_id,
accept=accept,
contentType=contentType
)

Using either of the two APIs, you can test and benchmark the base Amazon Nova models with the tool use dataset. In the next sections, we show how you can customize these base models specifically for the tool use domain.
Supervised fine-tuning using the Amazon Bedrock console
Amazon Bedrock offers three different customization techniques: supervised fine-tuning, model distillation, and continued pre-training. At the time of writing, the first two methods are available for customizing Amazon Nova models. Supervised fine-tuning is a popular method in transfer learning, where a pre-trained model is adapted to a specific task or domain by training it further on a smaller, task-specific dataset. The process uses the representations learned during pre-training on large datasets to improve performance in the new domain. During fine-tuning, the model’s parameters (either all or selected layers) are updated using backpropagation to minimize the loss.
In this post, we use the labeled datasets that we created and formatted previously to run supervised fine-tuning to adapt Amazon Nova models for the tool use domain.
Create a fine-tuning job
Complete the following steps to create a fine-tuning job:

Open the Amazon Bedrock console.
Choose us-east-1 as the AWS Region.
Under Foundation models in the navigation pane, choose Custom models.
Choose Create Fine-tuning job under Customization methods. 

At the time of writing, Amazon Nova model fine-tuning is exclusively available in the us-east-1 Region.

Choose Select model and choose Amazon as the model provider.
Choose your model (for this post, Amazon Nova Micro) and choose Apply.

For Fine-tuned model name, enter a unique name.
For Job name¸ enter a name for the fine-tuning job.
In the Input data section, enter following details:

For S3 location, enter the source S3 bucket containing the training data.
For Validation dataset location, optionally enter the S3 bucket containing a validation dataset.

In the Hyperparameters section, you can customize the following hyperparameters:

For Epochs¸ enter a value between 1–5.
For Batch size, the value is fixed at 1.
For Learning rate multiplier, enter a value between 0.000001–0.0001
For Learning rate warmup steps, enter a value between 0–100.

We recommend starting with the default parameter values and then changing the settings iteratively. It’s a good practice to change only one or a couple of parameters at a time, in order to isolate the parameter effects. Remember, hyperparameter tuning is model and use case specific.

In the Output data section, enter the target S3 bucket for model outputs and training metrics.
Choose Create fine-tuning job.

Run the fine-tuning job
After you start the fine-tuning job, you will be able to see your job under Jobs and the status as Training. When it finishes, the status changes to Complete.

You can now go to the training job and optionally access the training-related artifacts that are saved in the output folder.

You can find both training and validation (we highly recommend using a validation set) artifacts here.

You can use the training and validation artifacts to assess your fine-tuning job through loss curves (as shown in the following figure), which track training loss (orange) and validation loss (blue) over time. A steady decline in both indicates effective learning and good generalization. A small gap between them suggests minimal overfitting, whereas a rising validation loss with decreasing training loss signals overfitting. If both losses remain high, it indicates underfitting. Monitoring these curves helps you quickly diagnose model performance and adjust training strategies for optimal results.

Host the fine-tuned model and run inference
Now that you have completed the fine-tuning, you can host the model and use it for inference. Follow these steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Custom models
On the Models tab, choose the model you fine-tuned.

Choose Purchase provisioned throughput.

Specify a commitment term (no commitment, 1 month, 6 months) and review the associated cost for hosting the fine-tuned models.

After the customized model is hosted through provisioned throughput, a model ID will be assigned, which will be used for inference. For inference with models hosted with provisioned throughput, we have to use the Invoke API in the same way we described previously in this post—simply replace the model ID with the customized model ID.
The aforementioned fine-tuning and inference steps can also be done programmatically. Refer to the following GitHub repo for more detail.
Evaluation framework
Evaluating fine-tuned tool calling LLMs requires a comprehensive approach to assess their performance across various dimensions. The primary metric to evaluate tool calling is accuracy, including both tool selection and argument generation accuracy. This measures how effectively the model selects the correct tool and generates valid arguments. Latency and token usage (input and output tokens) are two other important metrics.
Tool call accuracy evaluates if the tool predicted by the LLM matches the ground truth tool for each question; a score of 1 is given if they match and 0 when they don’t. After processing the questions, we can use the following equation: Tool Call Accuracy=∑(Correct Tool Calls)/(Total number of test questions).
Argument call accuracy assesses whether the arguments provided to the tools are correct, based on either exact matches or regex pattern matching. For each tool call, the model’s predicted arguments are extracted. It uses the following argument matching methods:

Regex matching – If the ground truth includes regex patterns, the predicted arguments are matched against these patterns. A successful match increases the score.
Inclusive string matching – If no regex pattern is provided, the predicted argument is compared to the ground truth argument. Credit is given if the predicted argument contains the ground truth argument. This is to allow for arguments, like search terms, to not be penalized for adding additional specificity.

The score for each argument is normalized based on the number of arguments, allowing partial credit when multiple arguments are required. The cumulative correct argument scores are averaged across all questions: Argument Call Accuracy = ∑Correct Arguments/(Total Number of Questions).
Below we show some example questions and accuracy scores:
Example 1:

User question: Execute this run.py script with an argparse arg adding two gpus
GT tool: terminal   LLM output tool: terminal
Pred args:  [‘python run.py —gpus 2’]
Ground truth pattern: python(3?) run.py —gpus 2
Arg matching method: regex match
Arg matching score: 1.0

Example 2:

User question: Who had the most rushing touchdowns for the bengals in 2017 season?
GT tool: stat_pull   LLM output tool: stat_pull
Pred args:  [‘NFL’]
straight match
arg score 0.3333333333333333
Pred args:  [‘2017’]
Straight match
Arg score 0.6666666666666666
Pred args:  [‘Cincinnati Bengals’]
Straight match
Arg score 1.0

Results
We are now ready to visualize the results and compare the performance of base Amazon Nova models to their fine-tuned counterparts.
Base models
The following figures illustrate the performance comparison of the base Amazon Nova models.

The comparison reveals a clear trade-off between accuracy and latency, shaped by model size. Amazon Nova Pro, the largest model, delivers the highest accuracy in both tool call and argument call tasks, reflecting its advanced computational capabilities. However, this comes with increased latency.
In contrast, Amazon Nova Micro, the smallest model, achieves the lowest latency, which ideal for fast, resource-constrained environments, though it sacrifices some accuracy compared to its larger counterparts.
Fine-tuned models vs. base models
The following figure visualizes accuracy improvement after fine-tuning.

The comparative analysis of the Amazon Nova model variants reveals substantial performance improvements through fine-tuning, with the most significant gains observed in the smaller Amazon Nova Micro model. The fine-tuned Amazon Nova model showed remarkable growth in tool call accuracy, increasing from 75.8% to 95%, which is a 25.38% improvement. Similarly, its argument call accuracy rose from 77.8% to 87.7%, reflecting a 12.74% increase.
In contrast, the fine-tuned Amazon Nova Lite model exhibited more modest gains, with tool call accuracy improving from 90.8% to 96.66%—a 6.46% increase—and argument call accuracy rising from 85% to 89.9%, marking a 5.76% improvement. Both fine-tuned models surpassed the accuracy achieved by the Amazon Nova Pro base model.
These results highlight that fine-tuning can significantly enhance the performance of lightweight models, making them strong contenders for applications where both accuracy and latency are critical.
Conclusion
In this post, we demonstrated model customization (fine-tuning) for tool use with Amazon Nova. We first introduced a tool usage use case, and gave details about the dataset. We walked through the details of Amazon Nova specific data formatting and showed how to do tool calling through the Converse and Invoke APIs in Amazon Bedrock. After getting the baseline results from Amazon Nova models, we explained in detail the fine-tuning process, hosting fine-tuned models with provisioned throughput, and using the fine-tuned Amazon Nova models for inference. In addition, we touched upon getting insights from training and validation artifacts from a fine-tuning job in Amazon Bedrock.
Check out the detailed notebook for tool usage to learn more. For more information on Amazon Bedrock and the latest Amazon Nova models, refer to the Amazon Bedrock User Guide and Amazon Nova User Guide. The Generative AI Innovation Center has a group of AWS science and strategy experts with comprehensive expertise spanning the generative AI journey, helping customers prioritize use cases, build roadmaps, and move solutions into production. See Generative AI Innovation Center for our latest work and customer success stories.

About the Authors
Baishali Chaudhury is an Applied Scientist at the Generative AI Innovation Center at AWS, where she focuses on advancing Generative AI solutions for real-world applications. She has a strong background in computer vision, machine learning, and AI for healthcare. Baishali holds a PhD in Computer Science from University of South Florida and PostDoc from Moffitt Cancer Centre.
Isaac Privitera is a Principal Data Scientist with the AWS Generative AI Innovation Center, where he develops bespoke generative AI-based solutions to address customers’ business problems. His primary focus lies in building responsible AI systems, using techniques such as RAG, multi-agent systems, and model fine-tuning. When not immersed in the world of AI, Isaac can be found on the golf course, enjoying a football game, or hiking trails with his loyal canine companion, Barry.
Mengdie (Flora) Wang is a Data Scientist at AWS Generative AI Innovation Center, where she works with customers to architect and implement scalableGenerative AI solutions that address their unique business challenges. She specializes in model customization techniques and agent-based AI systems, helping organizations harness the full potential of generative AI technology. Prior to AWS, Flora earned her Master’s degree in Computer Science from the University of Minnesota, where she developed her expertise in machine learning and artificial intelligence.

Evaluate Amazon Bedrock Agents with Ragas and LLM-as-a-judge

AI agents are quickly becoming an integral part of customer workflows across industries by automating complex tasks, enhancing decision-making, and streamlining operations. However, the adoption of AI agents in production systems requires scalable evaluation pipelines. Robust agent evaluation enables you to gauge how well an agent is performing certain actions and gain key insights into them, enhancing AI agent safety, control, trust, transparency, and performance optimization.
Amazon Bedrock Agents uses the reasoning of foundation models (FMs) available on Amazon Bedrock, APIs, and data to break down user requests, gather relevant information, and efficiently complete tasks—freeing teams to focus on high-value work. You can enable generative AI applications to automate multistep tasks by seamlessly connecting with company systems, APIs, and data sources.
Ragas is an open source library for testing and evaluating large language model (LLM) applications across various LLM use cases, including Retrieval Augmented Generation (RAG). The framework enables quantitative measurement of the effectiveness of the RAG implementation. In this post, we use the Ragas library to evaluate the RAG capability of Amazon Bedrock Agents.
LLM-as-a-judge is an evaluation approach that uses LLMs to assess the quality of AI-generated outputs. This method employs an LLM to act as an impartial evaluator, to analyze and score outputs. In this post, we employ the LLM-as-a-judge technique to evaluate the text-to-SQL and chain-of-thought capabilities of Amazon Bedrock Agents.
Langfuse is an open source LLM engineering platform, which provides features such as traces, evals, prompt management, and metrics to debug and improve your LLM application.
In the post Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents, we showcased research agents for cancer biomarker discovery for pharmaceutical companies. In this post, we extend the prior work and showcase Open Source Bedrock Agent Evaluation with the following capabilities:

Evaluating Amazon Bedrock Agents on its capabilities (RAG, text-to-SQL, custom tool use) and overall chain-of-thought
Comprehensive evaluation results and trace data sent to Langfuse with built-in visual dashboards
Trace parsing and evaluations for various Amazon Bedrock Agents configuration options

First, we conduct evaluations on a variety of different Amazon Bedrock Agents. These include a sample RAG agent, a sample text-to-SQL agent, and pharmaceutical research agents that use multi-agent collaboration for cancer biomarker discovery. Then, for each agent, we showcase navigating the Langfuse dashboard to view traces and evaluation results.
Technical challenges
Today, AI agent developers generally face the following technical challenges:

End-to-end agent evaluation – Although Amazon Bedrock provides built-in evaluation capabilities for LLM models and RAG retrieval, it lacks metrics specifically designed for Amazon Bedrock Agents. There is a need for evaluating the holistic agent goal, as well as individual agent trace steps for specific tasks and tool invocations. Support is also needed for both single and multi-agents, and both single and multi-turn datasets.
Challenging experiment management – Amazon Bedrock Agents offers numerous configuration options, including LLM model selection, agent instructions, tool configurations, and multi-agent setups. However, conducting rapid experimentation with these parameters is technically challenging due to the lack of systematic ways to track, compare, and measure the impact of configuration changes across different agent versions. This makes it difficult to effectively optimize agent performance through iterative testing.

Solution overview
The following figure illustrates how Open Source Bedrock Agent Evaluation works on a high level. The framework runs an evaluation job that will invoke your own agent in Amazon Bedrock and evaluate its response.

The workflow consists of the following steps:

The user specifies the agent ID, alias, evaluation model, and dataset containing question and ground truth pairs.
The user executes the evaluation job, which will invoke the specified Amazon Bedrock agent.
The retrieved agent invocation traces are run through a custom parsing logic in the framework.
The framework conducts an evaluation based on the agent invocation results and the question type:

Chain-of-thought – LLM-as-a-judge with Amazon Bedrock LLM calls (conducted for every evaluation run for different types of questions)
RAG – Ragas evaluation library
Text-to-SQL – LLM-as-a-judge with Amazon Bedrock LLM calls

Evaluation results and parsed traces are gathered and sent to Langfuse for evaluation insights.

Prerequisites
To deploy the sample RAG and text-to-SQL agents and follow along with evaluating them using Open Source Bedrock Agent Evaluation, follow the instructions in Deploying Sample Agents for Evaluation.
To bring your own agent to evaluate with this framework, refer to the following README and follow the detailed instructions to deploy the Open Source Bedrock Agent Evaluation framework.
Overview of evaluation metrics and input data
First, we create sample Amazon Bedrock agents to demonstrate the capabilities of Open Source Bedrock Agent Evaluation. The text-to-SQL agent uses the BirdSQL Mini-Dev dataset, and the RAG agent uses the Hugging Face rag-mini-wikpedia dataset.
Evaluation metrics
The Open Source Bedrock Agent Evaluation framework conducts evaluations on two broad types of metrics:

Agent goal – Chain-of-thought (run on every question)
Task accuracy – RAG, text-to-SQL (run only when the specific tool is used to answer question)

Agent goal metrics measure how well an agent identifies and achieves the goals of the user. There are two main types: reference-based evaluation and no reference evaluation. Examples can be found in Agent Goal accuracy as defined by Ragas:

Reference-based evaluation – The user provides a reference that will be used as the ideal outcome. The metric is computed by comparing the reference with the goal achieved by the end of the workflow.
Evaluation without reference – The metric evaluates the performance of the LLM in identifying and achieving the goals of the user without reference.

We will showcase evaluation without reference using chain-of-thought evaluation. We conduct evaluations by comparing the agent’s reasoning and the agent’s instruction. For this evaluation, we use some metrics from the evaluator prompts for Amazon Bedrock LLM-as-a-judge. In this framework, the chain-of-thought evaluations are run on every question that the agent is evaluated against.
Task accuracy metrics measure how well an agent calls the required tools to complete a given task. For the two task accuracy metrics, RAG and text-to-SQL, evaluations are conducted based on comparing the actual agent answer against the ground truth dataset that must be provided in the input dataset. The task accuracy metrics are only evaluated when the corresponding tool is used to answer the question.
The following is a breakdown of the key metrics used in each evaluation type included in the framework:

RAG:

Faithfulness – How factually consistent a response is with the retrieved context
Answer relevancy – How directly and appropriately the original question is addressed
Context recall – How many of the relevant pieces of information were successfully retrieved
Semantic similarity – The assessment of the semantic resemblance between the generated answer and the ground truth

Text-to-SQL:

SQL query semantic equivalence – The equivalence of response query with the reference query
Answer correctness – How well the generated answer correctly represents the query results and matches ground truth

Chain-of-thought:

Helpfulness – How well the agent satisfies explicit and implicit expectations
Faithfulness – How well the agent sticks to available information and context
Instruction following – How well the agent respects all explicit directions

User-agent trajectories
The input dataset is in the form of trajectories, where each trajectory consists of one or more questions to be answered by the agent. The trajectories are meant to simulate how a user might interact with the agent. Each trajectory consists of a unique question_id, question_type, question, and ground_truth information. The following are examples of actual trajectories used to evaluate each type of agent in this post.
For more simple agent setups like the RAG and text-to-SQL sample agent, we created trajectories consisting of a single question, as shown in the following examples.
The following is an example of a RAG sample agent trajectory:

{
“Trajectory0”: [
{
“question_id”: 0,
“question_type”: “RAG”,
“question”: “Was Abraham Lincoln the sixteenth President of the United States?”,
“ground_truth”: “yes”
}
]
}

The following is an example of a text-to-SQL sample agent trajectory:

{
“Trajectory1”: [
{
“question_id”: 1,
“question”: “What is the highest eligible free rate for K-12 students in the schools in Alameda County?”,
“question_type”: “TEXT2SQL”,
“ground_truth”: {
“ground_truth_sql_query”: “SELECT `Free Meal Count (K-12)` / `Enrollment (K-12)` FROM frpm WHERE `County Name` = ‘Alameda’ ORDER BY (CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) DESC LIMIT 1”,
“ground_truth_sql_context”: “[{‘table_name’: ‘frpm’, ‘columns’: [(‘cdscode’, ‘varchar’), (‘academic year’, ‘varchar’), …”,
“ground_truth_query_result”: “1.0”,
“ground_truth_answer”: “The highest eligible free rate for K-12 students in schools in Alameda County is 1.0.”
}
]
}

Pharmaceutical research agent use case example
In this section, we demonstrate how you can use the Open Source Bedrock Agent Evaluation framework to evaluate pharmaceutical research agents discussed in the post Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents . It showcases a variety of specialized agents, including a biomarker database analyst, statistician, clinical evidence researcher, and medical imaging expert in collaboration with a supervisor agent.
The pharmaceutical research agent was built using the multi-agent collaboration feature of Amazon Bedrock. The following diagram shows the multi-agent setup that was evaluated using this framework.

As shown in the diagram, the RAG evaluations will be conducted on the clinical evidence researcher sub-agent. Similarly, text-to-SQL evaluations will be run on the biomarker database analyst sub-agent. The chain-of-thought evaluation evaluates the final answer of the supervisor agent to check if it properly orchestrated the sub-agents and answered the user’s question.
Research agent trajectories
For a more complex setup like the pharmaceutical research agents, we used a set of industry relevant pregenerated test questions. By creating groups of questions based on their topic regardless of the sub-agents that might be invoked to answer the question, we created trajectories that include multiple questions spanning multiple types of tool use. With relevant questions already generated, integrating with the evaluation framework simply required properly formatting the ground truth data into trajectories.
We walk through evaluating this agent against a trajectory containing a RAG question and a text-to-SQL question:

{
“Trajectory1”: [
{
“question_id”: 3,
“question_type”: “RAG”,
“question”: “According to the knowledge base, how did the EGF pathway associate with CT imaging features?”,
“ground_truth”: “The EGF pathway was significantly correlated with the presence of ground-glass opacity and irregular nodules or nodules with poorly defined margins.”
},
{
“question_id”: 4,
“question_type”: “TEXT2SQL”,
“question”: “According to the database, What percentage of patients have EGFR mutations?”,
“ground_truth”: {
“ground_truth_sql_query”: “SELECT (COUNT(CASE WHEN EGFR_mutation_status = ‘Mutant’ THEN 1 END) * 100.0 / COUNT(*)) AS percentage FROM clinical_genomic;”,
“ground_truth_sql_context”: “Table clinical_genomic: – Case_ID: VARCHAR(50) – EGFR_mutation_status: VARCHAR(50)”,
“ground_truth_query_result”: “14.285714”,
“ground_truth_answer”: “According to the query results, approximately 14.29% of patients in the clinical_genomic table have EGFR mutations.”
}
}
]
}

Chain-of-thought evaluations are conducted for every question, regardless of tool use. This will be illustrated through a set of images of agent trace and evaluations on the Langfuse dashboard.
After running the agent against the trajectory, the results are sent to Langfuse to view the metrics. The following screenshot shows the trace of the RAG question (question ID 3) evaluation on Langfuse.

The screenshot displays the following information:

Trace information (input and output of agent invocation)
Trace steps (agent generation and the corresponding sub-steps)
Trace metadata (input and output tokens, cost, model, agent type)
Evaluation metrics (RAG and chain-of-thought metrics)

The following screenshot shows the trace of the text-to-SQL question (question ID 4) evaluation on Langfuse, which evaluated the biomarker database analyst agent that generates SQL queries to run against an Amazon Redshift database containing biomarker information.

The screenshot shows the following information:

Trace information (input and output of agent invocation)
Trace steps (agent generation and the corresponding sub-steps)
Trace metadata (input and output tokens, cost, model, agent type)
Evaluation metrics (text-to-SQL and chain-of-thought metrics)

The chain-of-thought evaluation is included in part of both questions’ evaluation traces. For both traces, LLM-as-a-judge is used to generate scores and explanation around an Amazon Bedrock agent’s reasoning on a given question.
Overall, we ran 56 questions grouped into 21 trajectories against the agent. The traces, model costs, and scores are shown in the following screenshot.

The following table contains the average evaluation scores across 56 evaluation traces.

Metric Category
Metric Type
Metric Name
Number of Traces
Metric Avg. Value

Agent Goal
COT
Helpfulness
50
0.77

Agent Goal
COT
Faithfulness
50
0.87

Agent Goal
COT
Instruction following
50
0.69

Agent Goal
COT
Overall (average of all metrics)
50
0.77

Task Accuracy
TEXT2SQL
Answer correctness
26
0.83

Task Accuracy
TEXT2SQL
SQL semantic equivalence
26
0.81

Task Accuracy
RAG
Semantic similarity
20
0.66

Task Accuracy
RAG
Faithfulness
20
0.5

Task Accuracy
RAG
Answer relevancy
20
0.68

Task Accuracy
RAG
Context recall
20
0.53

Security considerations
Consider the following security measures:

Enable Amazon Bedrock agent logging – For security best practices of using Amazon Bedrock Agents, enable Amazon Bedrock model invocation logging to capture prompts and responses securely in your account.
Check for compliance requirements – Before implementing Amazon Bedrock Agents in your production environment, make sure that the Amazon Bedrock compliance certifications and standards align with your regulatory requirements. Refer to Compliance validation for Amazon Bedrock for more information and resources on meeting compliance requirements.

Clean up
If you deployed the sample agents, run the following notebooks to delete the resources created.
If you chose the self-hosted Langfuse option, follow these steps to clean up your AWS self-hosted Langfuse setup.
Conclusion
In this post, we introduced the Open Source Bedrock Agent Evaluation framework, a Langfuse-integrated solution that streamlines the agent development process. The framework comes equipped with built-in evaluation logic for RAG, text-to-SQL, chain-of-thought reasoning, and integration with Langfuse for viewing evaluation metrics. With the Open Source Bedrock Agent Evaluation agent, developers can quickly evaluate their agents and rapidly experiment with different configurations, accelerating the development cycle and improving agent performance.
We demonstrated how this evaluation framework can be integrated with pharmaceutical research agents. We used it to evaluate agent performance against biomarker questions and sent traces to Langfuse to view evaluation metrics across question types.
The Open Source Bedrock Agent Evaluation framework enables you to accelerate your generative AI application building process using Amazon Bedrock Agents. To self-host Langfuse in your AWS account, see Hosting Langfuse on Amazon ECS with Fargate using CDK Python. To explore how you can streamline your Amazon Bedrock Agents evaluation process, get started with Open Source Bedrock Agent Evaluation.
Refer to Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications from the Amazon Bedrock team to learn more about multi-agent collaboration and end-to-end agent evaluation.

About the authors
Hasan Poonawala is a Senior AI/ML Solutions Architect at AWS, working with healthcare and life sciences customers. Hasan helps design, deploy, and scale generative AI and machine learning applications on AWS. He has over 15 years of combined work experience in machine learning, software development, and data science on the cloud. In his spare time, Hasan loves to explore nature and spend time with friends and family.
Blake Shin is an Associate Specialist Solutions Architect at AWS who enjoys learning about and working with new AI/ML technologies. In his free time, Blake enjoys exploring the city and playing music.
Rishiraj Chandra is an Associate Specialist Solutions Architect at AWS, passionate about building innovative artificial intelligence and machine learning solutions. He is committed to continuously learning and implementing emerging AI/ML technologies. Outside of work, Rishiraj enjoys running, reading, and playing tennis.

Researchers from Sea AI Lab, UCAS, NUS, and SJTU Introduce FlowReasone …

LLM-based multi-agent systems characterized by planning, reasoning, tool use, and memory capabilities form the foundation of applications like chatbots, code generation, mathematics, and robotics. However, these systems face significant challenges as they are manually designed, leading to high human resource costs and limited scalability. Graph-based methods have attempted to automate workflow designs by formulating workflows as networks, but their structural complexity restricts scalability. State-of-the-art approaches represent multi-agent systems as programming code and use advanced LLMs as meta-agents to optimize workflows, but focus on task-level solutions that generate single task-specific systems. This one-size-fits-all approach lacks the capability for automatic adaptation to individual user queries.

LLM-based multi-agent systems are the foundation for various real-world applications, including code intelligence, computer use, and deep research. These systems feature LLM-based agents equipped with planning capabilities, database access, and tool function invocation that collaborate to achieve promising performance. Early approaches focused on optimizing prompts or hyperparameters through evolution algorithms to automate agent profiling. ADAS introduced code representation for agents and workflows with a meta-agent to generate workflows. Moreover, OpenAI has advanced reasoning in LLMs by developing the o1 model. Models like QwQ, QvQ, DeepSeek, and Kimi have followed suit, developing o1-like reasoning architectures. OpenAI’s o3 model achieves promising results on the ARG-AGI benchmark. 

Researchers from the Sea AI Lab, Singapore, the University of Chinese Academy of Sciences, the National University of Singapore, and Shanghai Jiao Tong University have proposed FlowReasoner, a query-level meta-agent designed to automate the creation of query-level multi-agent systems, generating one customized system per user query. The researchers distilled DeepSeek R1 to supply FlowReasoner with the fundamental reasoning capabilities needed to create multi-agent systems, and then enhanced it through reinforcement learning with external execution feedback. A multi-purpose reward mechanism is developed to optimize training across three critical dimensions: performance, complexity, and efficiency. This enables FlowReasoner to generate personalized multi-agent systems through deliberative reasoning for each unique user query.

The researchers select three datasets: BigCodeBench for engineering-oriented tasks, HumanEval, and MBPP for algorithmic challenges for detailed evaluation across diverse code generation scenarios. FlowReasoner is evaluated against three categories of baselines:

Single-model direct invocation using standalone LLMs

Manually designed workflows including Self-Refine, LLM-Debate, and LLM-Blender with human-crafted reasoning strategies

Automated workflow optimization methods like Aflow, ADAS, and MaAS that construct workflows through search or optimization. 

Both o1-mini and GPT-4o-mini are used as worker models for manually designed workflows. FlowReasoner is implemented with two variants of DeepSeek-R1-Distill-Qwen (7B and 14B parameters) using o1-mini as the worker model.

FlowReasoner-14B outperforms all competing approaches, achieving an overall improvement of 5 percentage points compared to the strongest baseline, MaAS. It exceeds the performance of its underlying worker model, o1-mini, by a substantial margin of 10%. These results show the effectiveness of the workflow-based reasoning framework in enhancing code generation accuracy. To evaluate generalization capabilities, experiments are conducted replacing the o1-mini worker with models like Qwen2.5-Coder, Claude, and GPT-4o-mini, while keeping the meta-agent fixed as either FLOWREASONER-7B or FLOWREASONER-14B. FLOWREASONER exhibits notable transferability, maintaining consistent performance across different worker models on the same tasks.

In this paper, researchers present FlowReasoner, a query-level meta-agent designed to automate the creation of personalized multi-agent systems for individual user queries. FlowReasoner utilizes external execution feedback and reinforcement learning with multi-purpose rewards focusing on performance, complexity, and efficiency to generate optimized workflows without relying on complex search algorithms or carefully designed search sets. This approach reduces human resource costs while enhancing scalability by enabling more adaptive and efficient multi-agent systems that dynamically optimize their structure based on specific user queries rather than relying on fixed workflows for entire task categories.

Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Researchers from Sea AI Lab, UCAS, NUS, and SJTU Introduce FlowReasoner: a Query-Level Meta-Agent for Personalized System Generation appeared first on MarkTechPost.

Microsoft Releases a Comprehensive Guide to Failure Modes in Agentic A …

As agentic AI systems evolve, the complexity of ensuring their reliability, security, and safety grows correspondingly. Recognizing this, Microsoft’s AI Red Team (AIRT) has published a detailed taxonomy addressing the failure modes inherent to agentic architectures. This report provides a critical foundation for practitioners aiming to design and maintain resilient agentic systems.

Characterizing Agentic AI and Emerging Challenges

Agentic AI systems are defined as autonomous entities that observe and act upon their environment to achieve predefined objectives. These systems typically integrate capabilities such as autonomy, environment observation, environment interaction, memory, and collaboration. While these features enhance functionality, they also introduce a broader attack surface and new safety concerns.

To inform their taxonomy, Microsoft’s AI Red Team conducted interviews with external practitioners, collaborated across internal research groups, and leveraged operational experience in testing generative AI systems. The result is a structured analysis that distinguishes between novel failure modes unique to agentic systems and the amplification of risks already observed in generative AI contexts.

A Framework for Failure Modes

Microsoft categorizes failure modes across two dimensions: security and safety, each comprising both novel and existing types.

Novel Security Failures: Including agent compromise, agent injection, agent impersonation, agent flow manipulation, and multi-agent jailbreaks.

Novel Safety Failures: Covering issues such as intra-agent Responsible AI (RAI) concerns, biases in resource allocation among multiple users, organizational knowledge degradation, and prioritization risks impacting user safety.

Existing Security Failures: Encompassing memory poisoning, cross-domain prompt injection (XPIA), human-in-the-loop bypass vulnerabilities, incorrect permissions management, and insufficient isolation.

Existing Safety Failures: Highlighting risks like bias amplification, hallucinations, misinterpretation of instructions, and a lack of sufficient transparency for meaningful user consent.

Each failure mode is detailed with its description, potential impacts, where it is likely to occur, and illustrative examples.

Consequences of Failure in Agentic Systems

The report identifies several systemic effects of these failures:

Agent Misalignment: Deviations from intended user or system goals.

Agent Action Abuse: Malicious exploitation of agent capabilities.

Service Disruption: Denial of intended functionality.

Incorrect Decision-Making: Faulty outputs caused by compromised processes.

Erosion of User Trust: Loss of user confidence due to system unpredictability.

Environmental Spillover: Effects extending beyond intended operational boundaries.

Knowledge Loss: Organizational or societal degradation of critical knowledge due to overreliance on agents.

Mitigation Strategies for Agentic AI Systems

The taxonomy is accompanied by a set of design considerations aimed at mitigating identified risks:

Identity Management: Assigning unique identifiers and granular roles to each agent.

Memory Hardening: Implementing trust boundaries for memory access and rigorous monitoring.

Control Flow Regulation: Deterministically governing the execution paths of agent workflows.

Environment Isolation: Restricting agent interaction to predefined environmental boundaries.

Transparent UX Design: Ensuring users can provide informed consent based on clear system behavior.

Logging and Monitoring: Capturing auditable logs to enable post-incident analysis and real-time threat detection.

XPIA Defense: Minimizing reliance on external untrusted data sources and separating data from executable content.

These practices emphasize architectural foresight and operational discipline to maintain system integrity.

Case Study: Memory Poisoning Attack on an Agentic Email Assistant

Microsoft’s report includes a case study demonstrating a memory poisoning attack against an AI email assistant implemented using LangChain, LangGraph, and GPT-4o. The assistant, tasked with email management, utilized a RAG-based memory system.

An adversary introduced poisoned content via a benign-looking email, exploiting the assistant’s autonomous memory update mechanism. The agent was induced to forward sensitive internal communications to an unauthorized external address. Initial testing showed a 40% success rate, which increased to over 80% after modifying the assistant’s prompt to prioritize memory recall.

This case illustrates the critical need for authenticated memorization, contextual validation of memory content, and consistent memory retrieval protocols.

Conclusion: Toward Secure and Reliable Agentic Systems

Microsoft’s taxonomy provides a rigorous framework for anticipating and mitigating failure in agentic AI systems. As the deployment of autonomous AI agents becomes more widespread, systematic approaches to identifying and addressing security and safety risks will be vital.

Developers and architects must embed security and responsible AI principles deeply within agentic system design. Proactive attention to failure modes, coupled with disciplined operational practices, will be necessary to ensure that agentic AI systems achieve their intended outcomes without introducing unacceptable risks.

Check out the Guide. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Microsoft Releases a Comprehensive Guide to Failure Modes in Agentic AI Systems appeared first on MarkTechPost.

Building Fully Autonomous Data Analysis Pipelines with the PraisonAI A …

In this tutorial, we demonstrate how PraisonAI Agents can elevate your data analysis from manual scripting to a fully autonomous, AI-driven pipeline. In a few natural-language prompts, you’ll learn to orchestrate every stage of the workflow, loading CSV or Excel files, filtering rows, summarizing trends, grouping by custom fields, pivoting tables, and exporting results to both CSV and Excel, without writing traditional Pandas code. In this implementation, under the hood, PraisonAI leverages Google Gemini to interpret your instructions and invoke the appropriate tools. At the same time, features such as self-reflection and verbose logging provide you with full visibility into each intermediate reasoning step.

Copy CodeCopiedUse a different Browser!pip install “praisonaiagents[llm]”

We install the core PraisonAI Agents library, along with its LLM integration extras, which bring in all necessary dependencies (such as Litellm and Gemini connectors) to drive autonomous workflows with large language models.

Copy CodeCopiedUse a different Browserimport os

os.environ[“GEMINI_API_KEY”] = “Use Your API Key”

llm_id = “gemini/gemini-1.5-flash-8b”

We configure your environment for Gemini access by setting your API key, then specify which Gemini model (the “1.5-flash-8b” variant) the PraisonAI Agent should use as its LLM backend.

Copy CodeCopiedUse a different Browserfrom google.colab import files

uploaded = files.upload()
csv_path = next(iter(uploaded))
print(“Loaded:”, csv_path)

We leverage Colab’s file‐upload widget to let you pick a local CSV, capture its filename into csv_path, and print a confirmation, making it easy to bring your data into the notebook interactively.

Copy CodeCopiedUse a different Browserfrom praisonaiagents import Agent
from praisonaiagents.tools import (
read_csv, filter_data, get_summary, group_by, pivot_table, write_csv
)

agent = Agent(
instructions=”You are a Data Analyst Agent using Google Gemini.”,
llm=llm_id,
tools=[
read_csv, filter_data, get_summary, group_by, pivot_table, write_csv
],
self_reflect=True,
verbose=True
)

We instantiate a PraisonAI Agent wired to Google Gemini, equipping it with data‐analysis tools (CSV I/O, filtering, summarization, grouping, pivoting, and export). Enabling self-reflect allows the agent to critique its reasoning, while verbose mode streams detailed tool-invocation logs for transparency.

Copy CodeCopiedUse a different Browserresult = agent.start(f”””
1. read_csv to load data from “{csv_path}”
2. get_summary to outline overall trends
3. filter_data to keep rows where Close > 800
4. group_by Year to average closing price
5. pivot_table to format the output table
“””)
print(result)

We send a clear, step-by-step prompt to your PraisonAI Agent, instructing it to load the CSV, summarize overall trends, filter for closing prices over $ 800, compute yearly averages, and pivot the table. The agent then prints out the combined response (including any generated summary or data output).

PraisonAI Agent First Step Code Generation

PraisonAI Agent Analysis After First Step Code Generation

PraisonAI Agent Second Step Code Generation

In conclusion, we have constructed an end-to-end data pipeline powered by PraisonAI Agents and Gemini, which goes from raw data upload to insightful visualizations and downloadable reports in just a few cells. We’ve seen how PraisonAI’s declarative toolset replaces dozens of lines of boilerplate code with concise, human-readable steps, and how built-in mechanisms, such as result caching and dual-mode API invocation, ensure both efficiency and reliability.

Sources

https://docs.praison.ai/ 

https://github.com/MervinPraison/PraisonAI

Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Building Fully Autonomous Data Analysis Pipelines with the PraisonAI Agent Framework: A Coding Implementation appeared first on MarkTechPost.

Implementing Persistent Memory Using a Local Knowledge Graph in Claude …

A Knowledge Graph Memory Server allows Claude Desktop to remember and organize information about a user across multiple chats. It can store things like user preferences, past conversations, and personal details. Because the information is saved as a knowledge graph, Claude can understand relationships between different pieces of information. This leads to more personalized responses and reduces repetition — you won’t have to explain the same things again and again.

In this tutorial, we will implement a simple persistent memory using a local knowledge graph in Claude Desktop, to help it remember user information across chats and provide more personalized, consistent responses.

Step 1: Installing the dependencies

Node.js Installation

We’ll be using npx to run the Knowledge Graph Memory Server, and for that, Node.js is required.

Download the latest version of Node.js from nodejs.org

Run the installer.

Leave all settings as default and complete the installation

Claude Desktop Installation

You can download the latest version of Claude Desktop at https://claude.ai/download. Next, you’ll need to configure Claude to connect with your MCP server. To do this, open the claude_desktop_config.json file located in the Claude directory using any text editor. If the file doesn’t exist, go ahead and create it manually.

Step 2: Configuring the mcp.json file

In the mcp.json file, enter the following code:

Copy CodeCopiedUse a different Browser{
“mcpServers”: {
“memory”: {
“command”: “npx”,
“args”: [
“-y”,
“@modelcontextprotocol/server-memory”
],
“env”: {
“MEMORY_PATH”: “./memory.json”
}
}
}
}

Step 3: Configuring Claude settings

Now, we need to configure Claude so it can use the knowledge graph to create entities, build relationships, and retrieve relevant information.

Go to File > Settings > Claude Settings > Configure.

In the Personal Preferences section, add the following text: 

(This preference will automatically apply to all conversations.)

Copy CodeCopiedUse a different BrowserFollow these steps for each interaction:

1. User Identification:
– You should assume that you are interacting with default_user
– If you have not identified default_user, proactively try to do so.

2. Memory Retrieval:
– Always begin your chat by saying only “Remembering…” and retrieve all relevant information from your knowledge graph
– Always refer to your knowledge graph as your “memory”

3. Memory
– While conversing with the user, be attentive to any new information that falls into these categories:
a) Basic Identity (age, gender, location, job title, education level, etc.)
b) Behaviors (interests, habits, etc.)
c) Preferences (communication style, preferred language, etc.)
d) Goals (goals, targets, aspirations, etc.)
e) Relationships (personal and professional relationships up to 3 degrees of separation)

4. Memory Update:
– If any new information was gathered during the interaction, update your memory as follows:
a) Create entities for recurring organizations, people, and significant events
b) Connect them to the current entities using relations
b) Store facts about them as observations

Once everything is configured, you will see 9 MCP tools available for the Knowledge Graph Server. These tools allow you to: create entities, create relationships, add observations, delete entities, delete observations, delete relationships, read the graph, search nodes, and open nodes.

Additionally, the text we added in the preferences section enables Claude to automatically use these tools during conversations.

Even if we go to a new chat, Claude will remember the information from the previous chats via the knowledge graph. The integration of this MCP tool enhances Claude’s ability to create, modify, and utilize knowledge in real-time, making it a powerful assistant for tasks like database management and SQL query generation. With this memory system in place, Claude becomes a more intelligent, responsive, and consistent tool for all your future interactions. For more details on the knowledge memory server, you can visit this link, where you’ll find resources to help you build even more advanced applications.

Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Implementing Persistent Memory Using a Local Knowledge Graph in Claude Desktop appeared first on MarkTechPost.

Google AI Unveils 601 Real-World Generative AI Use Cases Across Indust …

Google Cloud has just released an extraordinary compendium of 601 real-world generative AI (GenAI) use cases from some of the world’s top organizations — a major leap from the 101 use cases it shared just a year ago at Google Cloud Next 2024. This sixfold expansion showcases the explosive pace at which GenAI technologies are moving from prototypes to production, powering transformations across virtually every sector.

Announced during Google Cloud Next 2025, the comprehensive list covers companies ranging from Uber, Samsung, and Citi to Mercedes-Benz, Deutsche Bank, and Alaska Airlines. The breadth of applications highlights GenAI’s growing importance as an operational, creative, and strategic lever across automotive, finance, healthcare, manufacturing, media, retail, and public sector industries​.

The Structure: Agents, Industries, and Applications

Google structured the showcase across 11 major industry groups and six AI agent types:

Customer Agents: Enhance user experiences via chatbots, predictive services, and personalization

Employee Agents: Boost internal productivity through content generation, summarization, and knowledge discovery

Creative Agents: Accelerate campaign design, media production, and product innovation

Code Agents: Streamline software engineering and IT workflows

Data Agents: Leverage data for analysis, optimization, and decision support

Security Agents: Fortify organizations with AI-driven threat detection and fraud prevention​.

This agent-based taxonomy makes it clear: AI is no longer a separate tool — it’s becoming embedded into the organizational fabric.

Industry Snapshots: Real-World Impact

Automotive & Logistics

The automotive industry is rapidly adopting conversational and predictive AI. Volkswagen of America built a multimodal virtual assistant inside the myVW app using Google’s Gemini models, letting users point their phones at dashboard indicators for instant explanations​. Mercedes-Benz launched an automotive AI agent offering natural language navigation and e-commerce sales capabilities directly within its vehicles.

Even logistics giants are innovating: UPS is constructing a digital twin of its global package network for real-time package tracking and optimization​.

Financial Services

Banks and fintech companies are particularly aggressive in AI adoption. Citi is using Vertex AI to empower developer toolkits and document digitization. Deutsche Bank’s “DB Lumina” research tool, powered by Gemini, slashes research report creation times from hours to minutes​.

Meanwhile, Discover Financial Services deployed AI assistants that aid both customers and contact center representatives, significantly improving service efficiency​.

Healthcare & Life Sciences

In healthcare, the impact of AI extends from diagnostics to operational efficiency. Freenome is building early-detection cancer tests combining AI and blood samples. Mayo Clinic unlocked 50 petabytes of clinical data with Vertex AI Search, accelerating research access​.

Apollo Hospitals in India scaled tuberculosis and breast cancer screening to 3 million people by applying AI to radiology workflows​.

Manufacturing & Electronics

Manufacturers like Samsung are embedding Google’s Gemini AI directly into their devices — the Galaxy S24 now offers AI-driven text summarization and image editing features​. Trimble and Honeywell have incorporated Gemini for Workspace to enhance engineering productivity and document automation​.

Media, Retail, and Hospitality

AI is dramatically altering customer engagement. Papa John’s, Wendy’s, and Uber are using AI-powered predictive ordering systems​. Radisson Hotel Group reported a 50% gain in marketing productivity and over 20% revenue lift by personalizing ads with Vertex AI​.

Even creative industries are leveraging AI: Adobe has integrated Imagen 3 and Veo 2 into Adobe Express, dramatically accelerating campaign creation​.

Technology Highlights: Google’s Evolving Stack

Many of these applications were made possible through core Google Cloud AI technologies, notably:

Vertex AI: Model training, deployment, RAG (retrieval-augmented generation) pipelines

Gemini Models: Multimodal LLMs powering text, code, vision, and conversational capabilities

Imagen & Veo: High-fidelity generative image and video models

BigQuery ML: Data warehousing with embedded machine learning

Security AI: AI-first threat detection with Google SecOps​.

An emerging trend is the heavy use of enterprise-tuned AI agents, such as Gemini Code Assist for developer productivity or Gemini in Security for threat intelligence​.

Emerging Patterns Across Use Cases

Several clear trends emerge from Google’s compilation:

Generative AI is moving from experiments to mission-critical systems: Whether automating underwriting in finance, driving drug discovery, or powering multimodal search in automotive apps, GenAI is now operational at scale.

Hybrid Multimodal Models are increasingly vital: Many solutions integrate text, vision, and structured data — not just plain language models.

Verticalized AI Agents are accelerating: Google’s partners aren’t just fine-tuning LLMs — they’re building domain-specific, industry-tuned AI agents tightly integrated into their workflows.

Democratization of AI: Solutions like Vertex AI’s search and data agents are putting sophisticated AI tools into the hands of business users, scientists, and even drivers — not just engineers.

Final Thoughts

The 601 use cases shared by Google paint an exhilarating picture: AI transformation is no longer theoretical — it is happening today, at massive scale, in nearly every sector.

Google’s strategy of aligning its AI offerings with real-world operational needs — from customer engagement and logistics to employee productivity and cybersecurity — is accelerating this adoption curve.

As Google’s President of Global Revenue, Matt Renner, said in the announcement, “This is just scratching the surface of what’s becoming possible with AI across the enterprise”​.

If these use cases are any indication, the next year promises even more staggering innovation.

Check out the Report. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Google AI Unveils 601 Real-World Generative AI Use Cases Across Industries appeared first on MarkTechPost.

This AI Paper from China Proposes a Novel Training-Free Approach DEER …

Recent progress in large reasoning language models (LRLMs), such as DeepSeek-R1 and GPT-O1, has greatly improved complex problem-solving abilities by extending the length of CoT generation during inference. These models benefit from test-time scaling laws, allowing richer and more diverse reasoning paths. However, generating overly long CoT sequences leads to computational inefficiency and increased latency, making the deployment of real-world systems challenging. Moreover, excessive reasoning often introduces redundant or irrelevant steps, which can cause models to deviate from correct answers, ultimately reducing accuracy. This overthinking problem stems from traditional supervised fine-tuning and reinforcement learning approaches that do not prioritize dynamic control over reasoning length. Research has shown that in many cases, reasoning could be halted earlier, at what the authors call “pearl reasoning” points, without sacrificing correctness. Identifying and stopping at these critical points could significantly improve efficiency while maintaining model performance.

Existing approaches to improve inference efficiency generally fall into three categories: post-training, prompt-based, and output-based methods. Post-training techniques involve retraining models with variable-length CoT examples or length rewards, but they are often computationally intensive and risk overfitting. Prompt-based methods adjust CoT length by modifying the input prompts based on task difficulty, achieving more concise reasoning without sacrificing much accuracy. Output-based methods typically focus on sampling techniques, such as early stopping when multiple outputs converge on the same answer. However, with newer models like R1, reliance on best-of-N sampling has decreased. Recent works have explored early exiting strategies, but they often require separate verification models or are only effective in limited settings. In contrast, the discussed approach aims to empower models to recognize optimal stopping points during their reasoning process, providing a more seamless and generalizable solution.

Researchers from the Institute of Information Engineering, the University of Chinese Academy of Sciences, and Huawei Technologies have proposed DEER, a simple, training-free method to enable LRLMs to dynamically exit early during reasoning. DEER monitors key transition points, such as the generation of “Wait” tokens, and prompts the model to produce trial answers at these moments. If the model shows high confidence, reasoning is halted; otherwise, it continues. This approach integrates seamlessly with existing models, such as DeepSeek, and reduces CoT length by 31–43%, while improving accuracy by 1.7–5.7% across benchmarks including MATH-500, AIME 2024, and GPQA Diamond.

The DEER (Dynamic Early Exit in Reasoning) method enables large reasoning language models to exit reasoning early by evaluating their confidence in trial answers at key transition points. It uses three modules: a reasoning transition monitor to detect “thought switch” signals, an answer inducer to prompt a trial conclusion, and a confidence evaluator to assess if the reasoning is sufficient. If confidence exceeds a threshold, reasoning stops; otherwise, it continues. To reduce latency from trial answer generation, DEER also employs branch-parallel decoding with dynamic cache management, thereby improving efficiency without sacrificing accuracy, particularly for tasks such as code generation.

The experiments evaluated models on four major reasoning benchmarks: MATH-500, AMC 2023, AIME 2024, and GPQA Diamond, as well as programming benchmarks HumanEval and BigCodeBench. Tests were conducted using DeepSeek-R1-Distill-Qwen models of varying sizes (1.5B to 32B parameters) under a Zero-shot Chain-of-Thought setup. DEER significantly improved performance by reducing reasoning length by 31–43% while increasing accuracy by 1.7–5.7% compared to standard CoT. A detailed analysis revealed that DEER corrected more responses through early exits, particularly for smaller models and simpler tasks. On programming benchmarks, DEER also reduced reasoning length by over 60% with minimal or no loss in accuracy, demonstrating its robustness across various tasks.

In conclusion, the study validates the idea of using early exits during CoT generation through pilot studies. Based on these findings, it introduces a training-free dynamic early exit method that enables models to stop reasoning once enough information is gathered. Tested across various model sizes and six major reasoning benchmarks, the method achieves better accuracy with fewer tokens, effectively balancing efficiency and performance. Unlike traditional approaches that rely on long CoT for complex tasks, this method dynamically monitors model confidence to determine when to stop reasoning, thereby avoiding unnecessary steps. Experiments show significant reductions in reasoning length while boosting overall accuracy.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post This AI Paper from China Proposes a Novel Training-Free Approach DEER that Allows Large Reasoning Language Models to Achieve Dynamic Early Exit in Reasoning appeared first on MarkTechPost.

Skywork AI Advances Multimodal Reasoning: Introducing Skywork R1V2 wit …

Recent advancements in multimodal AI have highlighted a persistent challenge: achieving strong specialized reasoning capabilities while preserving generalization across diverse tasks. “Slow-thinking” models such as OpenAI-o1 and Gemini-Thinking have made strides in deliberate analytical reasoning but often exhibit compromised performance on general visual understanding tasks, with increased tendencies toward visual hallucinations. As the field progresses toward building general-purpose AI systems, reconciling this tradeoff remains a critical research problem.

Skywork AI Introduces Skywork R1V2

Skywork AI has released Skywork R1V2, a next-generation multimodal reasoning model designed to address the reasoning-generalization tradeoff systematically. Building upon the foundation of Skywork R1V, R1V2 introduces a hybrid reinforcement learning framework, combining reward-model guidance with structured rule-based signals. The model bypasses the conventional reliance on teacher-student distillation by learning directly from multimodal interactions, offering an open and reproducible advancement through its release on Hugging Face.

Technical Approach and Innovations

Skywork R1V2 incorporates Group Relative Policy Optimization (GRPO) alongside a Selective Sample Buffer (SSB) to enhance training stability and efficiency. GRPO enables relative evaluation among candidate responses within the same query group, but convergence issues can diminish effective learning signals. The SSB mechanism addresses this by maintaining a cache of informative samples, ensuring continuous access to high-value gradients.

Additionally, the model adopts a Mixed Preference Optimization (MPO) strategy, integrating reward-model-based preferences with rule-based constraints. This hybrid optimization allows Skywork R1V2 to strengthen step-by-step reasoning quality while maintaining consistency in general perception tasks. A modular training approach, utilizing lightweight adapters between a frozen Intern ViT-6B vision encoder and a pretrained language model, preserves the language model’s reasoning capabilities while optimizing cross-modal alignment efficiently.

Empirical Results and Analysis

Skywork R1V2 demonstrates robust performance across a range of reasoning and multimodal benchmarks. On text reasoning tasks, the model achieves 78.9% on AIME2024, 63.6% on LiveCodeBench, 73.2% on LiveBench, 82.9% on IFEVAL, and 66.3% on BFCL. These results represent significant improvements over Skywork R1V1 and are competitive with substantially larger models, such as Deepseek R1 (671B parameters).

In multimodal evaluation, R1V2 achieves 73.6% on MMMU, 74.0% on MathVista, 62.6% on OlympiadBench, 49.0% on MathVision, and 52.0% on MMMU-Pro. The model consistently outperforms open-source baselines of comparable or larger size, including Qwen2.5-VL-72B and QvQ-Preview-72B, particularly excelling in tasks that require structured problem-solving across visual and textual inputs.

When compared against proprietary models, R1V2 demonstrates narrowing performance gaps. It surpasses Claude 3.5 Sonnet and Gemini 2 Flash on critical multimodal benchmarks such as MMMU and MathVista. Importantly, hallucination rates were substantially reduced to 8.7% through calibrated reinforcement strategies, maintaining factual integrity alongside complex reasoning.

Qualitative assessments further illustrate R1V2’s systematic problem-solving approach, with the model demonstrating methodical decomposition and verification behaviors in complex scientific and mathematical tasks, reinforcing its alignment with reflective cognitive patterns.

Conclusion

Skywork R1V2 advances the state of multimodal reasoning through a carefully designed hybrid reinforcement learning framework. By addressing the vanishing advantages problem with the Selective Sample Buffer and balancing optimization signals through Mixed Preference Optimization, the model achieves notable improvements in both specialized reasoning tasks and general multimodal understanding.

With benchmark-leading performances such as 62.6% on OlympiadBench and 73.6% on MMMU, Skywork R1V2 establishes a strong open-source baseline. Its design principles and training methodology offer a pragmatic approach toward developing robust, efficient multimodal AI systems. Future directions for Skywork AI include enhancing general visual understanding capabilities while preserving the sophisticated reasoning foundations laid by R1V2.

Check out the Paper and Model on HuggingFace. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Skywork AI Advances Multimodal Reasoning: Introducing Skywork R1V2 with Hybrid Reinforcement Learning appeared first on MarkTechPost.

From GenAI Demos to Production: Why Structured Workflows Are Essential

At technology conferences worldwide and on social media, generative AI applications demonstrate impressive capabilities: composing marketing emails, creating data visualizations, or writing functioning code. Yet behind these polished demonstrations lies a stark reality. What works in controlled environments often fails when confronted with the demands of production systems.

Industry surveys reveal the scale of this challenge: 68% of organizations have moved 30% or fewer of their generative AI experiments into production, while only 53% of AI projects overall progress from prototype to production – with a mere 10% achieving measurable ROI (Wallaroo). Why does this gap persist? The controlled environment of a demonstration bears little resemblance to the unpredictable demands of real-world deployment.

Most current GenAI applications rely on what some have called ‘vibes-based’ assessments rather than rigorous validation. A developer reviews the output, determines it looks reasonable, and the system advances to the next stage of development. While this approach might sometimes identify obvious flaws, it fails to detect subtle inconsistencies that emerge only at scale or with edge-case inputs.

These reliability concerns become critical when AI systems influence business decisions with tangible consequences. 70% of organizations estimate needing at least 12 months to resolve challenges in achieving expected ROI from GenAI, highlighting the high stakes of production failures. Each misstep carries measurable costs: an incorrect product recommendation affects not just immediate sales but customer retention; an inaccurate financial summary might lead to misallocation of resources; a flawed legal interpretation could create significant liability exposure.

The transition from promising demonstrations to dependable production systems requires more than incremental improvements. It demands a fundamental shift in how we architect and evaluate GenAI applications. Structured workflows and systematic evaluation offer a methodical path forward—one that transforms unpredictable prototypes into systems worthy of trust with consequential decisions.

The Limitations of Monolithic GenAI Applications

Most first-generation GenAI applications employ a deceptively simple architecture: user input enters the system, a language model processes it with some contextual information, and the system produces a response. This end-to-end approach, while straightforward to implement, introduces significant limitations when deployed beyond controlled environments.

The most pressing challenge involves identifying the source of errors. When a monolithic system produces incorrect, biased, or nonsensical output, determining the cause becomes an exercise in speculation. Did the retrieval mechanism provide irrelevant context? Was the prompt construction flawed? Does the base model lack necessary capabilities? Without visibility into these components, improvement efforts resemble guesswork rather than engineering. Choco, a food distribution platform, discovered this when their single “catch-all” prompt worked in a hackathon but proved “not scalable or maintainable” in production.

Language models introduce another complication through their probabilistic nature. Even with identical inputs, these models may generate different outputs across successive executions. This variability creates a fundamental tension: creative applications benefit from diverse outputs, but business processes require consistency. The legal field saw an infamous example when an attorney unknowingly submitted hallucinated court cases from ChatGPT, leading to sanctions. The lack of internal measurement points further hampers improvement efforts. Without defined evaluation boundaries, teams struggle to isolate performance issues or quantify improvements.

Many current frameworks exacerbate these problems through premature abstraction. They encapsulate functionality behind interfaces that obscure necessary details, creating convenience at the expense of visibility and control. A team at Prosus found that off-the-shelf agent frameworks were fine for prototyping but too inflexible for production at scale.

These limitations become most apparent as organizations scale from prototype to production. Approaches that function adequately in limited tests falter when confronted with the volume, variety, and velocity of real-world data. Production deployment requires architectures that support not just initial development but ongoing operation, monitoring, and improvement—needs that monolithic systems struggle to satisfy. Successful teams have responded by breaking monolithic designs into modular pipelines, taming randomness with deterministic components, building comprehensive evaluation infrastructure, and favoring transparent architectures over premature abstractions.

Component-Driven GenAI: Breaking Down the Black Box

The transition to component-driven architecture represents more than a technical preference—it applies fundamental software engineering principles to generative AI development. By decomposing monolithic systems into discrete functional units, this approach transforms opaque black boxes into transparent, manageable workflows.

Component-based architecture divides complex systems into units with specific responsibilities, connected through well-defined interfaces. In GenAI applications, these components might include:

Data Retrieval Component: A vector database with embedding search that finds relevant documents or knowledge snippets based on user queries (e.g., Pinecone or Weaviate storing product information).

Prompt Construction Component: A template engine that formats retrieved information and user input into optimized prompts (e.g., a system that assembles query context).

Model Interaction Component: An API wrapper that handles communication with language models, manages retries, and standardizes input/output formats (e.g., a service that routes requests to Azure OpenAI or local Ollama endpoints).

Output Validation Component: A rule-based or LLM-based validator that checks outputs for accuracy, harmful content, or hallucinations (e.g., a fact-checking module that compares generated statements with retrieved knowledge).

Response Processing Component: A formatter that restructures raw model output into application-appropriate formats (e.g., a JSON parser that extracts structured data from text responses).

Each component addresses a specific function, creating natural boundaries for both execution and evaluation.

This decomposition yields several practical advantages that directly address the limitations of monolithic approaches. First, it establishes separation of concerns, allowing developers to focus on specific functionality without addressing the entire system simultaneously. Second, it creates discrete evaluation points where inputs and outputs can be validated against defined criteria. Third, it simplifies reasoning about system behavior by reducing complex interactions to manageable units that can be understood and modified independently.

Leading organizations have demonstrated these benefits in production. Uber’s DragonCrawl, a system for automated mobile app testing, uses LLMs to execute tests with human-like intuition. While not explicitly described as component-driven in Uber’s blog, its architecture effectively separates concerns into functional areas working together:

A representation component that converts app UI screens into text for the model to process

A decision-making component using a fine-tuned MPNet model (110M parameters) that determines what actions to take based on context and goals

An execution component that implements these decisions as interactions with the app

This structured approach achieved “99%+ stability” in November-December 2023 and successfully executed end-to-end trips in 85 out of 89 top cities without any city-specific tweaks. Most importantly, the system required no maintenance—when app changes occurred, DragonCrawl figured out how to navigate new flows on its own, unlike traditional tests that required hundreds of maintenance hours in 2023. The deliberate model selection process (evaluating multiple options against precision metrics) further demonstrates how systematic evaluation leads to reliable production systems.

Well-designed interfaces between components further enhance system maintainability. By establishing explicit contracts for data exchange, these interfaces create natural boundaries for testing and make components interchangeable. For example, a data retrieval component might specify that it accepts natural language queries and returns relevant document chunks with source metadata and relevance scores. This clear contract allows teams to swap between different retrieval implementations (keyword-based, embedding-based, or hybrid) without changing downstream components as long as the interface remains consistent.

The Component-Evaluation Pair: A Fundamental Pattern

At the heart of reliable GenAI systems lies a simple but powerful pattern: each component should have a corresponding evaluation mechanism that verifies its behavior. This component-evaluation pair creates a foundation for both initial validation and ongoing quality assurance.

This approach parallels unit testing in software engineering but extends beyond simple pass/fail validation. Component evaluations should verify basic functionality, identify performance boundaries, detect drift from expected behavior, and provide diagnostic information when issues arise. These evaluations serve as both quality gates during development and monitoring tools during operation.

Real-world implementations demonstrate this pattern’s effectiveness. Aimpoint Digital built a travel itinerary generator with separate evaluations for its retrieval component (measuring relevance of fetched results) and generation component (using an LLM-as-judge to grade output quality). This allowed them to quickly identify whether issues stemmed from poor information retrieval or flawed generation.

Payment processing company Stripe implemented a component-evaluation pair for their customer support AI by tracking “match rate” – how often the LLM’s suggested responses aligned with human agent final answers. This simple metric served as both quality gate and production monitor for their generation component.

The one-to-one relationship between components and evaluations enables targeted improvement when issues emerge. Rather than making broad changes to address vague performance concerns, teams can identify specific components that require attention. This precision reduces both development effort and the risk of unintended consequences from system-wide modifications.

The metrics from component evaluations form a comprehensive dashboard of system health. Engineers can monitor these indicators to identify performance degradation before it affects end users—a significant advantage over systems where problems become apparent only after they impact customers. This proactive approach supports maintenance activities and helps prevent production incidents.

When implemented systematically, component evaluations build confidence in system composition. If each component demonstrates acceptable performance against defined metrics, engineers can combine them with greater assurance that the resulting system will behave as expected. This compositional reliability becomes particularly important as systems grow in complexity.

Eval-First Development: Starting With Measurement

Conventional development processes often treat evaluation as an afterthought—something to be addressed after implementation is complete. Eval-first development inverts this sequence, establishing evaluation criteria before building components. This approach ensures that success metrics guide development from the outset rather than being retrofitted to match existing behavior.

The eval-first methodology creates a multi-tiered framework that operates at increasing levels of abstraction:

At the component level, evaluations function like unit tests in software development. These assessments verify that individual functional units perform their specific tasks correctly under various conditions. A retrieval component might be evaluated on the relevance of returned information across different query types, while a summarization component could be assessed on factual consistency between source text and generated summaries. These targeted evaluations provide immediate feedback during development and ongoing monitoring in production.

Step-level evaluations examine how components interact in sequence, similar to integration testing in software development. These assessments verify that outputs from one component serve as appropriate inputs for subsequent components and that the combined functionality meets intermediate requirements. For example, step-level evaluation might confirm that a classification component correctly routes queries to appropriate retrieval components, which then provide relevant context to a generation component.

Workflow-level evaluations assess whether the entire pipeline satisfies business requirements. These system-level tests validate end-to-end performance against defined success criteria. For a customer support system, workflow evaluation might measure resolution rate, customer satisfaction, escalation frequency, and handling time. These metrics connect technical implementation to business outcomes, providing a framework for prioritizing improvements.

This layered approach offers significant advantages over end-to-end evaluation alone. First, it provides a comprehensive view of system performance, identifying issues at multiple levels of granularity. Second, it establishes traceability between business metrics and component behavior, connecting technical performance to business outcomes. Third, it supports incremental improvement by highlighting specific areas that require attention.

Organizations that implement eval-first development often discover requirements and constraints earlier in the development process. By defining how components will be evaluated before implementation begins, teams identify potential issues when they’re least expensive to address. This proactive approach reduces both development costs and time-to-market for reliable systems.

Implementing Component-Based GenAI Workflows

Practical implementation of component-based GenAI workflows requires methodical decomposition of applications into steps that can be evaluated. This process begins with identifying core functions, then establishing clear responsibilities and interfaces for each component.

Effective breakdown balances granularity with practicality. Each component should have a single responsibility without creating excessive interaction overhead. Uber’s GenAI Gateway demonstrates this through a unified service layer handling 60+ LLM use cases. By mirroring OpenAI’s API interface, they created standardized endpoints that separate integration logic from application business logic.

Well-designed interfaces specify both data formats and semantic requirements. Microsoft’s Azure Copilot uses RESTful APIs between components like its Knowledge Service (document chunking) and LLM processors. This enables independent development while ensuring components exchange properly structured, semantically valid data.

Components and evaluations should be versioned together for traceable evolution. Uber’s approach allows centralized model upgrades – adding GPT-4V required only gateway adjustments rather than client changes. This containment of version impacts prevents system-wide disruptions.

Agentic components require constrained decision boundaries. Microsoft implements extensible plugins where each Azure service team builds domain-specific “chat handlers.” These predefined operations maintain control while enabling specialized functionality.

Sophisticated fallback mechanisms become possible with component isolation. Uber’s gateway implements automated model fallbacks, switching to internal models when external providers fail. This graceful degradation maintains service continuity without compromising entire workflows.

Microsoft’s golden dataset approach provides versioned benchmarking against 500+ validated question/answer pairs. Component updates are tested against this dataset before deployment, creating a closed feedback loop between evaluation and improvement.

Key challenges persist:

Initial Investment – Designing interfaces and evaluation frameworks requires upfront resources

Skill Gaps – Teams need both software engineering and AI expertise

Coordination Overhead – Inter-component communication adds complexity

Organizations must balance these against the benefits of maintainability and incremental improvement. As demonstrated by Uber’s gateway – now handling authentication, PII redaction, and monitoring across all LLM interactions – centralized components with clear contracts enable scalability while maintaining governance.

Practical Considerations

Implementing component-based GenAI workflows involves several practical considerations that influence their effectiveness in production environments.

Parcha discovered users preferred reliable “agent-on-rails” designs over fully autonomous systems after their initial agent approach proved too unpredictable. RealChar implemented a deterministic event-driven pipeline for AI phone calls, achieving low latency through fixed processing cycles rather than free-form agent architectures.

The organizational implications of component-based architecture extend beyond technical considerations. PagerDuty formed a centralized LLM service team that enabled four new AI features in two months by standardizing infrastructure across product teams. This mirrors how companies established dedicated data platform teams during earlier tech waves.

Organizations with established machine learning infrastructure have a significant advantage when implementing component-based GenAI systems. Many foundational MLOps capabilities transfer directly to LLMOps with minimal adaptation. For example, existing model registry systems can be extended to track LLM versions and their performance metrics. Data pipeline orchestrators that manage traditional ML workflows can be repurposed to coordinate GenAI component execution. Monitoring systems already watching for ML model drift can be adapted to detect LLM performance degradation.

Leading organizations have found that reusing these battle-tested MLOps components accelerates GenAI adoption while maintaining consistent governance and operational standards. Rather than building parallel infrastructure, enterprise companies have extended their ML platforms to accommodate the unique needs of LLMs, preserving the investment in tooling while adapting to new requirements.

Resource allocation represents another practical consideration. Component-based architectures require investment in infrastructure for component orchestration, interface management, and comprehensive evaluation. These investments compete with feature development and other organizational priorities. Successful implementation requires executive support based on understanding the long-term benefits of maintainable, evaluatable systems over short-term feature delivery.

Building for the Future

Component-based, evaluated workflows provide a foundation for sustainable GenAI development that extends beyond current capabilities. This approach positions organizations to incorporate emerging technologies without wholesale system replacement.

The field of generative AI continues to evolve rapidly, with new model architectures, specialized models, and improved techniques emerging regularly. Component-based systems can integrate these advances incrementally, replacing individual components as better alternatives become available. This adaptability provides significant advantage in a rapidly evolving field, allowing organizations to benefit from technological progress without disruptive rebuilding.

The reliability advantage of evaluated components becomes increasingly important as GenAI applications address critical business functions. Organizations that implement systematic evaluation establish quantitative evidence of system performance, supporting both internal confidence and external trust. This evidence-based approach helps organizations navigate regulatory requirements, customer expectations, and internal governance. As regulatory scrutiny of AI systems increases, the ability to demonstrate systematic evaluation and quality assurance will become a competitive differentiator.

Component evaluation enables continuous, data-driven improvement by providing detailed performance insights. Rather than relying on broad assessments or anecdotal feedback, teams can analyze component-level metrics to identify specific improvement opportunities. This targeted approach supports efficient resource allocation, directing effort toward areas with measurable impact.

Organizations should assess their current GenAI implementations through the lens of componentization and systematic evaluation. This assessment might examine several questions: Are system responsibilities clearly divided into evaluable components? Do explicit interfaces exist between these components? Are evaluation metrics defined at component, step, and workflow levels? Does the architecture support incremental improvement?

The transition from impressive demonstrations to reliable production systems ultimately requires both technical architecture and organizational commitment. Component-based workflows with systematic evaluation provide the technical foundation, while organizational priorities determine whether this foundation supports sustainable development or merely adds complexity. Organizations that commit to this approach—investing in component design, interface definition, and comprehensive evaluation—position themselves to deliver not just impressive demonstrations but dependable systems worthy of trust with consequential decisions.
The post From GenAI Demos to Production: Why Structured Workflows Are Essential appeared first on MarkTechPost.