Build reliable AI systems with Automated Reasoning on Amazon Bedrock …

Enterprises in regulated industries often need mathematical certainty that every AI response complies with established policies and domain knowledge. Regulated industries can’t use traditional quality assurance methods that test only a statistical sample of AI outputs and make probabilistic assertions about compliance. When we launched Automated Reasoning checks in Amazon Bedrock Guardrails in preview at AWS re:Invent 2024, it offered a novel solution by applying formal verification techniques to systematically validate AI outputs against encoded business rules and domain knowledge. These techniques make the validation output transparent and explainable.
Automated Reasoning checks are being used in workflows across industries. Financial institutions verify AI-generated investment advice meets regulatory requirements with mathematical certainty. Healthcare organizations make sure patient guidance aligns with clinical protocols. Pharmaceutical companies confirm marketing claims are supported by FDA-approved evidence. Utility companies validate emergency response protocols during disasters, while legal departments verify AI tools capture mandatory contract clauses.
With the general availability of Automated Reasoning, we have increased document handling and added new features like scenario generation, which automatically creates examples that demonstrate your policy rules in action. With the enhanced test management system, domain experts can build, save, and automatically execute comprehensive test suites to maintain consistent policy enforcement across model and application versions.
In the first part of this two-part technical deep dive, we’ll explore the technical foundations of Automated Reasoning checks in Amazon Bedrock Guardrails and demonstrate how to implement this capability to establish mathematically rigorous guardrails for generative AI applications.
In this post, you will learn how to:

Understand the formal verification techniques that enable mathematical validation of AI outputs
Create and refine an Automated Reasoning policy from natural language documents
Design and implement effective test cases to validate AI responses against business rules
Apply policy refinement through annotations to improve policy accuracy
Integrate Automated Reasoning checks into your AI application workflow using Bedrock Guardrails, following AWS best practices to maintain high confidence in generated content

By following this implementation guide, you can systematically help prevent factual inaccuracies and policy violations before they reach end users, a critical capability for enterprises in regulated industries that require high assurance and mathematical certainty in their AI systems.
Core capabilities of Automated Reasoning checks
In this section, we explore the capabilities of Automated Reasoning checks, including the console experience for policy development, document processing architecture, logical validation mechanisms, test management framework, and integration patterns. Understanding these core components will provide the foundation for implementing effective verification systems for your generative AI applications.
Console experience
The Amazon Bedrock Automated Reasoning checks console organizes policy development into logical sections, guiding you through the creation, refinement, and testing process. The interface includes clear rule identification with unique IDs and direct use of variable names within the rules, making complex policy structures understandable and manageable.
Document processing capacity
Document processing supports up to 120K tokens (approximately 100 pages), so you can encode substantial knowledge bases and complex policy documents into your Automated Reasoning policies. Organizations can incorporate comprehensive policy manuals, detailed procedural documentation, and extensive regulatory guidelines. With this capacity you can work with complete documents within a single policy.
Validation capabilities
The validation API includes ambiguity detection that identifies statements requiring clarification, counterexamples for invalid findings that demonstrate why validation failed, and satisfiable findings with both valid and invalid examples to help understand boundary conditions. These features provide context around validation results, to help you understand why specific responses were flagged and how they can be improved. The system can also express its confidence in translations between natural language and logical structures to set appropriate thresholds for specific use cases.
Iterative feedback and refinement process
Automated Reasoning checks provide detailed, auditable findings that explain why a response failed validation, to support an iterative refinement process instead of simply blocking non-compliant content. This information can be fed back to your foundation model, allowing it to adjust responses based on specific feedback until they comply with policy rules. This approach is particularly valuable in regulated industries where factual accuracy and compliance must be mathematically verified rather than estimated.

Finding types using a policy example
Consider the example of a policy for determining days off. When implementing Automated Reasoning checks, a policy consists of both a schema of variables (defining concepts like employee type, years of service, and available leave days) and a set of logical rules that establish relationships between these variables (such as eligibility conditions for different types of time off). During validation, the system uses this schema and rule structure to evaluate whether foundation model responses comply with your defined policy constraints.
We want to validate the following input that a user asked the foundation model (FM) powered application and the generated output.

Input:
“Is Thursday a day off if it’s a public holiday?”

Output:
“Yes, Thursday would be a day off if it’s a public holiday, since all public holidays are considered days off regardless of the day of the week.”

Premises are statements from which a conclusion is drawn and the claim is an assertion of the truth. In this example, the premises inferred are day is equal to “Thursday” and is_public_holiday is true, and the claim is made that is_day_off is true.
This reasoning follows the automated reasoning policy:

Days Off Policy:
– All weekend days (Saturday and Sunday) are days off
– All public holidays are days off
– A day is considered a day off if it is either a weekend day or a public holiday

With general availability, Automated Reasoning checks now produces seven distinct finding types that offer precise insights into the validation process of a FM generated response:

VALID: Confirms Input and Output fully aligns with policy rules, providing confidence that the information in the generated response is correct according to your defined policies. The claims are inferred from the premises and validated by the AR policy to be true, and there are no alternative answers that contradict these claims.

Input:
Today is Sunday, do I have the day off?

Output:
Yes, Sunday is a weekend day, you have the day off.

Premises:
day: Sunday
is_weekend: true

Claim:
is_day_off: true

Explanation:
Given the assigned variables and policy rules that weekends are days off, the only possible answer is that it is a day off.

SATISFIABLE: Recognizes that Input and Output could be true or false depending on specific assumptions. These findings help you understand boundary conditions where a response is valid only under certain conditions, so that you can decide whether those assumptions are reasonable in your context. If the required assumptions are false, then an alternative claim consistent with the premises may exist.

Input:
Today is Tuesday, do I have the day off?

Output:
Yes
Premises:
day: Tuesday

Claim:
is_day_off: true

Explanation:
If we assume is_public_holiday=true, this is correct, but if we assume is_public_holiday=false, the answer would be incorrect since Tuesday is not a weekend.

INVALID: Identifies Input and Output have policy inaccuracies or factual errors, enhanced with counter-examples that explicitly demonstrate why the validation failed. The claims are not implied by the premises and AR policy, and there exist different claims that would be consistent with the premises and AR policy.

Input:
Today is Sunday, do I have the day off?

Output:
No you do not have the day off.

Premises:
day: Sunday

Claim:
is_day_off: false

Explanation:
This is invalid because the policy states weekends are days off. The correct claim would be is_day_off = true since Sunday is a weekend day

IMPOSSIBLE: Indicates when no valid Claims can be generated because the premises conflict with the AR policy or the policy contains internal contradictions. This finding occurs when the constraints defined in the policy create a logical impossibility.

Input:
Today is Sunday and not a weekend day, do I have the day off?

Output:
Yes

Premises:
day: Sunday
is_weekend: false

Claim:
is_day_off: true

Explanation:
Sunday is always a weekend day, so the premises contain a contradiction. No valid claim can exist given these contradictory premises.

NO_TRANSLATIONS: Occurs when the Input and Output contains no information that can be translated into relevant data for the AR policy evaluation. This typically happens when the text is entirely unrelated to the policy domain or contains no actionable information.

Input:
How many legs does the average cat have?

Output:
Less than 4

Explanation:
The AR policy is about days off, so there is no relevant translation for content about cats. The input has no connection to the policy domain.

TRANSLATION_AMBIGUOUS: Identifies when ambiguity in the Input and Output prevents definitive translation into logical structures. This finding suggests that additional context or follow-up questions may be needed to proceed with validation.

Input:
I won! Today is Winsday, do I get the day off?

Output:
Yes, you get the day off!

Explanation:
“Winsday” is not a recognized day in the AR policy, creating ambiguity. Automated reasoning cannot proceed without clarification of what day is being referenced.

TOO_COMPLEX: Signals that the Input and Output contains too much information to process within latency limits. This finding occurs with extremely large or complex inputs that exceed the system’s current processing capabilities.

Input:
Can you tell me which days are off for all 50 states plus territories for the next 3 years, accounting for federal, state, and local holidays? Include exceptions for floating holidays and special observances.

Output:
I have analyzed the holiday calendars for all 50 states. In Alabama, days off include…

Explanation:
This use case contains too many variables and conditions for AR checks to process while maintaining accuracy and response time requirements.

Scenario generation
You can now generate scenarios directly from your policy, which creates test samples that conform to your policy rules, helps identify edge cases, and supports verification of your policy’s business logic implementation. With this capability policy authors can see concrete examples of how their rules work in practice before deployment, reducing the need for extensive manual testing. The scenario generation also highlights potential conflicts or gaps in policy coverage that might not be apparent from examining individual rules.
Test management system
A new test management system allows you to save and annotate policy tests, build test libraries for consistent validation, execute tests automatically to verify policy changes, and maintain quality assurance across policy versions. This system includes versioning capabilities that track test results across policy iterations, making it easier to identify when changes might have unintended consequences. You can now also export test results for integration into existing quality assurance workflows and documentation processes.
Expanded options with direct guardrail integration
Automated Reasoning checks now integrates with Amazon Bedrock APIs, enabling validation of AI generated responses against established policies throughout complex interactions. This integration extends to both the Converse and RetrieveAndGenerate actions, allowing policy enforcement across different interaction modalities. Organizations can configure validation confidence thresholds appropriate to their domain requirements, with options for stricter enforcement in regulated industries or more flexible application in exploratory contexts.
Solution – AI-powered hospital readmission risk assessment system
Now that we have explained the capabilities of Automated Reasoning checks, let’s work through a solution by considering the use case of an AI-powered hospital readmission risk assessment system. This AI system automates hospital readmission risk assessment by analyzing patient data from electronic health records to classify patients into risk categories (Low, Intermediate, High) and recommends personalized intervention plans based on CDC-style guidelines. The objective of this AI system is to reduce the 30-day hospital readmission rates by supporting early identification of high-risk patients and implementing targeted interventions. This application is an ideal candidate for Automated Reasoning checks because the healthcare provider prioritizes verifiable accuracy and explainable recommendations that can be mathematically proven to comply with medical guidelines, supporting both clinical decision-making and satisfying the strict auditability requirements common in healthcare settings.
Note: The referenced policy document is an example created for demonstration purposes only and should not be used as an actual medical guideline or for clinical decision-making.
Prerequisites
To use Automated Reasoning checks in Amazon Bedrock, verify you have met the following prerequisites:

An active AWS account
Confirmation of AWS Regions where Automated Reasoning checks is available
Appropriate IAM permissions to create, test, and invoke Automated Reasoning policies (Note: The IAM policy should be fine-grained and limited to necessary resources using proper ARN patterns for production usage):

{
“Sid”: “OperateAutomatedReasoningChecks”,
“Effect”: “Allow”,
“Action”: [
“bedrock:CancelAutomatedReasoningPolicyBuildWorkflow”,
“bedrock:CreateAutomatedReasoningPolicy”,
“bedrock:CreateAutomatedReasoningPolicyTestCase”,
“bedrock:CreateAutomatedReasoningPolicyVersion”,
“bedrock:CreateGuardrail”,
“bedrock:DeleteAutomatedReasoningPolicy”,
“bedrock:DeleteAutomatedReasoningPolicyBuildWorkflow”,
“bedrock:DeleteAutomatedReasoningPolicyTestCase”,
“bedrock:ExportAutomatedReasoningPolicyVersion”,
“bedrock:GetAutomatedReasoningPolicy”,
“bedrock:GetAutomatedReasoningPolicyAnnotations”,
“bedrock:GetAutomatedReasoningPolicyBuildWorkflow”,
“bedrock:GetAutomatedReasoningPolicyBuildWorkflowResultAssets”,
“bedrock:GetAutomatedReasoningPolicyNextScenario”,
“bedrock:GetAutomatedReasoningPolicyTestCase”,
“bedrock:GetAutomatedReasoningPolicyTestResult”,
“bedrock:InvokeAutomatedReasoningPolicy”,
“bedrock:ListAutomatedReasoningPolicies”,
“bedrock:ListAutomatedReasoningPolicyBuildWorkflows”,
“bedrock:ListAutomatedReasoningPolicyTestCases”,
“bedrock:ListAutomatedReasoningPolicyTestResults”,
“bedrock:StartAutomatedReasoningPolicyBuildWorkflow”,
“bedrock:StartAutomatedReasoningPolicyTestWorkflow”,
“bedrock:UpdateAutomatedReasoningPolicy”,
“bedrock:UpdateAutomatedReasoningPolicyAnnotations”,
“bedrock:UpdateAutomatedReasoningPolicyTestCase”,
“bedrock:UpdateGuardrail”
],
“Resource”: [
“arn:aws:bedrock:${aws:region}:${aws:accountId}:automated-reasoning-policy/*”,
“arn:aws:bedrock:${aws:region}:${aws:accountId}:guardrail/*”
]
}

Key service limits: Be aware of the service limits when implementing Automated Reasoning checks.
With Automated Reasoning checks, you pay based on the amount of text processed. For more information, see Amazon Bedrock pricing. For more information, see Amazon Bedrock pricing.

Use case and policy dataset overview
The full policy document used in this example can be accessed from the Automated Reasoning GitHub repository.  To validate the results from Automated Reasoning checks, being familiar with the policy is helpful. Moreover, refining the policy that is created by Automated Reasoning is key in achieving a soundness of over 99%.
Let’s review the main details of the sample medical policy that we are using in this post. As we start validating responses, it is helpful to verify it against the source document.

Risk assessment and stratification: Healthcare facilities must implement a standardized risk scoring system based on demographic, clinical, utilization, laboratory, and social factors, with patients classified into Low (0-3 points), Intermediate (4-7 points), or High Risk (8+ points) categories.
Mandatory interventions: Each risk level requires specific interventions, with higher risk levels incorporating lower-level interventions plus additional measures, while certain conditions trigger automatic High Risk classification regardless of score.
Quality metrics and compliance: Facilities must achieve specific completion rates including 95%+ risk assessment within 24 hours of admission and 100% completion before discharge, with High Risk patients requiring documented discharge plans.
Clinical oversight: While the scoring system is standardized, attending physicians maintain override authority with proper documentation and approval from the discharge planning coordinator.

Create and test an Automated Reasoning checks’ policy using the Amazon Bedrock console
The first step is to encode your knowledge—in this case, the sample medical policy—into an Automated Reasoning policy. Complete the following steps to create an Automated Reasoning policy:

On the Amazon Bedrock console, choose Automated Reasoning under Build in the navigation pane.
Choose Create policy.

Provide a policy name and policy description.

Add source content from which Automated Reasoning will generate your policy. You can either upload document (pdf, txt) or enter text as the ingest method.
Include a description of the intent of the Automated Reasoning policy you’re creating. The intent is optional but provides valuable information to the Large Language Models that are translating the natural language based document into a set of rules that can be used for mathematical verification. For the sample policy, you can use the following intent: This logical policy validates claims about the clinical practice guideline providing evidence-based recommendations for healthcare facilities to systematically assess and mitigate hospital readmission risk through a standardized risk scoring system, risk-stratified interventions, and quality assurance measures, with the goal of reducing 30-day readmissions by 15-23% across participating healthcare systems.

Following is an example patient profile and the corresponding classification.

<Patient Profile>Age: 82 years

Length of stay: 10 days

Has heart failure

One admission within last 30 days

Lives alone without caregiver

<Classification> High Risk
Once the policy has been created, we can inspect the definitions to see which rules, variables and types have been created from the natural language document to represent the knowledge into logic.

You may see differences in the number of rules, variables, and types generated compared to what is shown in this example. This is due to the non-deterministic processing of the supplied document. To address this, the recommended guidance is to perform a human-in-the-loop review of the generated information in the policy before using it with other systems.
Exploring the Automated Reasoning checks’ definition
A Variable in automated reasoning for policy documents is a named container that holds a specific type of information (like Integer, Real Number, or Boolean) and represents a distinct concept or measurement from the policy. Variables act as building blocks for rules and can be used to track, measure, and evaluate policy requirements. From the image below, we can see examples like admissionsWithin30Days (an Integer variable tracking previous hospital admissions), ageRiskPoints (an Integer variable storing age-based risk scores), and conductingMonthlyHighRiskReview (a Boolean variable indicating whether monthly reviews are being performed). Each variable has a clear description of its purpose and the specific policy concept it represents, making it possible to use these variables within rules to enforce policy requirements and measure compliance. Issues also highlight that some variables are unused. It is particularly important to verify which concepts these variables represent and to identify if rules are missing.

In the Definitions, we see ‘Rules’, ‘Variables’ and ‘Types’. A rule is an unambiguous logical statement that Automated Reasoning extracts from your source document. Consider this simple rule that has been created: followupAppointmentsScheduledRate is at least 90.0  – This rule has been created from the Section III A Process Measures, which states that healthcare facilities should monitor various process indications, requiring that follow up appointments scheduled prior to discharge should be 90% or higher.
Let’s look at a more complex rule:

comorbidityRiskPoints is equal to(ite hasDiabetesMellitus 1 0) + (ite hasHeartFailure 2 0) + (ite hasCOPD 1 0) + (ite hasChronicKidneyDisease 1 0)

Where “ite” is “If then else”

This rule calculates a patient’s risk points based on their existing medical conditions (comorbidities) as specified in the policy document. When evaluating a patient, the system checks for four specific conditions: diabetes mellitus of any type (worth 1 point), heart failure of any classification (worth 2 points), chronic obstructive pulmonary disease (worth 1 point), and chronic kidney disease stages 3-5 (worth 1 point). The rule adds these points together by using boolean logic – meaning it multiplies each condition (represented as true=1 or false=0) by its assigned point value, then sums all values to generate a total comorbidity risk score. For instance, if a patient has both heart failure and diabetes, they would receive 3 total points (2 points for heart failure plus 1 point for diabetes). This comorbidity score then becomes part of the larger risk assessment framework used to determine the patient’s overall readmission risk category.

The Definitions also include custom variable types. Custom variable types, also known as enumerations (ENUMs), are specialized data structures that define a fixed set of allowable values for specific policy concepts. These custom types maintain consistency and accuracy in data collection and rule enforcement by limiting values to predefined options that align with the policy requirements. In the sample policy, we can see that four custom variable types have been identified:

AdmissionType: This defines the possible types of hospital admissions (MEDICAL, SURGICAL, MIXED_MEDICAL_SURGICAL, PSYCHIATRIC) that determine whether a patient is eligible for the readmission risk assessment protocol.
HealthcareFacilityType: This specifies the types of healthcare facilities (ACUTE_CARE_HOSPITAL_25PLUS, CRITICAL_ACCESS_HOSPITAL) where the readmission risk assessment protocol may be implemented.
LivingSituation: This categorizes a patient’s living arrangement (LIVES_ALONE_NO_CAREGIVER, LIVES_ALONE_WITH_CAREGIVER) which is a critical factor in determining social support and risk levels.
RiskCategory: This defines the three possible risk stratification levels (LOW_RISK, INTERMEDIATE_RISK, HIGH_RISK) that can be assigned to a patient based on their total risk score.

An important step in improving soundness (accuracy of Automated Reasoning checks when it says VALID), is the policy refinement step of making sure that the rules, variable, and types that are captured best represent the source of truth. In order to do this, we will head over to the test suite and explore how to add tests, generate tests and use the results from the tests to apply annotations that will update the rules.
Testing the Automated Reasoning policy and policy refinement
The test suite in Automated Reasoning provides test capabilities for two purposes: First, we want to run different scenarios and test the various rules and variables in the Automated Reasoning policy and refine them so that they accurately represent the ground truth. This policy refinement step is important to improving the soundness of Automated Reasoning checks. Second, we want metrics to understand how well the Automated Reasoning checks performs for the defined policy and the use case. To do so, we can open the Tests tab on Automated Reasoning console.

Test samples can be added manually by using the Add button. To scale up the testing, we can generate tests from the policy rules. This testing approach helps verify both the semantic correctness of your policy (making sure rules accurately represent intended policy constraints) and the natural language translation capabilities (confirming the system can correctly interpret the language your users will use when interacting with your application). In the image below, we can see a test sample generated and before adding it to the test suite, the SME should indicate if this test sample is possible (thumbs up) or not possible (thumbs up). The test sample can then be saved to the test suite.

Once the test sample is created, it possible to run this test sample alone, or all the test samples in the test suite by choosing on Validate all tests. Upon executing, we see that this test passed successfully.

You can manually create tests by providing an input (optional) and output. These are translated into logical representations before validation occurs.
How translation works:
Translation converts your natural language tests into logical representations that can be mathematically verified against your policy rules:

Automated Reasoning Checks uses multiple LLMs to translate your input/output into logical findings
Each translation receives a confidence vote indicating translation quality
You can set a confidence threshold to control which findings are validated and returned

Confidence threshold behavior:
The confidence threshold controls which translations are considered reliable enough for validation, balancing strictness with coverage:

Higher threshold: Greater certainty in translation accuracy but also higher chance of no findings being validated.
Lower threshold:  Greater chance of getting validated findings returned, but potentially less certain translations
Threshold = 0: All findings are validated and returned regardless of confidence

Ambiguous results:
When no finding meets your confidence threshold, Automated Reasoning Checks returns “Translation Ambiguous,” indicating uncertainty in the content’s logical interpretation.The test case we will create and validate is:

Input:
Patient A
Age: 82
Length of stay: 16 days
Diabetes Mellitus: Yes
Heart Failure: Yes
Chronic Kidney Disease: Yes
Hemoglobin: 9.2 g/dL
eGFR: 28 ml/min/1.73m^2
Sodium: 146 mEq/L
Living Situation: Lives alone without caregiver
Has established PCP: No
Insurance Status: Medicaid
Admissions within 30 days: 1

Output:
Final Classification: INTERMEDIATE RISK

We see that this test passed upon running it, the result of ‘INVALID’ matches our expected results. Additionally Automated Reasoning checks also shows that 12 rules were contradicting the premises and claims, which lead to the output of the test sample being ‘INVALID’

Let’s examine some of the visible contradicting rules:

Age risk: Patient is 82 years old

Rule triggers: “if patientAge is at least 80, then ageRiskPoints is equal to 3”

Length of stay risk: Patient stayed 16 days

Rule triggers: “if lengthOfStay is greater than 14, then lengthOfStayRiskPoints is equal to 3”

Comorbidity risk: Patient has multiple conditions

Rule calculates: “comorbidityRiskPoints = (hasDiabetesMellitus × 1) + (hasHeartFailure × 2) + (hasCOPD × 1) + (hasChronicKidneyDisease × 1)”

Utilization risk: Patient has 1 admission within 30 days

Rule triggers: “if admissionsWithin30Days is at least 1, then utilizationRiskPoints is at least 3”

Laboratory risk: Patient’s eGFR is 28

Rule triggers: “if eGFR is less than 30.0, then laboratoryRiskPoints is at least 2”

These rules are likely producing conflicting risk scores, making it impossible for the system to determine a valid final risk category. These contradictions show us which rules where used to determine that the input text of the test is INVALID.

Let’s add another test to the test suite, as shown in the screenshot below:

Input:
Patient profile
Age: 83
Length of stay: 16 days
Diabetes Mellitus: Yes
Heart Failure: Yes
Chronic Kidney Disease: Yes
Hemoglobin: 9.2 g/dL
eGFR: 28 ml/min/1.73m^2
Sodium: 146 mEq/L
Living Situation: Lives alone without caregiver
Has established PCP: No
Insurance Status: Medicaid
Admissions within 30 days: 1
Admissions within 90 days: 2

Output:
Final Classification: HIGH RISK

When this test is executed, we see that each of the patient details are extracted as premises, to validate the claim that the risk of readmission if high. We see that 8 rules have been applied to verify this claim. The key rules and their validations include:

Age risk: Validates that patient age ≥ 80 contributes 3 risk points
Length of stay risk: Confirms that stay >14 days adds 3 risk points
Comorbidity risk: Calculated based on presence of Diabetes Mellitus, Heart Failure, Chronic Kidney Disease
Utilization risk: Evaluates admissions history
Laboratory risk: Evaluates risk based on Hemoglobin level of 9.2 and eGFR of 28

Each premise was evaluated as true, with multiple risk factors present (advanced age, extended stay, multiple comorbidities, concerning lab values, living alone without caregiver, and lack of PCP), supporting the overall Valid classification of this HIGH RISK assessment.

Moreover, the Automated Reasoning engine performed an extensive validation of this test sample using 93 different assignments to increase the soundness that the HIGH RISK classification is correct. Various related rules from the Automated Reasoning policy are used to validate the samples against 93 different scenarios and variable combinations. In this manner, Automated Reasoning checks confirms that there is no possible situation under which this patient’s HIGH RISK classification could be invalid. This thorough verification process affirms the reliability of the risk assessment for this elderly patient with multiple chronic conditions and complex care needs.In the event of a test sample failure, the 93 assignments would serve as an important diagnostic tool, pinpointing specific variables and their interactions that conflict with the expected outcome, thereby enabling subject matter experts (SMEs) to analyze the relevant rules and their relationships to determine if adjustments are needed in either the clinical logic or risk assessment criteria. In the next section, we will look at policy refinement and how SMEs can apply annotations to improve and correct the rules, variables, and custom types of the Automated Reasoning policy.
Policy refinement through annotations
Annotations provide a powerful improvement mechanism for Automated Reasoning policies when tests fail to produce expected results. Through annotations, SMEs can systematically refine policies by:

Correcting problematic rules by modifying their logic or conditions
Adding missing variables essential to the policy definition
Updating variable descriptions for greater precision and clarity
Resolving translation issues where original policy language was ambiguous
Deleting redundant or conflicting elements from the policy

This iterative process of testing, annotating, and updating creates increasingly robust policies that accurately encode domain expertise. As shown in the figure below, annotations can be applied to modify various policy elements, after which the refined policy can be exported as a JSON file for deployment.

In the following figure, we can see how annotations are being applied, and rules are deleted in the policy. Similarly, additions and updates can be made to rules, variables, or the custom types.

When the subject matter expert has validated the Automated Reasoning policy through testing, applying annotations, and validating the rules, it is possible to export the policy as a JSON file.

Using Automated Reasoning checks at inference
To use the Automated Reasoning checks with the created policy, we can now navigate to Amazon Bedrock Guardrails, and create a new guardrail by entering the name, description, and the messaging that will be displayed when the guardrail intervenes and blocks a prompt or a output from the AI system.

Now, we can attach Automated Reasoning check by using the toggle to Enable Automated Reasoning policy. We can set a confidence threshold, which determines how strictly the policy should be enforced. This threshold ranges from 0.00 to 1.00, with 1.00 being the default and most stringent setting. Each guardrail can accommodate up to two separate automated reasoning policies for enhanced validation flexibility. In the following figure, we are attaching the draft version of the medical policy related to patient hospital readmission risk assessment.

Now we can create the guardrail. Once you’ve established the guardrail and linked your automated reasoning policies, verify your setup by reviewing the guardrail details page to confirm all policies are properly attached.

Clean up
When you’re finished with your implementation, clean up your resources by deleting the guardrail and automated reasoning policies you created. Before deleting a guardrail, be sure to disassociate it from all resources or applications that use it.
Conclusion
In this first part of our blog, we explored how Automated Reasoning checks in Amazon Bedrock Guardrails help maintain the reliability and accuracy of generative AI applications through mathematical verification. You can use increased document processing capacity, advanced validation mechanisms, and comprehensive test management features to validate AI outputs against business rules and domain knowledge. This approach addresses key challenges facing enterprises deploying generative AI systems, particularly in regulated industries where factual accuracy and policy compliance are essential. Our hospital readmission risk assessment demonstration shows how this technology supports the validation of complex decision-making processes, helping transform generative AI into systems suitable for critical business environments. You can use these capabilities through both the AWS Management Console and APIs to establish quality control processes for your AI applications.
To learn more, and build secure and safe AI applications, see the technical documentation and the GitHub code samples, or access to the Amazon Bedrock console.

About the authors
Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.
Bharathi Srinivasan is a Generative AI Data Scientist at the AWS Worldwide Specialist Organization. She works on developing solutions for Responsible AI, focusing on algorithmic fairness, veracity of large language models, and explainability. Bharathi guides internal teams and AWS customers on their responsible AI journey. She has presented her work at various learning conferences.
Nafi Diallo  is a Senior Automated Reasoning Architect at Amazon Web Services, where she advances innovations in AI safety and Automated Reasoning systems for generative AI applications. Her expertise is in formal verification methods, AI guardrails implementation, and helping global customers build trustworthy and compliant AI solutions at scale. She holds a PhD in Computer Science with research in automated program repair and formal verification, and an MS in Financial Mathematics from WPI.

Custom Intelligence: Building AI that matches your business DNA

In 2024, we launched the Custom Model Program within the AWS Generative AI Innovation Center to provide comprehensive support throughout every stage of model customization and optimization. Over the past two years, this program has delivered exceptional results by partnering with global enterprises and startups across diverse industries—including legal, financial services, healthcare and life sciences, software development, telecommunications, and manufacturing. These partnerships have produced tailored AI solutions that capture each organization’s unique data expertise, brand voice, and specialized business requirements. They operate more efficiently than off-the-shelf alternatives, delivering increased alignment and relevance with significant cost savings on inference operations.

As organizations mature past proof-of-concept projects and basic chatbots, we’re seeing increased adoption of advanced personalization and optimization strategies beyond prompt engineering and retrieval augmented generation (RAG). Our approach encompasses creating specialized models for specific tasks and brand alignment, distilling larger models into smaller, faster, more cost-effective versions, implementing deeper adaptations through mid-training modifications, and optimizing hardware and accelerators to increase throughput while reducing costs.
Strategic upfront investment pays dividends throughout a model’s production lifecycle, as demonstrated by Cosine AI’s results. Cosine AI is the developer of an AI developer platform and software engineering agent designed to integrate seamlessly into their users’ workflows. They worked with the Innovation Center to fine-tune Nova Pro, an Amazon Nova foundation model, using Amazon SageMaker AI for their AI engineering assistant, Genie, achieving remarkable results including a 5x increase in A/B testing capability, a 10x faster developer iterations, and a 4x overall project speed improvement. The return on investment becomes even more compelling as companies transition toward agentic systems and workflows, where latency task specificity, performance, and depth are critical and compound across complex processes.
In this post, we’ll share key learnings and actionable strategies for leaders looking to use customization for maximum ROI while avoiding common implementation pitfalls.
Five tips for maximizing value from training and tuning generative AI models
The Innovation Center recommends the following top tips to maximize value from training and tuning AI models:
1. Don’t start from a technical approach; work backwards from business goals
This may seem obvious, but after working with over a thousand customers, we’ve found that working backwards from business goals is a critical factor in why projects supported by the Innovation Center achieve a 65% production success rate, with some launching within 45 days. We apply this same strategy to every customization project by first identifying and prioritizing tangible business outcomes that a technical solution will drive. Success must be measurable and deliver real business value, helping avoid flashy experiments that end up sitting on a shelf instead of producing results. In the Custom Model Program, many customers initially approach us seeking specific technical solutions—such as jumping directly into model pre-training or continued pre-training—without having defined downstream use cases, data strategies, or evaluation plans. By starting with clear business objectives first, we make sure that technical decisions align with strategic goals and create meaningful impact for the organization.
2. Pick the right customization approach
Start with a baseline customization approach and exhaust simpler approaches before diving into deep model customization. The first question we ask customers seeking custom model development is “What have you already tried?” We recommend establishing this baseline with prompt engineering and RAG before exploring more complex techniques. While there’s a spectrum of model optimization approaches that can achieve higher performance, sometimes the simplest solution is the most effective. Once you establish this baseline, identify remaining gaps and opportunities to determine whether advancing to the next level makes strategic sense.

Customization options range from lightweight approaches like supervised fine-tuning to ground-up model development. We typically advise starting with lighter-weight solutions that require smaller amounts of data and compute, then progressing to more complex techniques only when specific use cases or remaining gaps justify the investment:

Supervised fine-tuning sharpens the model’s focus for specific use cases, for example delivering consistent customer service responses or adapting to your organization’s preferred phrasing, structure and reasoning patterns. Volkswagen, one of the world’s largest automobile manufacturers, achieved an “improvement in AI-powered brand consistency checks, increasing accuracy in identifying on-brand images from 55% to 70%,” notes Dr. Philip Trempler, Technical Lead AI & Cloud Engineering at Volkswagen Group Services.
Model efficiency and deployment tuning supports organizations like Robin AI, a leader in AI-powered legal contract technology, to create tailored models that speed up human verification. Organizations can also use techniques like quantization, pruning, and system optimizations to improve model performance and reduce infrastructure costs.
Reinforcement learning uses reward functions or preference data to align models to preferred behavior. This approach is often combined with supervised fine-tuning so organizations like Cosine AI can refine their models’ decision making to match organizational preferences.
Continued pre-training allow organizations like Athena RC, a leading research center in Greece, to build Greek-first foundation models that expand language capabilities beyond English. By continually pre-training large language models on extensive Greek data, Athena RC strengthens the models’ core understanding of the Greek language, culture, and usage – not just their domain knowledge. Their Meltemi-7B and Llama-Krikri-8B models demonstrate how continued pre-training and instruction tuning can create open, high-quality Greek models for applications across research, education, industry, and society.
Domain-specific foundation model development enables organizations like TGS, a leading energy data, insights, and technology provider, to build custom AI models from scratch, ideal for those with highly specialized requirements and substantial volume of proprietary data. TGS helps energy companies make smarter exploration and development decisions by solving some of the industry’s toughest challenges in understanding what lies beneath the Earth’s surface. TGS has enhanced its Seismic Foundation Models (SFMs) to more reliably detect underground geological structures—such as faults and reservoirs—that indicate potential oil and gas deposits. The benefit is clear: operators can reduce uncertainty, lower exploration costs, and make faster investment decisions.

Data quality and accessibility will be a major consideration in determining feasibility of each customization technique. Clean, high-quality data is essential both for model improvement and measuring progress. While some Innovation Center customers achieve performance gains with relatively smaller volumes of fine-tuning training pairs on instruction-tuned foundation models, approaches like continued pre-training typically require large volumes of training tokens. This reinforces the importance of starting simple—as you test lighter-weight model tuning, you can collect and process larger data volumes in parallel for future phases.
3. Define measures for what good looks like
Success needs to be measurable, regardless of which technical approach you choose. It’s critical to establish clear methods for measuring both overall business outcomes and the technical solution’s performance. At the model or application level, teams typically optimize across some combination of relevance, latency, and cost. However, the metrics for your production application won’t be general leaderboard metrics—they must be unique to what matters for your business.
Customers developing content generation systems prioritize metrics like relevance, clarity, style, and tone. Consider this example from Volkswagen Group: “We fine-tuned Nova Pro in SageMaker AI using our marketing experts’ knowledge. This improved the model’s ability to identify on-brand images, achieving stronger alignment with Volkswagen’s brand guidelines,” according to Volkswagen’s Dr. Trempler. “We are building on these results to enable Volkswagen Group’s vision to scale high-quality, brand-compliant content creation across our diverse automotive markets worldwide using generative AI.” Developing an automated evaluation process is critical for supporting iterative solution improvements.
For qualitative use cases, it’s essential to align automated evaluations with human experts, particularly in specialized domains. A common solution involves using LLM as judge to review another model or system responses. For instance, when fine-tuning a generation model for a RAG application, you might use an LLM judge to compare the fine-tuned model response to your existing baseline. However, LLM judges come with intrinsic biases and may not align with your internal team’s human preferences or domain expertise. Robin AI partnered with the Innovation Center to develop Legal LLM-as-Judge, an AI model for legal contract review. Emulating expert methodology and creating “a panel of trained judges” using fine-tuning techniques, they obtained smaller and faster models that maintain accuracy while reviewing documents ranging from NDAs to merger agreements. The solution achieved an 80% faster contract review process, enabling lawyers to focus on strategic work while AI handles detailed analysis.
4. Consider hardware-level optimizations for training and inference
If you’re using a managed service like Amazon Bedrock, you can take advantage of built-in optimizations out of the box. However, if you have a more bespoke solution or are operating at a lower level of the technology stack, there are several areas to consider for optimization and efficiency gains. For instance, TGS’s SFMs process massive 3D seismic images (essentially giant CAT scans of the Earth) that can cover tens of thousands of square kilometers. Each dataset is measured in petabytes, far beyond what traditional manual or even semi-automated interpretation methods can handle. By rebuilding their AI models on AWS’s high-performance GPU training infrastructure, TGS achieved near-linear scaling, meaning that adding more computing power results in almost proportional speed increases while maintaining >90% GPU efficiency. As a result, TGS can now deliver actionable subsurface insights, such as identifying drilling targets or de-risking exploration zones, to customers in days instead of weeks.
Over the life of a model, resource requirements are generally driven by inference requests, and any efficiency gains you can achieve will pay dividends during the production phase. One approach to reduce inference demands is model distillation to reduce the model size itself, but in some cases, there are additional gains to be had by digging deeper into the infrastructure. A recent example is Synthesia, the creator of a leading video generation platform where users can create professional videos without the need for mics, cameras, or actors. Synthesia is continually looking for ways to elevate their user experience, including by decreasing generation times for content. They worked with the Innovation Center to optimize the Variational Autoencoder decoder of their already efficient video generation pipeline. Strategic optimization of the model’s causal convolution layers unlocked powerful compiler performance gains, while asynchronous video chunk writing eliminated GPU idle time – together delivering a dramatic reduction in end-to-end latency and a 29% increase in decoding throughput.
5. One size doesn’t fit all
The one size doesn’t fit all principle applies to both model size and family. Some models excel out of the box for specific tasks like code generation, tool usage, document processing, or summarization. With the rapid pace of innovation, the best foundation model for a given use case today likely won’t be the best tomorrow. Model size corresponds to the number of parameters and often determines its ability to complete a broad set of general tasks and capabilities. However, larger models require more compute resources at inference time and can be expensive to run at production scale. Many applications don’t need a model that excels at everything but rather one that performs exceptionally well at a more limited set of tasks or domain-specific capabilities.
Even within a single application, optimization may require using multiple model providers depending on the specific task, complexity level, and latency requirements. In agentic applications, you might use a lightweight model for specialized agent tasks while requiring a more powerful generalist model to orchestrate and supervise those agents. Architecting your solution to be modular and resilient to changing model providers or versions helps you adapt quickly and capitalize on improvements. Services like Amazon Bedrock facilitate this approach by providing a unified API experience across a broad range of model families, including custom versions of many models.
How the Innovation Center can help
The Custom Model Program by the Innovation Center provides end-to-end expert support from model selection to customization, delivering performance improvements, and reducing time-to-market and value realization. Our process works backwards from customer business needs, strategy and goals, and starts with a use case and generative AI capability review by an experienced generative AI strategist. Specialist hands-on-keyboard applied scientists and engineers embed with customer teams to train and tune models for customers and integrate into applications without data ever needing to leave customer VPCs. This end-to-end support has helped organizations across industries successfully transform their AI vision into real business outcomes.

Want to learn more? Contact your account manager to learn more about the Innovation Center or come see us at re:Invent at the AWS Village in the Expo.

About the authors
Sri Elaprolu serves as Director of the AWS Generative AI Innovation Center, where he leverages nearly three decades of technology leadership experience to drive artificial intelligence and machine learning innovation. In this role, he leads a global team of machine learning scientists and engineers who develop and deploy advanced generative and agentic AI solutions for enterprise and government organizations facing complex business challenges. Throughout his nearly 13-year tenure at AWS, Sri has held progressively senior positions, including leadership of ML science teams that partnered with high-profile organizations such as the NFL, Cerner, and NASA. These collaborations enabled AWS customers to harness AI and ML technologies for transformative business and operational outcomes. Prior to joining AWS, he spent 14 years at Northrop Grumman, where he successfully managed product development and software engineering teams. Sri holds a Master’s degree in Engineering Science and an MBA with a concentration in general management, providing him with both the technical depth and business acumen essential for his current leadership role.
Hannah Marlowe leads the Model Customization and Optimization program for the AWS Generative AI Innovation Center. Her global team of strategists, specialized scientists, and engineers embeds directly with AWS customers, developing custom model solutions optimized for relevance, latency, and cost to drive business outcomes and capture ROI. Previous roles at Amazon include Senior Practice Manager for Advanced Computing and Principal Lead for Computer Vision and Remote Sensing. Dr. Marlowe completed her PhD in Physics at the University of Iowa in modeling and simulation of astronomical X-ray sources and instrumentation development for satellite-based payloads.
Rohit Thekkanal serves as ML Engineering Manager for Model Customization at the AWS Generative AI Innovation Center, where he leads the development of scalable generative AI applications focused on model optimization. With nearly a decade at Amazon, he has contributed to machine learning initiatives that significantly impact Amazon’s retail catalog. Rohit holds an MBA from The University of Chicago Booth School of Business and a Master’s degree from Carnegie Mellon University.
Alexandra Fedorova leads Growth for the Model Customization and Optimization program for the AWS Generative AI Innovation Center. Previous roles at Amazon include Global GenAI Startups Practice Leader with the AWS Generative AI Innovation Center, and Global Leader, Startups Strategic Initiatives and Growth. Alexandra holds an MBA degree from Southern Methodist University, and BS in Economics and Petroleum Engineering from Gubkin Russian State University of Oil and Gas.

Clario streamlines clinical trial software configurations using Amazon …

This post was co-written with Kim Nguyen and Shyam Banuprakash from Clario.
Clario is a leading provider of endpoint data solutions for systematic collection, management, and analysis of specific, predefined outcomes (endpoints) to evaluate a treatment’s safety and effectiveness in the clinical trials industry, generating high-quality clinical evidence for life sciences companies seeking to bring new therapies to patients. Since Clario’s founding more than 50 years ago, the company’s endpoint data solutions have supported clinical trials more than 30,000 times with over 700 regulatory approvals across more than 100 countries.
This post builds upon our previous post discussing how Clario developed an AI solution powered by Amazon Bedrock to accelerate clinical trials. Since then, Clario has further enhanced their AI capabilities, focusing on innovative solutions that streamline the generation of software configurations and artifacts for clinical trials while delivering high-quality clinical evidence.
Business challenge
In clinical trials, designing and customizing various software systems configurations to manage and optimize the different stages of a clinical trial efficiently is critical. These configurations can range from basic study setup to more advanced features like data collection customization and integration with other systems. Clario uses data from multiple sources to build specific software configurations for clinical trials. The traditional workflow involved manual extraction of necessary data from individual forms. These forms contained vital information about exams, visits, conditions, and interventions. Additionally, the process required the need to incorporate study-related information such as study plans, participation criteria, sponsors, collaborators, and standardized exam protocols from multiple enterprise data providers.
The manual nature of this process created several challenges:

Manual data extraction – Team members manually review PDF documents to extract structured data.
Transcript challenges – The manual transfer of data from source forms into configuration documents presents opportunities for improvement, particularly in reducing transcription inconsistencies and enhancing standardization.
Version control challenges – When studies required iterations or updates, maintaining consistency between documents and systems became increasingly complicated.
Fragmented information flow – Data existed in disconnected silos, including PDFs, study detail database records, and other standalone documents.
Software build timelines – The configuration process directly impacted the timeline for generating the necessary software builds.

For clinical trials where timing is essential and accuracy is non-negotiable, Clario has implemented rigorous quality control measures to minimize the risks associated with manual processes. While these efforts are substantial, they underscore a business challenge of ensuring precision and consistency across complex study configurations.
Solution overview
To address the business challenge, Clario developed a generative AI-powered solution that Clario refers to as the Clario’s Genie AI Service on AWS. This solution uses the capabilities of large language models (LLMs), specifically Anthropic’s Claude 3.7 Sonnet on Amazon Bedrock. The process is orchestrated using Amazon Elastic Container Service (Amazon ECS) to transform how Clario handled software configuration for clinical trials.
Clario’s approach uses a custom data parser using Amazon Bedrock to automatically structure information from PDF transmittal forms into validated tables. The Genie AI Service centralizes data from multiple sources, including transmittal forms, study details, standard exam protocols, and additional configuration parameters. An interactive review dashboard helps stakeholders verify AI-extracted information and make necessary corrections before finalizing the validated configuration. Post-validation, the system automatically generates a Software Configuration Specification (SCS) document as a comprehensive record of the software configuration. The process culminates with generative AI-powered XML generation, which is then released into Clario’s proprietary medical imaging software for study builds, creating an end-to-end solution that drastically reduces manual effort while improving accuracy in clinical trial software configurations.
The Genie AI Service architecture consists of several interconnected components that work together in a clear workflow sequence, as illustrated in the following diagram.

The workflow consists of the following steps:

Initiate the study and collect data.
Extract the data using Amazon Bedrock.
Review and validate the AI-generated output.
Generate essential documentation and code artifacts.

In the following sections, we discuss the workflow steps in more detail.
Study initiation and data collection
The workflow begins with gathering essential study information through multiple integrated steps:

Study code lookup – Users begin by entering a study code that uniquely identifies the clinical trial.
API integration with study database – The study lookup operation makes an API call to fetch study details such as such as study plan, participation criteria, sponsors, collaborators, and more from the study database, establishing the foundation for the configuration.
Transmittal form processing – Users upload transmittal forms containing study parameters such as information about exams, visits, conditions, and interventions to the Genie AI Service using the web UI through a secure AWS Direct Connect network.
Data structuring – The system organizes information into key categories:

Visit information (scheduling, procedures)
Exam specifications (protocols, requirements)
Study-specific custom fields (vitals, dosing information, and so on)

Data extraction
The solution uses Anthropic’s Claude Sonnet on Amazon Bedrock through API calls to perform the following actions:

Parse and extract structured data from transmittal forms
Identify key fields and tables within the documents
Organize the information into standardized formats
Apply domain-specific rules to properly categorize clinical trial visits
Extract and validate demographic fields while maintaining proper data types and formats
Handle specialized formatting rules for medical imaging parameters
Manage document-specific adaptations (such as different processing for phantom vs. subject scans)

Review and validation
The solution provides a comprehensive review interface for stakeholders to validate and refine the AI-generated configurations through the following steps:

Interactive review process – Reviewers access the Genie AI Service interface to perform the following actions:

Examine the AI-generated output
Make corrections or adjustments to the data as necessary
Add comments and highlight adjustments made as a feedback mechanism
Validate the configuration accuracy

Data storage – Reviewed and approved software configurations are saved to Clario’s Genie Database, creating a central, authoritative, auditable source of configuration data

Document and code generation
After the configuration data is validated, the solution automates the creation of essential documentation and code artifacts through a structured workflow:

SCS document creation – Reviewers access the Genie AI Service interface to finalize the software configurations by generating an SCS document using the validated data.
XML generation workflow – After the SCS document is finalized, the workflow completes the following steps:

The workflow fetches the configuration details from the Genie database.
The SCSXMLConverter, an internal microservice of the Genie AI Service, processes both SCS document and study configurations. This microservice invokes Anthropic’s Claude 3.7 Sonnet through API calls to generate a standardized SCS XML file.
Validation checks are performed on the generated XML to make sure it meets the structural and content requirements of Clario’s clinical study software.
The final XML output is created for use in the software build process with detailed logs of the conversion process.

Benefits and results
The solution enhanced data extraction quality while providing teams with a streamlined dashboard that accelerates the validation process.
By implementing consistent extraction logic and minimizing manual data entry, the solution has reduced potential transcription errors. Additionally, built-in validation safeguards now help identify potential issues early in the process, preventing problems from propagating downstream.
The solution has also transformed how teams collaborate. By providing centralized review capabilities and giving cross-functional teams access to the same solution, communication has become more transparent and efficient. The standardized workflows have created clearer channels for information sharing and decision-making.
From an operational perspective, the new approach offers greater scalability across studies while supporting iterations as studies evolve. This standardization has laid a strong foundation for expanding these capabilities to other operational areas within the organization.
Importantly, the solution maintains strong compliance and auditability through complete audit trails and reproducible processes. Key outcomes include:

Study configuration execution time has been reduced while improving overall quality
Teams can focus more on value-added activities like study design optimization.

Lessons learned
Clario’s journey to transform software configuration through generative AI has taught them valuable lessons that will inform future initiatives.
Generative AI implementation insights
The following key learnings emerged specifically around working with generative AI technology:

Prompt engineering is foundational – Few-shot prompting with domain knowledge is essential. The team discovered that providing detailed examples and explicit business rules in the prompts was necessary for success. Rather than simple instructions, Clario’s prompts include comprehensive business logic, edge case handling, and exact output formatting requirements to guide the AI’s understanding of clinical trial configurations.
Prompt engineering requires iteration – The quality of data extraction depends heavily on well-crafted prompts that encode domain expertise. Clario’s team spent significant time refining these prompts through multiple iterations and testing different approaches to capture complex business rules about visit sequencing, demographic requirements, and field formatting.
Human oversight within a validation workflow – Although generative AI dramatically accelerates extraction, human review remains necessary within a structured validation workflow. The Genie AI Service interface was specifically designed to highlight potential inconsistencies and provide convenient editing capabilities for reviewers to apply their expertise efficiently.

Integration challenges
Some important challenges surfaced during system integration:

Two-system synchronization – One of the biggest challenges has been verifying that changes made in the SCS documents are reflected in the solution. This bidirectional integration is still being refined.
System transition strategy – Moving from the proof-of-concept scripts to fully integrated solution functionality requires careful planning to avoid disruption.

Process adaptation
The team identified the following key factors for successful process change:

Phased Implementation – Clario rolled out the solution in stages, beginning with pilot teams who could validate functionality and serve as internal advocates to help teams transition from familiar document-centric workflows to the new solution.
Workflow optimization is iterative – The initial workflow design has evolved based on user feedback and real-world usage patterns.
Training requirements – Even with an intuitive interface, proper training makes sure users can take full advantage of the solution’s capabilities.

Technical considerations
Implementation revealed several important technical aspects to consider:

Data formatting variability – Transmittal forms vary significantly across different therapeutic areas (oncology, neurology, and so on) and even between studies within the same area. This variability creates challenges when the AI model encounters form structures or terminology it hasn’t seen before. Clario’s prompt engineering requires continuous iteration as they discover new patterns and edge cases in transmittal forms, creating a feedback loop where human experts identify missed or misinterpreted data points that inform future prompt refinements.
Performance optimization – Processing times for larger documents required optimization to maintain a smooth user experience.
Error handling robustness – Building resilient error handling into the generative AI processing flow was essential for production reliability.

Strategic insights
The project yielded valuable strategic lessons that will inform future initiatives:

Start with well-defined use cases – Beginning with the software configuration process gave Clario a concrete, high-value target for demonstrating generative AI benefits.
Build for extensibility – Designing the architecture with future expansion in mind has positioned them well for extending these capabilities to other areas.
Measure concrete outcomes – Tracking specific metrics like processing time and error rates has helped quantify the return on the generative AI investment.

These lessons have been invaluable for refining the current solution and informing the approach to future generative AI implementations across the organization.
Conclusion
The transformation of the software configuration process through generative AI represents more than just a technical achievement for Clario—it reflects a fundamental shift in how the company approaches data processing and knowledge work in clinical trials. By combining the pattern recognition and processing power of LLMs available in Amazon Bedrock with human expertise for validation and decision-making, Clario created a hybrid workflow that delivers the best of both worlds, orchestrated through Amazon ECS for reliable, scalable execution.
The success of this initiative demonstrates how generative AI on AWS is a practical tool that can deliver tangible benefits. By focusing on specific, well-defined processes with clear pain points, Clario has implemented the solution Genie AI Service powered by Amazon Bedrock in a way that creates immediate value while establishing a foundation for broader transformation.
For organizations considering similar transformations, the experience highlights the importance of starting with concrete use cases, building for human-AI collaboration and maintaining a focus on measurable business outcomes. With these principles in mind, generative AI can become a genuine catalyst for organizational evolution.

About the authors
Kim Nguyen serves as the Sr Director of Data Science at Clario, where he leads a team of data scientists in developing innovative AI/ML solutions for the healthcare and clinical trials industry. With over a decade of experience in clinical data management and analytics, Kim has established himself as an expert in transforming complex life sciences data into actionable insights that drive business outcomes. His career journey includes leadership roles at Clario and Gilead Sciences, where he consistently pioneered data automation and standardization initiatives across multiple functional teams. Kim holds a Master’s degree in Data Science and Engineering from UC San Diego and a Bachelor’s degree from the University of California, Berkeley, providing him with the technical foundation to excel in developing predictive models and data-driven strategies. Based in San Diego, California, he leverages his expertise to drive forward-thinking approaches to data science in the clinical research space.
Shyam Banuprakash serves as the Senior Vice President of Data Science and Delivery at Clario, where he leads complex analytics programs and develops innovative data solutions for the medical imaging sector. With nearly 12 years of progressive experience at Clario, he has demonstrated exceptional leadership in data-driven decision making and business process improvement. His expertise extends beyond his primary role, as he contributes his knowledge as an Advisory Board Member for both Modal and UC Irvine’s Customer Experience Program. Shyam holds a Master of Advanced Study in Data Science and Engineering from UC San Diego, complemented by specialized training from MIT in data science and big data analytics. His career exemplifies the powerful intersection of healthcare, technology, and data science, positioning him as a thought leader in leveraging analytics to transform clinical research and medical imaging.
Praveen Haranahalli is a Senior Solutions Architect at Amazon Web Services (AWS), where he architects secure, scalable cloud solutions and provides strategic guidance to diverse enterprise customers. With nearly two decades of IT experience including over a decade specializing in cloud computing, Praveen has delivered transformative implementations across multiple industries. As a trusted technical advisor, Praveen partners with customers to implement robust DevSecOps pipelines, establish comprehensive security guardrails, and develop innovative AI/ML solutions. He is passionate about solving complex business challenges through cutting-edge cloud architectures and empowering organizations to achieve successful digital transformations powered by artificial intelligence and machine learning.