Microsoft AI Releases Phi-4-multimodal and Phi-4-mini: The Newest Mode …

In today’s rapidly evolving technological landscape, developers and organizations often grapple with a series of practical challenges. One of the most significant hurdles is the efficient processing of diverse data types—text, speech, and vision—within a single system. Traditional approaches have typically required separate pipelines for each modality, leading to increased complexity, higher latency, and greater computational costs. In many applications—from healthcare diagnostics to financial analytics—these limitations can hinder the development of responsive and adaptive AI solutions. The need for models that balance robustness with efficiency is more pressing than ever. In this context, Microsoft’s recent work on small language models (SLMs) provides a promising approach by striving to consolidate capabilities in a compact, versatile package.

Microsoft AI has recently introduced Phi-4-multimodal and Phi-4-mini, the newest additions to its Phi family of SLMs. These models have been developed with a clear focus on streamlining multimodal processing. Phi-4-multimodal is designed to handle text, speech, and visual inputs concurrently, all within a unified architecture. This integrated approach means that a single model can now interpret and generate responses based on varied data types without the need for separate, specialized systems.

In contrast, Phi-4-mini is tailored specifically for text-based tasks. Despite being more compact, it has been engineered to excel in reasoning, coding, and instruction following. Both models are made accessible via platforms like Azure AI Foundry and Hugging Face, ensuring that developers from a range of industries can experiment with and integrate these models into their applications. This balanced release represents a thoughtful step towards making advanced AI more practical and accessible.

Technical Details and Benefits

At the technical level, Phi-4-multimodal is a 5.6-billion-parameter model that incorporates a mixture-of-LoRAs—a method that allows the integration of speech, vision, and text within a single representation space. This design significantly simplifies the architecture by removing the need for separate processing pipelines. As a result, the model not only reduces computational overhead but also achieves lower latency, which is particularly beneficial for real-time applications.

Phi-4-mini, with its 3.8-billion parameters, is built as a dense, decoder-only transformer. It features grouped-query attention and boasts a vocabulary of 200,000 tokens, enabling it to handle sequences of up to 128,000 tokens. Despite its smaller size, Phi-4-mini performs remarkably well in tasks that require deep reasoning and language understanding. One of its standout features is the capability for function calling—allowing it to interact with external tools and APIs, thus extending its practical utility without requiring a larger, more resource-intensive model.

Both models have been optimized for on-device execution. This optimization is particularly important for applications in environments with limited compute resources or in edge computing scenarios. The models’ reduced computational requirements make them a cost-effective choice, ensuring that advanced AI functionalities can be deployed even on devices that do not have extensive processing capabilities.

Performance Insights and Benchmark Data

Benchmark results provide a clear view of how these models perform in practical scenarios. For instance, Phi-4-multimodal has demonstrated an impressive word error rate (WER) of 6.14% in automatic speech recognition (ASR) tasks. This is a modest improvement over previous models like WhisperV3, which reported a WER of 6.5%. Such improvements are particularly significant in applications where accuracy in speech recognition is critical.

Beyond ASR, Phi-4-multimodal also shows robust performance in tasks such as speech translation and summarization. Its ability to process visual inputs is notable in tasks like document reasoning, chart understanding, and optical character recognition (OCR). In several benchmarks—ranging from synthetic speech interpretation on visual data to document analysis—the model’s performance consistently aligns with or exceeds that of larger, more resource-intensive models.

Similarly, Phi-4-mini has been evaluated on a variety of language benchmarks, where it holds its own despite its more compact design. Its aptitude for reasoning, handling complex mathematical problems, and coding tasks underlines its versatility in text-based applications. The inclusion of a function-calling mechanism further enriches its potential, enabling the model to draw on external data and tools seamlessly. These results underscore a measured and thoughtful improvement in multimodal and language processing capabilities, providing clear benefits without overstating its performance.

Conclusion

The introduction of Phi-4-multimodal and Phi-4-mini by Microsoft marks an important evolution in the field of AI. Rather than relying on bulky, resource-demanding architectures, these models offer a refined balance between efficiency and performance. By integrating multiple modalities in a single, cohesive framework, Phi-4-multimodal simplifies the complexity inherent in multimodal processing. Meanwhile, Phi-4-mini provides a robust solution for text-intensive tasks, proving that smaller models can indeed offer significant capabilities.

Check out the Technical details and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Microsoft AI Releases Phi-4-multimodal and Phi-4-mini: The Newest Models in Microsoft’s Phi Family of Small Language Models (SLMs) appeared first on MarkTechPost.

DeepSeek AI Releases DualPipe: A Bidirectional Pipeline Parallelism Al …

The task of training deep neural networks, especially those with billions of parameters, is inherently resource-intensive. One persistent issue is the mismatch between computation and communication phases. In conventional settings, forward and backward passes are executed sequentially, resulting in intervals where GPUs remain idle while data is exchanged or synchronized. These idle periods, or pipeline bubbles, not only extend training times but also increase memory demands. Moreover, the management of micro-batches can lead to unnecessary duplication of parameters, further straining the available resources. Finding a method to better align these phases is essential for improving efficiency and reducing training costs.

DeepSeek AI Releases DualPipe, a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training. Rather than adhering to a strict sequential order, DualPipe orchestrates forward and backward passes to occur in overlapping, bidirectional streams. This scheduling strategy is designed to harmonize the computation and communication phases so that while one set of micro-batches is engaged in forward processing, another is simultaneously undergoing backward computation.

According to the DeepSeek-V3 Technical Report, this bidirectional design helps to reduce the traditional pipeline bubbles while optimizing memory usage. The system employs a symmetrical arrangement of micro-batches in both forward and reverse directions, allowing for a more consistent flow of data between GPUs. This alignment means that the hardware is in use more consistently, potentially leading to smoother and more efficient training cycles.

Technical Insights and Benefits

DualPipe achieves its efficiency by dividing the training process into a series of smaller micro-batches that are scheduled concurrently in both directions. The algorithm’s key innovation lies in its bidirectional scheduling mechanism. Unlike traditional methods—such as the simple one-forward, one-backward (1F1B) sequence or staggered variations like ZB1P—DualPipe minimizes idle time by allowing overlapping operations.

The GitHub documentation details a comparative approach:

1F1B: Executes forward and backward passes sequentially.

ZB1P: Introduces a degree of staggering to mitigate idle time.

DualPipe: Uses a dual-direction scheduling method, which is denoted in the documentation as “PP/2-1 (&+-3)”, indicating that the approach requires fewer pipeline stages while still accommodating an additional activation phase.

This nuanced method not only reduces idle periods but also offers a more balanced use of memory. Implemented with PyTorch 2.0 and above, DualPipe is compatible with current deep learning frameworks and is designed to integrate smoothly into existing training pipelines.

Observations and Comparative Data

The repository provides a clear example of how DualPipe schedules operations for a system with eight pipeline parallel ranks and twenty micro-batches. In this arrangement, micro-batches in the reverse direction mirror those in the forward direction, effectively reducing the usual delays observed in conventional pipelines. The schedule diagram, which highlights overlapping cells with a shared border, serves as a visual representation of how the communication and computation phases are interwoven.

Furthermore, the repository offers a comparative analysis of memory usage. Whereas methods like 1F1B and ZB1P require specific pipeline configurations, DualPipe’s approach—with a configuration denoted as “2× PP+1”—appears to use resources more judiciously. This efficient use of hardware can be especially beneficial in large-scale training environments, where even modest improvements can lead to significant time and cost savings.

Conclusion

DualPipe offers a thoughtful and well-engineered solution to one of the long-standing challenges in deep learning training. By overlapping the forward and backward passes and carefully coordinating communication with computation, the algorithm reduces idle time and optimizes resource utilization. This approach not only has the potential to shorten training times but also to lower the overall cost of deploying large models.

Check out the GitHub Repo. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post DeepSeek AI Releases DualPipe: A Bidirectional Pipeline Parallelism Algorithm for Computation-Communication Overlap in V3/R1 Training appeared first on MarkTechPost.

Simplifying Self-Supervised Vision: How Coding Rate Regularization Tra …

Learning useful features from large amounts of unlabeled images is important, and models like DINO and DINOv2 are designed for this. These models work well for tasks like image classification and segmentation, but their training process is difficult. A key challenge is avoiding representation collapse, where the model produces the same output for different images. Many settings must be carefully adjusted to prevent this, making training unstable and hard to manage. DINOv2 tries to solve this by directly using negative samples, but the training setup remains complex. Because of this, improving these models or using them in new areas is difficult, even though their learned features are very effective.

Currently, methods for learning image features rely on complex and unstable training setups. Techniques like SimCLR, SimSiam, VICReg, MoCo, and BYOL attempt to discover useful representations but face various challenges. SimCLR and MoCo require large batch sizes and explicit negative samples, making them computationally expensive. SimSiam and BYOL try to avoid collapse by modifying the gradient structure, which requires careful tuning. VICReg penalizes feature alignment and covariance but does not address feature variance effectively. Techniques like I-JEPA and C-JEPA focus on patch-based learning but add more complexity. These methods struggle to preserve simplicity, stability, and efficiency, complicating training and limiting flexibility.

To solve DINO’s complexities, researchers from UC Berkeley, TranscEngram, Microsoft Research and HKU proposed SimDINO and SimDINOv2. These models simplify training by incorporating a coding rate regularization term into the loss function, which prevents representation collapse and removes the need for heavy post-processing and hyperparameter tuning. By preventing unnecessary design choices, SimDINO improves training stability and efficiency. SimDINOv2 enhances performance by handling small and large regions of an image without applying high-dimensional transformations and eliminating the teacher-student paradigm, rendering the method more robust and efficient than existing methods.

This framework maximizes learning by directly controlling feature representations to be useful throughout training without intricate adaptations. The coding rate term gives the model structured and informative features, leading to better generalization and downstream task performance. This simplifies the training pipeline and removes the teacher-student paradigm. SimDINO reduces computational overhead while maintaining high-quality results, making it a more efficient alternative for self-supervised learning in vision tasks.

Researchers evaluated SimDINO and SimDINOv2 against DINO and DINOv2 on ImageNet–1K, COCO val2017, ADE20K, and DAVIS–2017 using ViT architectures with a patch size 16. SimDINO achieved higher k-NN and linear accuracy while maintaining stable training, unlike DINO, which showed performance drops. SimDINO outperformed DINO on COCO val2017 using MaskCut in object detection and segmentation. For semantic segmentation on ADE20K, SimDINOv2 improved DINOv2 by 4.4 mIoU on ViT-B. On DAVIS-2017, SimDINO variants performed better, though DINOv2 and SimDINOv2 underperformed their predecessors due to evaluation sensitivity. Stability tests showed that DINO was more sensitive to hyperparameters and dataset variations, diverging on ViT-L, while SimDINO remained robust, significantly outperforming DINO when trained on COCO train 2017.

In conclusion, the proposed SimDINO and SimDINOv2 models simplified the complex design choices of DINO and DINOv2 by introducing a coding-rate-related regularization term, making training pipelines more stable and robust while improving performance on downstream tasks. These models enhanced Pareto over their ancestors by eliminating unnecessary complexities, showing the advantages of directly dealing with trade-offs in vision self-supervised learning. The efficient framework establishes a foundation to analyze the geometric structure of self-supervised learning losses and model optimization without self-distillation. These ideas can also be applied to other self-supervised learning models to make training more stable and efficient, which makes SimDINO a strong starting point for developing better deep-learning models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Simplifying Self-Supervised Vision: How Coding Rate Regularization Transforms DINO & DINOv2 appeared first on MarkTechPost.

Evaluate healthcare generative AI applications using LLM-as-a-judge on …

In our previous blog posts, we explored various techniques such as fine-tuning large language models (LLMs), prompt engineering, and Retrieval Augmented Generation (RAG) using Amazon Bedrock to generate impressions from the findings section in radiology reports using generative AI. Part 1 focused on model fine-tuning. Part 2 introduced RAG, which combines LLMs with external knowledge bases to reduce hallucinations and improve accuracy in medical applications. Through real-time retrieval of relevant medical information, RAG systems can provide more reliable and contextually appropriate responses, making them particularly valuable for healthcare applications where precision is crucial. In both previous posts, we used traditional metrics like ROUGE scores for performance evaluation. This metric is suitable for evaluating general summarization tasks, but can’t effectively assess whether a RAG system successfully integrates retrieved medical knowledge or maintains clinical accuracy.
In Part 3, we’re introducing an approach to evaluate healthcare RAG applications using LLM-as-a-judge with Amazon Bedrock. This innovative evaluation framework addresses the unique challenges of medical RAG systems, where both the accuracy of retrieved medical knowledge and the quality of generated medical content must align with stringent standards such as clear and concise communication, clinical accuracy, and grammatical accuracy. By using the latest models from Amazon and the newly released RAG evaluation feature for Amazon Bedrock Knowledge Bases, we can now comprehensively assess how well these systems retrieve and use medical information to generate accurate, contextually appropriate responses.
This advancement in evaluation methodology is particularly crucial as healthcare RAG applications become more prevalent in clinical settings. The LLM-as-a-judge approach provides a more nuanced evaluation framework that considers both the quality of information retrieval and the clinical accuracy of generated content, aligning with the rigorous standards required in healthcare.
In this post, we demonstrate how to implement this evaluation framework using Amazon Bedrock, compare the performance of different generator models, including Anthropic’s Claude and Amazon Nova on Amazon Bedrock, and showcase how to use the new RAG evaluation feature to optimize knowledge base parameters and assess retrieval quality. This approach not only establishes new benchmarks for medical RAG evaluation, but also provides practitioners with practical tools to build more reliable and accurate healthcare AI applications that can be trusted in clinical settings.
Overview of the solution
The solution uses Amazon Bedrock Knowledge Bases evaluation capabilities to assess and optimize RAG applications specifically for radiology findings and impressions. Let’s examine the key components of this architecture in the following figure, following the data flow from left to right.

The workflow consists of the following phases:

Data preparation – Our evaluation process begins with a prompt dataset containing paired radiology findings and impressions. This clinical data undergoes a transformation process where it’s converted into a structured JSONL format, which is essential for compatibility with the knowledge base evaluation system. After it’s prepared, this formatted data is securely uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, providing accessibility and data security throughout the evaluation process.
Evaluation processing – At the heart of our solution lies an Amazon Bedrock Knowledge Bases evaluation job. This component processes the prepared data while seamlessly integrating with Amazon Bedrock Knowledge Bases. This integration is crucial because it enables the system to create specialized medical RAG capabilities specifically tailored for radiology findings and impressions, making sure that the evaluation considers both medical context and accuracy.
Analysis – The final stage empowers healthcare data scientists with detailed analytical capabilities. Through an advanced automated report generation system, professionals can access detailed analysis of performance metrics of the summarization task for impression generation. This comprehensive reporting system enables thorough assessment of both retrieval quality and generation accuracy, providing valuable insights for system optimization and quality assurance.

This architecture provides a systematic and thorough approach to evaluating medical RAG applications, providing both accuracy and reliability in healthcare contexts where precision and dependability are paramount.
Dataset and background
The MIMIC Chest X-ray (MIMIC-CXR) database v2.0.0 is a large, publicly available dataset of chest radiographs in DICOM format with free-text radiology reports. We used the MIMIC CXR dataset consisting of 91,544 reports, which can be accessed through a data use agreement. This requires user registration and the completion of a credentialing process.
During routine clinical care, clinicians trained in interpreting imaging studies (radiologists) will summarize their findings for a particular study in a free-text note. The reports were de-identified using a rule-based approach to remove protected health information. Because we used only the radiology report text data, we downloaded just one compressed report file (mimic-cxr-reports.zip) from the MIMIC-CXR website. For evaluation, 1,000 of the total 2,000 reports from a subset of MIMIC-CXR dataset were used. This is referred to as the dev1 dataset. Another set of 1,000 of the total 2,000 radiology reports (referred to as dev2) from the chest X-ray collection from the Indiana University hospital network were also used.
RAG with Amazon Bedrock Knowledge Bases
Amazon Bedrock Knowledge Bases helps take advantage of RAG, a popular technique that involves drawing information from a data store to augment the responses generated by LLMs. We used Amazon Bedrock Knowledge Bases to generate impressions from the findings section of the radiology reports by enriching the query with context that is received from querying the knowledge base. The knowledge base is set up to contain findings and corresponding impression sections of 91,544 MIMIC-CXR radiology reports as {prompt, completion} pairs.
LLM-as-a-judge and quality metrics
LLM-as-a-judge represents an innovative approach to evaluating AI-generated medical content by using LLMs as automated evaluators. This method is particularly valuable in healthcare applications where traditional metrics might fail to capture the nuanced requirements of medical accuracy and clinical relevance. By using specialized prompts and evaluation criteria, LLM-as-a-judge can assess multiple dimensions of generated medical content, providing a more comprehensive evaluation framework that aligns with healthcare professionals’ standards.
Our evaluation framework encompasses five critical metrics, each designed to assess specific aspects of the generated medical content:

Correctness – Evaluated on a 3-point Likert scale, this metric measures the factual accuracy of generated responses by comparing them against ground truth responses. In the medical context, this makes sure that the clinical interpretations and findings align with the source material and accepted medical knowledge.
Completeness – Using a 5-point Likert scale, this metric assesses whether the generated response comprehensively addresses the prompt holistically while considering the ground truth response. It makes sure that critical medical findings or interpretations are not omitted from the response.
Helpfulness – Measured on a 7-point Likert scale, this metric evaluates the practical utility of the response in clinical contexts, considering factors such as clarity, relevance, and actionability of the medical information provided.
Logical coherence – Assessed on a 5-point Likert scale, this metric examines the response for logical gaps, inconsistencies, or contradictions, making sure that medical reasoning flows naturally and maintains clinical validity throughout the response.
Faithfulness – Scored on a 5-point Likert scale, this metric specifically evaluates whether the response contains information not found in or quickly inferred from the prompt, helping identify potential hallucinations or fabricated medical information that could be dangerous in clinical settings.

These metrics are normalized in the final output and job report card, providing standardized scores that enable consistent comparison across different models and evaluation scenarios. This comprehensive evaluation framework not only helps maintain the reliability and accuracy of medical RAG systems, but also provides detailed insights for continuous improvement and optimization. For details about the metric and evaluation prompts, see Evaluator prompts used in a knowledge base evaluation job.
Prerequisites
Before proceeding with the evaluation setup, make sure you have the following:

An active AWS account with appropriate permissions
Amazon Bedrock model access enabled in your preferred AWS Region
An S3 bucket with CORS enabled for storing evaluation data
An Amazon Bedrock knowledge base
An AWS Identity and Access Management (IAM) role with necessary permissions for Amazon S3 and Amazon Bedrock

The solution code can be found at the following GitHub repo.
Make sure that your knowledge base is fully synced and ready before initiating an evaluation job.
Convert the test dataset into JSONL for RAG evaluation
In preparation for evaluating our RAG system’s performance on radiology reports, we implemented a data transformation pipeline to convert our test dataset into the required JSONL format. The following code shows the format of the original dev1 and dev2 datasets:
{
“prompt”: “value of prompt key”,
“completion”: “value of completion key”
}
Output Format

{
“conversationTurns”: [{
“referenceResponses”: [{
“content”: [{
“text”: “value from completion key”
}]
}],
“prompt”: {
“content”: [{
“text”: “value from prompt key”
}]
}
}]
}

Drawing from Wilcox’s seminal paper The Written Radiology Report, we carefully structured our prompt to include comprehensive guidelines for generating high-quality impressions:
import json
import random
import boto3

# Initialize the S3 client
s3 = boto3.client(‘s3’)

# S3 bucket name
bucket_name = “<BUCKET_NAME>”

# Function to transform a single record
def transform_record(record):
return {
“conversationTurns”: [
{
“referenceResponses”: [
{
“content”: [
{
“text”: record[“completion”]
}
]
}
],
“prompt”: {
“content”: [
{
“text”: “””You’re given a radiology report findings to generate a concise radiology impression from it.

A Radiology Impression is the radiologist’s final concise interpretation and conclusion of medical imaging findings, typically appearing at the end of a radiology report.
n Follow these guidelines when writing the impression:
n- Use clear, understandable language avoiding obscure terms.
n- Number each impression.
n- Order impressions by importance.
n- Keep impressions concise and shorter than the findings section.
n- Write for the intended reader’s understanding.n
Findings: n””” + record[“prompt”]
}
]
}
}
]
}

The script processes individual records, restructuring them to include conversation turns with both the original radiology findings and their corresponding impressions, making sure each report maintains the professional standards outlined in the literature. To maintain a manageable dataset size set used by this feature, we randomly sampled 1,000 records from the original dev1 and dev2 datasets, using a fixed random seed for reproducibility:
# Read from input file and write to output file
def convert_file(input_file_path, output_file_path, sample_size=1000):
# First, read all records into a list
records = []
with open(input_file_path, ‘r’, encoding=’utf-8′) as input_file:
for line in input_file:
records.append(json.loads(line.strip()))

# Randomly sample 1000 records
random.seed(42) # Set the seed first
sampled_records = random.sample(records, sample_size)

# Write the sampled and transformed records to the output file
with open(output_file_path, ‘w’, encoding=’utf-8′) as output_file:
for record in sampled_records:
transformed_record = transform_record(record)
output_file.write(json.dumps(transformed_record) + ‘n’)

# Usage
input_file_path = ‘<INPUT_FILE_NAME>.jsonl’ # Replace with your input file path
output_file_path = ‘<OUTPUT_FILE_NAME>.jsonl’ # Replace with your desired output file path
convert_file(input_file_path, output_file_path)

# File paths and S3 keys for the transformed files
transformed_files = [
{‘local_file’: ‘<OUTPUT_FILE_NAME>.jsonl’, ‘key’: ‘<FOLDER_NAME>/<OUTPUT_FILE_NAME>.jsonl’},
{‘local_file’: ‘<OUTPUT_FILE_NAME>.jsonl’, ‘key’: ‘<FOLDER_NAME>/<OUTPUT_FILE_NAME>.jsonl’}
]

# Upload files to S3
for file in transformed_files:
s3.upload_file(file[‘local_file’], bucket_name, file[‘key’])
print(f”Uploaded {file[‘local_file’]} to s3://{bucket_name}/{file[‘key’]}”)

Set up a RAG evaluation job
Our RAG evaluation setup begins with establishing core configurations for the Amazon Bedrock evaluation job, including the selection of evaluation and generation models (Anthropic’s Claude 3 Haiku and Amazon Nova Micro, respectively). The implementation incorporates a hybrid search strategy with a retrieval depth of 10 results, providing comprehensive coverage of the knowledge base during evaluation. To maintain organization and traceability, each evaluation job is assigned a unique identifier with timestamp information, and input data and results are systematically managed through designated S3 paths. See the following code:
import boto3
from datetime import datetime

# Generate unique name for the job
job_name = f”rag-eval-{datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S’)}”

# Configure knowledge base and model settings
knowledge_base_id = “<KNOWLEDGE_BASE_ID>”
evaluator_model = “anthropic.claude-3-haiku-20240307-v1:0”
generator_model = “amazon.nova-micro-v1:0”
role_arn = “<IAM_ROLE_ARN>”

# Specify S3 locations
input_data = “<INPUT_S3_PATH>”
output_path = “<OUTPUT_S3_PATH>”

# Configure retrieval settings
num_results = 10
search_type = “HYBRID”

# Create Bedrock client
bedrock_client = boto3.client(‘bedrock’)

With the core configurations in place, we initiate the evaluation job using the Amazon Bedrock create_evaluation_job API, which orchestrates a comprehensive assessment of our RAG system’s performance. The evaluation configuration specifies five key metrics—correctness, completeness, helpfulness, logical coherence, and faithfulness—providing a multi-dimensional analysis of the generated radiology impressions. The job is structured to use the knowledge base for retrieval and generation tasks, with the specified models handling their respective roles: Amazon Nova Micro for generation and Anthropic’s Claude 3 Haiku for evaluation, and the results are systematically stored in the designated S3 output location for subsequent analysis. See the following code:
retrieve_generate_job = bedrock_client.create_evaluation_job(
jobName=job_name,
jobDescription=”Evaluate retrieval and generation”,
roleArn=role_arn,
applicationType=”RagEvaluation”,
inferenceConfig={
“ragConfigs”: [{
“knowledgeBaseConfig”: {
“retrieveAndGenerateConfig”: {
“type”: “KNOWLEDGE_BASE”,
“knowledgeBaseConfiguration”: {
“knowledgeBaseId”: knowledge_base_id,
“modelArn”: generator_model,
“retrievalConfiguration”: {
“vectorSearchConfiguration”: {
“numberOfResults”: num_results,
“overrideSearchType”: search_type
}
}
}
}
}
}]
},
outputDataConfig={
“s3Uri”: output_path
},
evaluationConfig={
“automated”: {
“datasetMetricConfigs”: [{
“taskType”: “Custom”,
“dataset”: {
“name”: “RagDataset”,
“datasetLocation”: {
“s3Uri”: input_data
}
},
“metricNames”: [
“Builtin.Correctness”,
“Builtin.Completeness”,
“Builtin.Helpfulness”,
“Builtin.LogicalCoherence”,
“Builtin.Faithfulness”
]
}],
“evaluatorModelConfig”: {
“bedrockEvaluatorModels”: [{
“modelIdentifier”: evaluator_model
}]
}
}
}
)

Evaluation results and metrics comparisons
The evaluation results for the healthcare RAG applications, using datasets dev1 and dev2, demonstrate strong performance across the specified metrics. For the dev1 dataset, the scores were as follows: correctness at 0.98, completeness at 0.95, helpfulness at 0.83, logical coherence at 0.99, and faithfulness at 0.79. Similarly, the dev2 dataset yielded scores of 0.97 for correctness, 0.95 for completeness, 0.83 for helpfulness, 0.98 for logical coherence, and 0.82 for faithfulness. These results indicate that the RAG system effectively retrieves and uses medical information to generate accurate and contextually appropriate responses, with particularly high scores in correctness and logical coherence, suggesting robust factual accuracy and logical consistency in the generated content.
The following screenshot shows the evaluation summary for the dev1 dataset.

The following screenshot shows the evaluation summary for the dev2 dataset.

Additionally, as shown in the following screenshot, the LLM-as-a-judge framework allows for the comparison of multiple evaluation jobs across different models, datasets, and prompts, enabling detailed analysis and optimization of the RAG system’s performance.

Additionally, you can perform a detailed analysis by drilling down and investigating the outlier cases with least performance metrics such as correctness, as shown in the following screenshot.

Metrics explainability
The following screenshot showcases the detailed metrics explainability interface of the evaluation system, displaying example conversations with their corresponding metrics assessment. Each conversation entry includes four key columns: Conversation input, Generation output, Retrieved sources, and Ground truth, along with a Score column. The system provides a comprehensive view of 1,000 examples, with navigation controls to browse through the dataset. Of particular note is the retrieval depth indicator showing 10 for each conversation, demonstrating consistent knowledge base utilization across examples.
The evaluation framework enables detailed tracking of generation metrics and provides transparency into how the knowledge base arrives at its outputs. Each example conversation presents the complete chain of information, from the initial prompt through to the final assessment. The system displays the retrieved context that informed the generation, the actual generated response, and the ground truth for comparison. A scoring mechanism evaluates each response, with a detailed explanation of the decision-making process visible through an expandable interface (as shown by the pop-up in the screenshot). This granular level of detail allows for thorough analysis of the RAG system’s performance and helps identify areas for optimization in both retrieval and generation processes.

In this specific example from the Indiana University Medical System dataset (dev2), we see a clear assessment of the system’s performance in generating a radiology impression for chest X-ray findings. The knowledge base successfully retrieved relevant context (shown by 10 retrieved sources) to generate an impression stating “Normal heart size and pulmonary vascularity 2. Unremarkable mediastinal contour 3. No focal consolidation, pleural effusion, or pneumothorax 4. No acute bony findings.” The evaluation system scored this response with a perfect correctness score of 1, noting in the detailed explanation that the candidate response accurately summarized the key findings and correctly concluded there was no acute cardiopulmonary process, aligning precisely with the ground truth response.

In the following screenshot, the evaluation system scored this response with a low score of 0.5, noting in the detailed explanation the ground truth response provided is “Moderate hiatal hernia. No definite pneumonia.” This indicates that the key findings from the radiology report are the presence of a moderate hiatal hernia and the absence of any definite pneumonia. The candidate response covers the key finding of the moderate hiatal hernia, which is correctly identified as one of the impressions. However, the candidate response also includes additional impressions that are not mentioned in the ground truth, such as normal lung fields, normal heart size, unfolded aorta, and degenerative changes in the spine. Although these additional impressions might be accurate based on the provided findings, they are not explicitly stated in the ground truth response. Therefore, the candidate response is partially correct and partially incorrect based on the ground truth.
Clean up
To avoid incurring future charges, delete the S3 bucket, knowledge base, and other resources that were deployed as part of the post.
Conclusion
The implementation of LLM-as-a-judge for evaluating healthcare RAG applications represents a significant advancement in maintaining the reliability and accuracy of AI-generated medical content. Through this comprehensive evaluation framework using Amazon Bedrock Knowledge Bases, we’ve demonstrated how automated assessment can provide detailed insights into the performance of medical RAG systems across multiple critical dimensions. The high-performance scores across both datasets indicate the robustness of this approach, though these metrics are just the beginning.
Looking ahead, this evaluation framework can be expanded to encompass broader healthcare applications while maintaining the rigorous standards essential for medical applications. The dynamic nature of medical knowledge and clinical practices necessitates an ongoing commitment to evaluation, making continuous assessment a cornerstone of successful implementation.
Through this series, we’ve demonstrated how you can use Amazon Bedrock to create and evaluate healthcare generative AI applications with the precision and reliability required in clinical settings. As organizations continue to refine these tools and methodologies, prioritizing accuracy, safety, and clinical utility in healthcare AI applications remains paramount.

About the Authors
Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.
Priya Padate is a Senior Partner Solution Architect supporting healthcare and life sciences worldwide at Amazon Web Services. She has over 20 years of healthcare industry experience leading architectural solutions in areas of medical imaging, healthcare related AI/ML solutions and strategies for cloud migrations. She is passionate about using technology to transform the healthcare industry to drive better patient care outcomes.
Dr. Ekta Walia Bhullar is a principal AI/ML/GenAI consultant with AWS Healthcare and Life Sciences business unit. She has extensive experience in development of AI/ML applications for healthcare especially in Radiology. During her tenure at AWS she has actively contributed to applications of AI/ML/GenAI within lifescience domain such as for clinical, drug development and commercial lines of business.

AWS DeepRacer: Closing time at AWS re:Invent 2024 –How did that phys …

Having spent the last years studying the art of AWS DeepRacer in the physical world, the author went to AWS re:Invent 2024. How did it go?
In AWS DeepRacer: How to master physical racing?, I wrote in detail about some aspects relevant to racing AWS DeepRacer in the physical world. We looked at the differences between the virtual and the physical world and how we could adapt the simulator and the training approach to overcome the differences. The previous post was left open-ended—with one last Championship Final left, it was too early to share all my secrets.
Now that AWS re:Invent is over, it’s time to share my strategy, how I prepared, and how it went in the end.
Strategy
Going into the 2024 season, I was reflecting on my performance from 2022 and 2023. In 2022, I had unstable models that were unable to do fast laps on the new re:Invent 2022 Championship track, not even making the last 32. In 2023, things went slightly better, but it was clear that there was potential to improve.
Specifically, I wanted a model that:

Goes straight on the straights and corners with precision
Has a survival instinct and avoids going off-track even in a tight spot
Can ignore the visual noise seen around the track

Combine that with the ability to test the models before showing up at the Expo, and success seemed possible!
Implementation
In this section, I will explain my thinking about why physical racing is so different than virtual racing, as well as describe my approach to training a model that overcomes those differences.
How hard can it be to go straight?
If you have watched DeepRacer over the years, you have probably seen that most models struggle to go straight on the straights and end up oscillating left and right. The question has always been: why is it like that? This behavior causes two issues: the distance driven increases (result: slower lap time) and the car potentially enters the next turn in a way it can’t handle (result: off-track).
A few theories emerged:

Sim-to-real issues – The steering response isn’t matching the simulator, both with regards to the steering geometry and latency (time from picture to servo command, as well as the time it takes the servo to actually actuate). Therefore, when the car tries to adjust the direction on the straight, it doesn’t get the response it expects.
Model issues – A combination of the model not actually using the straight action, and not having access to angles needed to dampen oscillations (2.5–5.0 degrees).
Calibration issues – If the car isn’t calibrated to go straight when given a 0-degree action, and the left/right max values are either too high (tendency to oversteer) or too low (tendency to understeer), you are likely to get control issues and unstable behavior.

My approach:

Use the Ackermann steering geometry patch. With it, the car will behave more realistically, and the turning radius will decrease for a given angle. As a result, the action space can be limited to angles up to about 20 degrees. This roughly matches with the real car’s steering angle.
Include stabilizing steering angles (2.5 and 5.0) in the action space, allowing for minor corrections on the straights.
Use relatively slow speeds (0.8–1.3 m/s) to avoid slipping in the simulator. My theory is that the 15 fps simulator and the 30 fps car actually translates 1.2 mps in the simulator into effectively 2.4 mps in the real world.
By having an inverted chevron action space giving higher speeds for straights, nudge the car to use the straight actions, rather than oscillating left-right actions.
Try out v3, v4, and v5 physical models—test on a real track to see what works best.
Otherwise, the reward function was the same progress-based reward function I also use in virtual racing.

The following figure illustrates the view of testing in the garage, going straight at least one frame.

Be flexible
Virtual racing is (almost) deterministic, and over time, the model will converge and the car will take a narrow path, reducing the variety in the situations it sees. Early in training, it will frequently be in odd positions, almost going off-track, and it remembers how to get out of these situations. As it converges, the frequency at which it must handle these reduces, and the theory is that the memory fades, and at some point, it forgets how to get out of a tight spot.
My approach:

Diversify training to teach the car to handle a variety of corners, in both directions:

Consistently train models going both clockwise and counterclockwise.
Use tracks—primarily the 2022 Championship track—that are significantly more complex than the Forever Raceway.
Do final optimization on the Forever Raceway—again in both directions.

Take several snapshots during training; don’t go below 0.5 in entropy.
Test on tracks the car has never seen. The simulator has many suitable, narrow tracks—the hallmark of a generalized model is one that can handle tracks it has never seen during training.

Stay focused on the track
In my last post, I looked at the visual differences between the virtual and real worlds. The question is what to do about it. The goal is to trick the model into ignoring the noise and focus on what is important: the track.
My approach:

Train in an environment with significantly more visual noise. The tracks in the custom track repository have added noise through additional lights, buildings, and different walls (and some even come with shadows).
Alter the environment during training to avoid overfitting to the added noise. The custom tracks were made in such a way that different objects (buildings, walls, and lines) could be made invisible at runtime. I had a cron job randomizing the environment every 5 minutes.

The following figure illustrates the varied training environment.

What I didn’t consider this year was simulating blurring during training. I attempted this previously by averaging the current camera frame with the previous one before inferencing. It didn’t seem to help.
Lens distortion is a topic I have observed, but not fully investigated. The original camera has a distinct fish-eye distortion, and Gazebo would be able to replicate it, but it would require some work to actually determine the coefficients. Equally, I have never tried to replicate the rolling motions of the real car.
Testing
Testing took place in the garage on the Trapezoid Narrow track. The track is obviously basic, but with two straights and two 180-degree turns with different radii, it had to do the job. The garage track also had enough visual noise to see if the models were robust enough.
The method was straightforward: try all models both clockwise and counterclockwise. Using the logs captured by the custom car stack, I spent the evening looking through the video of each run to determine which model I liked the best—looking at stability, handling (straight on straights plus precision cornering), and speed.
re:Invent 2024
The track for re:Invent 2024 was the Forever Raceway. The shape of the track isn’t new; it shares the centerline with the 2022 Summit Speedway, but being only 76 cm wide (the original was 1.07 cm), the turns become more pronounced, making it a significantly more difficult track.
The environment
The environment is classic re:Invent: a smooth track with very little shine combined with smooth, fairly tall walls surrounding the track. The background is what often causes trouble—this year, a large lit display hung under the ceiling at the far end of the track, and as the following figure shows, it was attracting quite some attention from the GradCam.

Similarly, the pit crew cage, where cars are maintained, attracted attention.

The results
So where did I end up, and why? In Round 1, I ended up at place 14, with a best average of 10.072 seconds, and a best lap time of 9.335 seconds. Not great, but also not bad—almost 1 second outside top 8.
Using the overhead camera provided by AWS through the Twitch stream, it’s possible to create a graphical view showing the path the car took, as shown in the following figure.

If we compare this with how the same model liked to drive in training, we see a bit of a difference.

What becomes obvious quite quickly is that although I succeeded in going straight on the (upper) straight, the car didn’t corner as tightly as during training, making the bottom half of the track a bit of a mess. Nevertheless, the car demonstrated the desired survival instinct and stayed on track even when faced with unexpectedly sharp corners.
Why did this happen:

20 degrees of turning using Ackermann steering is too much; the real car isn’t capable of doing it in the real world
The turning radius is increasing as the speed goes up due to slipping, caused both by low friction and lack of grip due to rolling
The reaction time plays more of a role as the speed increases, and my model acted too late, overshooting into the corner

The combined turning radius and reaction time effect also caused issues at the start. If the car goes slowly, it turns much faster—and ends up going off-track on the inside—causing issues for me and others.

My takeaways:

Overall, the training approach seemed to work well. Well-calibrated cars went straight on the straights, and background noise didn’t seem to bother my models much.
I need to get closer to the car’s actual handling characteristics at speed during training by increasing the max speed and reducing the max angle in the action space.
Physical racing is still not well understood—and it’s a lot about model-meets-car. Some models thrive on objectively perfectly calibrated cars, whereas others work great when matched with a particular one.
Track is king—those that had access to the track, either through their employer or having built one at home, had a massive advantage, even if almost everyone said that they were surprised by which model worked in the end.

Now enjoy the inside view of a car at re:Invent, and see if you can detect any of the issues that I have discussed. The video was recorded after I had been knocked out of the competition using a car with the custom car software.

Closing time: Where do we go from here?
This section is best enjoyed with Semisonic’s Closing Time as a soundtrack.
As we all wrapped up at the Expo after an intense week of racing, re:Invent literally being dismantled around us, the question was: what comes next?
This was the last DeepRacer Championship, but the general sentiment was that whereas nobody will really miss virtual racing—it is a problem solved—physical racing is still a whole lot of fun, and the community is not yet ready to move on. Since re:Invent several initiatives have gained traction with a common goal to make DeepRacer more accessible:

By enrolling cars with the DeepRacer Custom Car software stack into DeepRacer Event Manager you can capture car logs and generate the analytics videos, as shown in this article, directly during your event!

DeepRacer Pi and DeepRacer Custom Car initiatives allow racers to build cars at home:Use off-the-shelf components for a 1:18 scale racer, or

Combine off-the-shelf components with a custom circuit board to build the 1:28 scale DeepRacer Pi Mini Both options are compatible with already trained models, including integration with DeepRacer Event Manager.

DeepRacer Custom Console will be a drop-in replacement for the current car UI with a beautiful UI designed in Cloudscape, aligning the design with DREM and the AWS Console.

Prototype DeepRacer Pi Mini – 1 :28 scale

Closing Words
DeepRacer is a fantastic way to teach AI in a very physical and visual way, and is suitable for older kids, students, and adults in the corporate setting alike. It will be interesting to see how AWS, its corporate partners, and the community will continue the journey in the years ahead.
A big thank you goes to all of those that have been involved in DeepRacer from its inception to today—too many to be named—it has been a wonderful experience. A big congratulations goes out to this years’ winners!
Closing time, every new beginning comes from some other beginning’s end…

About the Author
Lars Lorentz Ludvigsen is a technology enthusiast who was introduced to AWS DeepRacer in late 2019 and was instantly hooked. Lars works as a Managing Director at Accenture, where he helps clients build the next generation of smart connected products. In addition to his role at Accenture, he’s an AWS Community Builder who focuses on developing and maintaining the AWS DeepRacer community’s software solutions.

Hume Introduces Octave TTS: A New Text-to-Speech Model that Creates Cu …

In the rapidly evolving field of digital communication, traditional text-to-speech (TTS) systems have often struggled to capture the full range of human emotion and nuance. Conventional systems tend to “read” text in a flat, unvarying tone, missing the subtle inflections and emotional cues that make human speech so engaging. This shortfall poses a challenge for developers and content creators alike, who seek to deliver messages in a manner that truly resonates with their audience. The need for a TTS system that can interpret context and emotion—rather than simply converting text into speech—has been clear for some time, paving the way for new approaches to voice synthesis.

Hume’s Octave TTS represents a measured advancement in the realm of text-to-speech. Unlike earlier models that mechanically produce speech, Octave is designed to understand the context behind the text it processes. It is not merely about the literal conversion of words into sound; it is about conveying the subtleties of meaning, emotion, and style. Whether a piece of text requires a hint of sarcasm, a gentle whisper, or a firm declaration, Octave adjusts its output to better reflect the intended tone. This capability allows for the generation of custom AI voices that are tailored to fit a wide range of scenarios, from straightforward narration to more character-driven storytelling.

Technical Details

Octave TTS is built on the state-of-the-art large language model (LLM) that has been specifically trained for speech synthesis. This technical foundation enables the system to predict not only the words that should be spoken but also how they should be delivered—taking into account rhythm, timbre, and cadence. One of the notable features of Octave is its “Voice Design” function. With this tool, users can provide a simple script or even just descriptive prompts to generate a voice that suits a particular role or character. For example, one might request a voice reminiscent of a patient counselor or a more assertive narrator, and Octave adapts accordingly.

In addition to Voice Design, Octave also offers “Acting Instructions,” which allow users to fine-tune the emotional delivery of a speech segment. A single line can be rendered in multiple styles—whispered, calm, or even carrying a hint of disdain—depending on the instruction given. This flexibility extends the practical utility of Octave TTS, making it applicable across various domains such as education, entertainment, and customer service. Looking ahead, the team at Hume is also preparing to introduce a Voice Cloning feature, which will enable the replication of a specific voice using only a brief audio sample.

Data Insights and Comparative Evaluations

The development and evaluation of Octave TTS have been carried out with a focus on both technical merit and practical application. In an internal study involving 180 human raters, Octave was compared with an established competitor in the TTS field. Participants evaluated voice samples based on audio quality, naturalness, and fidelity to the provided voice description across 120 diverse prompts. The findings showed that Octave was preferred for audio quality in approximately 71.6% of the trials, for naturalness in about 51.7% of the cases, and for matching the intended description in roughly 57.7% of the assessments.

These results suggest that Octave not only produces clear and pleasant audio but also better aligns with the stylistic and emotional expectations of the user. In tandem with these internal tests, Hume has launched the Expressive TTS Arena, a public initiative designed to foster a broader evaluation of expressive speech synthesis. This platform invites the community to test and compare various TTS systems using longer, more nuanced text samples, thereby helping to refine the performance of models like Octave over time.

Conclusion

Hume’s Octave TTS offers a thoughtful improvement over conventional text-to-speech systems by focusing on context, emotion, and flexibility in voice generation. Its ability to interpret and deliver subtle emotional cues allows for a more natural and engaging auditory experience, making it a useful tool for a variety of applications. The technical foundation of Octave, built on an advanced large language model, ensures that the generated speech is not only clear but also reflective of the deeper meaning behind the text.

The internal evaluations and public testing initiatives underscore Octave’s potential to set a new standard in expressive TTS without resorting to overly dramatic claims. Instead, the focus is on practical enhancements that benefit both developers and end users. As the system continues to evolve—with upcoming features such as Voice Cloning on the horizon—Hume remains dedicated to refining AI voice technology in a way that is both technically sound and sensitive to the nuances of human communication.

Check out the Technical Details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Hume Introduces Octave TTS: A New Text-to-Speech Model that Creates Custom AI Voices with Tailored Emotions appeared first on MarkTechPost.

Allen Institute for AI Released olmOCR: A High-Performance Open Source …

Access to high-quality textual data is crucial for advancing language models in the digital age. Modern AI systems rely on vast datasets of token trillions to improve their accuracy and efficiency. While much of this data is from the internet, a significant portion exists in formats such as PDFs, which pose unique challenges for content extraction. Unlike web pages, which are structured for easy parsing, PDFs prioritize visual layout over logical text flow, making it difficult to extract coherent textual representations. Traditional optical character recognition (OCR) tools have attempted to address these challenges, but their limitations have hindered large-scale adoption in language model training.

A main issue with PDF processing is that these documents store information optimally for visual presentation rather than logical reading order. Many PDFs encode text at the character level, recording each letter’s position and font attributes without preserving sentence structure. This makes it difficult to reconstruct a coherent narrative in multi-column layouts or documents with embedded tables, images, and equations. Also, scanned PDFs introduce additional challenges, as they contain text in image format rather than machine-readable characters. Extracting structured and meaningful content from such documents requires specialized tools to understand textual and visual elements.

Several approaches have previously been developed to tackle the problem of extracting text from PDFs. Early OCR technologies like Tesseract provided basic character recognition but struggled with complex layouts. More recent methods include pipeline-based systems, which combine extraction into multiple machine-learning tasks, such as section segmentation and table recognition. These include tools like Grobid and VILA, which are designed for scientific papers. On the other hand, end-to-end models like Nougat and GOT Theory 2.0 attempt to convert entire PDF pages into readable text using deep learning. However, many systems are expensive, unreliable, or inefficient for large-scale applications.

Researchers at the Allen Institute for AI introduced olmOCR, an open-source Python toolkit designed to efficiently convert PDFs into structured plain text while preserving logical reading order. This toolkit integrates text-based and visual information, allowing for superior extraction accuracy compared to conventional OCR methods. The system is built upon a 7-billion-parameter vision language model (VLM), which has been fine-tuned on a dataset of 260,000 PDF pages collected from over 100,000 unique documents. Unlike traditional OCR approaches, which treat PDFs as mere images, olmOCR leverages the embedded text and its spatial positioning to generate high-fidelity structured content. The system is optimized for large-scale batch processing, enabling cost-efficient conversion of vast document repositories. One of its most notable advantages is its ability to process one million PDF pages for just $190 USD, 32 times cheaper than GPT-4o, where the same task would cost $6,200 USD.

Image Source

The core innovation behind olmOCR is document anchoring, a technique that combines textual metadata with image-based analysis. Unlike end-to-end OCR models that rely solely on rasterized images, this method extracts textual elements directly from the PDF’s embedded data. It aligns them with their corresponding visual representations. This enhances the model’s ability to recognize complex document structures, reducing errors and improving overall readability. The extracted content is formatted using Markdown, preserving structured elements like headings, lists, tables, and equations. Also, the system employs fine-tuning techniques to improve extraction accuracy, utilizing a dataset curated specifically for diverse document layouts. The model training process involved 10,000 optimization steps, using a four-batch size and an adaptive learning rate of 1e-6. olmOCR has been designed to operate seamlessly with inference frameworks such as vLLM and SGLang.

Image Source

The system achieves an alignment score of 0.875 with its teacher model, surpassing smaller-scale models like GPT-4o Mini. In direct comparison with other OCR tools, olmOCR consistently outperforms competitors in accuracy and efficiency. When subjected to human evaluation, the system received the highest ELO rating among leading PDF extraction methods. Also, when olmOCR-extracted text was used for mid-training on the OLMo-2-1124-7B language model, it resulted in an average accuracy improvement of 1.3 percentage points across multiple AI benchmark tasks. Specific performance gains were observed in datasets such as ARC Challenge and DROP, where olmOCR-based training data contributed to notable improvements in language model comprehension.

Several Key Takeaways from the Research on olmOCR include:

olmOCR is built on a 7-billion-parameter vision-language model and fine-tuned on 260,000 pages from 100,000 PDFs, ensuring robust extraction across diverse document types.

Utilizes document anchoring to combine textual metadata with image-based information, significantly improving the extraction accuracy for structured content.

Processes one million PDF pages for just $190, compared to $6,200 using GPT-4o, making it 32 times more cost-efficient for large-scale applications.

Achieves an alignment score of 0.875, surpassing smaller models and demonstrating superior accuracy in reconstructing logical reading order.

It outperforms traditional OCR tools in structured data recognition and large-scale processing and has the highest ELO score in human evaluations.

Improves language model training by increasing accuracy by 1.3 percentage points on AI benchmark datasets like ARC Challenge and DROP.

Compatible with inference engines like vLLM and SGLang, allowing flexible deployment on various hardware setups.

Check out the Training and toolkit code and Hugging Face collection. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Allen Institute for AI Released olmOCR: A High-Performance Open Source Toolkit Designed to Convert PDFs and Document Images into Clean and Structured Plain Text appeared first on MarkTechPost.

How to Compare Two LLMs in Terms of Performance: A Comprehensive Web G …

Comparing language models effectively requires a systematic approach that combines standardized benchmarks with use-case specific testing. This guide walks you through the process of evaluating LLMs to make informed decisions for your projects.

Table of contentsStep 1: Define Your Comparison GoalsStep 2: Choose Appropriate BenchmarksGeneral Language UnderstandingReasoning & Problem-SolvingCoding & Technical AbilityTruthfulness & FactualityInstruction FollowingSafety EvaluationStep 3: Review Existing LeaderboardsRecommended LeaderboardsStep 4: Set Up Testing EnvironmentEnvironment ChecklistStep 5: Use Evaluation FrameworksPopular Evaluation FrameworksStep 6: Implement Custom Evaluation TestsCustom Test CategoriesStep 7: Analyze ResultsAnalysis TechniquesStep 8: Document and Visualize FindingsDocumentation TemplateStep 9: Consider Trade-offsKey Trade-off FactorsStep 10: Make an Informed DecisionFinal Decision Process

Step 1: Define Your Comparison Goals

Before diving into benchmarks, clearly establish what you’re trying to evaluate:

Key Questions to Answer:

What specific capabilities matter most for your application?

Are you prioritizing accuracy, speed, cost, or specialized knowledge?

Do you need quantitative metrics, qualitative evaluations, or both?

Pro Tip: Create a simple scoring rubric with weighted importance for each capability relevant to your use case.

Step 2: Choose Appropriate Benchmarks

Different benchmarks measure different LLM capabilities:

General Language Understanding

MMLU (Massive Multitask Language Understanding)

HELM (Holistic Evaluation of Language Models)

BIG-Bench (Beyond the Imitation Game Benchmark)

Reasoning & Problem-Solving

GSM8K (Grade School Math 8K)

MATH (Mathematics Aptitude Test of Heuristics)

LogiQA (Logical Reasoning)

Coding & Technical Ability

HumanEval (Python Function Synthesis)

MBPP (Mostly Basic Python Programming)

DS-1000 (Data Science Problems)

Truthfulness & Factuality

TruthfulQA (Truthful Question Answering)

FActScore (Factuality Scoring)

Instruction Following

Alpaca Eval

MT-Bench (Multi-Turn Benchmark)

Safety Evaluation

Anthropic’s Red Teaming dataset

SafetyBench

Pro Tip: Focus on benchmarks that align with your specific use case rather than trying to test everything.

Step 3: Review Existing Leaderboards

Save time by checking published results on established leaderboards:

Recommended Leaderboards

Hugging Face Open LLM Leaderboard

Stanford CRFM HELM Leaderboard

LMSys Chatbot Arena

Papers with Code LLM benchmarks

Step 4: Set Up Testing Environment

Ensure fair comparison with consistent test conditions:

Environment Checklist

Use identical hardware for all tests when possible

Control for temperature, max tokens, and other generation parameters

Document API versions or deployment configurations

Standardize prompt formatting and instructions

Use the same evaluation criteria across models

Pro Tip: Create a configuration file that documents all your testing parameters for reproducibility.

Step 5: Use Evaluation Frameworks

Several frameworks can help automate and standardize your evaluation process:

Popular Evaluation Frameworks

FrameworkBest ForInstallationDocumentationLMSYS Chatbot ArenaHuman evaluationsWeb-basedLinkLangChain EvaluationWorkflow testingpip install langchain-evalLinkEleutherAI LM Evaluation HarnessAcademic benchmarkspip install lm-evalLinkDeepEvalUnit testingpip install deepevalLinkPromptfooPrompt comparisonnpm install -g promptfooLinkTruLensFeedback analysispip install trulens-evalLink

Step 6: Implement Custom Evaluation Tests

Go beyond standard benchmarks with tests tailored to your needs:

Custom Test Categories

Domain-specific knowledge tests relevant to your industry

Real-world prompts from your expected use cases

Edge cases that push the boundaries of model capabilities

A/B comparisons with identical inputs across models

User experience testing with representative users

Pro Tip: Include both “expected” scenarios and “stress test” scenarios that challenge the models.

Step 7: Analyze Results

Transform raw data into actionable insights:

Analysis Techniques

Compare raw scores across benchmarks

Normalize results to account for different scales

Calculate performance gaps as percentages

Identify patterns of strengths and weaknesses

Consider statistical significance of differences

Plot performance across different capability domains

Step 8: Document and Visualize Findings

Create clear, scannable documentation of your results:

Documentation Template

Step 9: Consider Trade-offs

Look beyond raw performance to make a holistic assessment:

Key Trade-off Factors

Cost vs. performance – is the improvement worth the price?

Speed vs. accuracy – do you need real-time responses?

Context window – can it handle your document lengths?

Specialized knowledge – does it excel in your domain?

API reliability – is the service stable and well-supported?

Data privacy – how is your data handled?

Update frequency – how often is the model improved?

Pro Tip: Create a weighted decision matrix that factors in all relevant considerations.

Step 10: Make an Informed Decision

Translate your evaluation into action:

Final Decision Process

Rank models based on performance in priority areas

Calculate total cost of ownership over expected usage period

Consider implementation effort and integration requirements

Pilot test the leading candidate with a subset of users or data

Establish ongoing evaluation processes for monitoring performance

Document your decision rationale for future reference

The post How to Compare Two LLMs in Terms of Performance: A Comprehensive Web Guide for Evaluating and Benchmarking Language Models appeared first on MarkTechPost.

How Pattern PXM’s Content Brief is driving conversion on ecommerce m …

Brands today are juggling a million things, and keeping product content up-to-date is at the top of the list. Between decoding the endless requirements of different marketplaces, wrangling inventory across channels, adjusting product listings to catch a customer’s eye, and trying to outpace shifting trends and fierce competition, it’s a lot. And let’s face it—staying ahead of the ecommerce game can feel like running on a treadmill that just keeps speeding up. For many, it results in missed opportunities and revenue that doesn’t quite hit the mark.

“Managing a diverse range of products and retailers is so challenging due to the varying content requirements, imagery, different languages for different regions, formatting and even the target audiences that they serve.”
– Martin Ruiz, Content Specialist, Kanto

Pattern is a leader in ecommerce acceleration, helping brands navigate the complexities of selling on marketplaces and achieve profitable growth through a combination of proprietary technology and on-demand expertise. Pattern was founded in 2013 and has expanded to over 1,700 team members in 22 global locations, addressing the growing need for specialized ecommerce expertise.
Pattern has over 38 trillion proprietary ecommerce data points, 12 tech patents and patents pending, and deep marketplace expertise. Pattern partners with hundreds of brands, like Nestle and Philips, to drive revenue growth. As the top third-party seller on Amazon, Pattern uses this expertise to optimize product listings, manage inventory, and boost brand presence across multiple services simultaneously.
In this post, we share how Pattern uses AWS services to process trillions of data points to deliver actionable insights, optimizing product listings across multiple services.
Content Brief: Data-backed content optimization for product listings
Pattern’s latest innovation, Content Brief, is a powerful AI-driven tool designed to help brands optimize their product listings and accelerate growth across online marketplaces. Using Pattern’s dataset of over 38 trillion ecommerce data points, Content Brief provides actionable insights and recommendations to create standout product content that drives traffic and conversions.
Content Brief analyzes consumer demographics, discovery behavior, and content performance to give brands a comprehensive understanding of their product’s position in the marketplace. What would normally require months of research and work is now done in minutes. Content Brief takes the guesswork out of product strategy with tools that do the heavy lifting. Its attribute importance ranking shows you which product features deserve the spotlight, and the image archetype analysis makes sure your visuals engage customers.
As shown in the following screenshot, the image archetype feature shows attributes that are driving sales in a given category, allowing brands to highlight the most impactful features in the image block and A+ image content.

Content Brief incorporates review and feedback analysis capabilities. It uses sentiment analysis to process customer reviews, identifying recurring themes in both positive and negative feedback, and highlights areas for potential improvement.

Content Brief’s Search Family analysis groups similar search terms together, helping brands understand distinct customer intent and tailor their content accordingly. This feature combined with detailed persona insights helps marketers create highly targeted content for specific segments. It also offers competitive analysis, providing side-by-side comparisons with competing products, highlighting areas where a brand’s product stands out or needs improvement.

“This is the thing we need the most as a business. We have all of the listening tools, review sentiment, keyword things, but nothing is in a single place like this and able to be optimized to my listing. And the thought of writing all those changes back to my PIM and then syndicating to all of my retailers, this is giving me goosebumps.”
– Marketing executive, Fortune 500 brand

Brands using Content Brief can more quickly identify opportunities for growth, adapt to change, and maintain a competitive edge in the digital marketplace. From search optimization and review analysis to competitive benchmarking and persona targeting, Content Brief empowers brands to create compelling, data-driven content that drives both traffic and conversions.
Select Brands looked to improve their Amazon performance and partnered with Pattern. Content Brief’s insights led to updates that caused a transformation for their Triple Buffet Server listing’s image stack. Their old image stack was created for marketplace requirements, whereas the new image stack was optimized with insights based on product attributes to highlight from category and sales data. The updated image stack featured bold product highlights and captured shoppers with lifestyle imagery. The results were a 21% MoM revenue surge, 14.5% more traffic, and a 21 bps conversion lift.

“Content Brief is a perfect example of why we chose to partner with Pattern. After just one month of testing, we see how impactful it can be for driving incremental growth—even on products that are already performing well. We have a product that, together with Pattern, we were able to grow into a top performer in its category in less than 2 years, and it’s exciting to see how adding this additional layer can grow revenue even for that product, which we already considered to be strong.”
– Eric Endres, President, Select Brands

To discover how Content Brief helped Select Brands boost their Amazon performance, refer to the full case study.
The AWS backbone of Content Brief
At the heart of Pattern’s architecture lies a carefully orchestrated suite of AWS services. Amazon Simple Storage Service (Amazon S3) serves as the cornerstone for storing product images, crucial for comprehensive ecommerce analysis. Amazon Textract is employed to extract and analyze text from these images, providing valuable insights into product presentation and enabling comparisons with competitor listings. Meanwhile, Amazon DynamoDB acts as the powerhouse behind Content Brief’s rapid data retrieval and processing capabilities, storing both structured and unstructured data, including content brief object blobs.
Pattern’s approach to data management is both innovative and efficient. As data is processed and analyzed, they create a shell in DynamoDB for each content brief, progressively injecting data as it’s processed and refined. This method allows for rapid access to partial results and enables further data transformations as needed, making sure that brands have access to the most up-to-date insights.
The following diagram illustrates the pipeline workflow and architecture.

Scaling to handle 38 trillion data points
Processing over 38 trillion data points is no small feat, but Pattern has risen to the challenge with a sophisticated scaling strategy. At the core of this strategy is Amazon Elastic Container Store (Amazon ECS) with GPU support, which handles the computationally intensive tasks of natural language processing and data science. This setup allows Pattern to dynamically scale resources based on demand, providing optimal performance even during peak processing times.
To manage the complex flow of data between various AWS services, Pattern employs Apache Airflow. This orchestration tool manages the intricate dance of data with a primary DAG, creating and managing numerous sub-DAGs as needed. This innovative use of Airflow allows Pattern to efficiently manage complex, interdependent data processing tasks at scale.
But scaling isn’t just about processing power—it’s also about efficiency. Pattern has implemented batching techniques in their AI model calls, resulting in up to 50% cost reduction for two-batch processing while maintaining high throughput. They’ve also implemented cross-region inference to improve scalability and reliability across different geographical areas.
To keep a watchful eye on their system’s performance, Pattern employs LLM observability techniques. They monitor AI model performance and behavior, enabling continuous system optimization and making sure that Content Brief is operating at peak efficiency.
Using Amazon Bedrock for AI-powered insights
A key component of Pattern’s Content Brief solution is Amazon Bedrock, which plays a pivotal role in their AI and machine learning (ML) capabilities. Pattern uses Amazon Bedrock to implement a flexible and secure large language model (LLM) strategy.
Model flexibility and optimization
Amazon Bedrock offers support for multiple foundation models (FMs), which allows Pattern to dynamically select the most appropriate model for each specific task. This flexibility is crucial for optimizing performance across various aspects of Content Brief:

Natural language processing – For analyzing product descriptions, Pattern uses models optimized for language understanding and generation.
Sentiment analysis – When processing customer reviews, Amazon Bedrock enables the use of models fine-tuned for sentiment classification.
Image analysis – Pattern currently uses Amazon Textract for extracting text from product images. However, Amazon Bedrock also offers advanced vision-language models that could potentially enhance image analysis capabilities in the future, such as detailed object recognition or visual sentiment analysis.

The ability to rapidly prototype on different LLMs is a key component of Pattern’s AI strategy. Amazon Bedrock offers quick access to a variety of cutting-edge models o facilitate this process, allowing Pattern to continuously evolve Content Brief and use the latest advancements in AI technology. Today, this allows the team to build seamless integration and use various state-of-the-art language models tailored to different tasks, including the new, cost-effective Amazon Nova models.
Prompt engineering and efficiency
Pattern’s team has developed a sophisticated prompt engineering process, continually refining their prompts to optimize both quality and efficiency. Amazon Bedrock offers support for custom prompts, which allows Pattern to tailor the model’s behavior precisely to their needs, improving the accuracy and relevance of AI-generated insights.
Moreover, Amazon Bedrock offers efficient inference capabilities that help Pattern optimize token usage, reducing costs while maintaining high-quality outputs. This efficiency is crucial when processing the vast amounts of data required for comprehensive ecommerce analysis.
Security and data privacy
Pattern uses the built-in security features of Amazon Bedrock to uphold data protection and compliance. By employing AWS PrivateLink, data transfers between Pattern’s virtual private cloud (VPC) and Amazon Bedrock occur over private IP addresses, never traversing the public internet. This approach significantly enhances security by reducing exposure to potential threats.
Furthermore, the Amazon Bedrock architecture makes sure that Pattern’s data remains within their AWS account throughout the inference process. This data isolation provides an additional layer of security and helps maintain compliance with data protection regulations.

“Amazon Bedrock’s flexibility is crucial in the ever-evolving landscape of AI, enabling Pattern to utilize the most effective and efficient models for their diverse ecommerce analysis needs. The service’s robust security features and data isolation capabilities give us peace of mind, knowing that our data and our clients’ information are protected throughout the AI inference process.”
– Jason Wells, CTO, Pattern

Building on Amazon Bedrock, Pattern has created a secure, flexible, and efficient AI-powered solution that continuously evolves to meet the dynamic needs of ecommerce optimization.
Conclusion
Pattern’s Content Brief demonstrates the power of AWS in revolutionizing data-driven solutions. By using services like Amazon Bedrock, DynamoDB, and Amazon ECS, Pattern processes over 38 trillion data points to deliver actionable insights, optimizing product listings across multiple services.
Inspired to build your own innovative, high-performance solution? Explore AWS’s suite of services at aws.amazon.com and discover how you can harness the cloud to bring your ideas to life. To learn more about how Content Brief could help your brand optimize its ecommerce presence, visit pattern.com.

About the Author
Parker Bradshaw is an Enterprise SA at AWS who focuses on storage and data technologies. He helps retail companies manage large data sets to boost customer experience and product quality. Parker is passionate about innovation and building technical communities. In his free time, he enjoys family activities and playing pickleball.

How to configure cross-account model deployment using Amazon Bedrock C …

In enterprise environments, organizations often divide their AI operations into two specialized teams: an AI research team and a model hosting team. The research team is dedicated to developing and enhancing AI models using model training and fine-tuning techniques. Meanwhile, a separate hosting team is responsible for deploying these models across their own development, staging, and production environments.
With Amazon Bedrock Custom Model Import, the hosting team can import and serve custom models using supported architectures such as Meta Llama 2, Llama 3, and Mistral using On-Demand pricing. Teams can import models with weights in Hugging Face safetensors format from Amazon SageMaker or from Amazon Simple Storage Service (Amazon S3). These imported custom models work alongside existing Amazon Bedrock foundation models (FMs) through a single, unified API in a serverless manner, alleviating the need to manage model deployment and scaling.
However, in such enterprise environments, these teams often work in separate AWS accounts for security and operational reasons. The model development team’s training results, known as model artifacts, for example model weights, are typically stored in S3 buckets within the research team’s AWS account, but the hosting team needs to access these artifacts from another account to deploy models. This creates a challenge: how do you securely share model artifacts between accounts?
This is where cross-account access becomes important. With Amazon Bedrock Custom Model Import cross-account support, we can help you configure direct access between the S3 buckets storing model artifacts and the hosting account. This streamlines your operational workflow while maintaining security boundaries between teams. One of our customers quotes:

Bedrock Custom Model Import cross-account support helped AI Platform team to simplify the configuration, reduce operational overhead and secure models in the original location.
– Scott Chang, Principal Engineer, AI Platform at Salesforce

In this guide, we walk you through step-by-step instructions for configuring cross-account access for Amazon Bedrock Custom Model Import, covering both non-encrypted and AWS Key Management Service (AWS KMS) based encrypted scenarios.
Example scenario
For this walkthrough, consider two AWS accounts:

Model Development account (111122223333):

Stores model artifacts (custom weights and configurations) in an S3 bucket called model-artifacts-111122223333
Optionally encrypts artifacts using AWS KMS customer managed key kms-cmk-111122223333

Model Hosting account (777788889999):

Hosts models using Amazon Bedrock Custom Model Import
Uses a new AWS Identity and Access Management (IAM) execution role BedrockCMIExecutionRole-777788889999
Can optionally encrypt artifacts using AWS KMS key kms-cmk-777788889999

The following figure illustrates this setup, showing how the cross-account access is configured between the S3 bucket, KMS keys, and Amazon Bedrock Custom Model Import.

To successfully implement the described scenario while adhering to the principle of least privilege access, the following steps must be executed:

The Model Development account must provide access to the Model Hosting account’s IAM role BedrockCMIExecutionRole-777788889999, allowing it to utilize their S3 bucket and, if applicable, the encryption key, using resource-based policies.
The Model Hosting account should establish an IAM role, such as BedrockCMIExecutionRole-777788889999. The identity-based policies needed would be for the Model Development S3 bucket and customer managed keys for decrypting model artifacts, like using kms-cmk-111122223333.
The Model Hosting account must enable the Amazon Bedrock service to assume the IAM role BedrockCMIExecutionRole-777788889999, created in step 2, by including the Amazon Bedrock service as a trusted entity. This IAM role will be utilized by the Model Hosting account to initiate the custom model import job.

Prerequisites
Before you can start a custom model import job, you need to fulfill the following prerequisites:

If you’re importing your model from an S3 bucket, prepare your model files in the Hugging Face weights format. For more information refer to Import source.
(Optional) Set up extra security configurations.

You can encrypt input and output data, import jobs, or inference requests made to imported models. For more information refer to Encryption of custom model import.
You can create a virtual private cloud (VPC) to protect your customization jobs. For more information, refer to (Optional) Protect custom model import jobs using a VPC.

Step-by-step execution
The following section provides the step-by-step execution of the previously outlined high-level process, from the perspective of an administrator managing both accounts:
Step 1: Set up the S3 bucket policy (in the Model Development account) to enable access for the Model Hosting account’s IAM role:

Sign in to the AWS Management Console for account 111122223333, then access the Amazon S3 console.
On the General purpose buckets view, locate model-artifacts-111122223333, the bucket used by the model development team to store their model artifacts.
On the Permissions tab, select Edit in the Bucket policy section, and insert the following IAM resource-based policy. Be sure to update the AWS account IDs (shown in red) in the policy with your information.

{
“Version”: “2012-10-17”,
“Id”: “AllowCrossAccountS3Access”,
“Statement”: [
{
“Sid”: “cross-account-list-get”,
“Effect”: “Allow”,
“Principal”: {
“AWS”: “arn:aws:iam::777788889999:root” },
“Action”: [
“s3:ListBucket”,
“s3:GetObject”
],
“Resource”: [
“arn:aws:s3:::model-artifacts-111122223333”, “arn:aws:s3:::model-artifacts-111122223333/*” ],
“Condition”: {
“ArnLike”: {
“aws:PrincipalArn”: “arn:aws:iam::777788889999:role/BedrockCMIExecutionRole-777788889999*” }
}
}
]
}

Step 2: Establish an IAM role (in the Model Hosting account) and authorize Amazon Bedrock to assume this role:

Sign in to the AWS console for account 777788889999 and launch the IAM console.
In the left navigation pane, select Policies and then choose Create policy. Within the Policy Editor, switch to the JSON tab and insert the following identity-based policy. This policy is designed for read-only access, enabling users or a role to list and download objects from a specified S3 bucket, but only if the bucket is owned by account 111122223333. Customize the AWS account ID and S3 bucket name/prefix (shown in red) with your information.

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “1”,
“Effect”: “Allow”,
“Action”: [
“s3:ListBucket”,
“s3:GetObject”
],
“Resource”: [
“arn:aws:s3:::model-artifacts-111122223333”, “arn:aws:s3:::model-artifacts-111122223333/*” ],
“Condition”: {
“StringEquals”: {
“aws:ResourceAccount”: “111122223333” }
}
}
]
}

Choose Next, assign the policy name as BedrockCMIExecutionPolicy-777788889999, and finalize by choosing Create policy.
In the left navigation pane, choose Roles and select Custom trust policy as the Trusted entity type. Insert the following trusted entity policy, which restricts the role assumption to the Amazon Bedrock service, specifically for model import jobs in account 777788889999 located in the US East (N. Virginia) us-east-1 Region. Modify the AWS account ID and Region (shown in red) with your information.

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “1”,
“Effect”: “Allow”,
“Principal”: {
“Service”: “bedrock.amazonaws.com”
},
“Action”: “sts:AssumeRole”,
“Condition”: {
“StringEquals”: {
“aws:SourceAccount”: “777788889999” },
“ArnEquals”: {
“aws:SourceArn”: “arn:aws:bedrock:us-east-1:777788889999:model-import-job/*” }
}
}
]
}

Choose Next and in the Add permissions section, search for the policy created in the previous step BedrockCMIExecutionPolicy-777788889999, select the checkbox, and proceed by choosing Next.
Assign the Role name as BedrockCMIExecutionRole-777788889999, provide a Description as “IAM execution role to be used by CMI jobs,” and finalize by choosing Create role.

Important: If you’re using an AWS KMS encryption key for model artifacts in the Model Development account or for imported model artifacts with the Amazon Bedrock managed AWS account, proceed with steps 3 through 5. If not, skip to step 6.
Step 3: Adjust the AWS KMS key policy (in the Model Development account) to allow the Amazon Bedrock CMI execution IAM role to decrypt model artifacts:

Transition back to the Model Development account and find the AWS KMS key named kms-cmk-111122223333 in the AWS KMS console. Note the AWS KMS key Amazon Resource Name (ARN).
On the Key policy tab, switch to the Policy view, and incorporate the following resource-based policy statement to enable the Model Hosting account’s IAM role BedrockCMIExecutionRole-777788889999 to decrypt model artifacts. Revise items in red with your information.

{
“Sid”: “Allow use of the key by the destination account”,
“Effect”: “Allow”,
“Principal”: {
“AWS”: “arn:aws:iam::777788889999:role/BedrockCMIExecutionRole-777788889999” },
“Action”: [
“kms:Decrypt”,
“kms:DescribeKey”
],
“Resource”: “*”
}

Step 4: Set the AWS KMS key policy (in the Model Hosting account) for the CMI execution IAM role to encrypt and decrypt model artifacts to securely store in the Amazon Bedrock AWS account:

Return to the Model Hosting account and locate the AWS KMS key named kms-cmk-777788889999 in the AWS KMS console. Note the AWS KMS key ARN.
Insert the following statement into the AWS KMS key’s resource-based policy to enable the BedrockCMIExecutionRole-777788889999 IAM role to encrypt and decrypt model artifacts at rest in the Amazon Bedrock managed AWS account. Revise items in red with your information.

{
“Sid”: “Allow use of the key”,
“Effect”: “Allow”,
“Principal”: {
“AWS”: “arn:aws:iam::777788889999:role/BedrockCMIExecutionRole-777788889999” },
“Action”: [
“kms:Encrypt”,
“kms:Decrypt”,
“kms:ReEncrypt*”,
“kms:GenerateDataKey*”,
“kms:DescribeKey”
],
“Resource”: “*”
}

Step 5: Modify the CMI execution role’s permissions (in the Model Hosting account) to provide access to encryption keys:
Access the IAM console and find the IAM policy BedrockCMIExecutionPolicy-777788889999. To the existing identity-based policy, append the following statements (replace the ARNs in red with one noted in steps 4 and 5):

{
“Effect”: “Allow”,
“Action”: [
“kms:Decrypt”,
“kms:DescribeKey”
],
“Resource”: “arn:aws:kms:us-east-1:111122223333:key/b5b6e052-fb27-4dbb-bf0d-daf3375a9fda” },
{
“Effect”: “Allow”,
“Action”: [
“kms:Encrypt”,
“kms:Decrypt”,
“kms:ReEncrypt*”,
“kms:GenerateDataKey*”,
“kms:DescribeKey”
],
“Resource”: “arn:aws:kms:us-east-1:777788889999:key/6cd5d3bf-3d9b-4d1c-83d5-8df6284435a1” }

Step 6: Initiate the Model import job (in the Model Hosting account)
In this step, we execute the model import job using the AWS Command Line Interface (AWS CLI) command. You can also use AWS SDKs or APIs for the same purpose. Run the following command from your terminal session with an IAM user or role that has the necessary privileges to create a custom model import job. You don’t need to explicitly provide an ARN or details of the CMK used by the Model Development team.

aws bedrock create-model-import-job
–job-name “cmi-job-777788889999-01”
–imported-model-name “mistral-777788889999-01”
–role-arn “arn:aws:iam::777788889999:role/BedrockCMIExecutionRole-777788889999”
–model-data-source “s3DataSource={s3Uri=”s3://model-artifacts-111122223333/mistral-model-weights/”}”

When encrypting model artifacts with Amazon Bedrock Custom Model Import, use the –imported-model-kms-key-id flag and specify the ARN of the Model Hosting account’s CMK key.

aws bedrock create-model-import-job
–job-name “cmi-job-777788889999-04”
–imported-model-name “mistral-777788889999-01”
–role-arn “arn:aws:iam::777788889999:role/BedrockCMIExecutionRole-777788889999”
–model-data-source “s3DataSource={s3Uri=”s3://model-artifacts-111122223333/mistral-model-weights/”}”
–imported-model-kms-key-id “arn:aws:kms:us-east-1:777788889999:key/6cd5d3bf-3d9b-4d1c-83d5-8df6284435a1”

Cross-account access to the S3 bucket using the custom model import job is only supported through AWS CLI, AWS SDKs, or APIs. Console support is not yet available.
Troubleshooting
When IAM policy misconfigurations prevent a custom model import job, you might encounter an error like:

Amazon Bedrock does not have access to the S3 location (s3://model-artifacts-111122223333/mistral-model-weights). Update the permissions and try again.

To resolve this, manually verify access to Model Development’s S3 bucket from the Model Hosting account by assuming the BedrockCMIExecutionRole-777788889999. Follow these steps:
Step 1: Identify the current IAM role or user in the CLI with the following and copy the ARN from the output:

aws sts get-caller-identity

Step 2: Update trust relationships. Append the trust policy of the BedrockCMIExecutionRole-777788889999 to allow the current user or IAM role to assume this role:

{
“Effect”: “Allow”,
“Principal”: {
“AWS”: “arn:aws:sts::777788889999:role/current-user-role”
},
“Action”: “sts:AssumeRole”
}

Step 3: List or copy the S3 bucket contents assuming the Amazon Bedrock Custom Model Import execution role

Assume the CMI execution role (replace the ARN with your information):

aws sts assume-role
–role-arn “arn:aws:iam::776941257690:role/BedrockCMIExecutionRole-777788889999”
–role-session-name “BedrockCMISession”

Export the returned temporary credentials as environment variables:

export AWS_ACCESS_KEY_ID=”ASIA…”
export AWS_SECRET_ACCESS_KEY=”…”
export AWS_SESSION_TOKEN=”…”

Run commands to troubleshoot permission issues:

aws s3 ls s3://model-artifacts-111122223333/mistral-model-weights/
aws s3 cp s3://model-artifacts-111122223333/mistral-model-weights/config.json .

If errors persist, consider using Amazon Q Developer or refer to additional resources outlined in the IAM User Guide.
Cleanup
There is no additional charge to import a custom model to Amazon Bedrock (refer to step 6 in the Step-by-step execution section). However, if your model isn’t in use for inference, and you want to avoid paying storage costs (refer to Amazon Bedrock pricing), delete the imported model using the AWS console or AWS CLI reference or API Reference. For example (replace the text in red with your imported model name):

aws bedrock delete-imported-model
–model-identifier “mistral-777788889999-01”

Conclusion
By using cross-account access in Amazon Bedrock Custom Model Import, organizations can significantly streamline their AI model deployment workflows.
Amazon Bedrock Custom Model Import is generally available today in Amazon Bedrock in the US East (N. Virginia) us-east-1 and US West (Oregon) us-west-2 AWS Regions. Refer to the full Region list for future updates. To learn more, refer to the Amazon Bedrock Custom Model Import product page and Amazon Bedrock pricing page. Give Amazon Bedrock Custom Model Import a try in the Amazon Bedrock console today and send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.
Thank you to our contributors Scott Chang (Salesforce), Raghav Tanaji (Salesforce), Rupinder Grewal (AWS), Ishan Singh (AWS), and Dharinee Gupta (AWS)

About the Authors
Hrushikesh Gangur is a Principal Solutions Architect at AWS. Based in San Francisco, California, Hrushikesh is an expert in AWS machine learning. As a thought leader in the field of generative AI, Hrushikesh has contributed to AWS’s efforts in helping startups and ISVs build and deploy AI applications. His expertise extends to various AWS services, including Amazon SageMaker, Amazon Bedrock, and accelerated computing which are crucial for building AI applications.
Sai Darahas Akkineni is a Software Development Engineer at AWS. He holds a master’s degree in Computer Engineering from Cornell University, where he worked in the Autonomous Systems Lab with a specialization in computer vision and robot perception. Currently, he helps deploy large language models to optimize throughput and latency.
Prashant Patel is a Senior Software Development Engineer in AWS. He’s passionate about scaling large language models for enterprise applications. Prior to joining AWS, he worked at IBM on productionizing large-scale AI/ML workloads on Kubernetes. Prashant has a master’s degree from NYU Tandon School of Engineering. While not at work, he enjoys traveling and playing with his dogs.

ByteDance processes billions of daily videos using their multimodal vi …

This is a guest post authored by the team at ByteDance.
ByteDance is a technology company that operates a range of content platforms to inform, educate, entertain, and inspire people across languages, cultures, and geographies. Users trust and enjoy our content platforms because of the rich, intuitive, and safe experiences they provide. These experiences are made possible by our machine learning (ML) backend engine, with ML models built for video understanding, search, recommendation, advertising, and novel visual effects.
In support of its mission to “Inspire Creativity and Enrich Life,” we’ve made it straightforward and fun for people to engage with, create, and consume content. People can also discover and transact with a suite of more than a dozen products and services, such as CapCut, e-Shop, Lark, Pico, and Mobile Legends: Bang Bang.
At ByteDance, we collaborated with Amazon Web Services (AWS) to deploy multimodal large language models (LLMs) for video understanding using AWS Inferentia2 across multiple AWS Regions around the world. By using sophisticated ML algorithms, the platform efficiently scans billions of videos each day. We use this process to identify and flag content that violates community guidelines, enabling a better experience for all users. By using Amazon EC2 Inf2 instances for these video understanding workloads, we were able to cut the inference cost by half.
In this post, we discuss the use of multimodal LLMs for video understanding, the solution architecture, and techniques for performance optimization.
Overcoming video understanding hurdles with multimodal LLMs
Multimodal LLMs enable better understanding of the world, enabling various forms of digital content as inputs to the LLM, greatly increasing the range of useful applications we can now build. The need for AI systems capable of processing various content forms has become increasingly apparent. Multimodal LLMs have risen to meet this challenge by taking multiple data modalities, including text, images, audio, and video (refer to the following diagram), which allows for full understanding of content, mimicking human perception and interaction with the world. The enhanced capabilities of these models are evident in their performance, which far surpasses that of traditional models in tasks ranging from sophisticated virtual assistant to advanced content creation. By expanding the boundaries of AI capabilities and paving the way for more natural and intuitive interactions with technology, these models aren’t just improving existing applications but opening doors to entirely new possibilities in the realm of AI and user experience.

In our operations, the implementation of multimodal LLMs for video understanding represents a significant shift in thinking about AI-driven content analysis. This innovation addresses the daily challenge of processing billions of volumes of video content, overcoming the efficiency limits of traditional AI models. We’ve developed our own multimodal LLM architecture, designed to achieve state-of-the-art performance across single-image, multi-image, and video applications. Unlike traditional ML models, this new generative AI–enabled system integrates multiple input streams into a unified representational space. Cross-modal attention mechanisms facilitate information exchange between modalities, and fusion layers combine representations from different modalities. The decoder then generates output based on the fused multimodal representation, enabling a more nuanced and context-aware analysis of content.
Solution overview
We’ve collaborated with AWS since the first generation of Inferentia chips. Our video understanding department has been committed to finding more cost-efficient solutions that deliver higher performance to better meet ever-growing business needs. During this period, we found that AWS has been continually inventing and adding features and capabilities to its AWS Neuron software development kit (SDK), the software enabling high-performance workloads on the Inferentia chips. The popular Meta Llama and Mistral models were well supported with high performance on Inferentia2 shortly after their open source release. Therefore, we began to evaluate the Inferentia2 based solution, illustrated in the following diagram.

We made the strategic decision to deploy a fine-tuned middle-sized LLM on Inferentia2, to provide a performant and cost-effective solution capable of processing billions of videos daily. The process was a comprehensive effort aimed at optimizing end-to-end response time for our video understanding workload. The team explored a wide range of parameters, including tensor parallel sizes, compile configurations, sequence lengths, and batch sizes. We employed various parallelization techniques, such as multi-threading and model replication (for non-LLM models) across multiple NeuronCores. Through these optimizations, which included parallelizing sequence steps, reusing devices, and using auto-benchmark and profiling tools, we achieved a substantial performance boost, maintaining our position at the forefront of industry performance standards
We used tensor parallelism to effectively distribute and scale the model across multiple accelerators in an Inf2 instance. We used static batching, which improved the latency and throughput of our models by making sure that data is processed in uniform, fixed-size batches during inference. Using repeated n-grams filtering significantly improved the quality of automatically generated text and reduced inference time. Quantizing the weights of the multimodal model from FP16/BF16 to INT8 format allowed it to run more efficiently on Inferentia2 with less device memory usage, without compromising on accuracy. Using these techniques and model serialization, we optimized the throughput on inf2.48xlarge instance by maximizing the batch size such that the model could still fit on a single accelerator in an instance so we could deploy multiple model replicas on the same instance. This comprehensive optimization strategy helped us meet our latency requirements while providing optimal throughput and cost reduction. Notably, the Inferentia2 based solution cut the cost by half compared to comparable Amazon Elastic Compute Cloud (Amazon EC2) instances, highlighting the significant economic advantages of using Inferentia2 chips for large-scale video understanding tasks.
The following diagram shows how we deploy our LLM container on Amazon EC2 Inf2 instances using Neuron.

In summary, our collaboration with AWS has revolutionized video understanding, setting new industry standards for efficiency and accuracy. The multimodal LLM’s ability to adapt to global market demands and its scalable performance on Inferentia2 chips underscore the profound impact of this technology in safeguarding the platform’s global community.
Future plans
Looking further ahead, the development of a unified multimodal LLM represents an important shift in video understanding technology. This ambitious project aims to create a universal content tokenizer capable of processing all content types and aligning them within a common semantic space. After it’s tokenized, the content will be analyzed by advanced large models, generating appropriate content understanding outputs regardless of the original format (as shown in the following diagram). This unified approach can streamline the content understanding process, potentially improving both efficiency and consistency across diverse content types.

For additional learning, refer to the paper The Evolution of Multimodal Model Architectures.
The implementation of this comprehensive strategy sets new benchmarks in video understanding technology, striking a balance between accuracy, speed, and cultural sensitivity in an increasingly complex digital ecosystem. This forward-looking approach not only addresses current challenges in video understanding but also positions the system at the forefront of AI-driven content analysis and management for the foreseeable future.
By using cutting-edge AI techniques and a holistic approach to content understanding, this next-generation content understanding system aims to set new industry standards, providing safer and more inclusive online environments while adapting to the ever-evolving landscape of digital communication. At the same time, AWS is investing in next-generation AI chips such as AWS Trainium2, which will continue to push the performance boundaries while keeping costs under control. At ByteDance, we’re planning to test out this new generation of AWS AI chips and adopt them appropriately as the models and workloads continue to evolve.
Conclusion
The collaboration between ByteDance and AWS has revolutionized video understanding through the deployment of multimodal LLMs on Inferentia2 chips. This partnership has yielded remarkable results, the ability to process billions of videos daily, and significant cost reductions and higher performance over comparable EC2 instances.
As ByteDance continues to innovate with projects such as the unified multimodal large model, we remain committed to pushing the boundaries of AI-driven content analysis. Our goal is to make sure our platforms remain safe, inclusive, and creative spaces for our global community, setting new industry standards for efficient video understanding.
To learn more about Inf2 instances, refer to Amazon EC2 Inf2 Architecture.

About the Authors
Wangpeng An, Principal Algorithm Engineer at TikTok, specializes in multimodal LLMs for video understanding, advertising, and recommendations. He has led key projects in model acceleration, content moderation, and Ads LLM pipelines, enhancing TikTok’s real-time machine learning systems.
Haotian Zhang is a Tech Lead MLE at TikTok, specializing in content understanding, search, and recommendation. He received an ML PhD from University of Waterloo. At TikTok, he leads a group of engineers to improve the efficiency, robustness, and effectiveness of training and inference for LLMs and multimodal LLMs, especially for large distributed ML systems.
Xiaojie Ding is a senior engineer at TikTok, focusing on content moderation system development, model resource and deployment optimization, and algorithm engineering stability construction. In his free time, he likes to play single-player games.
Nachuan Yang is a senior engineer at TikTok, focusing on content security and moderation. He has successively been engaged in the construction of moderation systems, model applications, and deployment and performance optimization.
Kairong Sun is a Senior SRE on the AML Team at ByteDance. His role focuses on maintaining the seamless operation and efficient allocation of resources within the cluster, specializing in cluster machine maintenance and resource optimization.
The authors would like to thank other ByteDance and AWS team members for their contributions: Xi Dai, Kaili Zhao, Zhixin Zhang, Jin Ye, and Yann Xia from ByteDance; Jia Dong, Bingyang Huang, Kamran Khan, Shruti Koparkar, and Diwakar Bansal from AWS.

Convergence Releases Proxy Lite: A Mini, Open-Weights Version of Proxy …

In today’s digital landscape, automating interactions with web content remains a nuanced challenge. Many existing solutions are resource-intensive and tailored for narrowly defined tasks, which limits their broader applicability. Developers often face the dual challenge of balancing computational efficiency with the need for a model that can generalize well across diverse websites. Traditional systems, heavily reliant on prompt-prediction, often lack the reflective reasoning required for the unpredictable nature of web environments. Additionally, proprietary models typically restrict access to detailed inner workings, making it difficult for researchers and practitioners in the open-source community to build on state-of-the-art methods. These persistent issues underline the importance of developing an automation tool that is both efficient and accessible.

Convergence has introduced Proxy Lite: a mini, open-weights version of their well-regarded Proxy assistant. This 3B parameter Vision-Language Model is designed to extend sophisticated web automation capabilities to the open-source community. Rather than promising extraordinary feats, Proxy Lite aims to offer a balanced approach that marries efficiency with reliability. Its architecture builds on a solid foundation, allowing it to perform a variety of web-based tasks without imposing heavy computational demands.

What makes Proxy Lite notable is its transparent design and open-weights approach. This encourages the community to explore, modify, and improve upon its framework. With an integrated system for Vision-Language Model (VLM) and browser interactions, Proxy Lite allows for nuanced control over browser tasks. The model’s configuration supports practical applications ranging from routine data extraction to more complex navigational tasks, all while keeping resource usage in check.

Technical Aspects and Their Benefits

At its core, Proxy Lite leverages a 3B parameter model built on the Qwen2.5-VL-3B-Instruct foundation. This choice reflects a commitment to balancing performance with efficiency. The model employs a three-phase process to generate responses:

Observation: The model first examines the current state of the web page—confirming, for instance, that an overlay or privacy banner has been dismissed.

Thinking: It then methodically determines the next course of action, weighing the various possibilities based on the context.

Tool Call: Finally, it issues a precise command to execute the selected action within the browser.

This structured approach not only improves task reliability but also facilitates the model’s ability to generalize across different types of web interactions. By mirroring human-like reasoning processes, Proxy Lite manages to strike a balance between simplicity and sophistication. Moreover, its design supports a straightforward integration into both command-line interfaces and Streamlit applications, making deployment accessible even for those with modest technical resources.

Performance Insights and Practical Evaluations

Proxy Lite has been carefully evaluated using the WebVoyager benchmark, a comprehensive set of tasks designed to test web automation capabilities. The model achieved an overall score of 72.4%, a strong performance indicator given its open-weights nature. Detailed performance statistics across various websites reveal its thoughtful design:

Allrecipes: Achieving an 87.8% success rate with an average of 10.3 message exchanges, it demonstrates effectiveness in content-rich environments.

Amazon: A 70.0% success rate here highlights the model’s ability to navigate more complex, dynamic e-commerce platforms.

Notable High-Profile Sites: With success rates in the low 80s on platforms such as Apple and GitHub, Proxy Lite consistently shows reliable behavior on diverse sites.

Google Services: While some areas, such as Google Flights, yield lower success metrics, the overall performance remains competitive considering the model’s scope.

These findings reflect a balanced performance, with Proxy Lite efficiently managing tasks without the overhead typically associated with larger, proprietary models. The comprehensive evaluation not only underscores its current utility but also points to potential enhancements through community-driven refinements.

Conclusion

Proxy Lite emerges as a thoughtfully designed tool in the field of web automation. By addressing key challenges—such as resource constraints, generalization, and transparency—it offers a practical solution for automating routine online tasks. Its open-weights approach and modular design invite collaboration and ongoing development, providing a valuable resource for both academic research and commercial projects.

Check out the Technical Details and Model here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Convergence Releases Proxy Lite: A Mini, Open-Weights Version of Proxy Assistant Performing Pretty Well on UI Navigation Tasks appeared first on MarkTechPost.

FinData Explorer: A Step-by-Step Tutorial Using BeautifulSoup, yfinanc …

In this tutorial, we will guide you through building an advanced financial data reporting tool on Google Colab by combining multiple Python libraries. You’ll learn how to scrape live financial data from web pages, retrieve historical stock data using yfinance, and visualize trends with matplotlib. Also, the tutorial demonstrates how to integrate an interactive UI using ipywidgets, culminating in a dynamic PDF report generated with FPDF.

Copy CodeCopiedUse a different Browser!pip install fpdf beautifulsoup4 yfinance ipywidgets

First, we install the necessary libraries for our project: fpdf for generating PDF reports, beautifulsoup4 for web scraping, yfinance for retrieving historical financial data, and ipywidgets for creating interactive UI elements in the notebook.

Copy CodeCopiedUse a different Browserimport requests
from bs4 import BeautifulSoup
from fpdf import FPDF
import yfinance as yf
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display, FileLink

Here, we import a range of libraries to build a comprehensive financial data tool.

Copy CodeCopiedUse a different Browserdef generate_report(b):
symbol = symbol_text.value.upper().strip()
start_date = start_date_picker.value
end_date = end_date_picker.value

output_area.clear_output() # Clear previous outputs

if not (symbol and start_date and end_date):
with output_area:
print(“Please provide valid inputs for stock symbol and both dates.”)
return

with output_area:
print(f”Generating report for {symbol} from {start_date} to {end_date}…”)

# 1. Retrieve current price using yfinance
try:
stock = yf.Ticker(symbol)
current_price = stock.info.get(‘regularMarketPrice’, ‘N/A’)
except Exception as e:
current_price = “Error retrieving price”
with output_area:
print(“Error retrieving current price:”, e)

# 2. Fetch historical data using yfinance
try:
hist = stock.history(start=start_date, end=end_date)
except Exception as e:
hist = None
with output_area:
print(“Error fetching historical data:”, e)

# 3. Plot historical closing prices
if hist is not None and not hist.empty:
plt.figure(figsize=(10, 5))
plt.plot(hist.index, hist[‘Close’], marker=’o’, linestyle=’-‘, label=”Close Price”)
plt.title(f”{symbol} Historical Closing Prices”)
plt.xlabel(“Date”)
plt.ylabel(“Close Price (USD)”)
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
graph_filename = “graph.png”
plt.savefig(graph_filename)
plt.show()
else:
graph_filename = None
with output_area:
print(“No historical data available for the selected date range.”)

# 4. Create a PDF report using FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_font(“Arial”, “B”, 16)
pdf.cell(0, 10, f”Financial Report for {symbol}”, ln=True, align=”C”)
pdf.ln(10)

pdf.set_font(“Arial”, size=12)
pdf.cell(0, 10, f”Current Price: {current_price}”, ln=True)
pdf.cell(0, 10, f”Date Range: {start_date} to {end_date}”, ln=True)
pdf.ln(10)

if graph_filename:
pdf.cell(0, 10, “Historical Closing Prices:”, ln=True)
# Adjust the image width to fit the page layout
pdf.image(graph_filename, w=180)

pdf_filename = “financial_report.pdf”
pdf.output(pdf_filename)

# 5. Display the download link for the PDF report
with output_area:
print(f”PDF Report generated: {pdf_filename}”)
display(FileLink(pdf_filename))

With the above function, we retrieve user inputs for the stock symbol and date range, then scrape the current financial data from Yahoo Finance while fetching historical data via yfinance. It plots the historical closing prices using matplotlib, generates a PDF report embedding the scraped data and the graph using FPDF, and finally displays a download link for the PDF report.

Copy CodeCopiedUse a different Browser# Create UI widgets
symbol_text = widgets.Text(
value=”AAPL”,
description=”Stock Symbol:”,
placeholder=”e.g., AAPL”
)
start_date_picker = widgets.DatePicker(
description=’Start Date’
)
end_date_picker = widgets.DatePicker(
description=’End Date’
)
generate_button = widgets.Button(
description=”Generate Report”,
button_style=’success’
)
output_area = widgets.Output()

generate_button.on_click(generate_report)

display(widgets.VBox([symbol_text, start_date_picker, end_date_picker, generate_button, output_area]))

Finally, this code block sets up an interactive user interface using ipywidgets. It creates input fields for a stock symbol, date pickers for a start and end date, and a button to trigger the report generation. The UI elements are then organized vertically using a VBox layout, and an output area is provided to display feedback and the generated PDF download link.

Output and PDF Sample

In conclusion, by following this tutorial, you have successfully integrated web scraping, data analysis, interactive UI design, and PDF report generation into a single Google Colab notebook. This step-by-step process illustrates how to harness the power of Python’s diverse libraries to create a robust, user-friendly financial data tool.

Here is the Colab Notebook for the above project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post FinData Explorer: A Step-by-Step Tutorial Using BeautifulSoup, yfinance, matplotlib, ipywidgets, and fpdf for Financial Data Extraction, Interactive Visualization, and Dynamic PDF Report Generation appeared first on MarkTechPost.

Enhancing Instruction Tuning in LLMs: A Diversity-Aware Data Selection …

Pre-trained LLMs require instruction tuning to align with human preferences. Still, the vast data collection and rapid model iteration often lead to oversaturation, making efficient data selection a crucial yet underexplored area. Existing quality-driven selection methods, such as LIMA and AlpaGasus, tend to overlook the importance of data diversity and complexity, essential for enhancing model performance. While scaling LLMs has proven beneficial, optimizing instruction fine-tuning (IFT) relies on training data’s quality, diversity, and complexity. However, measuring these factors remains challenging, with recent research calling for quantifiable metrics to assess dataset diversity rather than relying on subjective claims. Sparse autoencoders (SAEs) have recently emerged as effective tools for interpreting LLMs by ensuring mono-semantic representations, making them valuable for analyzing data selection mechanisms.

Sparse autoencoders have significantly improved LLM interpretability by enforcing sparsity in representations, thereby enhancing feature independence. Early works in sparse coding and dictionary learning laid the foundation for structured data representations, later applied to transformers to decode contextual embeddings. Recent research has highlighted the challenges of polysemantic neurons encoding multiple concepts, prompting efforts to develop monosemantic neurons for better interpretability. In parallel, data selection methods, such as ChatGPT-based scoring and gradient-based clustering, have been explored to refine instruction tuning. Despite advancements, accurately quantifying data quality, diversity, and complexity remains complex, necessitating further research into effective metrics and selection strategies to optimize instruction tuning in LLMs.

Researchers at Meta GenAI introduce a diversity-aware data selection strategy using SAEs to improve instruction tuning. SAEs help quantify data diversity and enhance model interpretability, explaining methods like selecting the longest response. They develop two selection algorithms: SAE-GreedSelect for limited data and SAE-SimScale for larger datasets. Experiments on Alpaca and WizardLM_evol_instruct_70k datasets demonstrate superior performance over prior techniques. Their approach refines data selection, reduces training costs, and offers deeper insights into model behavior, making instruction tuning more efficient and interpretable.

The study introduces two diversity-driven data selection methods using SAEs. SAE-GreedSelect optimizes feature utilization for selecting limited data, while SAE-SimScale scales data selection using similarity-based sampling. Experiments on Llama-2-13b, Gemma-2-9b, and Llama-2-7b-base validate the approach using Alpaca-52k and WizardLM_evol_instruct_70k datasets. Comparisons with baselines like Longest-response, #InsTag, and Repr Filter demonstrate superior performance. Models are trained using standardized settings and evaluated with IFEval, LLM- and Human-as-a-Judge methods, and benchmarks like MMLU and TruthfulQA. Results highlight improved instruction tuning efficiency and interpretability while maintaining simplicity in parameter tuning.

Selecting the 1,000 longest responses is an effective baseline for supervised fine-tuning (SFT), likely because longer responses contain more learnable information. A strong correlation (r = 0.92) between text length and feature richness in an SAE supports this hypothesis. The proposed data selection methods, SAE-GreedSelect and SAE-SimScale, outperform existing baselines, particularly at larger data scales. SAE-SimScale achieves notable improvements across multiple datasets and evaluation metrics, highlighting its robustness. Further experiments confirm its effectiveness across model sizes and architectures, reinforcing its potential for optimizing scalable data selection strategies.

In conclusion, the study introduces an approach to measuring data diversity using learned monosemanticity in sparse autoencoders. A new data selection algorithm for instruction tuning was developed, improving model performance across various datasets. The method consistently outperforms existing selection techniques and demonstrates that longer instruction-response pairs enhance model capabilities. The approach also improves efficiency by reducing data requirements and training costs. Additionally, it offers insights into model behavior and can be extended to preference data selection or improving model safety. This strategy ensures better alignment with human preferences while maintaining diversity and complexity in training data.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Enhancing Instruction Tuning in LLMs: A Diversity-Aware Data Selection Strategy Using Sparse Autoencoders appeared first on MarkTechPost.

How IDIADA optimized its intelligent chatbot with Amazon Bedrock

This post is co-written with Xavier Vizcaino, Diego Martín Montoro, and Jordi Sánchez Ferrer from Applus+ Idiada.
In 2021, Applus+ IDIADA, a global partner to the automotive industry with over 30 years of experience supporting customers in product development activities through design, engineering, testing, and homologation services, established the Digital Solutions department. This strategic move aimed to drive innovation by using digital tools and processes. Since then, we have optimized data strategies, developed customized solutions for customers, and prepared for the technological revolution reshaping the industry.
AI now plays a pivotal role in the development and evolution of the automotive sector, in which Applus+ IDIADA operates. Within this landscape, we developed an intelligent chatbot, AIDA (Applus Idiada Digital Assistant)— an Amazon Bedrock powered virtual assistant serving as a versatile companion to IDIADA’s workforce.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
With Amazon Bedrock, AIDA assists with a multitude of tasks, from addressing inquiries to tackling complex technical challenges spanning code, mathematics, and translation. Its capabilities are truly boundless.
With AIDA, we take another step towards our vision of providing global and integrated digital solutions that add value for our customers. Its internal deployment strengthens our leadership in developing data analysis, homologation, and vehicle engineering solutions. Additionally, in the medium term, IDIADA plans to offer AIDA as an integrable product for customers’ environments and develop “light” versions seamlessly integrable into existing systems.
In this post, we showcase the research process undertaken to develop a classifier for human interactions in this AI-based environment using Amazon Bedrock. The objective was to accurately identify the type of interaction received by the intelligent agent to route the request to the appropriate pipeline, providing a more specialized and efficient service.
The challenge: Optimize intelligent chatbot responses, allocate resources more effectively, and enhance the overall user experience
Built on a flexible and secure architecture, AIDA offers a versatile environment for integrating multiple data sources, including structured data from enterprise databases and unstructured data from internal sources like Amazon Simple Storage Service (Amazon S3). It boasts advanced capabilities like chat with data, advanced Retrieval Augmented Generation (RAG), and agents, enabling complex tasks such as reasoning, code execution, or API calls.
As AIDA’s interactions with humans proliferated, a pressing need emerged to establish a coherent system for categorizing these diverse exchanges.
Initially, users were making simple queries to AIDA, but over time, they started to request more specific and complex tasks. These included document translations, inquiries about IDIADA’s internal services, file uploads, and other specialized requests.
The main reason for this categorization was to develop distinct pipelines that could more effectively address various types of requests. By sorting interactions into categories, AIDA could be optimized to handle specific kinds of tasks more efficiently. This approach allows for tailored responses and processes for different types of user needs, whether it’s a simple question, a document translation, or a complex inquiry about IDIADA’s services.
The primary objective is to offer a more specialized service through the creation of dedicated pipelines for various contexts, such as conversation, document translation, and services to provide more accurate, relevant, and efficient responses to users’ increasingly diverse and specialized requests.
Solution overview
By categorizing the interactions into three main groups—conversation, services, and document translation—the system can better understand the user’s intent and respond accordingly. The Conversation class encompasses general inquiries and exchanges, the Services class covers requests for specific functionalities or support, and the Document_Translation class handles text translation needs.
The specialized pipelines, designed specifically for each use case, allow for a significant increase in efficiency and accuracy of AIDA’s responses. This is achieved in several ways:

Enhanced efficiency – By having dedicated pipelines for specific types of tasks, AIDA can process requests more quickly. Each pipeline is optimized for its particular use case, which reduces the computation time needed to generate an appropriate response.
Increased accuracy – The specialized pipelines are equipped with specific tools and knowledge for each type of task. This allows AIDA to provide more accurate and relevant responses, because it uses the most appropriate resources for each type of request.
Optimized resource allocation – By classifying interactions, AIDA can allocate computational resources more efficiently, directing the appropriate processing power to each type of task.
Improved response time – The combination of greater efficiency and optimized resource allocation results in faster response times for users.
Enhanced adaptability – This system allows AIDA to better adapt to different types of requests, from simple queries to complex tasks such as document translations or specialized inquiries about IDIADA services.

The research and development of this large language model (LLM) based classifier is an important step in the continuous improvement of the intelligent agent’s capabilities within the Applus IDIADA environment.
For this occasion, we use a set of 1,668 examples of pre-classified human interactions. These have been divided into 666 for training and 1,002 for testing. A 40/60 split has been applied, giving significant importance to the test set.
The following table shows some examples.

SAMPLE
CLASS

Can you make a summary of this text? “Legislation for the Australian Government’s …”
Conversation

No, only focus on this sentence : Braking technique to enable maximum brake application speed
Conversation

In a factory give me synonyms of a limiting resource of activities
Conversation

We need a translation of the file “Company_Bylaws.pdf” into English, could you handle it?
Document_Translation

Please translate the file “Product_Manual.xlsx” into English
Document_Translation

Could you convert the document “Data_Privacy_Policy.doc’ into English, please?
Document_Translation

Register my username in the IDIADA’s human resources database
Services

Send a mail to random_user@mail.com to schedule a meeting for the next weekend
Services

Book an electric car charger for me at IDIADA
Services

We present three different classification approaches: two based on LLMs and one using a classic machine learning (ML) algorithm. The aim is to understand which approach is most suitable for addressing the presented challenge.
LLM-based classifier: Simple prompt
In this case, we developed an LLM-based classifier to categorize inputs into three classes: Conversation, Services, and Document_Translation. Instead of relying on predefined, rigid definitions, our approach follows the principle of understanding a set. This principle involves analyzing the common characteristics and patterns present in the examples or instances that belong to each class. By studying the shared traits of inputs within a class, we can derive an understanding of the class itself, without being constrained by preconceived notions.
It’s important to note that the learned definitions might differ from common expectations. For instance, the Conversation class encompasses not only typical conversational exchanges but also tasks like text summarization, which share similar linguistic and contextual traits with conversational inputs.
By following this data-driven approach, the classifier can accurately categorize new inputs based on their similarity to the learned characteristics of each class, capturing the nuances and diversity within each category.
The code consists of the following key components: libraries, a prompt, model invocation, and an output parser.
Libraries
The programming language used in this code is Python, complemented by the LangChain module, which is specifically designed to facilitate the integration and use of LLMs. This module provides a comprehensive set of tools and abstractions that streamline the process of incorporating and deploying these advanced AI models.
To take advantage of the power of these language models, we use Amazon Bedrock. The integration with Amazon Bedrock is achieved through the Boto3 Python module, which serves as an interface to the AWS, enabling seamless interaction with Amazon Bedrock and the deployment of the classification model.
Prompt
The task is to assign one of three classes (Conversation, Services, or Document_Translation) to a given sentence, represented by question:

Conversation class – This class encompasses casual messages, summarization requests, general questions, affirmations, greetings, and similar types of text. It also includes requests for text translation, summarization, or explicit inquiries about the meaning of words or sentences in a specific language.
Services class – Texts belonging to this class consist of explicit requests for services such as room reservations, hotel bookings, dining services, cinema information, tourism-related inquiries, and similar service-oriented requests.
Document_Translation class – This class is characterized by requests for the translation of a document to a specific language. Unlike the Conversation class, these requests don’t involve summarization. Additionally, the name of the document to be translated and the target language are specified.

The prompt suggests a hierarchical approach to the classification process. First, the sentence should be evaluated to determine if it can be classified as a conversation. If the sentence doesn’t fit the Conversation class, one of the other two classes (Services or Document_Translation) should be assigned.
The priority for the Conversation class stems from the fact that 99% of the interactions are actually simple questions regarding various matters.
Model invocation
We use Anthropic’s Claude 3 Sonnet model for the natural language processing task. This LLM model has a context window of 200,000 tokens, enabling it to manage different languages and retrieve highly accurate answers. We use two key parameters:

max_tokens – This parameter limits the maximum number of tokens (words or subwords) that the language model can generate in its output to 50.
temperature – This parameter controls the randomness of the language model’s output. A temperature of 0.0 means that the model will produce the most likely output according to its training, without introducing randomness.

Output parser
Another important component is the output parser, which allows us to gather the desired information in JSON format. To achieve this, we use the LangChain parameter output_parsers.
The following code illustrates a simple prompt approach:

def classify_interaction(question):
response_schemas = [
ResponseSchema(name=”class”, description=”the assigned class”)
]
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
format_instructions = output_parser.get_format_instructions()
prompt =f”””
We have 3 classes Conversation (for example asking for assistance), Services and Document_Translation.
Conversation:text consist of casual messages, summarization requests, general questions, afirmations, greetings,
and similar. Requests for text translation, text summarisation or explicit text translation requests,
questions about the meaning of words or sentences in a concrete language.
Services:the text consist of explicit requests for rooms, hotels, eating services, cinema, tourism, and similar.
Document_Translation: A translation of a document to a specific language is requested, and a summary is not requested.
The length of the document is specified.
Assign a class to the following sentence.
{question}
Try to understand the sentence as a Conversation one, if you can’t, then asign one of the other classes.
{format_instructions}
“””
response = bedrock_runtime.invoke_model(
modelId=’anthropic.claude-3-sonnet-20240229-v1:0′,
body=json.dumps(
{
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 20,
“temperature”:0,
“messages”: [
{
“role”: “user”,
“content”: [{“type”: “text”, “text”: prompt}],
}
],
}
),
)
result_message = json.loads(response.get(“body”).read())
texto = result_message[‘content’][0][‘text’]
try:
output_dict = output_parser.parse(texto.replace(”’, ‘”‘))[‘class’]
except:
output_dict = ‘Conversation’
return output_dict

LLM-based classifier: Example augmented inference
We use RAG techniques to enhance the model’s response capabilities. Instead of relying solely on compressed definitions, we provide the model with a quasi-definition by extension. Specifically, we present the model with a diverse set of examples for each class, allowing it to learn the inherent characteristics and patterns that define each category. For instance, in addition to a concise definition of the Conversation class, the model is exposed to various conversational inputs, enabling it to identify common traits such as informal language, open-ended questions, and back-and-forth exchanges. This example-driven approach complements the initial descriptions provided, allowing the model to capture the nuances and diversity within each class. By combining concise definitions with representative examples, the RAG technique helps the model develop a more comprehensive understanding of the classes, enhancing its ability to accurately categorize new inputs based on their inherent nature and characteristics.
The following code provides examples in JSON format for RAG:

{
“Conversation”:[
“”Could you give me examples of how to solve it?”,
“cool but anything short and sweet”,
“…”
],
“Services”:[
“make a review of my investments in the eBull.com platform”,
“I need a room in IDIADA”,
“schedule a meeting with”,
“…”
]”Document_Translation”:[
“Translate the file into Catalan”,
“Could you translate the document I added earlier into Swedish?”,
“Translate the Guía_Rápida.doc file into Romanian”,
“…”
]
}

The total number of examples provided for each class is as follows:

Conversation – 500 examples. This is the most common class, and only 500 samples are given to the model due to the vast amount of information, which could cause infrastructure overflow (very high delays, throttling, connection shutouts). This is a crucial point to note because it represents a significant bottleneck. Providing more examples to this approach could potentially improve performance, but the question remains: How many examples? Surely, a substantial amount would be required.
Services – 26 examples. This is the least common class, and in this case, all available training data has been used.
Document_Translation – 140 examples. Again, all available training data has been used for this class.

One of the key challenges with this approach is scalability. Although the model’s performance improves with more training examples, the computational demands quickly become overwhelming for our current infrastructure. The sheer volume of data required can lead to quota issues with Amazon Bedrock and unacceptably long response times. Rapid response times are essential for providing a satisfactory user experience, and this approach falls short in that regard.
In this case, we need to modify the code to embed all the examples. The following code shows the changes applied to the first version of the classifier. The prompt is modified to include all the examples in JSON format under the “Here you have some examples” section.

def classify_interaction(question, agent_examples):
response_schemas = [
ResponseSchema(name=”class”, description=”the assigned class”)
]
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
format_instructions = output_parser.get_format_instructions()
prompt =f”””
We have 3 classes Conversation (for example asking for assistance), Services and Document_Translation.
Conversation:text consist of casual messages, summarization requests, general questions, afirmations, greetings,
and similar. Requests for text translation, text summarisation or explicit text translation requests,
questions about the meaning of words or sentences in a concrete language.
Services:the text consist of explicit requests for rooms, hotels, eating services, cinema, tourism, and similar.
Document_Translation: A translation of a document to a specific language is requested, and a summary is not requested.
The length of the document is specified.

Here you have some examples:
{agent_examples}

Assign a class to the following sentence.
{question}

Try to understand the sentence as a Conversation one, if you can’t, then asign one of the other classes.
{format_instructions}
“””

response = bedrock_runtime.invoke_model(
modelId=’anthropic.claude-3-sonnet-20240229-v1:0′,
body=json.dumps(
{
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 50,
“messages”: [
{
“role”: “user”,
“content”: [{“type”: “text”, “text”: prompt}],
}
],
}
),
)

result_message = json.loads(response.get(“body”).read())
texto = result_message[‘content’][0][‘text’]
output_dict = output_parser.parse(texto.replace(”’, ‘”‘))[‘class’]

return output_dict

K-NN-based classifier: Amazon Titan Embeddings
In this case, we take a different approach by recognizing that despite the multitude of possible interactions, they often share similarities and repetitive patterns. Instead of treating each input as entirely unique, we can use a distance-based approach like k-nearest neighbors (k-NN) to assign a class based on the most similar examples surrounding the input. To make this work, we need to transform the textual interactions into a format that allows algebraic operations. This is where embeddings come into play. Embeddings are vector representations of text that capture semantic and contextual information. We can calculate the semantic similarity between different interactions by converting text into these vector representations and comparing their vectors and determining their proximity in the embedding space.
To accommodate this approach, we need to modify the code accordingly:

from langchain.embeddings import BedrockEmbeddings
from sklearn.neighbors import KNeighborsClassifier

bedrock_runtime = boto3.client(
service_name=’bedrock-runtime’,
region_name=’us-west-2′
)

bedrock_embedding = BedrockEmbeddings(
client=bedrock_runtime,
model_id=”amazon.titan-embed-text-v1″,
)
df_train = pd.read_excel(‘coordinator_dataset/casos_coordinador_train.xlsx’)
df_test = pd.read_excel(‘coordinator_dataset/casos_coordinador_test.xlsx’)

X_train_emb = bedrock_embedding.embed_documents(df_train[‘sample’].values.tolist())
X_test_emb = bedrock_embedding.embed_documents(df_test[‘sample’].values.tolist())
y_train = df_train[‘agent’].values.tolist()
y_test = df_test[‘agent’].values.tolist()
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train_emb, y_train)
y_pred = neigh.predict(X_test_emb)
print(classification_report(y_test, y_pred, target_names=[‘Conversation’, ‘Document_Translation’, ‘Services’]))

We used the Amazon Titan Text Embeddings G1 model, which generates vectors of 1,536 dimensions. This model is trained to accept multiple languages while retaining the semantic meaning of the embedded phrases.
For the classfier, we employed a classic ML algorithm, k-NN, using the scikit-learn Python module. This method takes a parameter, which we set to 3.
The following figure illustrates the F1 scores for each class plotted against the number of neighbors (k) used in the k-NN algorithm. As the graph shows, the optimal value for k is 3, which yields the highest F1 score for the most prevalent class, Document_Translation. Although it’s not the absolute highest score for the Services class, Document_Translation is significantly more common, making k=3 the best overall choice to maximize performance across all classes.

K-NN-based classifier: Cohere’s multilingual embeddings model
In the previous section, we used the popular Amazon Titan Text Embeddings G1 model to generate text embeddings. However, other models might offer different advantages. In this section, we explore the use of Cohere’s multilingual model on Amazon Bedrock for generating embeddings. We chose the Cohere model due to its excellent capability in handling multiple languages without compromising the vectorization of phrases. As we will demonstrate, this model doesn’t introduce significant differences in the generated vectors compared to other models, making it more suitable for use in a multilingual environment like AIDA.
To use the Cohere model, we need to change the model_id:

from langchain.embeddings import BedrockEmbeddings
from sklearn.neighbors import KNeighborsClassifier

bedrock_runtime = boto3.client(
service_name=’bedrock-runtime’,
region_name=’us-west-2′
)

bedrock_embedding = BedrockEmbeddings(
client=bedrock_runtime,
model_id=” cohere.embed-multilingual-v3″,
)
df_train = pd.read_excel(‘coordinator_dataset/casos_coordinador_train.xlsx’)
df_test = pd.read_excel(‘coordinator_dataset/casos_coordinador_test.xlsx’)
data_train = [s[:1500] for s in df_train[‘sample’]]
data_test = [s[:1500] for s in df_test[‘sample’]]

y_train = df_train[‘agent’].values.tolist()
y_test = df_test[‘agent’].values.tolist()
X_test = df_test[‘sample’].values.tolist()

X_train_emb = bedrock_embedding.embed_documents(data_train)
X_test_emb = bedrock_embedding.embed_documents(data_test)

neigh = KNeighborsClassifier(n_neighbors=11)
neigh.fit(X_train_emb, y_train)
y_pred = neigh.predict(X_test_emb)
print(classification_report(y_test, y_pred, target_names=[‘Conversation’, ‘Document_Translation’, ‘Services’]))

We use Cohere’s multilingual embeddings model to generate vectors with 1,024 dimensions. This model is trained to accept multiple languages and retain the semantic meaning of the embedded phrases.
For the classifier, we employ k-NN, using the scikit-learn Python module. This method takes a parameter, which we have set to 11.
The following figure illustrates the F1 scores for each class plotted against the number of neighbors used. As depicted, the optimal point is k=11, achieving the highest value for Document_Translation and the second-highest for Services. In this instance, the trade-off between Documents_Translation and Services is favorable.

Amazon Titan Embeddings vs. Cohere’s multilingual embeddings model
In this section, we delve deeper into the embeddings generated by both models, aiming to understand their nature and consequently comprehend the results obtained. To achieve this, we have performed dimensionality reduction to visualize the vectors obtained in both cases in 2D.
Cohere’s multilingual embeddings model has a limitation on the size of the text it can vectorize, posing a significant constraint. Therefore, in the implementation showcased in the previous section, we applied a filter to only include interactions up to 1,500 characters (excluding cases that exceed this limit).
The following figure illustrates the vector spaces generated in each case.

As we can observe, the generated vector spaces are relatively similar, initially appearing to be analogous spaces with a rotation between one another. However, upon closer inspection, it becomes evident that the direction of maximum variance in the case of Cohere’s multilingual embeddings model is distinct (deducible from observing the relative position and shape of the different groups). This type of situation, where high class overlap is observed, presents an ideal case for applying algorithms such as k-NN.
As mentioned in the introduction, most human interactions with AI are very similar to each other within the same class. This would explain why k-NN-based models outperform LLM-based models.
SVM-based classifier: Amazon Titan Embeddings
In this scenario, it is likely that user interactions belonging to the three main categories (Conversation, Services, and Document_Translation) form distinct clusters or groups within the embedding space. Each category possesses particular linguistic and semantic characteristics that would be reflected in the geometric structure of the embedding vectors. The previous visualization of the embeddings space displayed only a 2D transformation of this space. This doesn’t imply that clusters coudn’t be highly separable in higher dimensions.
Classification algorithms like support vector machines (SVMs) are especially well-suited to use this implicit geometry of the data. SVMs seek to find the optimal hyperplane that separates the different groups or classes in the embedding space, maximizing the margin between them. This ability of SVMs to use the underlying geometric structure of the data makes them an intriguing option for this user interaction classification problem.
Furthermore, SVMs are a robust and efficient algorithm that can effectively handle high-dimensional datasets, such as text embeddings. This makes them particularly suitable for this scenario, where the embedding vectors of the user interactions are expected to have a high dimensionality.
The following code illustrates the implementation:

bedrock_runtime = boto3.client(
service_name=’bedrock-runtime’,
region_name=’eu-central-1′
)

bedrock_embedding = BedrockEmbeddings(
client=bedrock_runtime,
model_id=”amazon.titan-embed-text-v1″,
)

df_train = pd.read_excel(‘coordinator_dataset/casos_coordinador_train.xlsx’)
df_test = pd.read_excel(‘coordinator_dataset/casos_coordinador_test.xlsx’)

y_train = df_train[‘agent’].values.tolist()
y_test = df_test[‘agent’].values.tolist()
X_test = df_test[‘sample’].values.tolist()

X_train_emb = bedrock_embedding.embed_documents(df_train[‘sample’].values.tolist())
X_test_emb = bedrock_embedding.embed_documents(df_test[‘sample’].values.tolist())

f1 = make_scorer(f1_score , average=’weighted’)
parameters = {‘kernel’:(‘linear’, ‘rbf’,’poly’, ‘sigmoid’),
‘C’:[1, 2, 4, 6, 8, 10],
‘class_weight’:[None, ‘balanced’]}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters,cv=10,n_jobs= -1, scoring=f1)
clf.fit(X_train_emb, y_train)

y_pred = clf.predict(X_test_emb)

We use Amazon Titan Text Embeddings G1. This model generates vectors of 1,536 dimensions, and is trained to accept several languages and to retain the semantic meaning of the phrases embedded.
To implement the classifier, we employed a classic ML algorithm, SVM, using the scikit-learn Python module. The SVM algorithm requires the tuning of several parameters to achieve optimal performance. To determine the best parameter values, we conducted a grid search with 10-fold cross-validation, using the F1 multi-class score as the evaluation metric. This systematic approach allowed us to identify the following set of parameters that yielded the highest performance for our classifier:

C – We set this parameter to 1. This parameter controls the trade-off between allowing training errors and forcing rigid margins. It acts as a regularization parameter. A higher value of C (for example, 10) indicates a higher penalty for misclassification errors. This results in a more complex model that tries to fit the training data more closely. A higher C value can be beneficial when the classes in the data are well separated, because it allows the algorithm to create a more intricate decision boundary to accurately classify the samples. On the other hand, a C value of 1 indicates a reasonable balance between fitting the training set and the model’s generalization ability. This value might be appropriate when the data has a simple structure, and a more flexible model isn’t necessary to capture the underlying relationships. In our case, the selected C value of 1 suggests that the data has a relatively simple structure, and a balanced model with moderate complexity is sufficient for accurate classification.
class_weight – We set this parameter to None. This parameter adjusts the weights of each class during the training process. Setting class_weight to balanced automatically adjusts the weights inversely proportional to the class frequencies in the input data. This is particularly useful when dealing with imbalanced datasets, where one class is significantly more prevalent than the others. In our case, the value of None for the class_weight parameter suggests that the minor classes don’t have much relevance or impact on the overall classification task. This choice implies that the implicit geometry or decision boundaries learned by the model might not be optimized for separating the different classes effectively.
Kernel – We set this parameter to linear. This parameter specifies the type of kernel function to be used by the SVC algorithm. The linear kernel is a simple and efficient choice because it assumes that the decision boundary between classes can be represented by a linear hyperplane in the feature space. This value suggests that, in a higher dimension vector space, the categories could be linearly separated by an hyperplane.

SVM-based classifier: Cohere’s multilingual embeddings model
The implementation details of the classifier are presented in the following code:

from langchain.embeddings import BedrockEmbeddings

bedrock_runtime = boto3.client(
service_name=’bedrock-runtime’,
region_name=’us-west-2′
)

bedrock_embedding = BedrockEmbeddings(
client=bedrock_runtime,
model_id=”cohere.embed-multilingual-v3″,
)

df_train = pd.read_excel(‘coordinator_dataset/casos_coordinador_train.xlsx’)
df_test = pd.read_excel(‘coordinator_dataset/casos_coordinador_test.xlsx’)

data_train = [s[:1500] for s in df_train[‘sample’]]
data_test = [s[:1500] for s in df_test[‘sample’]]

y_train = df_train[‘agent’].values.tolist()

X_train_emb = bedrock_embedding.embed_documents(data_train)
X_test_emb = bedrock_embedding.embed_documents(data_test)

f1 = make_scorer(f1_score , average=’weighted’)

parameters = {‘kernel’:(‘linear’, ‘rbf’,’poly’, ‘sigmoid’),
‘C’:[1, 2, 4, 6, 8, 10],
‘class_weight’:[None, ‘balanced’]}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters,cv=10,n_jobs= -1, scoring=f1)
clf.fit(X_train_emb, y_train)

y_pred = clf.predict(X_test_emb)

We use the Amazon Titan Text Embeddings G1 model, which generates vectors of 1,536 dimensions. This model is trained to accept multiple languages and retain the semantic meaning of the embedded phrases.
For the classifier, we employ SVM, using the scikit-learn Python module. To obtain the optimal parameters, we performed a grid search with 10-fold cross-validation based on the multi-class F1 score, resulting in the following selected parameters (as detailed in the previous section):

C – We set this parameter to 1, which indicates a reasonable balance between fitting the training set and the model’s generalization ability. This setting suggests that the data has a simple structure and that a more flexible model might not be necessary to capture the underlying relationships.
class_weight – We set this parameter to None. A value of None suggests that the minor classes don’t have much relevance, which in turn implies that the implicit geometry might not be suitable for separating the different classes.
kernel – We set this parameter to linear. This value suggests that in a higher-dimensional vector space, the categories could be linearly separated by a hyperplane.

ANN-based classifier: Amazon Titan and Cohere’s multilingual embeddings model
Given the promising results obtained with SVMs, we decided to explore another geometry-based method by employing an Artificial Neural Network (ANN) approach.
In this case, we performed normalization of the input vectors to use the advantages of normalization when using neural networks. Normalizing the input data is a crucial step when working with ANNs, because it can help improve the model’s during training. We applied min/max scaling for normalization.
The use of an ANN-based approach provides the ability to capture complex non-linear relationships in the data, which might not be easily modeled using traditional linear methods like SVMs. The combination of the geometric insights and the normalization of inputs can potentially lead to improved predictive accuracy compared to the previous SVM results.
This approach consists of the following parameters:

Model definition – We define a sequential deep learning model using the Keras library from TensorFlow.
Model architecture – The model consists of three densely connected layers. The first layer has 16 neurons and uses the ReLU activation function. The second layer has 8 neurons and employs the ReLU activation function. The third layer has 3 neurons and uses the softmax activation function.
Model compilation – We compile the model using the categorical_crossentropy loss function, the Adam optimizer with a learning rate of 0.01, and the categorical_accuracy. We incorporate an EarlyStopping callback to stop the training if the categorical_accuracy metric doesn’t improve for 25 epochs.
Model training – We train the model for a maximum of 500 epochs using the training set and validate it on the test set. The batch size is set to 64. The performance metric used is the maximum classification accuracy (categorical_accuracy) obtained during the training.

We applied the same methodology, but using the embeddings generated by Cohere’s multilingual embeddings model after being normalized through min/max scaling. In both cases, we employed the same preprocessing steps:

bedrock_runtime = boto3.client(
service_name=’bedrock-runtime’,
region_name=’us-west-2′
)

bedrock_embedding = BedrockEmbeddings(
client=bedrock_runtime,
model_id=”cohere.embed-multilingual-v3″,
)

df_train = pd.read_excel(‘coordinator_dataset/casos_coordinador_train.xlsx’)
df_test = pd.read_excel(‘coordinator_dataset/casos_coordinador_test.xlsx’)

df_train[‘sample’] = [s[:1500] for s in df_train[‘sample’]]
df_test[‘sample’] = [s[:1500] for s in df_test[‘sample’]]

X_train_emb = bedrock_embedding.embed_documents(df_train[‘sample’].values.tolist())
X_test_emb = bedrock_embedding.embed_documents(df_test[‘sample’].values.tolist())
y_train = df_train[‘agent’].values.tolist()

y_train_ohe = [ [int(y==’Conversation’), int(y==’Document_Translation’), int(y==’Services’)] for y in y_train]
y_test = df_test[‘agent’].values.tolist()
y_test = [ [‘Conversation’, ‘Document_Translation’, ‘Services’].index(y) for y in y_test]
X_test = df_test[‘sample’].values.tolist()

To help avoid ordinal assumptions, we employed a one-hot encoding representation for the output of the network. One-hot encoding doesn’t make any assumptions about the inherent order or hierarchy among the categories. This is particularly useful when the categorical variable doesn’t have a clear ordinal relationship, because the model can learn the relationships without being biased by any assumed ordering.
The following code illustrates the implementation:

def train_model( X, y, n_hebras = 10, reps = 30, train_size = 0.7, tipo_optimizacion = “low”):
import threading

reps_por_hebra = int(reps/n_hebras)
hebras = [0]*n_hebras
results = [0]*reps
models = [0]*reps

for i in range(len(hebras)):
hebras[i] = threading.Thread(target=eval_model_rep_times,
args=(X, y, train_size, reps_por_hebra, i*reps_por_hebra, models, results))
hebras[i].start()

for i in range(len(hebras)):
hebras[i].join()

if tipo_optimizacion == “low”:
result = models[np.argmin(results)], min(results)
else:
result = models[np.argmax(results)], max(results)
return result

def eval_model_rep_times(X, y, train_size, reps, index, models, results):
for rep in range(reps):
X_train, X_test, y_train, y_test = train_test_split( X, y, train_size = train_size)
model, metric = create_and_fit_model(X_train, y_train, X_test, y_test)
models[index+rep] = model
results[index+rep] = metric

def create_and_fit_model(X_train, y_train, X_test, y_test):
### DEFINITION GOES HERE ###
model = Sequential()
model.add(Dense(16, input_shape = (len(X_train[0]),), activation=’relu’) )
model.add(Dense(8, activation=’relu’) )
model.add(Dense(3, activation=’softmax’ ))
model.compile(loss=’categorical_crossentropy’, optimizer=Adam(learning_rate=0.01), metrics=[‘categorical_accuracy’])
early_stopping = tf.keras.callbacks.EarlyStopping(monitor=”categorical_accuracy”, patience=25, mode = ‘max’)
### DEFINITION GOES HERE ###

### TRAINING GOES HERE ###
history = model.fit(X_train,
y_train,
epochs=500,
validation_data = (X_test, y_test),
batch_size=64,
callbacks= early_stopping,
verbose=0)
### TRAINING GOES HERE ###

metrica = max(history.history[‘categorical_accuracy’])

#ALWAYS RETURN THE MODEL
return model, metrica

model, mse = train_model(X_train_emb_norm, y_train_ohe, 5, 20, tipo_optimizacion=”high”)
y_pred = [ est.argmax() for est in model.predict(X_test_emb_norm) ]

Results
We conducted a comparative analysis using the previously presented code and data. The models were assessed based on their F1 scores for the conversation, services, and document translation tasks, as well as their runtimes. The following table summarizes our results.

MODEL
CONVERSATION F1
SERVICES F1
DOCUMENT_ TRANSLATION F1
RUNTIME (Seconds)

LLM
0.81
0.22
0.46
1.2

LLM with examples
0.86
0.13
0.68
18

KNN – Amazon Titan Embedding
0.98
0.57
0.88
0.35

KNN – Cohere Embedding
0.96
0.72
0.72
0.35

SVM Amazon Titan Embedding
0.98
0.69
0.82
0.3

SVM Cohere Embedding
0.99
0.80
0.93
0.3

ANN Amazon Titan Embedding
0.98
0.60
0.87
0.15

ANN Cohere Embedding
0.99
0.77
0.96
0.15

As illustrated in the table, the SVM and ANN models using Cohere’s multilingual embeddings model demonstrated the strongest overall performance. The SVM with Cohere’s multilingual embeddings model achieved the highest F1 scores in two out of three tasks, reaching 0.99 for Conversation, 0.80 for Services, and 0.93 for Document_Translation. Similarly, the ANN with Cohere’s multilingual embeddings model also performed exceptionally well, with F1 scores of 0.99, 0.77, and 0.96 for the respective tasks.
In contrast, the LLM exhibited relatively lower F1 scores, particularly for the services (0.22) and document translation (0.46) tasks. However, the performance of the LLM improved when provided with examples, with the F1 score for document translation increasing from 0.46 to 0.68.
Regarding runtime, the k-NN, SVM, and ANN models demonstrated significantly faster inference times compared to the LLM. The k-NN and SVM models with both Amazon Titan and Cohere’s multilingual embeddings model had runtimes of approximately 0.3–0.35 seconds. The ANN models were even faster, with runtimes of approximately 0.15 seconds. In contrast, the LLM required approximately 1.2 seconds for inference, and the LLM with examples took around 18 seconds.
These results suggest that the SVM and ANN models using Cohere’s multilingual embeddings model offer the best balance of performance and efficiency for the given tasks. The superior F1 scores of these models, coupled with their faster runtimes, make them promising candidates for application. The potential benefits of providing examples to the LLM model are also noteworthy, because this approach can help improve its performance on specific tasks.
Conclusion
The optimization of AIDA, Applus IDIADA’s intelligent chatbot powered by Amazon Bedrock, has been a resounding success. By developing dedicated pipelines to handle different types of user interactions—from general conversations to specialized service requests and document translations—AIDA has significantly improved its efficiency, accuracy, and overall user experience. The innovative use of LLMs, embeddings, and advanced classification algorithms has allowed AIDA to adapt to the evolving needs of IDIADA’s workforce, providing a versatile and reliable virtual assistant. AIDA now handles over 1,000 interactions per day, with a 95% accuracy rate in routing requests to the appropriate pipeline and driving a 20% increase in their team’s productivity.
Looking ahead, IDIADA plans to offer AIDA as an integrated product for customer environments, further expanding the reach and impact of this transformative technology.
Amazon Bedrock offers a comprehensive approach to security, compliance, and responsible AI development that empowers IDIADA and other customers to harness the full potential of generative AI without compromising on safety and trust. As this advanced technology continues to rapidly evolve, Amazon Bedrock provides the transparent framework needed to build innovative applications that inspire confidence.
Unlock new growth opportunities by creating custom, secure AI models tailored to your organization’s unique needs. Take the first step in your generative AI transformation—connect with an AWS expert today to begin your journey.

About the Authors
Xavier Vizcaino is the head of the DataLab, in the Digital Solutions department of Applus IDIADA. DataLab is the unit focused on the development of solutions for generating value from the exploitation of data through artificial intelligence.
Diego Martín Montoro is an AI Expert and Machine Learning Engineer at Applus+ Idiada Datalab. With a Computer Science degree and a Master’s in Data Science, Diego has built his career in the field of artificial intelligence and machine learning. His experience includes roles as a Machine Learning Engineer at companies like AppliedIT and Applus+ IDIADA, where he has worked on developing advanced AI systems and anomaly detection solutions.
Jordi Sánchez Ferrer is the current Product Owner of the Datalab at Applus+ Idiada. A Computer Engineer with a Master’s degree in Data Science, Jordi’s trajectory includes roles as a Business Intelligence developer, Machine Learning engineer, and lead developer in Datalab. In his current role, Jordi combines his technical expertise with product management skills, leading strategic initiatives that align data science and AI projects with business objectives at Applus+ Idiada.
Daniel Colls is a professional with more than 25 years of experience who has lived through the digital transformation and the transition from the on-premises model to the cloud from different perspectives in the IT sector. For the past 3 years, as a Solutions Architect at AWS, he has made this experience available to his customers, helping them successfully implement or move their workloads to the cloud.