evaluate

page_type

languages

products

description

sample

python

ai-services

azure-openai

Evaluate.

Evaluate

Overview

This tutorial provides a step-by-step guide on how to evaluate Generative AI base models or AI Applications with Azure. Each of these samples uses the azure-ai-evaluation SDK.

When selecting a base model for building an application—or after building an AI application (such as a Retrieval-Augmented Generation (RAG) system or a multi-agent framework)—evaluation plays a pivotal role. Effective evaluation ensures that the chosen or developed AI model or application meets the intended safety, quality, and performance benchmarks.

In both cases, running evaluations requires specific tools, methods, and datasets. Here’s a breakdown of the key components involved:

Testing with Evaluation Datasets
- Bring Your Own Data: Use datasets tailored to your application or domain.
- Redteaming Queries: Design adversarial prompts to test robustness.
- Azure AI Simulators: Leverage Azure AI's context-specific or adversarial dataset generators to create relevant test cases.
Selecting the Appropriate Evaluators or Building Custom Ones
- Pre-Built Evaluators: Azure AI provides a range of generation safety and quality/NLP evaluators ready for immediate use.
- Custom Evaluators: Using the Azure AI Evaluation SDK, you can design and implement evaluators that align with the unique requirements of your application.
Generating and Visualizing Evaluation Results: Azure AI Evaluation SDK enables you to evaluate the target functions (such as endpoints of your AI application or your model endpoints on your dataset with either built-in or custom evaluators. You can run evaluations remotely in the cloud or locally on your own machine.

Objective

The main objective of this tutorial is to help users understand the process of evaluating an AI model in Azure. By the end of this tutorial, you should be able to:

Simulate interactions with an AI model
Evaluate both deployed model endpoints and applications
Evaluate using quantitative NLP metrics, qualitative metrics, and custom metrics

Our samples cover the following tools for evaluation of AI models in Azure:

Sample name	adversarial	simulator	conversation starter	index	raw text	against model endpoint	against app	qualitative metrics	custom metrics	quantitative NLP metrics
Simulate_Adversarial.ipynb	X	X				X
Simulate_From_Conversation_Starter.ipynb		X	X			X
Simulate_From_Azure_Search_Index.ipynb		X		X		X
Simulate_From_Input_Text.ipynb		X			X	X
Evaluate_Base_Model_Endpoint.ipynb						X		X
Evaluate_App_Endpoint.ipynb							X	X
AI_Judge_Evaluators_Quality.ipynb						X		X
Custom_Evaluators.ipynb						X			X
NLP_Evaluators.ipynb						X				X
AI_Judge_Evaluators_Safety_Risks.ipynb	X					X
Simulate_Evaluate_Groundedness.py		X			X	X		X

Pre-requisites

To use the azure-ai-evaluation SDK, install withpython pip install azure-ai-evaluation Python 3.8 or later is required to use this package.- See our Python reference documentation for our azure-ai-evaluation SDKhere for more granular details on input/output requirements and usage instructions.- Check out our Github repo for azure-ai-evaluation SDK here.

Programming Languages

Python

Name		Name	Last commit message	Last commit date
parent directory ..
Image		Image
Simulators		Simulators
Supported_Evaluation_Metrics		Supported_Evaluation_Metrics
Supported_Evaluation_Targets		Supported_Evaluation_Targets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluate

evaluate

README.md

Evaluate

Overview

Objective

Pre-requisites

Programming Languages

Estimated Runtime: 30 mins

Files

evaluate

Directory actions

More options

Directory actions

More options

Latest commit

History

evaluate

Folders and files

parent directory

README.md

Evaluate

Overview

Objective

Pre-requisites

Programming Languages

Estimated Runtime: 30 mins