page_type | languages | products | description | |||
---|---|---|---|---|---|---|
sample |
|
|
Evaluate. |
This tutorial provides a step-by-step guide on how to evaluate Generative AI base models or AI Applications with Azure. Each of these samples uses the azure-ai-evaluation
SDK.
When selecting a base model for building an application—or after building an AI application (such as a Retrieval-Augmented Generation (RAG) system or a multi-agent framework)—evaluation plays a pivotal role. Effective evaluation ensures that the chosen or developed AI model or application meets the intended safety, quality, and performance benchmarks.
In both cases, running evaluations requires specific tools, methods, and datasets. Here’s a breakdown of the key components involved:
-
Testing with Evaluation Datasets
- Bring Your Own Data: Use datasets tailored to your application or domain.
- Redteaming Queries: Design adversarial prompts to test robustness.
- Azure AI Simulators: Leverage Azure AI's context-specific or adversarial dataset generators to create relevant test cases.
-
Selecting the Appropriate Evaluators or Building Custom Ones
- Pre-Built Evaluators: Azure AI provides a range of generation safety and quality/NLP evaluators ready for immediate use.
- Custom Evaluators: Using the Azure AI Evaluation SDK, you can design and implement evaluators that align with the unique requirements of your application.
-
Generating and Visualizing Evaluation Results: Azure AI Evaluation SDK enables you to evaluate the target functions (such as endpoints of your AI application or your model endpoints on your dataset with either built-in or custom evaluators. You can run evaluations remotely in the cloud or locally on your own machine.
The main objective of this tutorial is to help users understand the process of evaluating an AI model in Azure. By the end of this tutorial, you should be able to:
- Simulate interactions with an AI model
- Evaluate both deployed model endpoints and applications
- Evaluate using quantitative NLP metrics, qualitative metrics, and custom metrics
Our samples cover the following tools for evaluation of AI models in Azure:
Sample name | adversarial | simulator | conversation starter | index | raw text | against model endpoint | against app | qualitative metrics | custom metrics | quantitative NLP metrics |
---|---|---|---|---|---|---|---|---|---|---|
Simulate_Adversarial.ipynb | X | X | X | |||||||
Simulate_From_Conversation_Starter.ipynb | X | X | X | |||||||
Simulate_From_Azure_Search_Index.ipynb | X | X | X | |||||||
Simulate_From_Input_Text.ipynb | X | X | X | |||||||
Evaluate_Base_Model_Endpoint.ipynb | X | X | ||||||||
Evaluate_App_Endpoint.ipynb | X | X | ||||||||
AI_Judge_Evaluators_Quality.ipynb | X | X | ||||||||
Custom_Evaluators.ipynb | X | X | ||||||||
NLP_Evaluators.ipynb | X | X | ||||||||
AI_Judge_Evaluators_Safety_Risks.ipynb | X | X | ||||||||
Simulate_Evaluate_Groundedness.py | X | X | X | X |
To use the azure-ai-evaluation
SDK, install withpython pip install azure-ai-evaluation
Python 3.8 or later is required to use this package.- See our Python reference documentation for our azure-ai-evaluation
SDKhere for more granular details on input/output requirements and usage instructions.- Check out our Github repo for azure-ai-evaluation
SDK here.
- Python