Installation

A library for data valuation.

pyDVL collects algorithms for Data Valuation and Influence Function computation.

Data Valuation for machine learning is the task of assigning a scalar to each element of a training set which reflects its contribution to the final performance or outcome of some model trained on it. Some concepts of value depend on a specific model of interest, while others are model-agnostic. pyDVL focuses on model-dependent methods.

Comparison of different data valuation methods on best sample removal.

The Influence Function is an infinitesimal measure of the effect that single training points have over the parameters of a model, or any function thereof. In particular, in machine learning they are also used to compute the effect of training samples over individual test points.

Influences of input points with corrupted data. Highlighted points have flipped labels.

Installation

To install the latest release use:

$ pip install pyDVL

You can also install the latest development version from TestPyPI:

pip install pyDVL --index-url https://test.pypi.org/simple/

pyDVL has also extra dependencies for certain functionalities (e.g. influence functions).

For more instructions and information refer to Installing pyDVL in the documentation.

Usage

In the following subsections, we will showcase the usage of pyDVL for Data Valuation and Influence Functions using simple examples.

For more instructions and information refer to Getting Started in the documentation. We provide several examples for data valuation (e.g. Shapley Data Valuation) and for influence functions (e.g. Influence Functions for Neural Networks) with details on the algorithms and their applications.

Influence Functions

For influence computation, follow these steps:

Import the necessary packages (The exact packages depend on your specific use case).

import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

from pydvl.influence.torch import DirectInfluence
from pydvl.influence.torch.util import NestedTorchCatAggregator, TorchNumpyConverter
from pydvl.influence import SequentialInfluenceCalculator

Create PyTorch data loaders for your train and test splits.

input_dim = (5, 5, 5)
output_dim = 3
train_x = torch.rand((10, *input_dim))
train_y = torch.rand((10, output_dim))
test_x = torch.rand((5, *input_dim))
test_y = torch.rand((5, output_dim))

train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)

Instantiate your neural network model.

nn_architecture = nn.Sequential(
  nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
  nn.Flatten(),
  nn.Linear(27, 3),
)

Define your loss:
```
loss = nn.MSELoss()
```

Instantiate an InfluenceFunctionModel and fit it to the training data

infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01)
infl_model = infl_model.fit(train_data_loader)

For small input data call influence method on the fitted instance.
```
influences = infl_model.influences(test_x, test_y, train_x, train_y)
```
The result is a tensor of shape (training samples x test samples) that contains at index (i, j) the influence of training sample i on test sample j.

For larger data, wrap the model into a calculator and call methods on the calculator.

infl_calc = SequentialInfluenceCalculator(infl_model)

 # Lazy object providing arrays batch-wise in a sequential manner
lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)

# Trigger computation and pull results to memory
influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())

# Trigger computation and write results batch-wise to disk
lazy_influences.to_zarr("influences_result", TorchNumpyConverter())

The higher the absolute value of the influence of a training sample on a test sample, the more influential it is for the chosen test sample, model and data loaders. The sign of the influence determines whether it is useful (positive) or harmful (negative).

Note pyDVL currently only support PyTorch for Influence Functions. We are planning to add support for Jax and perhaps TensorFlow or even Keras.

Data Valuation

The steps required to compute data values for your samples are:

Import the necessary packages (The exact packages depend on your specific use case).

import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from pydvl.utils import Dataset, Scorer, Utility
from pydvl.value import (
   compute_shapley_values,
   ShapleyMode,
   MaxUpdates,
)

Create a Dataset object with your train and test splits.

data = Dataset.from_sklearn(
    load_breast_cancer(),
    train_size=10,
    stratify_by_target=True,
    random_state=16,
)

Create an instance of a SupervisedModel (basically any sklearn compatible predictor).
```
model = LogisticRegression()
```

Create a Utility object to wrap the Dataset, the model and a scoring function.

u = Utility(
   model,
   data,
   Scorer("accuracy", default=0.0)
)

Use one of the methods defined in the library to compute the values. In our example, we will use Permutation Montecarlo Shapley, an approximate method for computing Data Shapley values.
```
values = compute_shapley_values(
   u,
   mode=ShapleyMode.PermutationMontecarlo,
   done=MaxUpdates(100),
   seed=16,  
   progress=True
)
```
The result is a variable of type ValuationResult that contains the indices and their values as well as other attributes.

The higher the value for an index, the more important it is for the chosen model, dataset and scorer.
(Optional) Convert the valuation result to a dataframe and analyze and visualize the values.
```
df = values.to_dataframe(column="data_value")
```

Contributing

Please open new issues for bugs, feature requests and extensions. You can read about the structure of the project, the toolchain and workflow in the guide for contributions.

Papers

We currently implement the following papers:

Data Valuation

Castro, Javier, Daniel Gómez, and Juan Tejada. Polynomial Calculation of the Shapley Value Based on Sampling. Computers & Operations Research, Selected papers presented at the Tenth International Symposium on Locational Decisions (ISOLDE X), 36, no. 5 (May 1, 2009): 1726–30.
Ghorbani, Amirata, and James Zou. Data Shapley: Equitable Valuation of Data for Machine Learning. In International Conference on Machine Learning, 2242–51. PMLR, 2019.
Wang, Tianhao, Yu Yang, and Ruoxi Jia. Improving Cooperative Game Theory-Based Data Valuation via Data Utility Learning. arXiv, 2022.
Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
Okhrati, Ramin, and Aldo Lipani. A Multilinear Sampling Algorithm to Estimate Shapley Values. In 25th International Conference on Pattern Recognition (ICPR 2020), 7992–99. IEEE, 2021.
Yan, T., and Procaccia, A. D. If You Like Shapley Then You’ll Love the Core. Proceedings of the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. Towards Efficient Data Valuation Based on the Shapley Value. In 22nd International Conference on Artificial Intelligence and Statistics, 1167–76. PMLR, 2019.
Wang, Jiachen T., and Ruoxi Jia. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. arXiv, October 22, 2022.
Kwon, Yongchan, and James Zou. Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.
Kwon, Yongchan, and James Zou. Data-OOB: Out-of-Bag Estimate as a Simple and Efficient Data Value. In Proceedings of the 40th International Conference on Machine Learning, 18135–52. PMLR, 2023.
Schoch, Stephanie, Haifeng Xu, and Yangfeng Ji. CS-Shapley: Class-Wise Shapley Values for Data Valuation in Classification. In Proc. of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS). New Orleans, Louisiana, USA, 2022.

Influence Functions

Koh, Pang Wei, and Percy Liang. Understanding Black-Box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning, 70:1885–94. Sydney, Australia: PMLR, 2017.
Naman Agarwal, Brian Bullins, and Elad Hazan, Second-Order Stochastic Optimization for Machine Learning in Linear Time, Journal of Machine Learning Research 18 (2017): 1-40.
Schioppa, Andrea, Polina Zablotskaia, David Vilar, and Artem Sokolov. Scaling Up Influence Functions. In Proceedings of the AAAI-22. arXiv, 2021.
James Martens, Roger Grosse, Optimizing Neural Networks with Kronecker-factored Approximate Curvature, International Conference on Machine Learning (ICML), 2015.
George, Thomas, César Laurent, Xavier Bouthillier, Nicolas Ballas, Pascal Vincent, Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis, Advances in Neural Information Processing Systems 31,2018.
Hataya, Ryuichiro and Yamada, Makoto, Nystrom Method for Accurate and Scalable Implicit Differentiation, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, 2023

License

pyDVL is distributed under LGPL-3.0. A complete version can be found in two files: here and here.

All contributions will be distributed under this license.

Name		Name	Last commit message	Last commit date
Latest commit History 3,140 Commits
.github		.github
badges		badges
build_scripts		build_scripts
data		data
docs		docs
docs_includes		docs_includes
notebooks		notebooks
public		public
src/pydvl		src/pydvl
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.test_durations		.test_durations
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
COPYING.LESSER		COPYING.LESSER
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
logo.svg		logo.svg
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-docs.txt		requirements-docs.txt
requirements-notebooks.txt		requirements-notebooks.txt
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Installation

Usage

Influence Functions

Data Valuation

Contributing

Papers

Data Valuation

Influence Functions

License

About

Licenses found

Releases 14

Contributors 11

Languages

License

Licenses found

aai-institute/pyDVL

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

Influence Functions

Data Valuation

Contributing

Papers

Data Valuation

Influence Functions

License

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 14

Contributors 11

Languages