SyftBox is framework for building PETs applications with minimal barriers, regardless of the programming language or environment
- Distributed Network (of Datasites) Each datasite contribute to the network with their data and applications (APIs)
- Modular Breaking down complex PETs into smaller, reusable components
- Language & Environment Agnostic
- SyftBox APIs Designed to interact with either public/private data from Datasites (Data analysis, ML, visualization etc while respecting the privacy)
Suitable for FL across multiple Datasites, enabling collaborative model training without exposing sensitive data.
SyftBox API is a script designed to interact with your own data and/or data synced from other datasites on your machine. These APIs form the backbone of the SyftBox ecosystem, enabling users to process, analyze, and manipulate data in a privacy-preserving manner.
run.sh
Contains instructions to setup the environment and execute the main script.main script
(Language Agnostic) Holds the code for the API, interacting with the platform and data.
- Ensuring that different kinds of statistical analysis don't compromise privacy.
- It's trail goes back to only 2003. Applied to ML/DL, even more recently.
"You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available"
Netflix Competition
- Netflix tried to hide users' privacy via masking the data
- Still it has been able to unmask those data via statistical analysis against the scraped data from IMDB ratings. ("Robust De-anonymization of Large Sparse Datasets")
Similar incidents happened several times.
Privacy in a Simple Database
- Suppose, we have true/false value for 5000 users (1 column database)
- If we remove person from the database, and the query does not change, then that person's privacy is fully protected. (That means, that person wasn't leaking any statistical information.)
- "Can we construct a query which doesn't change no matter who we remove from the database?"
- Experiment
- Make 5000 copies of the database. In each of those copy, one of the persons will be absent from the dataset (4999 entries).
- Then check if the query returns the same result for all of those copies.
- Sensitivity: Maximum amount that the query changes when removing an individual from the database
- If the sensitivity was 0, we would get the same output of the query regardless of who we removed from the database.
- Sensitivity of the functions explored:
avg
,sum
,threshold
.
However, there are more advanced techniques for calculating sensitivity in Differential Privacy. For example, Data-Conditioned Sensitivity.
Differencing Attack
- Way to attack Differential Privacy.
- For example, from the previous experiment, if we think about the
sum
case.- Take the sum of the entire database
- Take the sum of the database without a specific person
- Differencing them will give us that person's value
- Can be applied to other functions too.
Developing Differentially Private Algorithms
-
Add random noise to the database & to the queries to the database
-
Types of Differential Privacy
- Local Differential Privacy
- Adds noise to the function input
- Noise added on the user-side (before even putting it into database)
- Most secure (Don't need to trust the database owner)
- Global Differential Privacy
- Adds noise to the function output
- If database operator is trustworthy, this one results into more accurate result.
- Local Differential Privacy
Local Differential Privacy
-
When the survey audience are likely to lie about the queries.
(Have you jaywalked in the last week?)
-
Participant Flips Coin Twice
-
If the first coin flip is heads, answer honestly
-
If the first coin flip is tails, answer according to the second coin flip.
(heads = yes, tails = no)
(Coin flips hidden from the experimenter)
-
-
So if a participant says yes, it's plausible to think that they are saying because of a coin-flip, not because that's the truth. (Local Noise)
-
Experimenter can take the statistic and average it with 50-50 coin-flip. (Removing the noise)
-
Suppose, 70% jaywalked. So, half of them will say yes/no with a 50% probability (based on second coin-flip). And the other half will say yes/no with a 70% probability (truthfully or based on first coin-flip).
(50% * 0.5 = 25%, 50% * 0.7 = 35%, 25% + 35% = 60% of the population will say yes)
-
So, now we know what % of people (approx) jaywalked without knowing whether any individual person jaywalked or not. We can denoise this using
(result * 2) - 0.5
. -
Gained privacy, lost some accuracy
The more data points that we have, the more this noise will tend to average out and not affect the output of the query. Thus, producing a more accurate result.
-
We can also bias the coin flips to vary the amount of noise added.
def query(db, noise=0.2): true_result = torch.mean(db.float()) first_coin_flip = (torch.rand(len(db)) < noise).float() second_coin_flip = (torch.rand(len(db)) < 0.5).float() augmented_database = db.float() * first_coin_flip + (1 - first_coin_flip) * second_coin_flip sk_result = augmented_database.float().mean() private_result = ((sk_result / noise) - 0.5) * noise / (1 - noise) return private_result, true_result
-
Size of the dataset, allows us to add more noise while maintaining a decent accuracy.
More private data we have access to, the easier it is to protect the privacy of the people that are involved.
Global Differential Privacy
-
Add noise to the output of the function
- Gaussian Noise
- Laplacian Noise (Generally, this works better)
Advantage: we can often add less noise after computing the result, and get a better accuracy while maintaining privacy. Because many function reduce the sensitivity that is involved. So, adding noise later in the processing chain is more efficient.
-
Privacy Budget
How much epsilon/delta leakage we allow for our analysis? -
How much noise should we add?
Function of four things:- Type of noise (Gaussian/Laplacian)
- Sensitivity of the query
- Desired epsilon
- Desired delta
-
Laplacian noise takes a input parameter beta, which tells us how significant the noise is.
b = sensitivity (query) / epsilon
delta = 0
def laplacian_mechanism(db, query, sensitivity): beta = sensitivity / epsilon noise = torch.tensor(np.random.laplace(0, beta, 1)) return query(db) + noise
-
Laplacian noise always has zero delta, gaussian noise has non-zero delta.
-
epsilon is per query. Suppose, we have multiple queries to perform, we can adjust epsilon as per that.
Smaller epsilon means, smaller privacy leakage. Thus, more noise and big laplacian distribution. So, the resultant values will fluctuate more.
Formal Definition of Differential Privacy
How much the original distribution can vary compared to the distribution where one of the column is missing. Provides a upper bound.
Less the variation, less the privacy leakage.
- Ensuring that when our neural networks are learning from sensitive data, that they are only learning what their supposed to learn from the data.
- Without accidentally learning what they are not supposed to learn from the data.
Perfect Privacy
Training a model on a dataset should return the same model even if we remove any person from the training dataset
Querying a database => Training a model on a dataset
Two Points of Complexity
-
Do we always know where "people" are referenced in the dataset?
Treat each training example as a single separate person although often they have no correlation to people at all.
Think about a picture with multiple people in it.
-
Neural models rarely ever train to the same location, even when trained on the same dataset twice.
Scenario
Suppose, our hospital have unlabeled data that we want to use to train a model.
- Ask other hospitals (n) to train model on their own dataset (labeled)
- Use each model to predict on our own local dataset, generating n labels for each data point.
- Perform a DP query to generate the final true (DP) label for each datapoint.
- Retrain a new model on our local dataset which now has DP labels.
PATE Analysis
How much do these hospital really agree/disagree?
- A formal set of mechanisms that's capable of computing epsilon level that is conditioned on this level of agreement.
- Returns
- Data Dependent Epsilon
- Data Independent Epsilon
- The more the agreement between parties, the tighter the Data Dependent Epsilon value that we get. Less privacy leakage, better privacy.
- So, this rewards us for creating good generalized models.
-
Horizontal Federated Learning
Ideal when different parties have similar data types but distinct samples. Think hospitals collaborating on disease prediction without sharing patient records. Each party trains a local model, and their learnings are aggregated into a global model.
-
Vertical Federated Learning
Perfect for scenarios where parties have different features but overlapping samples. Imagine two companies, a social media platform and an online retailer, wanting to personalise recommendations for their shared users. They can train a global model that leverages both social media activity and purchase history without directly sharing their data.
-
Federated Transfer Learning
It allows leveraging knowledge from a pre-trained model to improve models on new tasks. Think of a football coach who transfers their expertise and strategies to a new team with different players and playing styles.
PySyft is a Python library for secure and private Deep Learning which is build on top of PyTorch (extending its functionalities).
- Remote Tensors (Pointers)
- Virtual Worker
- Local Worker
- Garbage Collection
- Pointers of Pointers
- PointerChain
Trusted Aggregator
A neutral 3rd party who has a machine that we can trust to not look at the gradients when performing the aggregation.
Additive Secret Sharing
Allows multiple individuals to add numbers together without any person learning anyone else's inputs to the addition.
import random
Q = 23740629843760239486723
def encrypt(x, n_share=3):
shares = list()
for i in range(n_share-1):
shares.append(random.randint(0,Q))
shares.append(Q - (sum(shares) % Q) + x)
return tuple(shares)
def decrypt(shares):
return sum(shares) % Q
def add(a, b):
c = list()
for i in range(len(a)):
c.append((a[i] + b[i]) % Q)
return tuple(c)
# Function Calls
x = encrypt(5)
y = encrypt(7)
z = add(x,y)
decrypt(z)
Fixed Precision Encoding
To convert our decimal numbers into an integer format.
Structured Transparency: a framework for addressing use/mis-use trade-offs when sharing information
- Provide a great mental model for understanding & dissecting the information flow of a scenario
- Frame a set of tools which enable progress towards an ideal information flow
We all know how important collaboration is, be it for achieving something can't otherwise be achieved, or be it for making faster progress towards a goal. But a successful collaboration is often involves sharing information.
And that is the main domain of the aformentioned paper, "Sharing Information".
Sharing information comes with its own perks. Notable among them are:
-
Copy Problem
Once a bit of information is copied/shared, the sender can no longer control how the recipient uses it.
Trade-off
"Benefits of Sharing" vs "Risks of Misuse"
The Copy Problem is often amplified by three related problems.
-
Bundling Problem
It is often difficult to share a bit of information without also needing to reveal additional bits because either the conventional encoding will not allow individual bits to be shared or a bit cannot be trusted/verified without the context of other relevant bits.
Surveillance Video, Driver's License etc
-
Edit Problem
If an entity that stores a piece of information makes an edit before transmitting it to another party.
A bank balance is stored by the bank itself rather than by the account holder, who might be inclined to make edits to it.
-
Recursive Oversight Problem
When one party oversees the use of information, it creates another, even more knowledgeable entity that could potentially misuse the information.
“Who watches the watchers?”
Federated learning is set-up for training machine learning algorithms on data without the owner having to share, transfer or expose their data with the developer and the service provider.
Limitations of Standard Machine Learning
- Centralized Data
- Compromised Privacy
- Scarcity of Sensitive Data
- Computationally Expensive
Types of Federated Learning
Federated Learning vs Federated Analytics
-
Federated Learning
Private machine learning on remote data
-
Federated Analytics
Private data science on remote data
Federated Analytics works by running local computations over each device's data, and passes only the aggregated results to the data scientists.
Duet [Outdated]
- After the connection establishment, the OpenGrid node isn't needed anymore. Only the Duet session between the participants is enough for the communication.
Self-information
If we say that, we tossed a coin and seen "HEAD", then we are giving 1 bit of information.
Entropy
Mutual Information
If there's lot of mutual information between the activation and the raw image, then just having the activation we are going to be able to make some good educated guesses about the raw image.
Horizontally Distributed Data
Vertically Distributed Data
Resources
- https://youtu.be/SEBdYXxijSo?feature=shared
- https://youtu.be/JM0bJoCRp0I?feature=shared
- https://quantumzero.io/homomorphic-encryption-a-deep-dive/
- https://youtu.be/SEBdYXxijSo?feature=shared
- https://youtu.be/iQlgeL64vfo?feature=shared
Papers
Allows us to compute over encrypted data.
Quantum Secure
Massively Parellelizable
-
Microsoft SEAL - Homomorphic Encryption Library
-
PolyModulusDegree
Bigger PolyModulusDegree means that we have bigger computational capabilities on the encrypted data.
Encryption Schemes
- BFV (Encrypted Modular Arithmetic)
- CKKS (Encrypted Real or Complex Number Arithmetic)
More suitable for HE
Demonstrates that minimizing distance correlation between raw data and intermediary representations reduces leakage of sensitive raw data patterns across client communications while maintaining model accuracy.
Leakage Invertibility/reconstruction of raw data from intermediary representation
The solution prevent such reconstruction of raw data while maintaining information required to sustain good classification accuracies. The approach is based on minimizing a statistical dependency measure called distance correlation.
Distance Correlation A powerful measure of non-linear (and linear) statistical dependence between random variables.
- Pearson's Correlation only captures linear relationships. Can't capture non-linear relationships between variables.
- But Distance Correlation is able to capture both linear and non-linear relationships.
- Value between 0 to 1.
In worst-case reconstruction attack settings, the attacker has access to a leaked subset of samples of training data along with corresponding transformed activations ata chosen layer, the outputs of which are always exposed to other client/server by design for the distributed learning of the deep learning network to be possible.
Before applying NoPeek
After applying NoPeek
Two popular distributed learning settings where this attack is highly relevant:
- Split Learning
- Adversarial Reconstruction (Server side insider threat)
Moreover, model extraction, model inversion, malicious training, adversarial examples (evasion attacks) and membership inference etc.
Existing Solutions
-
Deep learning, adversarial learning and information theoretic loss based privacy
The proposed solution is not necessarily tied to a generative adversarial network (GAN) styled architecture where two separate models have to be trained in tandem. The proposed model is based on a easily implementable differentiable loss function between the intermediate activations and the raw data.
-
Homomorphic encryption and secure multi-party computation for computer vision
HE and MPC techniqes although highly secure are not computationally scalable and communication efficient for complex tasks like training large deep learning models.
The proposed method on the other hand is communication efficient and highly scalable with regards to large deep learning achitectures.
-
Differential privacy for computer vision
These methods typically take a stronger hit on accuracy of deep learning models although at the benefit of attempting to provide worst case privacy guarantees for membership inference attacks
Method
The key idea of the proposed method is to reduce information leakage by adding an additional loss term (distance correlation) to the commonly used classification loss term of categorical crossentropy.
Reconstruction Attack Testbed
- CIFAR10
- UTKFace
- Diabetic retinopathy severity detection method
Privacy-Utility Tradeoff on UTKFace
We show l2 error of reconstruction of a baseline strategy of adding uniform noise (in red) to activations of the layer being protected. This results in a model of no classififcation utility (performs at chance accuracy) albeit while preventing reconstruction. Our NoPeek approach (in blue) attains a much greater classfication accuracy for the downstream task ( 0.82) compared to adding uniform noise ( chance accuracy) while still preventing reconstruction of raw data. This is compared to regular training, that does not prevent the reconstruction (in green).
Memorization in deep learning: A survey https://arxiv.org/abs/2406.03880
https://github.com/leriomaggio/ppml-tutorial
data anonymization k-anonimity
Linking attack doctor prescription - pharmacy netflix - IMDB
unlike k-anonimity, differential privacy is the property of algorithms and not property of data.
FL & Homomorphic Encryption paillier encryption
The trained model might leak the training set.
-
Membership inference
Was the image in the training set?
(privacy, data extraction, upper bound on data leakage)
-
Attribute Inference
-
Data Extraction
Membership Inference Attacks
- Unifrom loss thresholding A model's loss leaks membership on average.
Average case leakage is a very poor metric for privacy
We should focus on low false positive region on the ROC curve to evaluate an attack's performance.
Insight: Uniform thresholding is bad way to infering membership. Because not all training examples are equally hard to learn.
LIRA (Likelihood Ratio Attck)
This is the attack that will give us highest true positive rate, for a fixed false positive rate. (strongest attack)
Very hard probablity to compute, so we approximate using some techniques.
10x better in the worst case
Works extremely well for data points that are outliers, and extremely poorly for data points that aren't outliers.
Shadow dataset and training dataset will have the same distribution
(So, which attack will perform well on inliers too?)
new threat model: privacy poisioning
adveraries will be able to interfere with the training of the model.
https://arxiv.org/abs/2204.00032
poisioning can transform inliers into outliers.
How to defend?
Find a way to get the loss distribution to overlap (whether we train on the example or not)
Differential Privacy
DP bounds the success of any MI attack
Expensive in terms of training time and in terms of utility loss (as a trade-off to the privacy gained)
Simpler defenses
-
Remove vulnerable data (Doesn't work, privacy onion)
Train > run to identify vuln points > remove vuln points > retrain
-
Wait until data is forgotten
Adversarial Perturbation
decision boundary analysis plot models are really good at random perturbation directions. but really bad at worst case directions.
How to find adversarial example?
Run gradient descent on the classifier But don't change the model weights to make the data more likely. Change the data instead to make the data less likely.
Despite several thousand papers on adversarial ML, there are basically no real attacks. Security work matters when you propose a attack that makes the industry change the way it does something. Have to address the attack, type of scenario.
Attacking LLMs
Affirmative Response Attack (On multi-modal models)
Just run gradient descent on the image-embedding to make the model respond affirmatively. The rest of the answer will be triggered from that affirmative response.
(Doesn't work on ChatGPT, might work on some open source LLMs)
(On text-only models) You can't do SGD, because text is discrete. You can't perturb each word by a little bit.
But every language model embeds every token into a continuous domain embedding space. Can we manipulate these embeddings to run the same attack?
(Gradient Descent in some greedy sampling way. Not necessarily, the modified embeddings will mean anything.)
Higher dimensional embeddings, you can move it a lot more dimensions, but its also sparser. That means, hard to find valid word along the way.
On the other hand, when a model supports relatively lower dimensional embeddings, its more dense. So, easier to find valid word along the way.
(Trade-offs cancel each other out)
Transferability of Adversarial Examples
Adversarial examples that works on model 1, also works on model 2 even if they are trained on different datasets and everything else is different.
Defense against Adversarial Examples
Many papers, but couldn't find a solution to stop it yet.
Poisioning (What if we control the training dataset)
Model Stealing (Study input/output behavior to steal model weights)
f(x) = A * h(x)
Now do that for n number of samples. And the matrix will have certain number of linearly independent rows.
That number corresponds to the dimension of the h(x) as we are upscaling the dimension of h(x) by multiplying with A.
Model stealing in the simplest form (1 layer).