Skip to content

smmehrab/30DaysOfFLCode

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

30DaysOfFLCode

Banner

SyftBox is framework for building PETs applications with minimal barriers, regardless of the programming language or environment

  • Distributed Network (of Datasites) Each datasite contribute to the network with their data and applications (APIs)
  • Modular Breaking down complex PETs into smaller, reusable components
  • Language & Environment Agnostic
  • SyftBox APIs Designed to interact with either public/private data from Datasites (Data analysis, ML, visualization etc while respecting the privacy)

Suitable for FL across multiple Datasites, enabling collaborative model training without exposing sensitive data.

SyftBox API is a script designed to interact with your own data and/or data synced from other datasites on your machine. These APIs form the backbone of the SyftBox ecosystem, enabling users to process, analyze, and manipulate data in a privacy-preserving manner.

  • run.sh Contains instructions to setup the environment and execute the main script.
  • main script (Language Agnostic) Holds the code for the API, interacting with the platform and data.

Differential Privacy

  • Ensuring that different kinds of statistical analysis don't compromise privacy.
  • It's trail goes back to only 2003. Applied to ML/DL, even more recently.

"You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available"

Netflix Competition

  • Netflix tried to hide users' privacy via masking the data
  • Still it has been able to unmask those data via statistical analysis against the scraped data from IMDB ratings. ("Robust De-anonymization of Large Sparse Datasets")

Similar incidents happened several times.

Privacy in a Simple Database

  • Suppose, we have true/false value for 5000 users (1 column database)
  • If we remove person from the database, and the query does not change, then that person's privacy is fully protected. (That means, that person wasn't leaking any statistical information.)
  • "Can we construct a query which doesn't change no matter who we remove from the database?"
  • Experiment
    • Make 5000 copies of the database. In each of those copy, one of the persons will be absent from the dataset (4999 entries).
    • Then check if the query returns the same result for all of those copies.
    • Sensitivity: Maximum amount that the query changes when removing an individual from the database
    • If the sensitivity was 0, we would get the same output of the query regardless of who we removed from the database.
    • Sensitivity of the functions explored: avg, sum, threshold.

However, there are more advanced techniques for calculating sensitivity in Differential Privacy. For example, Data-Conditioned Sensitivity.

Differencing Attack

  • Way to attack Differential Privacy.
  • For example, from the previous experiment, if we think about the sum case.
    • Take the sum of the entire database
    • Take the sum of the database without a specific person
    • Differencing them will give us that person's value
  • Can be applied to other functions too.

Developing Differentially Private Algorithms

  • Add random noise to the database & to the queries to the database

  • Types of Differential Privacy

    • Local Differential Privacy
      • Adds noise to the function input
      • Noise added on the user-side (before even putting it into database)
      • Most secure (Don't need to trust the database owner)
    • Global Differential Privacy
      • Adds noise to the function output
      • If database operator is trustworthy, this one results into more accurate result.

Local Differential Privacy

  • When the survey audience are likely to lie about the queries.

    (Have you jaywalked in the last week?)

  • Participant Flips Coin Twice

    • If the first coin flip is heads, answer honestly

    • If the first coin flip is tails, answer according to the second coin flip.

      (heads = yes, tails = no)

    (Coin flips hidden from the experimenter)

  • So if a participant says yes, it's plausible to think that they are saying because of a coin-flip, not because that's the truth. (Local Noise)

  • Experimenter can take the statistic and average it with 50-50 coin-flip. (Removing the noise)

  • Suppose, 70% jaywalked. So, half of them will say yes/no with a 50% probability (based on second coin-flip). And the other half will say yes/no with a 70% probability (truthfully or based on first coin-flip).

    (50% * 0.5 = 25%, 50% * 0.7 = 35%, 25% + 35% = 60% of the population will say yes)

  • So, now we know what % of people (approx) jaywalked without knowing whether any individual person jaywalked or not. We can denoise this using (result * 2) - 0.5.

  • Gained privacy, lost some accuracy

    The more data points that we have, the more this noise will tend to average out and not affect the output of the query. Thus, producing a more accurate result.

  • We can also bias the coin flips to vary the amount of noise added.

    def query(db, noise=0.2):
    
      true_result = torch.mean(db.float())
    
      first_coin_flip = (torch.rand(len(db)) < noise).float()
      second_coin_flip = (torch.rand(len(db)) < 0.5).float()
    
      augmented_database = db.float() * first_coin_flip + (1 - first_coin_flip) * second_coin_flip
    
      sk_result = augmented_database.float().mean()
    
      private_result = ((sk_result / noise) - 0.5) * noise / (1 - noise)
    
      return private_result, true_result
  • Size of the dataset, allows us to add more noise while maintaining a decent accuracy.

    More private data we have access to, the easier it is to protect the privacy of the people that are involved.

Global Differential Privacy

  • Add noise to the output of the function

    • Gaussian Noise
    • Laplacian Noise (Generally, this works better)

    Advantage: we can often add less noise after computing the result, and get a better accuracy while maintaining privacy. Because many function reduce the sensitivity that is involved. So, adding noise later in the processing chain is more efficient.

  • Privacy Budget
    How much epsilon/delta leakage we allow for our analysis?

  • How much noise should we add?
    Function of four things:

    • Type of noise (Gaussian/Laplacian)
    • Sensitivity of the query
    • Desired epsilon
    • Desired delta
  • Laplacian noise takes a input parameter beta, which tells us how significant the noise is.

    b = sensitivity (query) / epsilon

    delta = 0

    def laplacian_mechanism(db, query, sensitivity):
      beta = sensitivity / epsilon
      noise = torch.tensor(np.random.laplace(0, beta, 1))
      return query(db) + noise
  • Laplacian noise always has zero delta, gaussian noise has non-zero delta.

  • epsilon is per query. Suppose, we have multiple queries to perform, we can adjust epsilon as per that.

    Smaller epsilon means, smaller privacy leakage. Thus, more noise and big laplacian distribution. So, the resultant values will fluctuate more.

Formal Definition of Differential Privacy

Formal Definition of DP

How much the original distribution can vary compared to the distribution where one of the column is missing. Provides a upper bound.

Less the variation, less the privacy leakage.

Differential Privacy for Deep Learning

  • Ensuring that when our neural networks are learning from sensitive data, that they are only learning what their supposed to learn from the data.
  • Without accidentally learning what they are not supposed to learn from the data.

Perfect Privacy
Training a model on a dataset should return the same model even if we remove any person from the training dataset

Querying a database => Training a model on a dataset

Two Points of Complexity

  • Do we always know where "people" are referenced in the dataset?

    Treat each training example as a single separate person although often they have no correlation to people at all.

    Think about a picture with multiple people in it.

  • Neural models rarely ever train to the same location, even when trained on the same dataset twice.

Scenario

Suppose, our hospital have unlabeled data that we want to use to train a model.

  • Ask other hospitals (n) to train model on their own dataset (labeled)
  • Use each model to predict on our own local dataset, generating n labels for each data point.
  • Perform a DP query to generate the final true (DP) label for each datapoint.
  • Retrain a new model on our local dataset which now has DP labels.

PATE Analysis

How much do these hospital really agree/disagree?

  • A formal set of mechanisms that's capable of computing epsilon level that is conditioned on this level of agreement.
  • Returns
    • Data Dependent Epsilon
    • Data Independent Epsilon
  • The more the agreement between parties, the tighter the Data Dependent Epsilon value that we get. Less privacy leakage, better privacy.
  • So, this rewards us for creating good generalized models.

Federated Learning

Architectures

  • Horizontal Federated Learning

    Ideal when different parties have similar data types but distinct samples. Think hospitals collaborating on disease prediction without sharing patient records. Each party trains a local model, and their learnings are aggregated into a global model.

  • Vertical Federated Learning

    Perfect for scenarios where parties have different features but overlapping samples. Imagine two companies, a social media platform and an online retailer, wanting to personalise recommendations for their shared users. They can train a global model that leverages both social media activity and purchase history without directly sharing their data.

  • Federated Transfer Learning

    It allows leveraging knowledge from a pre-trained model to improve models on new tasks. Think of a football coach who transfers their expertise and strategies to a new team with different players and playing styles.

PySyft

PySyft is a Python library for secure and private Deep Learning which is build on top of PyTorch (extending its functionalities).

  • Remote Tensors (Pointers)
  • Virtual Worker
  • Local Worker
  • Garbage Collection
  • Pointers of Pointers
  • PointerChain

Securing Federated Learning

Trusted Aggregator

A neutral 3rd party who has a machine that we can trust to not look at the gradients when performing the aggregation.

Additive Secret Sharing

Allows multiple individuals to add numbers together without any person learning anyone else's inputs to the addition.

import random

Q = 23740629843760239486723

def encrypt(x, n_share=3):
    
    shares = list()
    
    for i in range(n_share-1):
        shares.append(random.randint(0,Q))
        
    shares.append(Q - (sum(shares) % Q) + x)
    
    return tuple(shares)

def decrypt(shares):
    return sum(shares) % Q

def add(a, b):
  c = list()
  for i in range(len(a)):
      c.append((a[i] + b[i]) % Q)
  return tuple(c)

# Function Calls
x = encrypt(5)
y = encrypt(7)
z = add(x,y)
decrypt(z)

Fixed Precision Encoding

To convert our decimal numbers into an integer format.

Structured Transparency: a framework for addressing use/mis-use trade-offs when sharing information

- Provide a great mental model for understanding & dissecting the information flow of a scenario
- Frame a set of tools which enable progress towards an ideal information flow

We all know how important collaboration is, be it for achieving something can't otherwise be achieved, or be it for making faster progress towards a goal. But a successful collaboration is often involves sharing information.

And that is the main domain of the aformentioned paper, "Sharing Information".

Sharing information comes with its own perks. Notable among them are:

  • Copy Problem

    Once a bit of information is copied/shared, the sender can no longer control how the recipient uses it.

    Trade-off
    "Benefits of Sharing" vs "Risks of Misuse"

The Copy Problem is often amplified by three related problems.

  • Bundling Problem

    It is often difficult to share a bit of information without also needing to reveal additional bits because either the conventional encoding will not allow individual bits to be shared or a bit cannot be trusted/verified without the context of other relevant bits.

    Surveillance Video, Driver's License etc

  • Edit Problem

    If an entity that stores a piece of information makes an edit before transmitting it to another party.

    A bank balance is stored by the bank itself rather than by the account holder, who might be inclined to make edits to it.

  • Recursive Oversight Problem

    When one party oversees the use of information, it creates another, even more knowledgeable entity that could potentially misuse the information.

    “Who watches the watchers?”

Federated Learning

Federated learning is set-up for training machine learning algorithms on data without the owner having to share, transfer or expose their data with the developer and the service provider.

Limitations of Standard Machine Learning

  • Centralized Data
  • Compromised Privacy
  • Scarcity of Sensitive Data
  • Computationally Expensive

Types of Federated Learning

  • Cross-device Cross-device FL Picture

  • Cross-silo Cross-silo FL Picture

Federated Learning vs Federated Analytics

  • Federated Learning

    Private machine learning on remote data

  • Federated Analytics

    Private data science on remote data

Federated Analytics works by running local computations over each device's data, and passes only the aggregated results to the data scientists.

Duet [Outdated]

Duet Basics in Action

  • After the connection establishment, the OpenGrid node isn't needed anymore. Only the Duet session between the participants is enough for the communication.

Split Learning

Federated Learning vs Split Learning

FL-vs-SL

Mathematical Concepts

Self-information

self-information-1

self-information-2

If we say that, we tossed a coin and seen "HEAD", then we are giving 1 bit of information.

Entropy

entropy

Mutual Information

mutual-information

If there's lot of mutual information between the activation and the raw image, then just having the activation we are going to be able to make some good educated guesses about the raw image.

Vertically Distributed Learning

vertically-distributed-learning

Horizontally Distributed Data

horizontally-distributed-data

Vertically Distributed Data

vertically-distributed-data

Resource 5: Homomorphic Encryption

Resources

Papers

Introduction

Homomorphic-Encryption

Allows us to compute over encrypted data.

Quantum Secure

Massively Parellelizable

  • Microsoft SEAL - Homomorphic Encryption Library

  • PolyModulusDegree

    Bigger PolyModulusDegree means that we have bigger computational capabilities on the encrypted data.

Encryption Schemes

  • BFV (Encrypted Modular Arithmetic)
  • CKKS (Encrypted Real or Complex Number Arithmetic)

CKKS

Floating Point Representation

Floating-Point-Representation

Floating Point Arithmetic

Floating-Point-Arithmetic

Fixed Point Arithmetic

Fixed-Point-Arithmetic

More suitable for HE

Algorithm

CKKS-Algorithm

Demonstrates that minimizing distance correlation between raw data and intermediary representations reduces leakage of sensitive raw data patterns across client communications while maintaining model accuracy.

Leakage Invertibility/reconstruction of raw data from intermediary representation

The solution prevent such reconstruction of raw data while maintaining information required to sustain good classification accuracies. The approach is based on minimizing a statistical dependency measure called distance correlation.

Distance Correlation A powerful measure of non-linear (and linear) statistical dependence between random variables.

  • Pearson's Correlation only captures linear relationships. Can't capture non-linear relationships between variables.
  • But Distance Correlation is able to capture both linear and non-linear relationships.
  • Value between 0 to 1.

In worst-case reconstruction attack settings, the attacker has access to a leaked subset of samples of training data along with corresponding transformed activations ata chosen layer, the outputs of which are always exposed to other client/server by design for the distributed learning of the deep learning network to be possible.

Before applying NoPeek

successful-reconstruction-attack

After applying NoPeek

failed-reconstruction-attack

Two popular distributed learning settings where this attack is highly relevant:

  • Split Learning
  • Adversarial Reconstruction (Server side insider threat)

Moreover, model extraction, model inversion, malicious training, adversarial examples (evasion attacks) and membership inference etc.

Existing Solutions

  • Deep learning, adversarial learning and information theoretic loss based privacy

    The proposed solution is not necessarily tied to a generative adversarial network (GAN) styled architecture where two separate models have to be trained in tandem. The proposed model is based on a easily implementable differentiable loss function between the intermediate activations and the raw data.

  • Homomorphic encryption and secure multi-party computation for computer vision

    HE and MPC techniqes although highly secure are not computationally scalable and communication efficient for complex tasks like training large deep learning models.

    The proposed method on the other hand is communication efficient and highly scalable with regards to large deep learning achitectures.

  • Differential privacy for computer vision

    These methods typically take a stronger hit on accuracy of deep learning models although at the benefit of attempting to provide worst case privacy guarantees for membership inference attacks

Method

solution-method

The key idea of the proposed method is to reduce information leakage by adding an additional loss term (distance correlation) to the commonly used classification loss term of categorical crossentropy.

Reconstruction Attack Testbed

reconstruction-attack-testbed

  • CIFAR10
  • UTKFace
  • Diabetic retinopathy severity detection method

Privacy-Utility Tradeoff on UTKFace

tradeoff-utk

We show l2 error of reconstruction of a baseline strategy of adding uniform noise (in red) to activations of the layer being protected. This results in a model of no classififcation utility (performs at chance accuracy) albeit while preventing reconstruction. Our NoPeek approach (in blue) attains a much greater classfication accuracy for the downstream task ( 0.82) compared to adding uniform noise ( chance accuracy) while still preventing reconstruction of raw data. This is compared to regular training, that does not prevent the reconstruction (in green).

utk-results

diabetic-retinopathy-results

Introduction to PPML - Valerio Maggio

Memorization in deep learning: A survey https://arxiv.org/abs/2406.03880

https://github.com/leriomaggio/ppml-tutorial

alt text

data anonymization k-anonimity

Linking attack doctor prescription - pharmacy netflix - IMDB

unlike k-anonimity, differential privacy is the property of algorithms and not property of data.

https://opacus.ai/

FL & Homomorphic Encryption paillier encryption

The trained model might leak the training set.

  • Membership inference

    Was the image in the training set?

    (privacy, data extraction, upper bound on data leakage)

  • Attribute Inference

  • Data Extraction

Membership Inference Attacks

  • Unifrom loss thresholding A model's loss leaks membership on average.

Average case leakage is a very poor metric for privacy

We should focus on low false positive region on the ROC curve to evaluate an attack's performance.

Insight: Uniform thresholding is bad way to infering membership. Because not all training examples are equally hard to learn.

LIRA (Likelihood Ratio Attck)

lira

This is the attack that will give us highest true positive rate, for a fixed false positive rate. (strongest attack)

Very hard probablity to compute, so we approximate using some techniques.

lira1

10x better in the worst case

Works extremely well for data points that are outliers, and extremely poorly for data points that aren't outliers.

Shadow dataset and training dataset will have the same distribution

(So, which attack will perform well on inliers too?)

new threat model: privacy poisioning

adveraries will be able to interfere with the training of the model.

privacy-poisioning

https://arxiv.org/abs/2204.00032

poisioning can transform inliers into outliers.

How to defend?

Find a way to get the loss distribution to overlap (whether we train on the example or not)

Differential Privacy

DP bounds the success of any MI attack

Expensive in terms of training time and in terms of utility loss (as a trade-off to the privacy gained)

Simpler defenses

  • Remove vulnerable data (Doesn't work, privacy onion)

    Train > run to identify vuln points > remove vuln points > retrain

  • Wait until data is forgotten

Adversarial Perturbation

decision boundary analysis plot models are really good at random perturbation directions. but really bad at worst case directions.

adversarial-perturbation

How to find adversarial example?

Run gradient descent on the classifier But don't change the model weights to make the data more likely. Change the data instead to make the data less likely.

how-to-find-adversarial-example

Despite several thousand papers on adversarial ML, there are basically no real attacks. Security work matters when you propose a attack that makes the industry change the way it does something. Have to address the attack, type of scenario.

Attacking LLMs

Affirmative Response Attack (On multi-modal models)

Just run gradient descent on the image-embedding to make the model respond affirmatively. The rest of the answer will be triggered from that affirmative response.

(Doesn't work on ChatGPT, might work on some open source LLMs)

affirmative-response-attack

(On text-only models) You can't do SGD, because text is discrete. You can't perturb each word by a little bit.

But every language model embeds every token into a continuous domain embedding space. Can we manipulate these embeddings to run the same attack?

sgd-on-text-embeddings

(Gradient Descent in some greedy sampling way. Not necessarily, the modified embeddings will mean anything.)

sgd-on-text-embeddings-1

Higher dimensional embeddings, you can move it a lot more dimensions, but its also sparser. That means, hard to find valid word along the way.

On the other hand, when a model supports relatively lower dimensional embeddings, its more dense. So, easier to find valid word along the way.

(Trade-offs cancel each other out)

Transferability of Adversarial Examples

Adversarial examples that works on model 1, also works on model 2 even if they are trained on different datasets and everything else is different.

Defense against Adversarial Examples

Many papers, but couldn't find a solution to stop it yet.

Poisioning (What if we control the training dataset)

Model Stealing (Study input/output behavior to steal model weights)

f(x) = A * h(x)

Now do that for n number of samples. And the matrix will have certain number of linearly independent rows.

That number corresponds to the dimension of the h(x) as we are upscaling the dimension of h(x) by multiplying with A.

Model stealing in the simplest form (1 layer).

attacks-on-ml

Random

https://www.oblivious.com/games