Skip to content

Tools to assess fairness and mitigate unfairness in sociolinguistic auto-coding

License

Notifications You must be signed in to change notification settings

djvill/SLAC-Fairness

Repository files navigation

SLAC-Fairness: Tools to assess fairness and mitigate unfairness in sociolinguistic auto-coding

Dan Villarreal (Department of Linguistics, University of Pittsburgh)

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Introduction

This GitHub repository is a companion to the paper "Sociolinguistic auto-coding has fairness problems too: Measuring and mitigating overlearning bias", published open-access in Linguistics Vanguard in 2024: https://doi.org/10.1515/lingvan-2022-0114. In the paper, I investigate sociolinguistic auto-coding (SLAC) through the lens of machine-learning fairness. Just as some algorithms produce biased predictions by overlearning group characteristics, I find that the same is true for SLAC. As a result, I attempt unfairness mitigation strategies (UMSs) as techniques for removing gender bias in auto-coding predictions (without harming overall auto-coding performance too badly).

Repository navigation:

If you're new to sociolinguistic auto-coding (SLAC)

Sociolinguistic auto-coding is a machine-learning method for classifying variable linguistic data (often phonological data), such as the alternation between park & "pahk" or working & workin'.

You can learn more about SLAC by reading the following resources.

What's the point of this repository?

First, you can reproduce the analysis I performed for the Linguistics Vanguard paper, using the same data and code that I did. Simply follow the analysis walkthrough tutorial.

Second, you can also adapt this code to your own projects. You might want to use it if you want to (1) assess fairness for a pre-existing auto-coder and/or (2) create a fair auto-coder by testing unfairness mitigation strategies on your data.

Finally, I invite comments, critiques, and questions about this code. I've made this code available for transparency's sake, so please don't hesitate to reach out!

What's in this repository?

The files in this repository fall into a few categories. Click the links below to jump to the relevant subsection:

You can browse files here.

A quick note on the two-computer setup

This repository's structure reflects the two-computer setup I used to run this analysis. I generated and measured auto-coders on a more powerful system that is not quite as user-friendly (Pitt's CRC), then analyzed the metrics on my less-powerful-but-user-friendlier laptop. (It's perfectly fine to use a one-computer setup if you don't have access to high-performance computing; the code will take longer to run on a less-powerful machine, but it still might be faster/easier than the HPC learning curve!) In the rest of this section, I'll refer to this two-computer split several times.

Input data

Contents:

  • Input-Data/
    • LabPhonClassifier.Rds: Pre-existing auto-coder to analyze for fairness. This auto-coder is the same as in "How to train", but with a Gender column added to the auto-coder's trainingData element.
    • trainingData.Rds: /r/ data for generating auto-coders that use unfairness mitigation strategies, also available here. This is the result of step 1 in "How to train". A dataset with more tokens (but less acoustic information) is available here.
    • meanPitches.csv: Pitch data to use for UMS 3.1 (normalizing speaker pitch). I measured pitch (F0) for word-initial /r/ tokens and calculated each speaker's average minimum and maximum pitch.
    • UMS-List.txt: Tab-separated file matching UMS codes to descriptions. This is also used by the R scripts to define the set of acceptable UMS codes.

The /r/ and pitch data comes from Southland New Zealand English, historically New Zealand's only regional variety, which is characterized by variable rhoticity. The New Zealand Institute of Language, Brain and Behaviour maintains a corpus of sociolinguistic interviews with Southland English speakers totaling over 83 hours of data. This corpus is hosted in an instance of LaBB-CAT; the data files were downloaded from LaBB-CAT, with subsequent data-wrangling in R (including speaker anonymization).

Skip ahead for info on using your own auto-coder and training data, or modifying the set of UMSs.

Outputs

Contents:

  • Outputs/
    • Autocoders-to-Keep/
      • "Final" auto-coders (saved as .Rds files). Unlike the temporary auto-coders, this folder is version-controlled (see info on .gitignore), so it's useful for selectively saving auto-coders we want to hold onto.
    • Shell-Scripts/
      • Text files (saved with the .out file extension) that record any output of shell scripts, including errors. Useful for diagnosing issues with the code if something goes wrong.
    • Performance/
      • Tabular data (saved as .csv files) with metrics of auto-coders' performance (e.g., overall accuracy) and fairness (e.g., accuracy for women's vs. men's tokens). These files bridge the two-computer split split: we extract metrics on a more powerful system (see walkthrough) so we can analyze them on a user-friendlier computer.
    • Diagnostic-Files/: Temporary files that are useful only in the moment (e.g., peeking "under the hood" to diagnose a code issue if something goes wrong) and/or too large to share between computers. Most files are .gitignored, save for empty dummy_files that exist only so the empty folders can be shared to GitHub.
      • Model-Status/
        • Temporary files (with extension .tmp) that are created during optimization for performance to signal which auto-coders are completed or running.
      • Temp-Autocoders/
        • Auto-coders run with different UMSs, for which we want to measure performance and fairness but we don't need to version-control
    • Other/: Files mostly meant for passing info between scripts

All these outputs were generated using the data and code in this repository. You may want to create a 'clean' version of the repository without any of these outputs, to see if your system replicates the outputs I got.

Code

Contents:

  • R-Scripts/: Scripts that do the heavy lifting of running auto-coders and facilitating analysis.
  • Shell-Scripts/: Scripts meant for the user to run; these scripts call the R scripts and collate their outputs.

The division of labor has two benefits. First, it makes the code more modular, so a larger process isn't completely lost if just one part fails. Second, many high-performance computing environments don't allow users to run code on-demand, instead submitting job requests, packaged into shell scripts, to a workload management system (aka job queue). These shell scripts are written to be compatible with Slurm, the job queue used by Pitt's CRC clusters. If your computing environment doesn't require submitting job requests, the shell scripts should still run as-is. You also have the option of foregoing the shell scripts and running the R scripts directly.

R scripts

These include 'main scripts' that run auto-coders and 'helper scripts' that define functionality shared among the main scripts.

Main scripts:

  • R-Scripts/
    • Run-UMS.R: Generates a single auto-coder according to an unfairness mitigation strategy.
    • Hyperparam-Tuning.R: Subjects an auto-coder to hyperparameter tuning, one stage of optimizing an auto-coder for performance. (Note: This tunes only what "How to train" calls ranger parameters) because I'm now less sure that the other hyperparameters are appropriate for tuning.)
    • Outlier-Dropping.R: Subjects an auto-coder to outlier dropping, one stage of optimizing an auto-coder for performance.

The main scripts are written to be called from a command-line client like Bash, using the command Rscript. (To use Rscript, R needs to be in your PATH.) For example, if you navigate Bash to R-Scripts/, you can run Rscript Run-UMS.R --ums 0.0 These scripts take several arguments (like --ums); to see arguments, run Rscript <script-name> --help from the command line. If you prefer working exclusively in R, you can use rscript() from callr to call these scripts from within R (e.g., callr::rscript("Run-UMS.R", c("--ums", "0.0"), wd="R-Scripts/", stdout="../Shell-Scripts/Output/Run-UMS_UMS0.0.out", stderr="2>&1")).

Helper scripts:

  • R-Scripts/
    • UMS-Utils.R: Contains utility functions for generating and analyzing auto-coders that utilize UMSs. The most important functions are:
      • umsData(): Reshape data for auto-coder by applying UMS, only keeping necessary columns, and optionally dropping outliers
      • umsFormula(): Specify model formula based on UMS
      • cls_fairness(): Investigate auto-coder fairness (see walkthrough)
      • cls_summary(): Generate one-row dataframe of fairness/performance metrics (see walkthrough)
    • Rscript-Opts.R: Defines command-line options for how main scripts should run.
    • Session-Info.R: Combines and prints R session info from the outputs of multiple scripts. Meant to be used in shell scripts.

Shell scripts

Contents:

  • Shell-Scripts/
    • Run-UMS.sh: Generates a single auto-coder according to a UMS, and optionally optimizes it for performance. This flexible lower-level script is useful for exploratory analysis.
    • Baseline.sh: Wrapper script that calls Run-UMS.sh for baseline auto-coder (mostly exists to override the default outfile name)
    • UMS-Round1.sh: Generates auto-coders according to UMSs whose codes start with 0, 1, 2, or 3 (save for UMS 0.0, the baseline).
    • UMS-Round2.sh: Generates auto-coders according to UMSs whose codes start with 4.

The shell scripts are written to be called from Bash, using the commands bash (to run directly) or sbatch (to submit to a Slurm job queue; see above). For example, if you navigate Bash to Shell-Scripts/, you can run sbatch Baseline.sh. Run-UMS.sh takes two arguments: a UMS numerical code (e.g., sbatch Run-UMS.sh 4.2.1) and an optional -o flag to optimize the auto-coder for performance. All other shell scripts hard-code these options (as well as other options passed to the R scripts), but these can be adjusted as needed (under the heading ##EDITABLE PARAMETERS).

The shell scripts also collate output from the R scripts; I recommend saving this output to a text file. These scripts include a Slurm command that automatically writes script output to a corresponding .out file in Outputs/Shell-Scripts/. If you're not using Slurm, you can append a command that tells Bash where to send outputs, including errors (e.g., bash Baseline.sh &> ../Outputs/Shell-Scripts/Baseline.out); if you omit this part of the command (e.g., bash Baseline.sh), the output will simply print in Bash.

CRC's cluster uses Lmod to make modules (like R) available to shell scripts via the module load command. Your system may not need to load modules explicitly, or may use different commands to load R.

Code/info pertaining to the repository itself

Contents:

  • .gitignore: Tells Git which files/folders to exclude from being version-controlled (and being shared to GitHub or between computers). Because the auto-coders are huge files, I exclude Outputs/Diagnostic-Files/Temp-Autocoders/ from version-control and just pull out fairness/performance data instead. If there's any I want to keep, I save them to the non-ignored folder Outputs/Autocoders-to-Keep/.
  • LICENSE.md: Tells you what you're permitted to do with this code.
  • renv/: Set up by the renv package to ensure our code behaves the same regardless of package updates. See more info below.
  • renv.lock: Set up by renv to store info about package versions.
  • .Rprofile: Contains R code to run at the start of any R session in this repository. In this case, this code was set up by renv to run a script that loads the package versions recorded in renv.lock. If you want to disable renv, simply delete this file.
  • _includes/: Contains code that is inserted into the <head> element of this site's webpages, which uses GitHub Pages and the Jekyll theme Primer to render the site. This is not necessary for any of the code's core functionality.

Running this code on your own machine

To run this code on your own machine, you'll need a suitable computing environment and software. All required and recommended software is free and open-source. This document was originally run using high-performance computing resources provided by the University of Pittsburgh's Center for Research Computing (CRC), in particular its shared memory parallel cluster. You can run this code on a normal desktop or laptop---it just might take a while! You'll also need at least 400 Mb of disk space free. See the walkthrough for more information about machine specs, running time, and disk space used.

Required software:

  • The statistical computing language R (version >= 4.3.0)
    • Since these scripts call R from the command line, R must be in your PATH (directions for Windows, macOS, Unix) - To check, run Rscript -e R.version.string at the command line. If you see your R version, then R is in your path; if you get the error Rscript: command not found, R is not.
  • R packages:
    • tidyverse (v. >= 2.0.0)
    • magrittr (v. >= 2.0.3)
    • caret (v. >= 6.0-94)
    • ranger (v. >= 0.15.1)
    • ROCR (v. >= 1.0-11)
    • foreach (v. >= 1.5.2)
    • doParallel (v. >= 1.0.17)
    • optparse (v. >= 1.7.3)
    • this.path (v. >= 2.0.0)
    • benchmarkme (v. >= 1.0.8)
    • rmarkdown (v. >= 2.22)
    • knitr (v. >= 1.43)
    • renv (v. >= 0.17.3)
    • These packages will install dependencies that you don't need to install directly. See full R session info here
  • The command-line client Bash (v. >= 5.0.0)
    • If you install Git (recommended), Bash is included in the install
  • The document converter Pandoc (v. >= 2.19)
    • If you install RStudio (recommended), Pandoc is included in the install

Please note that R and its packages are continually updated, so in the future the code may not work as expected (or at all!). If you hit a brick wall, don't hesitate to reach out!

I also recommend using Git and GitHub to create your own shareable version of the code; doing so will help me effectively troubleshoot any issues you have. In particular:

  1. Download Git onto your computer.
  2. Sign up for a free GitHub account
  3. Fork this repository (keep the same repository name), and clone it onto your computer.
  4. Test out the code on your own system: Edit the code, create commits, push your commits to your remote fork.
    • You may want to create a 'clean' version of the repository without any of the generated outputs, to see if your system replicates my outputs.
  5. Reach out!

Finally, I recommend using the integrated development environment RStudio. While it doesn't change how the code in this repository works, RStudio makes R code easier to understand, write, edit, and debug.

renv

This repository uses the renv package to ensure that updates to R packages don't break the code. In effect, renv freezes your environment in time by preserving the package versions the code was originally run on. This is great from a reproducibility perspective, but it entails some extra machinery before you can run the code. For all the examples below, you need to load this repo in R or RStudio by setting your working directory somewhere inside the repo.

Before you can run any of this code, run renv::restore(). This will download the packages at the correct versions to an renv cache on your system. Then you should be able to run this code on your machine.

Of course, using old versions of these packages means you won't be able to benefit to any package updates since this repo was published. If you want to use new package versions, you have to register them with renv. If you're using R 4.3.x (the version used for this code), run renv::update(); if R >= 4.4, run renv::init() and select option 2. To update renv itself, run renv::upgrade(). Of course, the code may not work as expected thanks to changes to the packages it relies on. If you're satisfied with how the code runs, you can register the updated versions with renv::snapshot().

If you want to use a package that's not registered with renv, use renv::record().

Finally, if you're finding this all too much of a hassle, you can skip using renv altogether; just delete .Rprofile and restart R/RStudio.

Adapting this code to your own projects

How much you want to adapt this code is really up to you. You might want to 'carbon-copy' this analysis on your own project, but in all likelihood your project will dictate that you make some changes to better fit your project's needs. Just as "training an auto-coder is not a one-size-fits-all process", so too is auto-coding fairness. For example, if you are confident that your predictor data (e.g., acoustic measures) does not suffer from measurement error, you can skip the time-consuming step of accounting for outliers. In some cases, this code might not necessarily work for your project; for example, this code only handles fairness across two groups, and it only handles binary classification (two categories).

Below, you can read about:

In addition, I strongly recommend making the data you use for this task publicly available if possible, since open data helps advance science (see Villarreal & Collister "Open methods in linguistics", in press for Oxford collection Decolonizing linguistics). However, if you do so, make sure what you share conforms to the ethics/IRB agreement(s) in place when the data was collected (if applicable).

Finally, if there's anything in this code that you can't figure out or isn't working for you, please don't hesitate to reach out! Please note that there is no warranty for this code.

Assessing fairness for a pre-existing auto-coder

This is one possible goal of your analysis, mirroring the Linguistics Vanguard paper's RQ2. To analyze your auto-coder, it needs to have been generated by caret::train(). The auto-coder's trainingData element also needs a column with group data (e.g., which tokens belong to female vs. male speakers). If you use the scripts in this repository, that's taken care of for you; umsData() retains the group column in the training dataframe passed to train(), and umsFormula() excludes the group column from the predictor set. However, if you didn't use these scripts to run your auto-coder, you'll need to either manually add the group column to the trainingData element, or just re-run your auto-coder using these scripts.

If your auto-coder conforms to these requirements, you can use functions from R-Scripts/UMS-Utils.R to analyze fairness. See the walkthrough for examples of how to use this code.

Testing unfairness mitigation strategies

This is the other possible goal of your analysis, mirroring the Linguistics Vanguard paper's RQ3.

Using your own training data

You'll need your own training data (in place of trainingData.Rds), and you may need normalization data depending on which UMSs you want to try.

Formatting requirements for training data:

  • Tabular data (data stored in rows and columns), saved in a .csv or .Rds file
  • Each row represents a single token of some categorical linguistic variable
  • At least some of the tokens have been coded into classes (in trainingData.Rds, these are tokens for which the column HowCoded=="Hand")
  • Columns needed for auto-coder:
    • 1 column with variant labels for already-coded tokens and blanks/NAs for uncoded tokens (Rpresent in trainingData.Rds)
      • Currently, this code only handles binary classification (two categories, not counting NAs)
    • 1 column with the group that you're assessing fairness for (Gender in trainingData.Rds)
      • Currently, this code only handles two-group fairness
    • Multiple columns that contain predictors that the auto-coder will use for coding (in trainingData.Rds, 180 columns from tokenDur to absSlopeF0, inclusive)
      • Can be any data type

If you want to perform any speaker normalization (either as a preprocessing step or as UMS 3.1), you'll also need:

  • In your training data, a Speaker column
  • An additional data file with normalization baselines (like meanPitches.csv) :
    • One row per speaker, with every speaker in your training data
    • A Speaker column
    • A column for each baseline measure you want to use for normalization (MinPitch in meanPitches.csv is used as baseline for the F0min measure in the training data, MaxPitch for the F0max measure)

In addition to those requirements, here are some data formatting recommendations (you don't have to format your data this way, but if not you'll need to tinker with the code some more). Some of these pertain to the data preprocessing step in "How to train":

  • If you suspect that your predictors have measurement error and you want to take advantage of the outlier-dropping script, then you need to mark measurement outliers. Outliers for predictors X & Y (for example) should be marked as TRUE in columns X_Outlier & Y_Outlier.
  • The code drops rows that have NAs for the dependent variable, group, and predictor columns (and outlier columns, if applicable). Depending on how many missing measurements you have, you might want to consider imputing measurements and/or thinning your predictor set.
  • You might want to add a HowCoded column to easily separate hand-coded and auto-coded tokens.
  • Thanks to pre-processing, trainingData.Rds reflects normalized measurements for formant timepoints but not pitch, so UMS 3.1 involves pitch normalization. You might decide to fold all normalization into pre-processing and skip normalization as a UMS.
  • You may want to anonymize your speakers, especially if you choose to make your data open, as in trainingData.Rds, but this is strictly optional.

Finally, as a general note, you may have to tweak the R scripts a little bit to accommodate your data. For example:

  • These scripts assume the training data file is an .Rds file. If it's a .csv, you'll have to tweak the code
  • If your columns have different names than the ones in trainingData.Rds (e.g., if your dependent variable isn't Rpresent), you'll need to find-and-replace column names in the scripts. Alternatively, you can pass column names as arguments to UMS-Utils.R functions (e.g., umsData(myData, dependent=ING, group=Ethnicity)).
  • If you don't have a HowCoded column, you'll need to modify the lines of code that refer to that column.
  • Depending on the size of your predictor set, you'll want to change the default value of mtry (the number of predictors attempted at each split) in Rscript-Opts.R and Hyperparam-Tuning.R; a typical value is the square root of the number of predictors, rounded down to the nearest integer.
  • If you rename or reorganize folders or files, you'll need to change the code to account for that.

Adding and/or subtracting UMSs

Depending on your groups, your predictor set, and/or your dependent variable, you might want to add or subtract UMSs. For example, if you already have equal token counts for women vs. men, UMS 1.1 (downsample men to equalize token counts by gender) wouldn't apply.

To add a new UMS:

  1. Pick a new UMS code
    • Don't use a code that's already been defined (it just creates unnecessary complications)
    • If it's a combination UMS, the code should start with 4
  2. Add the code and description to Input-Data/UMS-List.txt
  3. Modify umsData() in R-Scripts/UMS-Utils.R
    • Single UMSs: Add a new } else if (UMS=="<new-UMS>") { block to the implementUMS() subroutine
    • Combination UMSs: Add code to interpret the second & third digits near the bottom of umsData()
    • Note that umsData() uses tidyselect semantics for several arguments (dependent, group, predictors, & dropCols). If you're using any of these column names in a dplyr function, wrap them in double-braces (e.g., data %>% select({{dependent}}, {{group}})); if you need a column name as a string, use the deparse-substitute trick (e.g., depName <- deparse(substitute(dependent)))
  4. If using a shell script to run multiple UMSs in a single round, edit the script so the UMS code is matched by the pattern regex and not by excl (e.g., to include UMS 5.1, use pattern=^[0-35])

You only need to subtract a UMS explicitly if you're using a shell script to run multiple UMSs in a single round. To subtract a UMS, use the excl regex to exclude it (e.g., to exclude UMSs 1.4 and 2.2, use excl="^1.4|2.2"). No need to modify umsData(), since the code will just skip over that UMS in the chain of else if {} statements.

Note that the existing UMS list is actually more general than its descriptions suggest. For example, UMSs 1.3.1 and 1.3.2 both achieve equal /r/ base rates by gender, by downsampling either women's Absent (1.3.1) or men's Present (1.3.2). However, umsData() actually translates this into "downsample one of the classes from the smaller group" vs. "the bigger group", automatically detecting which class to downsample from which group. Try plugging your data into umsData() to see whether the existing code affects your data the way you expect.

Auditing this code to critique and/or suggest changes

Readers are more than welcome to critique this code! While I think much of this code is pretty solid, there are no doubt some bugs here and there, some inefficient code implementations, and/or some tortured data-scientific reasoning. You can raise GitHub issues, start discussions, or send me an email.

Please don't be afraid to suggest changes, report bugs, or ask questions---I want this code to be useful for you, and there are no bad questions!

Citing this repository

If you use this repository, please cite it! Studies show that research software and data are under-cited, which makes it hard for contributors to gauge usage or get credit. Here's a recommended citation:

Villarreal, Dan. 2023. SLAC-Fairness: Tools to assess fairness and mitigate unfairness in sociolinguistic auto-coding. Available at https://djvill.github.io/SLAC-Fairness/.

Acknowledgements

I would like to thank Chris Bartlett, the Southland Oral History Project (Invercargill City Libraries and Archives), and the speakers for sharing their data and their voices. Thanks are also due to Lynn Clark, Jen Hay, Kevin Watson, and the New Zealand Institute of Language, Brain and Behaviour for supporting this research. Valuable feedback was provided by audiences at NWAV 49, the Penn Linguistics Conference, Pitt Computer Science, and the Michigan State SocioLab. Other resources were provided by a Royal Society of New Zealand Marsden Research Grant (16-UOC-058) and the University of Pittsburgh Center for Research Computing (specifically, the H2P cluster supported by NSF award number OAC-2117681). Any errors are mine entirely.

About

Tools to assess fairness and mitigate unfairness in sociolinguistic auto-coding

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published