Dan Villarreal (Department of Linguistics, University of Pittsburgh)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This GitHub repository is a companion to the paper "Sociolinguistic auto-coding has fairness problems too: Measuring and mitigating overlearning bias", published open-access in Linguistics Vanguard in 2024: https://doi.org/10.1515/lingvan-2022-0114. In the paper, I investigate sociolinguistic auto-coding (SLAC) through the lens of machine-learning fairness. Just as some algorithms produce biased predictions by overlearning group characteristics, I find that the same is true for SLAC. As a result, I attempt unfairness mitigation strategies (UMSs) as techniques for removing gender bias in auto-coding predictions (without harming overall auto-coding performance too badly).
Repository navigation:
- Repository homepage
- Repository code
- Analysis walkthrough
- Unfairness mitigation strategy descriptions
Sociolinguistic auto-coding is a machine-learning method for classifying variable linguistic data (often phonological data), such as the alternation between park & "pahk" or working & workin'.
You can learn more about SLAC by reading the following resources.
- Dan Villarreal et al.'s 2020 Laboratory Phonology article "From categories to gradience: Auto-coding sociophonetic variation with random forests"
- Tyler Kendall et al.'s 2021 Frontiers in AI article "Considering performance in the automated and manual coding of sociolinguistic variables: Lessons from variable (ING)"
- Dan Villarreal et al.'s 2019 tutorial "How to train your classifier"
- Explains the R code powering my implementation of SLAC
First, you can reproduce the analysis I performed for the Linguistics Vanguard paper, using the same data and code that I did. Simply follow the analysis walkthrough tutorial.
Second, you can also adapt this code to your own projects. You might want to use it if you want to (1) assess fairness for a pre-existing auto-coder and/or (2) create a fair auto-coder by testing unfairness mitigation strategies on your data.
Finally, I invite comments, critiques, and questions about this code. I've made this code available for transparency's sake, so please don't hesitate to reach out!
The files in this repository fall into a few categories. Click the links below to jump to the relevant subsection:
- Info written for humans
README.md
: What you're reading nowAnalysis-Walkthrough.Rmd
&.html
: Tutorial for replicating analysis in paperUMS-Info.Rmd
&.html
: Descriptions of unfairness mitigation strategies
- Input data
Input-Data/
- Outputs
Outputs/
- Code that does stuff
R-Scripts/
Shell-Scripts/
- Code/info pertaining to the repository itself
.gitignore
LICENSE.md
renv/
renv.lock
.Rprofile
_includes/
You can browse files here.
This repository's structure reflects the two-computer setup I used to run this analysis. I generated and measured auto-coders on a more powerful system that is not quite as user-friendly (Pitt's CRC), then analyzed the metrics on my less-powerful-but-user-friendlier laptop. (It's perfectly fine to use a one-computer setup if you don't have access to high-performance computing; the code will take longer to run on a less-powerful machine, but it still might be faster/easier than the HPC learning curve!) In the rest of this section, I'll refer to this two-computer split several times.
Contents:
Input-Data/
LabPhonClassifier.Rds
: Pre-existing auto-coder to analyze for fairness. This auto-coder is the same as in "How to train", but with aGender
column added to the auto-coder'strainingData
element.trainingData.Rds
: /r/ data for generating auto-coders that use unfairness mitigation strategies, also available here. This is the result of step 1 in "How to train". A dataset with more tokens (but less acoustic information) is available here.meanPitches.csv
: Pitch data to use for UMS 3.1 (normalizing speaker pitch). I measured pitch (F0) for word-initial /r/ tokens and calculated each speaker's average minimum and maximum pitch.UMS-List.txt
: Tab-separated file matching UMS codes to descriptions. This is also used by the R scripts to define the set of acceptable UMS codes.
The /r/ and pitch data comes from Southland New Zealand English, historically New Zealand's only regional variety, which is characterized by variable rhoticity. The New Zealand Institute of Language, Brain and Behaviour maintains a corpus of sociolinguistic interviews with Southland English speakers totaling over 83 hours of data. This corpus is hosted in an instance of LaBB-CAT; the data files were downloaded from LaBB-CAT, with subsequent data-wrangling in R (including speaker anonymization).
Skip ahead for info on using your own auto-coder and training data, or modifying the set of UMSs.
Contents:
Outputs/
Autocoders-to-Keep/
- "Final" auto-coders (saved as
.Rds
files). Unlike the temporary auto-coders, this folder is version-controlled (see info on.gitignore
), so it's useful for selectively saving auto-coders we want to hold onto.
- "Final" auto-coders (saved as
Shell-Scripts/
- Text files (saved with the
.out
file extension) that record any output of shell scripts, including errors. Useful for diagnosing issues with the code if something goes wrong.
- Text files (saved with the
Performance/
- Tabular data (saved as
.csv
files) with metrics of auto-coders' performance (e.g., overall accuracy) and fairness (e.g., accuracy for women's vs. men's tokens). These files bridge the two-computer split split: we extract metrics on a more powerful system (see walkthrough) so we can analyze them on a user-friendlier computer.
- Tabular data (saved as
Diagnostic-Files/
: Temporary files that are useful only in the moment (e.g., peeking "under the hood" to diagnose a code issue if something goes wrong) and/or too large to share between computers. Most files are.gitignore
d, save for emptydummy_file
s that exist only so the empty folders can be shared to GitHub.Model-Status/
- Temporary files (with extension
.tmp
) that are created during optimization for performance to signal which auto-coders are completed or running.
- Temporary files (with extension
Temp-Autocoders/
- Auto-coders run with different UMSs, for which we want to measure performance and fairness but we don't need to version-control
Other/
: Files mostly meant for passing info between scriptsVar-Imp*.csv
: Data on variable importance for "precursor" UMSs used for UMSs 2.1.x.Best-Params*.csv
: Optimal hyperparameters from hyperparameter-tuning runs.Drop-Log*.csv
: Log files for outlier-dropping runs.
All these outputs were generated using the data and code in this repository. You may want to create a 'clean' version of the repository without any of these outputs, to see if your system replicates the outputs I got.
Contents:
R-Scripts/
: Scripts that do the heavy lifting of running auto-coders and facilitating analysis.Shell-Scripts/
: Scripts meant for the user to run; these scripts call the R scripts and collate their outputs.
The division of labor has two benefits. First, it makes the code more modular, so a larger process isn't completely lost if just one part fails. Second, many high-performance computing environments don't allow users to run code on-demand, instead submitting job requests, packaged into shell scripts, to a workload management system (aka job queue). These shell scripts are written to be compatible with Slurm, the job queue used by Pitt's CRC clusters. If your computing environment doesn't require submitting job requests, the shell scripts should still run as-is. You also have the option of foregoing the shell scripts and running the R scripts directly.
These include 'main scripts' that run auto-coders and 'helper scripts' that define functionality shared among the main scripts.
Main scripts:
R-Scripts/
Run-UMS.R
: Generates a single auto-coder according to an unfairness mitigation strategy.Hyperparam-Tuning.R
: Subjects an auto-coder to hyperparameter tuning, one stage of optimizing an auto-coder for performance. (Note: This tunes only what "How to train" callsranger
parameters) because I'm now less sure that the other hyperparameters are appropriate for tuning.)Outlier-Dropping.R
: Subjects an auto-coder to outlier dropping, one stage of optimizing an auto-coder for performance.
The main scripts are written to be called from a command-line client like Bash, using the command Rscript
.
(To use Rscript
, R needs to be in your PATH.)
For example, if you navigate Bash to R-Scripts/
, you can run Rscript Run-UMS.R --ums 0.0
These scripts take several arguments (like --ums
);
to see arguments, run Rscript <script-name> --help
from the command line.
If you prefer working exclusively in R, you can use rscript()
from callr
to call these scripts from within R (e.g., callr::rscript("Run-UMS.R", c("--ums", "0.0"), wd="R-Scripts/", stdout="../Shell-Scripts/Output/Run-UMS_UMS0.0.out", stderr="2>&1")
).
Helper scripts:
R-Scripts/
UMS-Utils.R
: Contains utility functions for generating and analyzing auto-coders that utilize UMSs. The most important functions are:umsData()
: Reshape data for auto-coder by applying UMS, only keeping necessary columns, and optionally dropping outliersumsFormula()
: Specify model formula based on UMScls_fairness()
: Investigate auto-coder fairness (see walkthrough)cls_summary()
: Generate one-row dataframe of fairness/performance metrics (see walkthrough)
Rscript-Opts.R
: Defines command-line options for how main scripts should run.Session-Info.R
: Combines and prints R session info from the outputs of multiple scripts. Meant to be used in shell scripts.
Contents:
Shell-Scripts/
Run-UMS.sh
: Generates a single auto-coder according to a UMS, and optionally optimizes it for performance. This flexible lower-level script is useful for exploratory analysis.Baseline.sh
: Wrapper script that callsRun-UMS.sh
for baseline auto-coder (mostly exists to override the default outfile name)UMS-Round1.sh
: Generates auto-coders according to UMSs whose codes start with 0, 1, 2, or 3 (save for UMS 0.0, the baseline).UMS-Round2.sh
: Generates auto-coders according to UMSs whose codes start with 4.
The shell scripts are written to be called from Bash, using the commands bash
(to run directly) or sbatch
(to submit to a Slurm job queue; see above).
For example, if you navigate Bash to Shell-Scripts/
, you can run sbatch Baseline.sh
.
Run-UMS.sh
takes two arguments: a UMS numerical code (e.g., sbatch Run-UMS.sh 4.2.1
) and an optional -o
flag to optimize the auto-coder for performance.
All other shell scripts hard-code these options (as well as other options passed to the R scripts), but these can be adjusted as needed (under the heading ##EDITABLE PARAMETERS
).
The shell scripts also collate output from the R scripts;
I recommend saving this output to a text file.
These scripts include a Slurm command that automatically writes script output to a corresponding .out
file in Outputs/Shell-Scripts/
.
If you're not using Slurm, you can append a command that tells Bash where to send outputs, including errors (e.g., bash Baseline.sh &> ../Outputs/Shell-Scripts/Baseline.out
);
if you omit this part of the command (e.g., bash Baseline.sh
), the output will simply print in Bash.
CRC's cluster uses Lmod to make modules (like R) available to shell scripts via the module load
command.
Your system may not need to load modules explicitly, or may use different commands to load R.
Contents:
.gitignore
: Tells Git which files/folders to exclude from being version-controlled (and being shared to GitHub or between computers). Because the auto-coders are huge files, I excludeOutputs/Diagnostic-Files/Temp-Autocoders/
from version-control and just pull out fairness/performance data instead. If there's any I want to keep, I save them to the non-ignored folderOutputs/Autocoders-to-Keep/
.LICENSE.md
: Tells you what you're permitted to do with this code.renv/
: Set up by therenv
package to ensure our code behaves the same regardless of package updates. See more info below.renv.lock
: Set up byrenv
to store info about package versions..Rprofile
: Contains R code to run at the start of any R session in this repository. In this case, this code was set up byrenv
to run a script that loads the package versions recorded inrenv.lock
. If you want to disablerenv
, simply delete this file._includes/
: Contains code that is inserted into the<head>
element of this site's webpages, which uses GitHub Pages and the Jekyll theme Primer to render the site. This is not necessary for any of the code's core functionality.
To run this code on your own machine, you'll need a suitable computing environment and software. All required and recommended software is free and open-source. This document was originally run using high-performance computing resources provided by the University of Pittsburgh's Center for Research Computing (CRC), in particular its shared memory parallel cluster. You can run this code on a normal desktop or laptop---it just might take a while! You'll also need at least 400 Mb of disk space free. See the walkthrough for more information about machine specs, running time, and disk space used.
Required software:
- The statistical computing language R (version >= 4.3.0)
- R packages:
tidyverse
(v. >= 2.0.0)magrittr
(v. >= 2.0.3)caret
(v. >= 6.0-94)ranger
(v. >= 0.15.1)ROCR
(v. >= 1.0-11)foreach
(v. >= 1.5.2)doParallel
(v. >= 1.0.17)optparse
(v. >= 1.7.3)this.path
(v. >= 2.0.0)benchmarkme
(v. >= 1.0.8)rmarkdown
(v. >= 2.22)knitr
(v. >= 1.43)renv
(v. >= 0.17.3)- These packages will install dependencies that you don't need to install directly. See full R session info here
- The command-line client Bash (v. >= 5.0.0)
- If you install Git (recommended), Bash is included in the install
- The document converter Pandoc (v. >= 2.19)
- If you install RStudio (recommended), Pandoc is included in the install
Please note that R and its packages are continually updated, so in the future the code may not work as expected (or at all!). If you hit a brick wall, don't hesitate to reach out!
I also recommend using Git and GitHub to create your own shareable version of the code; doing so will help me effectively troubleshoot any issues you have. In particular:
- Download Git onto your computer.
- See this Git tutorial if you've never used it before.
- Sign up for a free GitHub account
- Fork this repository (keep the same repository name), and clone it onto your computer.
- Test out the code on your own system: Edit the code, create commits, push your commits to your remote fork.
- You may want to create a 'clean' version of the repository without any of the generated outputs, to see if your system replicates my outputs.
- Reach out!
Finally, I recommend using the integrated development environment RStudio. While it doesn't change how the code in this repository works, RStudio makes R code easier to understand, write, edit, and debug.
This repository uses the renv
package to ensure that updates to R packages don't break the code.
In effect, renv
freezes your environment in time by preserving the package versions the code was originally run on.
This is great from a reproducibility perspective, but it entails some extra machinery before you can run the code.
For all the examples below, you need to load this repo in R or RStudio by setting your working directory somewhere inside the repo.
Before you can run any of this code, run renv::restore()
.
This will download the packages at the correct versions to an renv
cache on your system.
Then you should be able to run this code on your machine.
Of course, using old versions of these packages means you won't be able to benefit to any package updates since this repo was published.
If you want to use new package versions, you have to register them with renv
.
If you're using R 4.3.x (the version used for this code), run renv::update()
;
if R >= 4.4, run renv::init()
and select option 2.
To update renv itself, run renv::upgrade()
.
Of course, the code may not work as expected thanks to changes to the packages it relies on.
If you're satisfied with how the code runs, you can register the updated versions with renv::snapshot()
.
If you want to use a package that's not registered with renv
, use renv::record()
.
Finally, if you're finding this all too much of a hassle, you can skip using renv
altogether;
just delete .Rprofile
and restart R/RStudio.
How much you want to adapt this code is really up to you. You might want to 'carbon-copy' this analysis on your own project, but in all likelihood your project will dictate that you make some changes to better fit your project's needs. Just as "training an auto-coder is not a one-size-fits-all process", so too is auto-coding fairness. For example, if you are confident that your predictor data (e.g., acoustic measures) does not suffer from measurement error, you can skip the time-consuming step of accounting for outliers. In some cases, this code might not necessarily work for your project; for example, this code only handles fairness across two groups, and it only handles binary classification (two categories).
Below, you can read about:
- Assessing fairness for a pre-existing auto-coder
- Creating a fair auto-coder by testing unfairness mitigation strategies
- Using your own training data
- Adding and/or subtracting UMSs
In addition, I strongly recommend making the data you use for this task publicly available if possible, since open data helps advance science (see Villarreal & Collister "Open methods in linguistics", in press for Oxford collection Decolonizing linguistics). However, if you do so, make sure what you share conforms to the ethics/IRB agreement(s) in place when the data was collected (if applicable).
Finally, if there's anything in this code that you can't figure out or isn't working for you, please don't hesitate to reach out! Please note that there is no warranty for this code.
This is one possible goal of your analysis, mirroring the Linguistics Vanguard paper's RQ2.
To analyze your auto-coder, it needs to have been generated by caret::train()
.
The auto-coder's trainingData
element also needs a column with group data (e.g., which tokens belong to female vs. male speakers).
If you use the scripts in this repository, that's taken care of for you;
umsData()
retains the group column in the training dataframe passed to train()
, and umsFormula()
excludes the group column from the predictor set.
However, if you didn't use these scripts to run your auto-coder, you'll need to either manually add the group column to the trainingData
element, or just re-run your auto-coder using these scripts.
If your auto-coder conforms to these requirements, you can use functions from R-Scripts/UMS-Utils.R
to analyze fairness.
See the walkthrough for examples of how to use this code.
This is the other possible goal of your analysis, mirroring the Linguistics Vanguard paper's RQ3.
You'll need your own training data (in place of trainingData.Rds
), and you may need normalization data depending on which UMSs you want to try.
Formatting requirements for training data:
- Tabular data (data stored in rows and columns), saved in a
.csv
or.Rds
file - Each row represents a single token of some categorical linguistic variable
- At least some of the tokens have been coded into classes (in
trainingData.Rds
, these are tokens for which the columnHowCoded=="Hand"
) - Columns needed for auto-coder:
- 1 column with variant labels for already-coded tokens and blanks/
NA
s for uncoded tokens (Rpresent
intrainingData.Rds
)- Currently, this code only handles binary classification (two categories, not counting
NA
s)
- Currently, this code only handles binary classification (two categories, not counting
- 1 column with the group that you're assessing fairness for (
Gender
intrainingData.Rds
)- Currently, this code only handles two-group fairness
- Multiple columns that contain predictors that the auto-coder will use for coding (in
trainingData.Rds
, 180 columns fromtokenDur
toabsSlopeF0
, inclusive)- Can be any data type
- 1 column with variant labels for already-coded tokens and blanks/
If you want to perform any speaker normalization (either as a preprocessing step or as UMS 3.1), you'll also need:
- In your training data, a
Speaker
column - An additional data file with normalization baselines (like
meanPitches.csv
) :- One row per speaker, with every speaker in your training data
- A
Speaker
column - A column for each baseline measure you want to use for normalization (
MinPitch
inmeanPitches.csv
is used as baseline for theF0min
measure in the training data,MaxPitch
for theF0max
measure)
In addition to those requirements, here are some data formatting recommendations (you don't have to format your data this way, but if not you'll need to tinker with the code some more). Some of these pertain to the data preprocessing step in "How to train":
- If you suspect that your predictors have measurement error and you want to take advantage of the outlier-dropping script, then you need to mark measurement outliers. Outliers for predictors
X
&Y
(for example) should be marked asTRUE
in columnsX_Outlier
&Y_Outlier
. - The code drops rows that have
NA
s for the dependent variable, group, and predictor columns (and outlier columns, if applicable). Depending on how many missing measurements you have, you might want to consider imputing measurements and/or thinning your predictor set. - You might want to add a
HowCoded
column to easily separate hand-coded and auto-coded tokens. - Thanks to pre-processing,
trainingData.Rds
reflects normalized measurements for formant timepoints but not pitch, so UMS 3.1 involves pitch normalization. You might decide to fold all normalization into pre-processing and skip normalization as a UMS. - You may want to anonymize your speakers, especially if you choose to make your data open, as in
trainingData.Rds
, but this is strictly optional.
Finally, as a general note, you may have to tweak the R scripts a little bit to accommodate your data. For example:
- These scripts assume the training data file is an
.Rds
file. If it's a.csv
, you'll have to tweak the code - If your columns have different names than the ones in
trainingData.Rds
(e.g., if your dependent variable isn'tRpresent
), you'll need to find-and-replace column names in the scripts. Alternatively, you can pass column names as arguments toUMS-Utils.R
functions (e.g.,umsData(myData, dependent=ING, group=Ethnicity)
). - If you don't have a
HowCoded
column, you'll need to modify the lines of code that refer to that column. - Depending on the size of your predictor set, you'll want to change the default value of
mtry
(the number of predictors attempted at each split) inRscript-Opts.R
andHyperparam-Tuning.R
; a typical value is the square root of the number of predictors, rounded down to the nearest integer. - If you rename or reorganize folders or files, you'll need to change the code to account for that.
Depending on your groups, your predictor set, and/or your dependent variable, you might want to add or subtract UMSs. For example, if you already have equal token counts for women vs. men, UMS 1.1 (downsample men to equalize token counts by gender) wouldn't apply.
To add a new UMS:
- Pick a new UMS code
- Don't use a code that's already been defined (it just creates unnecessary complications)
- If it's a combination UMS, the code should start with
4
- Add the code and description to
Input-Data/UMS-List.txt
- Modify
umsData()
inR-Scripts/UMS-Utils.R
- Single UMSs: Add a new
} else if (UMS=="<new-UMS>") {
block to theimplementUMS()
subroutine - Combination UMSs: Add code to interpret the second & third digits near the bottom of
umsData()
- Note that
umsData()
usestidyselect
semantics for several arguments (dependent
,group
,predictors
, &dropCols
). If you're using any of these column names in adplyr
function, wrap them in double-braces (e.g.,data %>% select({{dependent}}, {{group}})
); if you need a column name as a string, use the deparse-substitute trick (e.g.,depName <- deparse(substitute(dependent))
)
- Single UMSs: Add a new
- If using a shell script to run multiple UMSs in a single round, edit the script so the UMS code is matched by the
pattern
regex and not byexcl
(e.g., to include UMS 5.1, usepattern=^[0-35]
)
You only need to subtract a UMS explicitly if you're using a shell script to run multiple UMSs in a single round.
To subtract a UMS, use the excl
regex to exclude it (e.g., to exclude UMSs 1.4 and 2.2, use excl="^1.4|2.2"
).
No need to modify umsData()
, since the code will just skip over that UMS in the chain of else if {}
statements.
Note that the existing UMS list is actually more general than its descriptions suggest.
For example, UMSs 1.3.1 and 1.3.2 both achieve equal /r/ base rates by gender, by downsampling either women's Absent (1.3.1) or men's Present (1.3.2).
However, umsData()
actually translates this into "downsample one of the classes from the smaller group" vs. "the bigger group", automatically detecting which class to downsample from which group.
Try plugging your data into umsData()
to see whether the existing code affects your data the way you expect.
Readers are more than welcome to critique this code! While I think much of this code is pretty solid, there are no doubt some bugs here and there, some inefficient code implementations, and/or some tortured data-scientific reasoning. You can raise GitHub issues, start discussions, or send me an email.
Please don't be afraid to suggest changes, report bugs, or ask questions---I want this code to be useful for you, and there are no bad questions!
If you use this repository, please cite it! Studies show that research software and data are under-cited, which makes it hard for contributors to gauge usage or get credit. Here's a recommended citation:
Villarreal, Dan. 2023.
SLAC-Fairness
: Tools to assess fairness and mitigate unfairness in sociolinguistic auto-coding. Available at https://djvill.github.io/SLAC-Fairness/.
I would like to thank Chris Bartlett, the Southland Oral History Project (Invercargill City Libraries and Archives), and the speakers for sharing their data and their voices. Thanks are also due to Lynn Clark, Jen Hay, Kevin Watson, and the New Zealand Institute of Language, Brain and Behaviour for supporting this research. Valuable feedback was provided by audiences at NWAV 49, the Penn Linguistics Conference, Pitt Computer Science, and the Michigan State SocioLab. Other resources were provided by a Royal Society of New Zealand Marsden Research Grant (16-UOC-058) and the University of Pittsburgh Center for Research Computing (specifically, the H2P cluster supported by NSF award number OAC-2117681). Any errors are mine entirely.