Welcome to the Prospector Heads repo! Here, we share our implementation of prospector heads (aka "prospectors"). This repo is nicknamed "K2" for readability and legacy purposes, and references a key operator in prospector module called *k2conv* (see our arXiv paper above!). As a work in progress, this repo will continue to get weekly updates (see bottom of this readme).
At a high level, prospectors offer feature attribution capabilties to virtually any encoder while being:
- Computationally efficient (sub-quadratic)
- Data-efficient (through parameter efficiency)
- Performant at feature localization
- Modality-generalizable
- Interpretable... and more!
The core functionality of prospectors follows a similar API to scikit-learn
package. As detailed in our arXiv preprint, prospectors contain two trainable layers, which we train sequentially in this implementation. Prospectors can be used with 3 simple steps:
- Convert data to
networkx
graph objects, where each graph node is loaded with a token embedding (e.g. seeDoc-Step1-Embed.ipynb
for text encoders). Note: connectivity and resolution is defined by the user. - Construct a
K2Processor
object and then fit layer (I)'s quantizer via.fit_quantizer()
command (e.g. seeHisto-Step2-VizSetup.ipynb
). Note: this can assume a random sample of token embeddings - Construct a
K2Model
object and then fit layer (II)'s convolutional kernel via the.create_train_array()
and.fit_kernel()
commands (e.g. seeDoc-Step2-VizSetup.ipynb
)
The IPython notebooks herein also give examples on how to visualize:
- Data sprites (false color representations of data colored by concepts)
- Concept monogram and skip-bigram frequencies (per datum) as fully connected graphs
- Prospector convolutional kernels as fully connected graphs
- Prospect maps as outputs for feature attribution
We present IPython notebooks for each experimented modality. The following naming convention is used for notebooks:
- Any notebook beginning with
Doc
outlines experiments for sentence retrieval in text documents (sequences) - Any notebook beginning with
Histo
outlines experiments for tumor localization in histopathology slides (images) - Any notebook beginning with
Protein
outlines experiments for binding site identification in protein strucutures (graphs)
Python files contain all architectures and helper functions used in the notebooks. Here we briefly summarize each file (italics denote any modality-specific files):
architectures.py
: encoder architectures used for pathology/imageryattention-baselines.py
: attention-based heads used as baselines for proteins/graphseval-baselines.py
: evaluation of baselines for proteins/graphsevaluation.py
: all evaluation functions: training gridsearch, model selection, test-set evaluation, target region characteristics (e.g. prevalence, dispersion)job_params.py
: example function for loading in hyperparameters outside of a notebook - especially useful for bash scriptingk2.py
: The prospector head architecturemetrics.py
: metrics used for training grid search / model selectionmodel_selection.py
: helper functions for model selectionprocess_protein_data.py
: helper functions for processing protein/graph dataprotein-eval.py
: evaluation script for protein/graph datarun_gridsearch.py
: script to run training grid search -- especially useful for bash scriptingutils.py
: all highly-used helper functions for prospector heads, including visualizations; interfaces heavily withk2.py
xml_parse.py
: data preprocessing for protein/graph data
Prospectors' dependecies are very light, only requiring the following popular/maintained packages:
os
numpy
pandas
networkx
scikit-learn
pickle
dill
This work was originally implemented in Python 3.10. Given light dependencies, we anticipate fairly seemless support for future Python versions.
We plan to make a more official repo release in the coming weeks and months. At this time, our priority is to get this software into the hands of researchers applying feature attribution to large data with large models. Looking forward, we anticipate the following quality of life improvements:
- Updated OOP nomenclature — ideally standardized with our arXiv preprint
- Pytorch support for individual layer fitting (e.g. torch-enabled k-means as a quantizer)