This is an open, shareable, reproducible, computational research project on entity resolution.
It is the joint work of:
Given a collection of records which are each "about" one entity, entity resolution is the process of determining which records probably refer to the same entity. It is used in contexts where there is no uniquely identifying entity key on the records, so the process is forced to rely on record attributes that are associated with identity, but not uniquely determined by identity (e.g. height, weight, and eye colour as attributes of persons).
This inference of two records referring to the same entity is inherently probabilistic because it is always possible that multiple entities might have identical values on the available record attributes, and are therefore functionally identical. So, given a pair of records, we are interested in the probability that they refer to the same entity.
Entity resolution is typically conceptualised in terms of the similarity between records, and similarity is assumed to be monotonic with the probability of referring to the same entity. This project investigates the value of empirically determining the relationship between similarity and probability of co-reference. Determining the precise relationship between similarity and probability of co-reference can be seen as an example of calibration.
We also investigate whether that calibration varies as a function of other measurable quantities of the specific records being compared. For example, we could look at the frequency in the collection of the record attribute values being compared, and see whether that information can be exploited to yield better entity resolution.
Entity resolution typically uses a small number of fixed similarity functions (e.g. edit distance between strings) that are defined without reference to the specific pair of records being compared. Incorporation of other predictors, which are functions of the specific records being compared, into the calibration function can be seen as similar in spirit to having a customised similarity function for every pair of records. This parallels the practice of using subpopulation-specific model calibration functions to better combine model estimates across multiple subpopulations.
This is an open, shareable, reproducible, computational research project.
-
All the computational work and document preparation is done with the R statistical computing environment and the Rstudio integrated development environment.
-
The entire research project is contained in a single directory that corresponds to an RStudio R project.
-
We use the
renv
package to manage the R package versions used by the project -
We are using the
targets
package to structure the project so that the work is computationally reproducible. -
We are using the
workflowr
package to structure the project so that all the materials and outputs are available via an openly accessible, automatically generated website. -
The project code and documents are shared publicly on GitHub at https://github.com/rgayler/fa_sim_cal
-
The website automatically generated by
workflowr
from the rendered project documents is at https://rgayler.github.io/fa_sim_cal/
This directory is managed by the targets
package. It contains the
metadata describing the status of the computational pipelines and the
cached results of those computations. You will normally only manipulate
these via functions from targets
.
workflowr
creates a set of
standard directories. See the package documentation for details on how
these directories are used. The brief purposes are:
analysis
-rmarkdown
analysis notebooksR
- R code not in analysis notebooks (changed from theworkflowr
default ofcode
)data
- raw data and associated metadatadocs
- automatically generated websiteoutput
- generated data and other objects
workflowr
only manages the
subset of files that it knows about, so you will need to manually stage
and commit any other files that need to be mirrored on GitHub.
If any files in data
and output
are more than trivially small, they
are not shared via Git and GitHub.
.gitignore
is used to keep them out of Git.- There will be a separate mechanism (e.g. Zenodo) for sharing those large files.
The analysis
notebooks are for capturing all the analytical work that
was done, including exploratory work and abandoned directions. They
contain both the code and enough interpretation/explanation to make
sense of the results.
The notebooks will be too verbose, and inappropriately
structured/formatted for publication. Publishable documents are written
separately and kept in the manuscripts
directory.
manuscripts
contains a subdirectory for each
manuscript/document/presentation.
Each manuscript/document/presentation is prepared and formatted using a
package like rticles
or
bookdown
. Each document is
prepared in a separate subdirectory of manuscripts
that contains all
the necessary infrastructure files (templates, bibliographies, etc.).
The renv
package keeps track of the
R packages (and their versions) used by the project. It allows anyone to
reinstate the same packages and versions in their local copy of the
project.
The renv
directory contains the information need by renv
to
reinstate the local package environment
.gitignore
in the R project root directory is used for all manual
entries so that all the manual rules are in one place. Packages, such as
renv
, may create their own .gitignore
files in subdirectories that
they manage.
The static website automatically generated by workflowr
is stored in
the docs
directory.
The key document is docs/index.html
. Open this file with a browser to
get access to the website. docs/index.html
allows you to navigate to
all the generated content.
This index page is mirrored on the internet at https://rgayler.github.io/fa_sim_cal/index.html
- All detailed setup instructions and notes go in this project-level
READ.md
file. - The
README.md
files in the subdirectories only state the purpose of each subdirectory and the files in that directory.
This assumes that you already have current versions of R and RStudio installed.
-
Clone the project repository https://github.com/rgayler/fa_sim_cal from GitHub
-
Open the cloned repository as an RStudio project
You can combine steps 1 and 2 using RStudio by creating a new project
from the GitHub repository:
File | New Project... | Version Control | Git | Create Project
When you open the project you will get warning messages about packages
not being installed. This is because you need to use the
renv
package to reinstate the
packages that are used by the project.
-
Install
renv
in that project if it is not already installed -
Use
renv::restore()
to install all the needed packages in the project-specific library:renv::restore()
The computational work of this project is separated into core, meta, and publication pipelines.
The core pipeline contains the computational steps that are essential to
the subject matter of the project. The core uses targets
but not
workflowr
. It is purely computational and only produces data objects
as outputs. The core pipeline is managed by editing the definitions in
_targets/R
.
The publication pipelines contain the computational steps required to
convert the outputs of the core pipeline into publications. The leaves
of the publication pipelines are Rmarkdown documents that are rendered
to publications. The publication pipelines may also contain purely
computational steps to perform publication-specific transformations of
the outputs from the core pipeline. The publication pipelines are
managed by editing the definitions in _targets/R
.
The meta pipelines contain the analyses used to design and develop the
core pipeline. The leaves of the meta pipelines are workflowr
Rmarkdown documents that are rendered to web pages. The meta pipelines
may also contain purely computational steps to remove computational cost
from the rendered leaves. The meta pipelines are managed by a mixture of
targets
and workflowr
. While the meta pipelines are being developed
they are primarily executed manually via workflowr
. They are also
recorded as definitions in _targets/R
so that they can be
automatically re-executed after they are finalised.
See the Workflow Management notebook for a detailed description of the logic behind this organisation.
Any files in data
, output
and _targets
that are more than
trivially small are not shared via Git and GitHub. They will be shared
via a separate, yet to be determined, mechanism (e.g.
Zenodo).
For the immediate purposes of this project the raw data files should be downloadable from the internet and any processed data can be locally regenerated. The relevant analysis notebooks indicate where to get the data. In the longer term, the raw data should be bundled with the project somehow so that there is no dependency on continued data availability via the internet.
The purpose of the meta analyses is to work out what analyses we really want the project to do and how to implement them in the core pipeline. The meta notebooks document the process and reasoning by which we arrived at the design of the core pipeline.
Most meta notebooks focus on the development of functions that will be used in the core pipeline as the computational edges in the computation graph. Some meta notebooks are more diffuse in that they perform general background analyses of data available in the core pipeline so we better understand the data in order to support later design reasoning.
The analysis notebooks follow the workflowr
workflow. See the getting
started
vignette for an introduction.
-
Create a new analysis notebook:
workflowr::wflow_open("analysis/new_notebook_name.Rmd")
-
Build the website locally (either manually or indirectly via
targets
):workflowr::wflow_build()
-
Publish the website online (manually). This will only work if you have
push
authorisation for the GitHub remote repository.workflowr::wflow_publish("analysis/*.Rmd" "A commit message")
-
Add
mathjax = "local"
as an argument toworkflowr::wflow_html
inanalysis/_site.yml
so that the MathJax JavaScript library is bundled with the website indocs/
rather than being loaded from a remote server when the website is viewed. This removes the dependency on the remote server being available. See workflowr/workflowr#211output: workflowr::wflow_html: mathjax: "local"
-
Bibliography records for citations in the
analysis/
notebooks are stored inanalysis/references.bib
.- Rstudio provides convenient features for managing citations.
-
The reference style sheet for citations in the
analysis/
notebooks is stored inanalysis/some_style_name.csl
.
See the R Markdown citation guide for more details.
The renv
package is used to keep track of the installed packages and
their versions. See the renv
collaboration
guide or
the workflow for synchronising package environments between
collaborators.
-
Each publishable document is managed in a separate subdirectory of
manuscripts
. -
The
manuscripts
directory is not managed byworkflowr
, so must be manually managed with respect to Git. -
Each publishable Rmarkdown document is prepared and formatted using a package like
rticles
orbookdown
, so the details may vary between documents. -
The rendering of each publishable Rmarkdown document is managed via
targets
. -
The publishable Rmarkdown documents should avoid heavy computation. It is generally better if heavy computation is done in analysis notebooks and the results stored in the
output
directory. Those results can then be picked up by the publishable R Markdown document. -
Each rendered publishable document will be created in its subdirectory of
manuscripts
.- The rendered document must be stored in the
docs
directory so that the GitHub website can access it. (See workflowr/workflowr#209) - The manuscript subdirectory must contain a symlink to the
rendered document in the
docs
directory. This allows the manuscript rendering process to update the rendered file in thedocs
directory.
- The rendered document must be stored in the