Parse a regulation (plain text) into a well-formated JSON tree (along with associated layers, such as links and definitions) with this tool. It also pulls in notice content from the Federal Register and creates JSON representations for them. The parser works hand-in-hand with regulations-site, a front-end for the data structures generated, and regulations-core, an API for hosting the data.
This repository is part of a larger project. To read about it, please see http://eregs.github.io/eregulations/.
- Split regulation into paragraph-level chunks
- Create a tree which defines the hierarchical relationship between these chunks
- Layer for external citations -- links to Acts, Public Law, etc.
- Layer for graphics -- converting image references into federal register urls
- Layer for internal citations -- links between parts of this regulation
- Layer for interpretations -- connecting regulation text to the interpretations associated with it
- Layer for key terms -- pseudo headers for certain paragraphs
- Layer for meta info -- custom data (some pulled from federal notices)
- Layer for paragraph markers -- specifying where the initial paragraph marker begins and ends for each paragraph
- Layer for section-by-section analysis -- associated analyses (from FR notices) with the text they are analyzing
- Layer for table of contents -- a listing of headers
- Layer for terms -- defined terms, including their scope
- Create diffs between versions of the regulations (if those versions are available from an API)
- lxml (3.2.0) - Used to parse out information XML from the federal register
- pyparsing (1.5.7) - Used to do generic parsing on the plain text
- inflection (0.1.2) - Helps determine pluralization (for terms layer)
- requests (1.2.3) - Client library for writing output to an API
- requests_cache (0.4.4) - Optional - Library for caching request results (speeds up rebuilding regulations)
If running tests:
- nose (1.2.1) - A pluggable test runner
- mock (1.0.1) - Makes constructing mock objects/functions easy
- coverage (3.6) - Reports on test coverage
- cov-core (1.7) - Needed by coverage
- nose-cov (1.6) - Connects nose to coverage
Download the source code from GitHub (e.g. git clone [URL]
)
Make sure the libxml
libraries are present. On Ubuntu/Debian, install
it via
$ sudo apt-get install libxml2-dev libxslt-dev
$ sudo pip install virtualenvwrapper
$ mkvirtualenv parser
$ cd regulations-parser
$ pip install -r requirements.txt
At the moment, we parse from a plain-text version of the regulation. This requires such a plain text version exist. One of the easiest ways to do that is to find your full regulation from e-CFR. For example, CFPB's regulation E.
Once you have your regulation, copy-paste from "Part" to the "Back to Top" link at the bottom of the regulation. Next, we need to get rid of some of the non-helpful info e-CFR puts in. Delete lines of the form
- ^Link to an amendment .*$
- Back to Top
We've also found that tables of contents can cause random issues with the parser, so we recommend removing them. The parser will most likely generate the same content in a layer.
Save that file as a text file (e.g. reg.txt).
The syntax is
$ python build_from.py regulation.txt title notice_doc_# act_title act_section
So, for the regulation we copy-pasted above, we could run
$ python build_from.py reg.txt 12 2013-06861 15 1693
Here 12
is the CFR title number (in our case, for "Banks and
Banking"), 2013-06861
is the last notice used to create this version
(i.e. the last "final rule" which is currently in effect), 15
is the
title of "the Act" and 1693
is the relevant section. Wherever the
phrase "the Act" is used in the regulation, the external link parser will
treat it as "15 U.S.C. 1693". The final rule number is used to pull in
section-by-section analyses and deduce which notices were used to create
this version of the regulation. To find this, use the
Federal Register, finding the last,
effective final rule for your version of the regulation and copying the
document number from the meta data (currently in a table on the right side).
This will generate four folders, regulation
, notice
, layer
and possibly diff
in the OUTPUT_DIR
(current directory by default).
If you'd like to write the data to an api instead (most likely, one running
regulations-core), you can set the API_BASE
setting (described below).
All of the settings listed in settings.py
can be overridden in a
local_settings.py
file. Current settings include:
OUTPUT_DIR
- a string with the path where the output files should be written. Only useful if the JSON files are to be written to disk.API_BASE
- a string defining the url root of an API (if the output files are to be written to an API instead)META
- a dictionary of extra info which will be included in the "meta" layer. This is free-form.CFR_TITLE
- array of CFR Title names (used in the meta layer); not required as those provided are currentDEFAULT_IMAGE_URL
- string format used in the graphics layer; not required as the default should be adequateIMAGE_OVERRIDES
- a dictionary between specific image ids and unique urls for them -- useful if the Federal Register versions aren't pretty
Unlike our other layers (at the moment), the Keyterms layer (which indicates pseudo titles used as headers in regulation paragraphs) is built using XML from the Federal Register rather than plain text. Right now, this is a particularly manual process which involves manually retrieving each notice's XML, generating a layer, and merging the results with the existing layer. This is not a problem if the regulation is completely re-issued.
In any event, to generate the layer based on a particular XML, first
download that XML (found by on federalregister.gov
by selecting 'DEV', then 'XML' on a notice). Then, modify the
build_tree.py
file to point to the correct XML. Running this script
will convert the XML into a JSON tree, maintaining some tags that the plain
text version does not.
Save this JSON to /tmp/xtree.json
, then run generate_layers.py
.
The output should be a complete layer; so combine information from
multiple rules, simply copy-paste the fields of the newly generated layer.
An alternative (or additional option) is to use the
plaintext_keyterms.py
script, which adds best-guesses for the
keyterms. If you do not have the /tmp/xtree.json
from before, create a
file with {}
in its place. Modify plaintext_keyterms.py
so that
the api_stub.get_regulation_as_json
line uses the regulation output of
build_from.py
as described above. Running plaintext_keyterms.py
will generate a keyterm layer.
For obvious reasons, plain text does not include images, but we would still like to represent model forms and the like. We use Markdown style image inclusion in the plaintext:
![Appendix A9](ER27DE11.000)
This will be converted to an img tag by the graphics layer, pointing to the
image as included in the Federal Register. Note that you can override each
image via the IMAGE_OVERRIDES
setting (see above).
For most tweaks, you will simply need to run the Sphinx documentation builder again.
$ pip install Sphinx
$ cd docs
$ make dirhtml
The output will be in docs/_build/dirhtml
.
If you are adding new modules, you may need to re-run the skeleton build script first:
$ pip install Sphinx
$ sphinx-apidoc -F -o docs regparser/
To run the unit tests, make sure you have added all of the testing requirements:
$ pip install -r requirements_test.txt
Then, run nose on all of the available unit tests:
$ nosetests tests/*.py
If you'd like a report of test coverage, use the nose-cov plugin:
$ nosetests --with-cov --cov-report term-missing --cov regparser tests/*.py