This repo contains some python code I used to download form10k filings from EDGAR database, and then extract the MDA section from the downloaded form10k filings heuristically
I used python3.6
#python36
pip install -r requirements.txt
Specify the starting year and end year and the directory to save outputs.
By default, indices, forms and mdas will be saved to ./data
# Downloads and parses MDA section from 2016 to 2016 quarter 1 and 2, and saves to `./data/`
python edgar.py --start_year 2016 --end_year 2016 --quarters 1 2 --data_dir ./data/
usage: edgar.py [-h] -s START_YEAR -e END_YEAR [-q QUARTERS [QUARTERS ...]]
[-d DATA_DIR] [--overwrite] [--debug]
optional arguments:
-h, --help show this help message and exit
-s START_YEAR, --start_year START_YEAR
year to start
-e END_YEAR, --end_year END_YEAR
year to end
-q QUARTERS [QUARTERS ...], --quarters QUARTERS [QUARTERS ...]
quarters to download for start to end years
-d DATA_DIR, --data_dir DATA_DIR
path to save data
--overwrite If True, overwrites downloads and processed files.
--debug Debug mode
The code runs the extraction in the following steps
- Download indices for form 10k to
./data/index
- Combines all indices into a single csv
./data/index/combined.csv
- From Step2 combined csv, downloads all form 10k to
./data/form10k
- Parses the html forms with BeautifulSoup to
./data/form10k.parsed
- Parses MDA section to
./data/mda
- MDA section is parsed heuristically, and may not work for all forms. You'll probably need to modify the
find_mda_from_text
function for coverage. - You also might need to modify
normalize_text
function for MDA parsing.