A simple project to scrape 10-K forms from the US SEC (Securities and Exchange Commission) using spreadsheets and Python.
A simple scraper for some simple statistics gathering on US SEC 10-K forms. Coded very poorly, and in need of script cleanup.
- Download a decent text editor, such as VS Code
- Download Python
- Download the project
- Open a Command Prompt (Windows) (Mac) in the Folder
- Install the Requirements
pip install -r requirements.txt
- Copy your input file (Excel Workbook) into the same directory as the script
- Edit
sec_scraper.py
with:- the numbers of spreadsheet columns
- the names of files
- the text-search regexes
- any additional parameters
- Create a
secrets.json
with the following contents:{ "sec_request_headers": { "User-Agent": "YOUR INSTITUTION, YOUR EMAIL", "Accept-Encoding": "gzip, deflate", "Host": "www.sec.gov" } }
- Run the script
python sec_scrape.py
- Find your results in the original file
- See the Notes folder for current status.
This is not intended to be a long-running project. - Significantly better documentation of the code needed
- Significantly better breakdown of code into smaller functions needed
- Still very buggy/many edge cases not addressed
- Download Git for your Operating System
- General Python Knowledge
- How to Web Scrape the SEC | Part 1
- Python Regexes
- Clone the Repository
git clone git@github.com:peter201943/sec-scraper.git
- Open the Folder
cd sec-scraper
- Create a Virtual Environment
- Install the Requirements
pip install -r requirements.txt
- Open the Project (with VS Code, as example)
code .
sec_scraper.py
Configuration, definition, etcetera. The meat of the project.tests.py
Small incremental steps to learn how each part works.
This is a low-priority project for peter201943 and as such pull requests are not likely to be accepted. You will be better served by forking it and continuing development of it on your own.
Code distributed under the MIT License. See LICENSE
for more information.
Documentation distributed under the Creative Commons Attribution 4.0 License.
This document released under Creative Commons Attribution 4.0 License by Peter J. Mangelsdorf.
See Notes for links to articles, repositories, and programs.