Sept. 29, 2020 – A group of data scientist volunteers has produced a new set of databases designed to make it easier for journalists to track the nearly $4 trillion in federal pandemic stimulus funding approved by Congress since March.
Volunteers from the Washington, D.C., branch of the nonprofit DataKind, in a partnership with the National Press Foundation, have been cleaning and analyzing data from the U.S. Small Business Administration and other sources to help journalists trying to track the COVID stimulus cash.
The U.S. Small Business Administration released Paycheck Protection Program data only on loans of more than $150,000, and news organizations have filed suit to get access to all details of how the taxpayer funds were spent.
In addition, the data that the SBA did release last summer was incomplete and difficult to work with. Recipients were not required to fill in some important fields. The lack of demographic data made it impossible to know whether PPP loans were made disproportionately to wealthy or predominantly white areas, or whether minority-owned business or low-income citizens, who have been hardest hit by the pandemic, had received their fair share. And the data contained errors and inconsistencies.
While many large news organizations employ data journalists, smaller regional and local news outlets often lack the resources to do the intensive data analysis needed to power their investigative reporting.
To meet that need, DataKindDC volunteers produced their own databases with cleaner and enhanced data designed to make it easier for journalists to drill down on loans that were made to businesses in their coverage areas, as well as the demographics of who got the loans. The National Press Foundation has also produced a downloadable guide with tips, insights and resources to help journalists to tracking the COVID cash.
The DataKindDC datasets aggregate total loan amounts for multiple geographical areas, giving mapping capabilities including demographic information for each of the geographical areas, as well as recent election results data.
Here are details of the datasets and how journalists might use them:
- Full database of loans: This is the complete SBA dataset from Aug. 8, with some enhancements added, including low/mid/high loan estimates for larger loans, which were given as ranges in the raw data, and definitions of North American Industry Classification System (NAICS) industry codes at multiple levels of granularity. (Note this is a very large file and therefore is zipped.)
- Multi-loan addresses: This spreadsheet helps quickly identify addresses that have received more than one loan.
- Loans by congressional district: This dataset gives a sum of the total loan amounts for each congressional district (in yellow). It also gives demographic information for each congressional district from the 2014-2018 U.S. Census Bureau’s American Community Survey of the U.S. Census Bureau (in blue) and 2018 election result data from the MIT Election Lab (in green).
- Loans by county: This dataset gives a sum of the total loan amounts for each county (in yellow). As above, it also gives demographic information for each county from the 2014-2018 American Community Survey (in blue) and 2016 election result data from the MIT Election Lab (in green).
- The code that produced the above, more documentation about the project and the additional datasets that were brought in as enhancements can be found on the project’s GitHub repository.
NPF thanks the data scientists who contributed to these datasets, in particular:
- Rich Carder
- John McCambridge
- Dave Rench McCauley
- Ken Morales
- Zeinab Mousavi
- Kyle Ogilvie
- Mitchell Shuey
- Monique Williams
- Kathy Xiong
This program is funded by the Evelyn Y. Davis Foundation. NPF is solely responsible for the content.