How to Extract Data from Web Pages Even if You Don’t Know Programming
Program Date: Jan. 10, 2022

5 takeaways:

Get comfortable with HTML. You don’t need to know how to code, but it helps to have a basic understanding of HTML helps. First, right click on the web page with the data you want and click “Inspect.” This will show you the HTML. Examine it to see where the charts and tables are that you might want. “If you hover over it, it will highlight the areas and the HTML to give you a sense and idea of what line of the code you should be working with,” said Mark Walker, FOIA coordinator for The New York Times bureau in Washington.

Use Google Sheets, not Excel. “Import HTML” is a function that works on Google Sheets but not Excel. It allows you to quickly pull in data in a “concise” way that would take hours to clean up manually if you tried to simply copy/paste, Walker said. “This saves me a lot of time,” he said.

Rules of the road. Once you’ve imported the data, you can start analyzing it, but Walker says never start making pivot tables or applying other functions until you’ve created a new sheet. Have one sheet for the raw, original data and another for working data. As soon as you put in an equal sign (=), Google Sheets will guess at the function you want, but Walker always types in commands manually. Unsure if you are dealing with a table or a list? Walker said it’s OK to “eyeball” it. If it’s not a table and the function doesn’t work, it will give you an error message, he said.

Know extractors vs. crawlers. Extractors pull data from a single webpage, while crawlers pull data from multiple web pages, such as the Wayback Machine, Walker explained. “One thing about data crawling, and I found this out the hard way, was that it will scrape everything and then it’s … all going to be on your computer.” You will start getting error messages due to their size, Walker said. He recommends a program called DownThemAll!, which downloads scraped data to a zip file.

Web scraping is not a crime. If you set up scraping to run too frequently on certain sites, your IP address could be blocked. If that is a concern, Walker recommends doing it manually. “I try and make it a little slow because I don’t want to get banned or get flagged for hacking,” he said. While web scraping is not illegal, breaking someone’s (i.e. the government’s) website will land you in hot water. Simple web scraping done with Google Sheets will not break anything, but Python or other programming could if you’re inexperienced. Walker reminds journalists to be “cognizant.”

You may also be interested in: A FOIA Field Guide

 


Speaker: 

Mark Walker, FOIA coordinator, The New York Times Washington Bureau; President, Investigative Reporters and Editors


This program was funded by the Evelyn Y. Davis Foundation. NPF is solely responsible for the content.

Mark Walker
Investigative Reporter, The New York Times
1
Transcript
10
Resources for HTML, web scraping and data analysis
Gathering the Data: Web Scraping
Subscribe on YouTube
Help Make Good Journalists Better
Donate to the National Press Foundation to help us keep journalists informed on the issues that matter most.
DONATE ANY AMOUNT
You might also like
Social Media for Reluctant Reporters
Connection as an Antidote to Disinformation
How Data Journalists Track Politics
A More Accurate View of America
Putting the Count in Accountability
Expanding Your Visual Range
Data Journalism That Engages Citizens
A FOIA Field Guide