This week Virginia Tech libraries are celebrating dissemination and free use of data with Open Data Week. As part of Open Data Week, Philip Young, curator of the Open@VT blog, worked with Code for New River Valley – a volunteer organization of software developers – to create technology focused sections about web scraping and APIs.
I attended the web scraping session. Web scraping is a quick and a dirty way to collect information from web pages. It can be used to gather information from pages that don’t offer an API. As an example, we were shown code that scrapes restaurant health inspection data from a website.
The Code for New River Valley group is very hands-on, so I tried to work on a small project to practice what we were learning in class. I thought that salary information at Virginia Tech would be interesting to people here, and I found a page from the Richmond Times-Dispatch where salary data for over 100,000 employees for the State (Commonwealth?) of Virginia can be queried. It looked like it would be easy to scrape, so that would be my in-class project.
It wasn’t easy to scrape. The HTML on the page was invalid. HTML tables are supposed to have opening and closing tags for each row. The opening tags for the rows were missing.
The parser didn’t understand the broken HTML, and it was confusing to use the Beautiful Soup web scraping libraries with these issues. I mentioned that I was having trouble before I left the class, and the Code for NRV group had a solution for me by the next morning. The solution was to ask Beautiful Soup to use lxml libraries for parsing. The Python 2 script here, created by Ben Schoenfeld, will gather this salary data and create a CSV file that can be imported into spreadsheet programs. The script requires the Beautiful Soup and lxml libraries on your system.
The data of course, is imperfect, depending on purposes.
- This data is from 2014, and it is now 2016.
- It’s not clear, without research, if the full annual salary is indicated for people who only worked part of the year.
- People who made $47,500 or less are marked as “Name Withheld”. This was an editorial decision by the Richmond Times-Dispatch.
- Do you want to compare salaries by gender? You’ll have to guess at which gender each name is…
- People with the same job title may have hugely varying responsibilities.
- Some people that you may think of as Virginia Tech employees are not on the list. For example, the football coach did not appear on the list because he was not paid with funds from state taxes.
- The data shows only base pay. For example, the car allowance and the deferred compensation are not shown for the university president.
In this case, the data is useful if you want to know the base salary in 2014 of a specific person that worked at Virginia Tech the entire year, as long as this base salary was above $47,500. The original website already facilitated that usage.
There may be other ways to obtain the data:
- Looking for other online sources
- Completing an FOI request of one’s own. (this could be time consuming and costly)
- Just asking the newspaper for the data in one file (they might share)
But using the web scraper was fun, quick (with help) and applicable to other problems.
Is the salary data truly Open Data, and should it be posted here?
UPDATE: Philip has provided a link to a salary database at the Collegiate Times. It has some nice features, including the ability to look at salaries at select other universities, and the ability to look at a specific department within Virginia Tech. Unlike the data from the Richmond Times-Dispatch, the database lists some salaries below $47,500. The data does seem to be from one year prior from the marked date. For example, the data listed for 2014 appears to be from 2013.