When I visited Cape Town, South Africa this year for the exchange program between the libraries at Virginia Tech and Cape Peninsula University of Technology, I was inspired by three projects related to statistics and visualization:
- A national library statistics database hosted by Cape Peninsula University of Technology
- An electronic resources usage and visualization tool built on Splunk by the Information Technology department
- A large data visualisation wall and classroom facility in the library at the neighboring University of Cape Town
I surprised myself by finding inspiration in these projects, because this is not one of my normal areas of focus. However, I think that people at Virginia Tech will find some of the concepts from these projects interesting. There are staff participating in data collection for library assessment, and I have heard that we may even hire someone to help analyze and visualize that data.
I wanted to do something to visualize Virginia Tech data, but nothing ambitious; I’m a beginner. I looked at a one data set from a one project at one library, and created very simple figures from the data. I chose something that my colleagues know about and can quickly know whether the visualizations match reality.
I decided to look at the availability status of electronic theses and dissertations, or ETDs. Each semester, graduating Masters and PhD students submit their theses and dissertations electronically to the Graduate School at Virginia Tech. When the graduate school approves a paper, the graduate school system uploads a copy of the paper, plus descriptive information – metadata – to the library. One part of the metadata that is sent to the library is the availability status. Upon submission, a student selects whether the ETD will be:
- available to the public immediately OR
- available to the public only after an embargo period (typically 18 months) has passed OR
- available only to people who access the ETD from Virginia Tech’s network OR
- creative writing ETDs are automatically embargoed for 5 years and then made available only on the Virginia Tech network
I’m curious about these statuses, and wondered how many items are submitted with each. I used Python to parse data used by the ETD processing system. Most of these visualizations were created with PyGal, a Python library to create SVG graphics that can be embedded in web pages.
ETDs by availability on initial submission
This is a simple bar graph of ETDs submitted with each status.
Clicking on a bar (except for the withheld bar, and we’ll get to that in a minute) yields a list of ETDs that were submitted with the specified availability status, which can be sorted by clicking on the column headers. When an ETD is available in VTechWorks, Virginia Tech’s institutional repository, the title is linked to the document.
Information from this visualization matches what my colleague Anne, who performs audits of the ETDs in the library with information provided by the graduate school, told me before I created the visualizations. Approximately one third of ETDs are restricted in some manner when they are submitted.
Current embargo status of ETDs submitted as withheld
When the ‘Withheld’ bar from the graph of availability upon submission is clicked, this graph is shown. It shows the number of initially “Withheld” ETDs that are still embargoed, compared to the number of ETDs that have become available.
Clicking on either slice of the pie yields another table showing ETDs in the selected category. In the table, the list of “Still embargoed” ETDs are not linked to records in VTechWorks, of course, because they are still unavailable there, but it’s currently the easiest way we have of discovering which previously withheld ETDs are soon to be published. This feature was suggested by Anne.
ETDs by submission period
Some of these amateur experiments didn’t work out particularly well. I segmented ETD submissions into time periods that would correspond roughly to ETD processing periods for fall semester and spring semester. The red area is the number of unrestricted ETDs processed; the blue area includes ETDs with other statuses as well. One can click on the blue points to go to the tables of ETDs. I was hoping to look at two things with this stacked line graph. Has the number of ETDs that we process per semester changed? Has the proportion of ETDs that are processed as unrestricted changed over time?
In retrospect, I think it was bad to try to answer two questions with one visualization, and this is probably the wrong type of visualization for those questions anyway. Also, the number of ETDs that are processed presumably corresponds directly with the number of students.
The most notable thing is the lovely saw-like shape. We process more ETDs in the spring than we do in the fall. It’s also interesting that there was a sharper spike in the spring of 2015.
Mosaic Plot: Unrestricted ETDs by department
This diagram is the visualization that I’m most excited about, although it’s the one that I haven’t fully automated yet . It’s a simplified mosaic plot that’s intended to visualize, at a glance, the likelihood that students from each academic department will publish their ETDs as unrestricted. I’ve purposely selected only four departments for this illustration.
Each colored bar represents an academic department. The bars are split into two pieces by a dividing gap. The area above the gap represents the proportion of ETDs from that department that were published as “Unrestricted”, and the area below the gap represents the proportion of ETDs from that department that were published with restrictions of some kind. Incidentally, the width of each bar symbolizes the total number of ETDs published by each department. The wider the bar, the more ETDs from that department.
Glancing at this chart, one could conclude that mathematics and architecture students published nearly all (35 of 36) of their ETDs as unrestricted, that computer science students publish ETDs with a balance of unrestricted and restricted availability, and that educational leadership and policy students publish most of their ETDs (22 of 30) with restrictions.
Don’t conclude that just yet, though. This plot is meant only as a demonstration of the visualization technique. The data used to make this plot is based on less than four months of data, but this submission system has been in use for more than four years. The data for previous years needs to be cleaned up to standardize department names and to correct for a metadata mapping error before the complete set can be plotted.
I’m excited because I think that this is the type of visualization that can help to inform and guide both actions and research. In this case, we could do any of these things if we notice that a department tends to publish ETDs with a restricted status:
- Send library liaisons to the department to discuss the benefits of open access
- Conduct surveys within the department about barriers to open publication of ETDs and publish research about the results
- Work to understand the scholarly publishing needs of members of the discipline and accommodate those needs with a new class of restrictions
I think that my colleagues have done all of these things within recent years. Having this information available at a glance could have given them a quicker start.
I imagine other applications in libraries for this type of visualization – for example, looking at characteristics of particular books that might make them more likely to circulate, and then deciding whether to adjust purchasing habits based on that data. Most libraries already do this to some degree, for example, comparing usage of e-resources to circulation of printed books.
After doing this prototyping experiment, I have ideas about what makes visualizing data difficult.
Consistency of data
While creating the visualizations, inconsistencies caused problems. A few records in our index file had missing tabs where a field was blank, causing the visualization script to crash. Some values, like names for department, are inconsistent. And recent records record this data in a completely different metadata field. Keeping data consistent will help if the data is going to be analyzed later by other systems.
Designing visualizations for previously built systems
This system was built years ago, so I was limited by the data that was available. I’m fortunate that it saves a lot of data, but the data formats weren’t designed for this sort of work.
The first passes of the visualization script used a single file, and finished running in about one tenth of a second. But the ETD titles aren’t stored in this single file, so when I wanted the titles, I had to look them up in thousands of separate files. It now takes 30 seconds to run this script.
It’s hard to predict exactly what will be needed in the future, but it’s good to think about how the data might be used when designing a system.
First, these visualizations don’t work well on mobile devices.
Second, I do have a general area of concern about visualizations. Starting this year, one area of responsibility I have in the library is assisting with technology accessibility. My assumption is that visualizations, by their nature, are not directly useful to people who are blind or have severe vision impairments. I need to learn more.
What makes a good visualization?
I know how to make chastensgrafs, but I don’t understand which questions are appropriate to answer with visualizations, or which visualizations work best to display different types of data. Several people have recommended that I read at least one of the books by Edward R. Tufte.
I’ve sent these visualization demonstrations to a few people in the library for feedback about accuracy and usefulness, and hope that this can stimulate some discussion. If you want to take a quick peek at these simple visualizations for yourself, you can find them here.