The cheese really puts her heart and her soul into her writing

Adding Automatic Speech Recognition (ASR) captions to DSpace repository videos

In August of 2017, I experimented with adding captions for the Deaf and hard of hearing to over 300 videos in our institutional repository using automated speech recognition technologies, and the results were… interesting.

A slightly incorrect caption and a funny look

A slightly incorrect caption and a funny look


Over the summer, I visited Cape Peninsula University of Technology in Cape Town, South Africa. CPUT has a national reputation for being attentive to the needs of students with disabilities. I visited accessibility labs on two campuses, and also learned in the library from the excellent Hillary Hartle, who had initiated a program to add subtitles to videos used in online library courses.

Hillary’s program is an excellent idea. Captions can help to provide access for the Deaf, for people in noisy environments where audio is difficult to hear, for people who need very quiet environments, and for people with particular learning disabilities. With the proper implementation, captions can also allow the host website to allow searches based on words spoken in the videos.

In the United States, after legal action against MIT, Harvard, and Berkeley – which in Berkeley’s case resulted in the removal of tens of thousands of online lectures – universities have also recently become more attuned to the obligation to make sure that content which has been made available to the public is also accessible.

VTechWorks, our institutional repository at Virginia Tech, has a collection of over 300 videos, including lectures by Virginia Tech affiliates and guests, oral histories and interviews, and information about library programs and services. We modified the DSpace software to allow video playback, and modified it again to allow for display of closed captions. This wasn’t enough, though. Even though we’d done the work to support captions, before this project only two videos in VTechWorks had been captioned.

The library has been more successful in incorporating captions into other video collections. For example, Odyssey, our new learning objects repository, has excellent quality subtitles on all videos. I learned from Kayla McNabb, who worked on the Odyssey team, that many of the videos in Odyssey are hosted on YouTube. The team had word-for-word scripts of the dialogue spoken in each video, which were uploaded as text files to YouTube along with the videos. YouTube then did the work of synchronizing the text scripts with the audio tracks in the videos.

We don’t have the advantage of having dialogue scripts for the videos that were deposited into VTechWorks; they come from multiple sources and are typically unscripted lectures or interviews. We needed another way to create captions for VTechWorks.

Captions that are viewable in VTechWorks can be created manually with the aid of special tools, but we had a backlog of 300 videos and not enough available staff time (very rough estimate: 1500 hours) to handle this backlog in a timely manner.

In recent years, artificial intelligence techniques have rapidly improved speech recognition. In 2017, news articles and press releases have reported that speech recognition technologies from Microsoft, IBM, and Google are now approximately as accurate as human transcriptionists, with a 95% word accuracy rate, as compared to Google’s 77% accuracy rate from 2013.

YouTube uses Google’s automated speech recognition technologies to generate captions for videos when scripts are not available, and has been doing so – poorly, at first and now much better – since 2009. Kayla suggested that I might upload the videos to YouTube and then download the captions for use in our repository. I decided to attempt this, with the understanding that some Deaf users have found previous generations of YouTube generated captions to be of inadequate quality, and the hope that the current generation would be much better.

A quick comparison of the resulting captions with human created captions

First, look at human generated captions, created by a non-professional, so that you have a basis for comparison:

Human-generated captions are, in general, helpful, though imperfect. The content of the videos is made understandable, but there are occasional mistakes. In these videos, disfluencies such as “um..” and “uh..” may be removed, and the transcription is not always verbatim. Sometimes repetitions or grammatical accidents are left out.

Then, try these machine generated captions that resulted from the experiment:

These are good enough to give the viewer the gist, and sometimes a close to verbatim transcript when they are working well. But the lack of punctuation can sometimes be an obstacle. Additionally, because of a problem with this implementation, it’s confusing initially that two captions are on the screen at once, rolling down instead of rolling up.

Strong points of automated captioning in this experiment

  • For many videos where the audio quality was acceptable, the word accuracy was indeed very good, and probably more accurate than humans.
  • The automated system was not confused by accents of non-native English speakers.
  • The automated system was generally not stumped by technical words or domain specific language, which occurs commonly in college engineering lectures.
  • As mentioned, for a batch of videos of this size, this method was much faster than using humans to create captions.

Weak points of the automated captioning in this experiment

  • When the system makes mistakes, they can be nonsensical or worse. For example, “the cheese” does not put her heart and soul into her writing – but “she” does.
  • There are no periods, question marks, exclamation points, or other punctuation in the computer generated captioning. Good pandas know that punctuation can be important. Jason Ronallo, who blogs occasionally about topics in digital video at North Carolina State University Libraries, predicted this issue an email conversation and recommended a different process where voice recognition transcripts are generated first and then parsed into sentences before they are synchronized with videos and resulting captions are generated.
  • Audio quality problems can completely break the system. On one video with poor audio quality resulting from poor microphones and outdoor environmental noise, the YouTube Data API misinterpreted English at the beginning of the video as Spanish. This resulted in the entire English video being captioned as gibberish Spanish words completely unrelated to the content. Videos where the speakers wore microphones exhibited high word accuracy, but we don’t always have this situation. Even when the speaker has a microphone, the emcee introducing the speaker at the beginning of the event may not, which could cause a Deaf viewer to immediately have low confidence in the accuracy of the captions.

In October of 2017, machine learning researcher Awni Hannun documented a set of similar problems with speech recognition, with some background information.

Workflow and tools

The basic tools used to obtain the captions and add them to the VTechWorks repository are outlined in the figure below.

  • An SQL query to find videos in the DSpace PostgreSQL database, exported as a CSV file
  • A Python script that parses the results of the above SQL query, retrieves each video from its location in the DSpace assetstore, and uploads the video to YouTube using the YouTube Data API
  • A second Python Script to download the captions for each video, remove some formatting that caused display problems in the Firefox browser, and create a DSpace Simple Archive Format directory for each video to be used with the DSpace ItemUpdate tool
  • The DSpace ItemUpdate tool, which loads the captions into existing video items on VTechWorks
The bitstream metadata from the SQL query results and the files in the DSpace assetstore served as inputs to a process to create captions and add them to the items. The pipeline begins by uploading videos into the YouTube cloud where ASR captioning happens. The captions are then downloaded and packaged so that the DSpace Item Update tool can add them to the items.

The bitstream metadata from the SQL query results and the files in the DSpace assetstore served as inputs to a process to create captions and add them to the items. The pipeline begins by uploading videos into the YouTube cloud where ASR captioning happens. The captions are then downloaded and packaged so that the DSpace Item Update tool can add them to the items.

Problems particular to this implementation

In addition to the general problems of automated speech recognition discussed above, there were some problems particular to this implementation.

  • Confusing caption display. YouTube implements roll-up captions, additionally highlighting each word (paint-on) as it is spoken in the video. This is a reasonable choice, because it gives the viewer specific timing context, while also giving the viewer a strong hint that the captions were machine generated. However, the video viewer in VTechWorks doesn’t have the same caption display code as YouTube, and confusingly, two captions appear on the screen at the same time, rolling down instead of rolling up. This could be fixed with some additional processing in the second Python script, or perhaps a change to the video viewing technology in VTechWorks, but I haven’t come up with the proper recipe to make it work in all major browser families yet.
  • The YouTube Daily Upload Limit. At the time of this experiment, YouTube limited video uploads using the YouTube Data API to 100 uploads per day. I was not aware of this when I started the project, so I had to split the CSV file from the SQL query into multiple files and process the 310 videos over 4 days instead of 1.
  • Duplicate Videos on YouTube. This system broke on a few videos that were detected to already be in YouTube, because YouTube did not complete processing on them after the initial upload.
  • Relying on the YouTube APIs. While I couldn’t find anything in the YouTube Terms of Service that prevented this usage, and Google is generally very generous to educational institutions and understands the importance of accessibility, it’s not explicitly clear that the API was intended to be used in this manner, or that this will work on a continuing basis.
  • Captions are on by default. The user of VTechWorks can switch them off, but it perhaps should be the opposite: off by default, and a user can turn captions on if they are desired.
  • Multiple videos per item. The scripts didn’t work with multiple videos assigned to one DSpace item. Some of this problem lies with the original implementation of video in VTechWorks, which naively expected that items would not have more than one video.
  • Complicated workflow and suboptimal DSpace integration. This was meant to be a one-time experiment (the scripts have even been discarded), but that led to the workflow being overly complicated. This may be better implemented in a DSpace repository using the REST API, or as a DSpace curation task. For now, though, we’ve decided to work on a process for human generated captions.

Congratulations from the director, then curse words

Initially, Gail McMillan, Director of Scholarly Communications at Virginia Tech’s University Libraries, congratulated me – while she was on vacation! – for taking this step toward inclusive accessibility. I was proud. She wrote back later though to tell me of a serious problem. Captions in a video from an important guest lecturer had a very serious curse word which had not actually been spoken, and was extremely inappropriate in the context of the rest of the conversation. (This word appears in another video, but it was actually spoken, and in an artistic and appropriate context.)

I removed the captions for the video with the curse word that wasn’t actually spoken, and I’ll add it back to the video when it’s been fixed. So the experiment wasn’t as much of a success as I’d hoped. But good news: we have an answer to the quality problems.

I’m excited to report that, with the support of the libraries, I am hiring a student to fix quality problems with the captions using YouTube’s caption adjustment tools, as was demonstrated to me by Hillary Hartle at CPUT. The student will also caption new videos as they are added to the repository. My feeling is that the experiment was not a complete failure, because it will indirectly lead to more and better captions and better accessibility.

In summary, I strongly recommend that humans should be involved in the captioning process for videos, but I am hopeful that the technology will continue to improve so that an automated solution which can create high quality captions for large batches of videos will be available in the foreseeable future.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s