Improving access and findability by integrating repository curation tasks with cloud based audio and video transcription

Many institutional repositories were designed to store and provide access primarily to text based file formats, such as PDF and Microsoft Word, which replicate the format of traditional academic print journal articles.  However, increased interest in the use of multimedia in education and scholarly communications, in part due to the transliteracy movement, has motivated institutional repositories to accept additional file formats.

Some repositories have started to store audio files, such as primary documents in oral history projects, and video files, such as seminars presented by guest lecturers. Most repositories are inherently capable of storing these files, because they are capable of storing any digital file type. The challenge, though, is in providing a level of findability and access for multimedia equivalent to the access provided for text based formats.


In addition to the typical metadata assigned to most items in digital libraries – title, author, and subject keywords – textual file formats are locatable due to fulltext search capabilities. The majority of the contents of PDF and Word documents in institutional repositories are indexed and made searchable through the standard web search button. Most digital repositories do not have a similar facility for audio and video files, because the underlying technology does not have the ability to “know” the full contents of non-textual files with any level of semantic understanding.


Users of institutional repositories, and the web in general, sometimes possess sensory impairments that can make access to information a challenge. Visual impairments can become barriers to access of textual materials in institutional repositories. Access for those with visual impairments has been addressed, to some degree, by the creation of web usability guidelines and specialized software that can read text out loud.

Similarly, as institutional repositories begin to collect content items that are primarily based on sound, issues in access for those with auditory impairments may surface. There are fewer guidelines and tools in providing access for these impairments. One bright spot is that many media players now provide the capability to play text captions, or subtitles, in the video viewing area.

Even users without auditory impairments may at times wish to deal with textual versions of information when audio would be inappropriate, or when attempting to save time.

Transcription as a means of improving findability and access

Transcription is one way to improve findability and access for items whose information content is primarily based on the spoken word. A document is created that contains a textual version of the spoken audio in a multimedia file. The textual document serves as a means for the system to provide indexing of the full informational content of the item. Additionally, the textual document provides access to those who prefer or a need a version of the information that can be read.

Benefits of automation and integration

Library hosted repositories face occasional obstacles to recruiting new content. One such obstacle is the lack of staff time to provide metadata values for available digitized and born digital items. Libraries may soon be faced with a large amount of digital content to place online, but not enough time to describe it. Automating the transcription process is one means of getting items online quickly and less expensively.  Items of special import can be transcribed manually by expert staff as they are located by searching for keywords in the automated transcriptions.

There are benefits to integrating this automation with institutional repository software. As items are automatically transcribed, the newly created transcriptions can be placed in the repository items alongside the original content.

DSpace Curation Framework

DSpace offers a framework for curation tasks. Typical tasks include checking items for viruses and enforcing metadata constraints. I am considering using this framework for submission of transcription tasks.

Two cloud based technology alternatives

MAVIS and GreenButton inCus

MAVIS, the Microsoft Research Audio Video Indexing System, is a speech recognition application that is capable of providing transcription and closed captioning of audio and video files. MAVIS has been deployed as a cloud service on Microsoft’s Azure platform. GreenButton inCus is a multimedia search application built on MAVIS, and offers a REST API for the platform.

Benefits: This is a very inexpensive means of providing audio transcripts very quickly. Speech recognition is an active research area and will continue to improve in quality. This method allows for the creation of subtitle and caption files which can be used with media playback. Transcription quality is likely to be consistent between files of consistent audio quality. The Georgia Archives has participated in a successful pilot project with this technology.

Drawbacks: Effectiveness of transcription is dependent on the quality of the audio. This approach may be limited to a small subset of human languages depending on the capabilities of MAVIS. The approach is more appropriate when there is a single speaker, rather than multiple speakers.

Amazon Mechanical Turk and CastingWords

Mechanical Turk, branded as “artificial artificial intelligence”, is a cloud service from Amazon that allows the submission of tasks to a pool of web workers (these are real people) who are willing to perform the task for a named price. This pool of web workers is routinely used for transcription tasks. Amazon offers its own API for Mechanical Turk, and Casting Words, an unaffiliated transcription service built on Mechanical Turk, offers its own RESTish API.

Benefits: Transcription quality will typically be very high (though it is variable). Cost model allows for fixed budgets in advance of transcription. Possibly applicable to a wider range of human languages.

Drawbacks: In theory, transcription quality could vary enormously from near-perfect to nonsensical. This method is likely to be more expensive than using speech recognition software. It is unlikely that usable subtitle and caption files can be generated with this approach. This approach may not be appropriate for audio or video that is restricted to a certain audience, for example due to sensitive content or intellectual property concerns. There is a possibility of not creating the transcription, if no user accepts the task for the listed price.

Design and Implementation Challenges

Asynchronous nature

Curation automation frameworks in institutional repositories are typically designed for synchronous tasks; short tasks are performed on an item and a result is expected in limited time so that the system can move on to the next task. This is in contrast to the cloud’s web service architecture, where results are often not returned in interactive time, or in some cases at all. In the Mechanical Turk system, for instance, it’s possible that no transcript will be returned because the task was not accepted by any web worker.


The billing systems are different for each of these services.  I’m not sure if the billing workflow should be part of the repository workflow, or if it should stay outside of the repository altogether.


I am imagining a hardware device to be used by historians, archivists, and others who work with transcriptions.  After the recording is made, a button is pressed on the device which sends the recording to the repository, where it is automatically transcribed.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s