1. Events
  2. 2018
  3. March
  4. Creating universal open access to closed textual data at scale: Use cases from the HathiTrust Research Center




Creating universal open access to closed textual data at scale: Use cases from the HathiTrust Research Center



Staff, Students, Academics

Speaker: Prof Stephen Downie from the University of Illinois

Research Centre: Research Centre for Machine Learning


The HathiTrust (HT) Digital Library contains 16 million volumes (over 5.6 billion pages). Unfortunately, roughly 10 million volumes are under copyright restrictions and cannot be shared with users. To overcome this problem, the HathiTrust Research Center (HTRC) is creating a set of “non-consumptive research” services to make these closed materials more open and thus useful to scholars. Central to this approach has been the creation and publication of the HTRC “Extracted Features” (EF) dataset, which provides unigram counts and Part-of-Speech (POS) information for each of the 5.6 billion pages. This talk will introduce two uses cases that leverage the EF dataset: the “Bookworm + HathiTrust” visualization and analysis tool; and the Workset Building environment developed to provide researchers fine-grained access to the entire HT collection (both public domain and in-copyright) via the EF dataset. Thus, each of these HTRC services is designed to open new points of access to otherwise closed data while still respecting all copyright limitations.


J. Stephen Downie is the Associate Dean for Research and a Professor at the School of Information Sciences at the University of Illinois at Urbana-Champaign. Dr. Downie is the Illinois Co-Director of the HathiTrust Research Center (HTRC). Downie is the leader of the Hathitrust + Bookworm (HT+BW) text analysis project that is creating tools to visualize the evolution of term usage over time. Professor Downie represents the HTRC on the NOVEL(TM) text mining project and the Single Interface for Music Score Searching and Analysis (SIMSSA) project, both funded by the SSHRC Partnership Grant programme. Professor Downie is also the Principal Investigator on the Workset Creation for Scholarly Analysis + Data Capsules (WCSA+DC) project, funded by the Andrew W. Mellon Foundation. All of these aforementioned projects share a common thread of striving to provide large-scale analytic access to copyright-restricted cultural data. Stephen has been very active in the establishment of the Music Information Retrieval (MIR) community through his ongoing work with the International Society for Music Information Retrieval (ISMIR) conferences. He was ISMIR's founding President and now serves on the ISMIR board. In the recent past, Professor Downie worked with Dunhuang Academy on the "Digital Dunhuang" project to help connect Digital Humanities scholars with the high-resolution digital materials capturing the Mogao Caves. Professor Downie holds a BA (Music Theory and Composition) along with a Master's and a PhD in Library and Information Science, all earned at the University of Western Ontario, London, Canada.

Share this event

When and where

5.00pm - 6.00pmFriday 23rd March 2018

AG21 College Building City, University of London St John Street London EC1V 4PB United Kingdom