Helen Agüera is in Program Development with the US National Endowment for the Humanities (NEH). Mark Sweeney is in Preservation Planning at the US Library of Congress (LC). The full title of their joint presentation is: “National Digital Newspaper Program: Enhancing Access to America’s Newspapers.”
Agüera first. NEH programs: preservation & access; scholarly research; education; and public programs.
US Newspaper Program (USNP): started in 1980s; grants to do inventory & catalog all newspaper holdings within a state, and do selection microfilming. This is a partnership between NEH and LC. Its accomplishments: over 140K newspaper titles, 70 million pages of newsprint in microfilm. Every state is included (to varying degrees?). NEH funding totaling over $54 million was necessary to complete this program.
New program to provide enhanced access to newspapers by digitizing certain titles already preserved in microfilm. There’s no single library that has a complete newspaper collection, so this has to be a distributed effort in order to create a geographically representative collection. This is again a partnership between NEH and LC. LC will develop and maintain American Chronicle to make digitized papers freely accessible. This is a We the People project.
NDNP features: 1836-1922 (public domain only). Complements other dig. resources for earlier historical period. Begins chron. coverage with early 20th century and expans to earlier decades to achieve broad geogr. representation. Repurposes USNP bibliographic information for users to locate newspapers in analog formats (microfilm and print). (Interesting! So this is not an effort to eliminate paper. That’s great.)
Development phase began in May 2005: six projects digitizing a min. of 100K pages (each!) published in CA, FL, KY, NY, UT and VA from 1900 to 1910. LC contributes titles from its collection, aggregates all information, and creates a preservation framework. Prototype launched in September 2006 and the test bed results are [being?] evaluated by all partners.
No one knows what the optimal way to preserve this digital data will be. The optimal way to deal with that is to proceed in phases, and evaluate often. (Hooray for project management. We need more of that in the acad. humanities.)
Future directions: they’re planning to make awards to state projects with partners that have access to negative microfilm and digital infrastructure. (Collaboration is encouraged!) Successful projects will have an advisory board assisting in selecting titles. Titles should reflect political, economic, cultural history of the state, and have a significant chronological span (some continuity). Special consideration given to “orphan” (unavailable in digital form, papers no longer published, no recognized owner) titles. NEH awards will cover the costs of selection, digitization, and delivery of information to LC.
Mark Sweeney now, on preservation planning and [first] user interface.
Preservation is crucial for access. Their guiding principles:
– aggregate, serve, and preserve; do so consistently with missions and philosophies of NEH and LC (open/perpetual access to public; preservation of the assets that NDNP builds;
– demonstrate good use of taxpayer money);
– phased develpment (develop incrementally, keep door open for new options)
What’s open mean to them: freely accessible; available to use and re-use; persistent identification to support citation; open technical formats; interoperability and modular architecture; open-source software. (YAY.)
The only thing that is certain is change. Change in technologies available; change in user expectations; change in preservation models.
How do we plan for the future? Content is more important than today’s system. Design “system” to be expandable and interoperable with other systems; explicitly incorporate a dev’t phase.
Practical concerns: out-of-the-box solutions have preservation challenges, so they’re doing a lot of from-scratch development with an eye to making it easy for future generations to modify it. They’re building on LC’s expertise and experience today with metadata formats. They expect to learn from their awardees.
They’re aiming to interact with different archival and data needs, and have tools to ingest, manage and distribute the data.
They distinguish between information object and data object. Info. object: original newspaper or microfilm. Data object is the digital surrogate (interesting use of the word -vz): TIFF, JP2, PDF, OCR‘d text, structural metadata etc. Their archival master format is TIFF; their production master format is JPEG 2000. PDF is the derivative (end-user-oriented?) format.
More information about NDNP (including a lot of information on its technical specs) can be found on its website.
The prototype interface beta is available in the LC newspaper reading library. It’s behind a firewall, but from what I understand they expect to release it into the wild by end of January or beginning of February 2007.
Q&A. Mark mentioned that their OCR (which they show you upon request! that’s cool) is uncorrected/unproofread. I wonder if they use it in searching, and if so, how they account for inaccuracies? Mark says: currently it’s not a requirement that the participants correct the OCR, although some correct headlines and important stuff like that. The assumption is that significant words are going to appear multiple times; so if you bomb on recognizing the first couple of occurrences, there’d still be a good chance of it being recognized. They’ve thought of asking the reader community to help with proofreading, but that’s not part of the current development phase. That’s fair enough: the enterprise is huge, this is a beginning phase, and they’re already doing a marvelous job.
Mary Molinaro from Univ. of KY: with OCR, they’re finding the need to strike the balance between good and good-enough. It comes down to the quality of the microfilm, but the OCR technology is actually really good, and is good enough. From their perspective, it wouldn’t be worth it to go back and correct it all. Newspapers are very challenging digitization subjects.
Mark: newspapers are great because they’re so interesting to so many people. This newspaper repository is the first true digital repository that LC has built. On several different levels, this program is a prototype for other such programs, with different digitized objects.
Other questions are about the interface prototype, so I won’t reproduce them since I can’t show you the prototype itself. :) If you’re at one of the institutions developing the initial phase of this program, you’ll be able to see it around mid-October.