Arms on vast amounts of data.
William Y. Arms is a computer scientist currently working at Cornell. The full title of his keynote is “Humanities and social science research using vast amounts of web data.”
Examples of very large collections:
- Library of Congress: National Digital Information Infrastructure and Preservation Program
- The Internet Archive‘s historical collection of the web (600 TB, terabytes)
- Large scale digitization projects: Open Content Alliance, Project Gutenberg, Google, Microsoft, Yahoo, etc.
- USC Shoah Foundation: Survivors of the Shoah (400 TB)
How will humanities and social science scholars do research on collections which are large by supercomputing standards?
“Only the computer reads every word” –Greg Crane
- Researchers interact with the collections through computer programs that act as their agents.
- Users rarely view individual items except after preliminary screening by programs.
- Collection requires a highly technical computer system that is used by researchers who are not computing specialists.
- The collection is a high-performance computing system.
- Use of the collection depends on automated tools, which require state-of-the-art indexes for text and semi-structured data, natural language processing, and machine learning.)
Example: the Cornell Web Lab (or is it a Library, asks Arms?)
The structure of text:
Manual analysis and mark-up
- skilled bibliographers and cataloguers
- manual textual markup
- semantic web tools for representing relationships (e.g., RDF, Fedora)
Semi-automated methods
- automated name recognition under human control (e.g., Perseus)
- expert-guided web crawling (e.g., iVia)
The above are tens of millions of records. How do we manage billions of records?
Example: The Internet Archive web collection
The data: complete crawls of the web, every two months since 1996, with some gaps:
- range of formats and depth of crawl have increased with time
- no data from sites that are protected by robots.txt or where owners have requested not to be archived
- some missing or lost data
- metadata contains format, links, anchor text
- organized to facilitate historical access to a known URL (Wayback Machine)
The research dialog between a scholar (S) and a computer scientist (CS) goes something like this:
S: Here’s a study we’d like to do…
CS: We don’t know how to do that analysis, but would this be any use to you,
S: Not as you suggest it, but here’s another idea…
CS: That might be possible, with the following modification…
BOTH: Let’s try it and see!
Eventually we get something that is both useful from a research point of view and feasible from a computing POV.
Social Science Research:
- the web as evidence of current social events (spread of urban legends; development of legal concepts across time)
- the web as social phenomenon (political campaigns, online retailing, polarization of opinions)
Research topic example: social and information networks, joining a community. Question: what is the probability an individual will adopt a new behavior, as a function of the number of his/her friends who are adopters? New behavior could be: adopting a new technology, joining a club, etc.
So, when everything is in digital form, will the library go from being the largest building on campus to being the largest computing system on campus? WA says there’s a good likelihood of that.
WA goes on to describe some of the projects on the Web Lab’s plate right now. Their descriptions can be found on the Web Lab site.
Policies issues on the use of the lab: custodianship of data; copyright; privacy.
Design guidelines for builders of large digital collections:
- every online collection or service needs an application program interface (API) for computers, not humans, to interact with the library.
- a primary methodology is: select a subset of the collection; download to researcher’s computer; use programs on the researcher’s computer to analyze the data.
- almost all metadata will be computer generated, but human cooperative editing can correct errors.