Munro on computer science in text analysis

[Oh look: Geoffrey Rockwell is posting some of his thoughts about this CaSTA conference on the TADA wiki. Highly recommended reading.]

Ian Munro is the Canada Research Chair in Algorithm Design at the Univ. of Waterloo. The full title of his keynote is… well, in the schedule it’s “Computer science research for text analysis,” but on the opening slide it’s “Developing text analysis software.”

Will talk about text search, one of his interests. He’s hard-core CS.

The need for computing in the humanities: “Scholarship increasingly depends on electronic document repositories and the growth of digital libraries… Even more apparent in computer readable form are collections of business documents and linguistic corpora. Gray literature, including technical reports, personal communications, and online help information, also constitute a growing text source.” –Frank Tompa

IM’s resaerch: data structures. How to organize information so we can find what we want: quickly; using an acceptable amount of space; proving the necessary inherent time and space bounds. [vz: bless his heart.] He’s on the theoretical side of computer science, a very different side from “user interface” or “understanding natural language” sides.

Where di IM get going on text? The Oxford English Dictionary project; interaction with humanists and lexicographers. New problems to work on; great data.

Another project IM was involved in, in the early 1980s: Videotext. Like the internet, but assumed few information providers, and access would’ve probably been restricted. The software ideas were there, but it was too early to use them.

IM gives some history of the OED project, which is actually covered pretty well in the Wikipedia article about it. The article includes a description of the first SGML encoding(s) of the OED.

The software they developed for the OED project:

- Lector, a general purpose browser. Worked with tagged text, presented in reasonable form, early SGML that, were they doing this a bit later, would’ve been HTML.

- Goedel, a programming language/database system.

- Pat, a search engine.

Pat is short for PATRICIA, “Practivel Algorithm to Retrieve Information Coded in Alphanumeric.” [vz: oy!] It does full-text searching, using an approach now generally known as “suffix tree.” In fact, in the final implementation it was a “suffix array.”

Typical problem: text indexing. Let’s take a large text file, like all the documents/email for a company, or a genome. We need to construct a structure so that given an arbitrary phrase they can quickly find where this phrase occurs in the “document.” Call the “extra stuff” an index.

What’s “suffix array”? It’s a method: an array of pointers referring to text positions in lexicographic order. Allows binary research. More on it here. (By the way, about this and other links: yeah, it’s wikipedia. Don’t even start with me on it being a Bad Resource. It’s not, unless you take it for gods’ word.)

Then IM describes suffix tries. This is all so far over my head that I’m not even going to try to summarize it; besides, the link does it pretty well.

From the OED project, IM and colleagues’ work proceeded to:

- more text search;

- data warehousing for asking complex queries

- enabling people to view relational databases as text (tags substituted for fields)

- enabling people to get things in “sorted” order: online phone books; buildings wired separately [tell me the companies that have offices in buildings I, a phone company, have wired – but who are not yet customers of mine); Sarah Lee = Sara Li; Romeo and Juliet (how many places are there in England where someone named Romeo lives near someone named Juliet? IM says that the answer is three.)

Where do things go next? IM wants to get rid of the tedium of searching in raw form, scanning texts etc, all parts of humanities work; improve the language interface; utilize better OCR (optical character recognition); build an application that can handle archaic linguistic forms.

Comments are closed.


Switch to our mobile site