Archive

Archive for the ‘digital humanities’ Category

Drucker on visualizing interpretation.

October 14th, 2006 vika Comments off

Johanna Drucker is a professor of media studies, and a founding member of UVA’s Speculative Computing Lab. The full title of her keynote is “Graphic conventions: visualizing knowledge and subjectivity.”

Can we shift from information to interpretation, and shift [back] to a more humanistic point of view within digital humanities? To do that, we have to re-introduce subjectivity, and maybe substitute the mechanistic with the probabilistic (the latter being humanists’ worldview, according to JD).

Visual information conveys information in a form that makes it very hard to analyze systematically. Two ways to create stable knowledge in a notation system: one is with [natural] language, the other – mathematical notation, said Rene Toms of the Oulipo. He never did talk about visual representation, and with good reason: it’s an unstable notation mode.

Subjectivity comes in two forms: position (structural) or inflection (semantic).

The notion of information comes from a particular set of assumptions of what knowledge is. JD not interested of getting rid of this model of knowledge; but rather to propose another model of knowledge.

Visualization is compact, the problem isn’t having enough space to represent information – it’s pinning down the exact nature of the information and our assumptions about its aspects.

Visual information (VI) can lie or misinform, like natural language can.

The silliness of chunking of processes (how authors write stories: “author thinks about a topic” –> “author sketches an outline” –> “author reviews the sketch”) is apparent, but we have to do this sort of chunking when we’re working in a computational environment, which requires discrete units. Because of schematics’ rhetorical power, we eventually come to believe them.

So what about text visualization (as opposed to data visualization above)? Interest in doing things to and with texts is very active, especially within the creative-writing communities. Like TextArc, that sort of thing. They can be silly, ugly, destructive of the original, and yet they have their uses.

Edward Tufte, the exquisite engineer according to JD: information pre-exists visualization. Visualizations can be transparent enough to get us access to information. JD disagrees: visualizations are interpretive, opaque, distortive. They create informmation.

Temporal modeling at SpecLab. Basic assumption: timelines as they are conventionally defined and designed come out of the empirical/natural sciences. Assumptions there: time is unilinear; time is homogeneous (metric is stable); time is continuous (no unbroken intervals in temporality). None of these three things hold. Temporality branches in our lives, in poetry/film/etc. Time is not homogeneous – some moments fly by, others are long (the moment before the kiss and the moment after are very different, JD says). Time is not continous, either: there are breaks/ruptures, recorded in historical accounts for example.

SpecLab constructed a grammar of inflections, of visual elements they’d use to represent time, types of events and their relations to each other. There’s a lot of information JD is giving about what SpecLab has been doing; I’ll point you to the Lab’s site instead of summarizing.

In the IVANHOE game, every action takes place from within a role. Each role has a set of assumptions that go with it. They employed it as a teaching tool at UVA, with the purpose of showing that, in fact, every action stems from a set of presuppositions. [vz: that can't be right, I've captured too simplistic a description. Go see the site for more.]

Subjective meteorology: JD’s current project. An art project, which JD says – duh, art is in the humanities. [vz: yay!] She charted and graphed and visually represented a bunch of weather patterns – lines of anxiety/anticipation, storms of anger – which look gorgeous on the slides but don’t seem to be on the web. These representations can be chained together and animated to playfully and visually represent one’s subjective perceptions of the world around.

Great discussion follows. I can’t pretend to capture it well enough; I’ll post an update when the keynote webcasts are up, and urge anyone interested to watch this one when it’s available.

Categories: digital humanities Tags:

Hoover on CaSTA and breadth.

October 13th, 2006 vika Comments off

David Hoover is at NYU, and is the Vice-President of the Association for Computers and the Humanities. The full title of his paper is “CaSTAing breadth upon the waters.” (“Cast thy bread upon the waters: for thou shalt find it after many days,” say Ecclesiastes 11 – hence his title.)

DH seeks simple methods for examining word frequencies in corpora of single authors, at different stages of their production. Do authors tend to start disliking (using less) words they used to like (use a lot) earlier in their careers? Or vice versa? How does an author’s vocabulary change over her production’s life? DH does a lot of statistics to try to find out.

He’s talking about Trollope, about whom I know nothing. Apparently, the 100 most variable words in his corpus are all proper names. They also all appear in more than one stage of his writing career (early-middle-late). Henry James, however, has some non-proper nouns.

DH’s project is very much in progress; he says it’ll be a while until he has something interesting to say about the evolution of writerly language. One interesting question is: if you have a writer who starts writing very young, does their vocabulary change quickly, early on? What about writers who write far into old age?

One interesting consistency in James is that he seems to have used fewer “precious” nouns as the years went on: words like coquette and tresses.

Categories: digital humanities Tags:

Cunningham on the Arte of Navigation

October 13th, 2006 vika Comments off

Richard Cunningham is at the Acadia University English department, and directs the hypermedia center there. The full title of his paper is “Developing digital navigation from The Arte of Navigation.”

Readers experience something different reading an electronic document as opposed to a paper one. [Glaringly obvious, RC admits.]

RC presents the Acadia Digital Culture Observatory. They have digitized a 1561 edition of The Arte of Navigation and want to observe how readers read and use it. The original text included a navigation instrument made of three concentric paper circles of different sizes (volvelles), which are to be overlaid one on top of another and rotated. Here, you can see it for yourself. (Hm, it doesn’t seem to work in Firefox on Mac in Blackletter mode; I suggest the use of Arial to be safe, or you can download Blackletter from their table of contents.) Check out particularly the navigation instrument Flash files in the “other moving images” section of the TOC, they’re fun to play with.

Categories: digital humanities Tags:

TAPoR on TAPoR.

October 13th, 2006 vika Comments off

Ray Siemens of Victoria hosts a session of three papers related to the Text Analysis Portal for Research. First we have Geoffrey Rockwell, with “Text empires: text analysis in excess.” Shawn Day will talk about “The use of the recipe as a guilding metaphor for flexible and efficient self-guided computing instruction.” Finally, Stéfan Sinclair will talk “On data & views in text analysis.” All three presenters are from McMaster University in Hamilton, near Toronto.

ROCKWELL.

Information overload: 5 exabytes of information created in 2002. Exabyte = 1,000,000,000,000,000,000 Bytes. It’s a thousand petabytes, or a million terabytes. [Holy wow.] Spam is cheap, but reading has costs. How can text analysis help?

Why this explosion of information?

- growth in population and wealth: more money, more media toys

- multiple-media, from the photograph (1820s) to the iPod

- digitization of information and business practices: cheap creation, storage, reproduction, and transmission

Challenges to the system: what are the effects?

- experience of information overload

- multimedia shock

- narrowing expertise (because nobody can’t keep up with a broad discipline!)

- archive fever

What can we do?

- understand the problem (literary dimension to it; a problem of scale, a bibliographic problem)

- produce less? [shock! I can hear the internal gasps around the room!]

- file and (not) store smarter

- find smarter (not more) [ooh, I'll quote him in my dissertation work! no, I cannot, in fact, process all the litcrit written to this day]

- learn to read differently

The latter two of the above are opportunities for text analysis.

Problem of scale to text analysis for finding and reading:

- heterogeneous formats and multimedia rich

- closed (“for perfectly reasonable reasons” -GR) information empires (Google) build on existing indexes or build their own

- new questions, research methods (data mining and visualization)

- text analysis tools developed for coherent texts (collaborate with data mining & HPC [high-performance computing] community)

TAPoR.2 model, Beyond Finding and Reading:

- gathering and aggregation function (working with existing empires like Google; create your own study library (myEmpire))

- mining function (clustering and classification; provoking questions, not finding)

- interface and visualization function (effective interactions for research)

DAY

They’re using the recipe metaphor to get people of different backgrounds to use TAPoR.

A recipe for self-guided instruction:

- ingredients

- steps

- glossary

- discussion

- further information

Ingredients:

- ingenuity

- a useful metaphor

- a versatile set of tools

- users desirous or willing to consider using said tools

Steps

- identify objective

- consider users’ needs

- develop case studies that describe how your tools can meet these needs

- apply a familiar metaphorical approach to engage and instruct

- deploy recipes through a wiki

Glossary

- recipe: a useful guiding metaphor that offers optimal flexibility…. [couldn't get it, too fast]

Further Information:

Try the recipes out! (For example.)

Nice, familiar, easy concept. As Shawn is pointing out right now, super easy to engage a beginner user. This could be very useful, as well, when getting folks used to traditional humanities research methods to try, say, text encoding.

SINCLAIR

[Stéfan is the creator of HyperPo, the coolest text analysis tool ever so far.]

Generally, there’s a one-to-one mapping between tools and the data views of their results. SS has been thinking more in terms of this progression:

text -> tool -> data (TAML) -> style -> view

Among other things, he wanted to create a framework to use in teaching the development of text analysis tools in a modular way.

It’d also be nice to be able to chain tools together – you run a tool on a text, get the resultant data and feed it to another tool, and so on. This requires tools that can ‘talk” to each other, and output data in the same (or similar enough, or easily translateable) formats.

HyperPo 7.0 is coming soon!

Categories: digital humanities Tags:

Arms on vast amounts of data.

October 13th, 2006 vika Comments off

William Y. Arms is a computer scientist currently working at Cornell. The full title of his keynote is “Humanities and social science research using vast amounts of web data.”

Examples of very large collections:

- Library of Congress: National Digital Information Infrastructure and Preservation Program

- The Internet Archive‘s historical collection of the web (600 TB, terabytes)

- Large scale digitization projects: Open Content Alliance, Project Gutenberg, Google, Microsoft, Yahoo, etc.

- USC Shoah Foundation: Survivors of the Shoah (400 TB)

How will humanities and social science scholars do research on collections which are large by supercomputing standards?

“Only the computer reads every word” –Greg Crane

- Researchers interact with the collections through computer programs that act as their agents.

- Users rarely view individual items except after preliminary screening by programs.

- Collection requires a highly technical computer system that is used by researchers who are not computing specialists.

- The collection is a high-performance computing system.

- Use of the collection depends on automated tools, which require state-of-the-art indexes for text and semi-structured data, natural language processing, and machine learning.)

Example: the Cornell Web Lab (or is it a Library, asks Arms?)

The structure of text:

Manual analysis and mark-up

- skilled bibliographers and cataloguers

- manual textual markup

- semantic web tools for representing relationships (e.g., RDF, Fedora)

Semi-automated methods

- automated name recognition under human control (e.g., Perseus)

- expert-guided web crawling (e.g., iVia)

The above are tens of millions of records. How do we manage billions of records?

Example: The Internet Archive web collection

The data: complete crawls of the web, every two months since 1996, with some gaps:

- range of formats and depth of crawl have increased with time

- no data from sites that are protected by robots.txt or where owners have requested not to be archived

- some missing or lost data

- metadata contains format, links, anchor text

- organized to facilitate historical access to a known URL (Wayback Machine)

The research dialog between a scholar (S) and a computer scientist (CS) goes something like this:

S: Here’s a study we’d like to do…

CS: We don’t know how to do that analysis, but would this be any use to you,

S: Not as you suggest it, but here’s another idea…

CS: That might be possible, with the following modification…

BOTH: Let’s try it and see!

Eventually we get something that is both useful from a research point of view and feasible from a computing POV.

Social Science Research:

- the web as evidence of current social events (spread of urban legends; development of legal concepts across time)

- the web as social phenomenon (political campaigns, online retailing, polarization of opinions)

Research topic example: social and information networks, joining a community. Question: what is the probability an individual will adopt a new behavior, as a function of the number of his/her friends who are adopters? New behavior could be: adopting a new technology, joining a club, etc.

So, when everything is in digital form, will the library go from being the largest building on campus to being the largest computing system on campus? WA says there’s a good likelihood of that.

WA goes on to describe some of the projects on the Web Lab’s plate right now. Their descriptions can be found on the Web Lab site.

Policies issues on the use of the lab: custodianship of data; copyright; privacy.

Design guidelines for builders of large digital collections:

- every online collection or service needs an application program interface (API) for computers, not humans, to interact with the library.

- a primary methodology is: select a subset of the collection; download to researcher’s computer; use programs on the researcher’s computer to analyze the data.

- almost all metadata will be computer generated, but human cooperative editing can correct errors.

Categories: digital humanities Tags:

Pytlik Zillig on TokenX

October 13th, 2006 vika Comments off

Brian Pytlik Zillig is an all-around digital-library tech wizard at the University of Nebraska-Lincoln (UNL), which hosted the first annual Digital Humanities Workshop a few weeks ago. The full title of his paper is “TokenX: a text visualization, analysis, and play tool designed for the XML document tree.”

Some history:

- CDRH and other digital centers significantly rely upon XML

- XML is a 1998 [whoa, old] recommendation [hunh, not a standard] of the W3C

- XML is a robust and flexible medium for content

- UNL has been using XML/SGML since 1998

- all CDRH projects use XML

Research question, born in 2004:

- can emerging standards assist in text visualization, analysis and play? (for example: XSLT)

BPZ’s goal:

- use XSLT to explore text visualization, analysis, and play (TVAP)

- provide TVAP options useful to facilitate the creative, qualitative, and quantitative exploration of XML text

Why another text analysis tool?

- there are good tools available written in a variety of languages, but none are created in XSLT, and none that takes advantage of the special relationship between XML and XSLT

Say we take a Shakespearean sonnet line: “when to the sessions of sweet silent thought.” You’d be crazy to try to try to mark up every word in XML, it’d be a huge undertaking. But XSLT 2.0 can add markup to words using tokenization! Way cool! Tokenized, each word will look like this: <w>word</w> – and here’s a punctuation mark: <nonWord>,</nonWord>

With this markup, XSLT can be used to do a variety of TVAP actions on a text. TokenX ingests XML documents, retaining the original markup, and adds tokens like the examples above. Visualizations include word highlighting, keywords in context (looks a bit like a concordance), replacing words with blocks (for example, to find words that are too long?), highlight punctuation and non-words, all kinds of stuff.

Here’s the TokenX site, if you’d like to play with it.

Analyze, in the TokenX context, means:

- count words in context

- decontextualize words and count them (frex, list all the words in the document alphabetically, or by frequency, each word only once with a number of its occurrences next to it)

- word statistics (how many words, how many elements containing words, mean number of words per element)

- punctuation and non-word statistics

TokenX exports into spreadsheets, so you can export and save your dataset.

You can play with TokenX:

- substitute words

- replace words with images

Best part: it’s free and open-source. “You can change it!” Brian exclaims. Excellent.

Categories: digital humanities Tags:

Wulfman on the Modernist Journals Project

October 13th, 2006 vika Comments off

Cliff Wulfman is working at Brown – lucky us! (Major shout-out to Cliff.) The full title of his paper is “The Modernist Journals Project: A new architecture.”

Here’s the MJP site. It’s evolved from a quite small-scale faculty project. Cliff talks about how to take one of those and move it toward technologies that will allow it to grow and expand and move at a healthy pace. Their primary-source set is pretty large: modernism grew up largely in periodicals, and they’re digitizing them and putting them online.

Complete runs of magazines are scarce, CW says. Even when they exist, oftentimes the advertising has been stripped.

The MJP started out with a desktop scanner and an OCR (optical-character-recognition) package. One office, one faculty member, several students. That’s all. They tried to digitize all 30 volumes – nearly 18,000 pages! – of The New Age (“a weekly review of politics, literature, and art”) that way.

The limits of the original implementation: it was labor-intensive, hand-scanned and hand-coded; and it was served through eclectic, hand-made HTML pages. The MJP was outgrowing the prot in which it was seeded, CW says. It needed:

- engagement with the concept of “cyberinfrastructure”

- embrace of new technologies, standards, best practices that weren’t in place when the project was first conceived.

So they stepped back and devised a new architecture:

- complex digital objects based on digital library standards (METS, MODS, MADS)

- XML substrate

- data- [?] and database- driven service

- polymorphous delivery: can deliver in formats other than PDF

We then had a demo. Go look at the site for more. :)

Future directions:

- access to new scanner technolgoies will enable vast collection growth

- developing an interlinked encyclopedia of modernism

- build on Fedora‘s digital library infrastructure

Categories: digital humanities Tags:

Hirtle on TRANSLATOR.

October 13th, 2006 vika Comments off

David Hirtle is doing graduate work here at the University of New Brunswick, in computer science. The full title of his paper is “TRANSLATOR: a TRANSlator from LAnguage TO Rules.”

Semantic web is still n ot widely used.

– Focus of current development: machine-readable (meta)data

– Problem: only experts can contribute. Need to lower barrier to entry.

Provide a user-friendly format!

– why not English [he really means natural language]?

– “controlled English” avoids ambiguity: it’s formal, but also natural

TRANSLATOR will translate “every student gets a discount of 15 percent” to express [in XML, from what I see] that “student” implies “customer,” etc.

ACE (attempto controlled english):

- looks like English: “every honest student who does not procrastinate receives a good mark and easily passes the course.”

- but actually a formal language, like RDF: a tractable of English – all ACE sentences are English, but not vice versa

- every ACE sentence can be unambiguously translated into logic.

Strategies for handling ambiguity:

- exclude imprecise phrasings (“students hate annoying professors” – do they hate to annoy profs, or do they hate profs who are annoying?)

- interpretation rules (“the student brings a friend who is an alumnus and receives a discount” – who receives the discount? in ace, by default, it’s the student because of a certain rule. If you want it to be the alumnus, you write “…and who receives a discount.”)

How can rules be expressed?

- in natural language, many different forms (everyone is mortal, all humanity is mortal, for each person the person is mortal)

- all above are valid ACE

- further embellishment (negation, relative clauses, etc) [vz: but doesn't that add ambiguity?]

What can’t yet be easily expressed?

- “infix” implication (“the student is happy if there is no class” – solution: TRANSLATOR swaps the condition(s) and conclusion(s) and voila, ACE-acceptable)

- production and reaction rules (involve actions: “if a student is caught cheating then send a report to the registrar” requires the imperative mood, which is not yet in ACE)

Discourse representation structures, and more technical info. Sad, I can’t reproduce his diagrams here. The rules are eventually translated into RuleML, in whose development David is participating.

RuleML:

- goal is interoperable rule markup (XSLT translators to other semantic web languages)

- family of “sublanguages” (modular XML schemas; each represents a well-known rule system; TRANSLATOR uses First-Order Logic sublanguage)

Why use RuleML?

- ease of interchange (XML)

- compatibility with RDF and other languages, as well as W3C’s upcoming Rule Interchange Format

- availability of tools

- wide fariety of features (negation-as-failure, weightings, data types etc.)

Again, work-in-progress. Truly an attempt at getting closer to the semantic web. Formalizing natural language, what a gargantuan task. One critical benefit of TRANSLATOR is that it “allows non-experts to write facts and rules for the semantic web.” When can we play with it?

Now, it appears. Here’s a site for TRANSLATOR, including a Java Web Start demo.

Categories: digital humanities Tags:

Lukon and Juola on building an index generator.

October 13th, 2006 vika Comments off

Shelly Lukon and Patrick Juola are both at Duquesne University. The full title of their paper (presented by Lukon) is “Designing a context-sensitive machine-aided index generator.”

Problem definition.

Back-of-the-book indexing provides relevant terms, identifies cross-references and subcategories, and has a static, rigid structure (as opposed to web indexing). Human indexers invest a LOT of time into indexing (1 week per 100 pages of text); use software to automate mundane texts; and make all the intelligent indexing decisions. SL&PJ’s prototype system bridges the gap between the human and currently available tools, but not to replace the human indexers.

They’ve interviewed professional indexers, product-tested some of the software packages they tend to use, and looked at some mathematical techniques (particularly LSA, latent semantic analysis) that have had proven success in text processing and capturing semantic content of terms.

Cognitive tasks involved in index construction:

- identifying terms to index;

- locate all informative references;

- identify/locate synonymous terms;

- split index terms into subterms;

- develop cross-references within text;

- compile page numbers.

Their techniques for obtaining semantic information:

- parsing/tagging of terms, frequency analysis

- LSA

- word sense disambiguation (WSD)

- hierarchical cluster analysis (HCA)

This is still a work in progress. So far they’ve been able to locate all informative terms in text, and to allow the user to set thresholds/parameters. LSA, WSD and HCA show first level of clustering nearly 40% accurate upon inspection (not great, but a solid start). Their single-processor PC takes several hours to process small (60K words) corpora. Better than a human’s speed!

They’re categorizing words into parts of speech: identify the part-of-speech of each term; label each term with delimiter and acronym (home becomes home/NN since home is a noun). They’re only dealing with English right now. Their app is written in Java, as is MontyLingua which they’re using for part-of-speech tagging.

LSA:

-use factor analysis to generate numerical representations of terms and their meanings;

-divide corpus into “documents” (paragraphs), then analyze each unique “term” (word) relative to each document;

-create term-by-document matrix;

- create term-by-term covariance matrix (look at how each pair of terms vary together)

- singular value decomposition (SVD) – a way of explaining variability among random variables (dimensions)

- decompose covariance matrix into three submatrices [over my head here]

- rank resulting values

- reconstruct using most significant dimensions (reduce noise, sharpen similarities/contrasts)

- 200 most significant dimensions: pinpoint each term’s location in 200-dimension “semantic space” [why 200?]

WSA

- separate out different senses (meanings) of each term token

- numerical encodings generated by LSA give average context for each term token

- look at encodings of the other terms surrounding each occurrence of the token

- Example: the word “bass” occurs throughout text (both as fish and as musical instrument), proximate to other words (guitar, boat, fish) that help disambiguate

- disambiguate “bass” into “bass_fish” and “bass_instrument”

HCA

- partition terms into subsets with similar properties/characteristics

- antonyms as well as synonyms will cluster together (both have strong relationships, but the system doesn’t know whether they’re positive or negative)

- this information can be used to identify cross-refs (see also) and subterms

This is a machine-aided system. Its purpose is not to replace but to assist the human indexer, whose judgment and experience cannot be fully captured by a sophisticated expert system. Users can edit results at any stage, control indexing parameters, etc.

Metrics for evaluating the “goodness” of the resulting index:

- side-by-side comparison between entirely-human-generated and machine-aided indexes of the same dataset, quantify what percentage of agreement is acceptable, maybe find meaningful information in how they disagree as well.

Future work:

- incremental refinement

- system has modular architecture for ease of swapping out individual components

- need robust, effective user interface

- empirically vary frequency thresholds, weighting methods, number/percentage of dimensions to use in the reduced data matrix

- continue to build in the latest/most efficient indexing/retrieval methods.

What a great project. I’d love to use it for RolandHT, but it probably won’t be done in time. Enabling the software to read/process XML is on their wish list of big enhancements, hooray!

Categories: digital humanities Tags:

Munro on computer science in text analysis

October 13th, 2006 vika Comments off

[Oh look: Geoffrey Rockwell is posting some of his thoughts about this CaSTA conference on the TADA wiki. Highly recommended reading.]

Ian Munro is the Canada Research Chair in Algorithm Design at the Univ. of Waterloo. The full title of his keynote is… well, in the schedule it’s “Computer science research for text analysis,” but on the opening slide it’s “Developing text analysis software.”

Will talk about text search, one of his interests. He’s hard-core CS.

The need for computing in the humanities: “Scholarship increasingly depends on electronic document repositories and the growth of digital libraries… Even more apparent in computer readable form are collections of business documents and linguistic corpora. Gray literature, including technical reports, personal communications, and online help information, also constitute a growing text source.” –Frank Tompa

IM’s resaerch: data structures. How to organize information so we can find what we want: quickly; using an acceptable amount of space; proving the necessary inherent time and space bounds. [vz: bless his heart.] He’s on the theoretical side of computer science, a very different side from “user interface” or “understanding natural language” sides.

Where di IM get going on text? The Oxford English Dictionary project; interaction with humanists and lexicographers. New problems to work on; great data.

Another project IM was involved in, in the early 1980s: Videotext. Like the internet, but assumed few information providers, and access would’ve probably been restricted. The software ideas were there, but it was too early to use them.

IM gives some history of the OED project, which is actually covered pretty well in the Wikipedia article about it. The article includes a description of the first SGML encoding(s) of the OED.

The software they developed for the OED project:

- Lector, a general purpose browser. Worked with tagged text, presented in reasonable form, early SGML that, were they doing this a bit later, would’ve been HTML.

- Goedel, a programming language/database system.

- Pat, a search engine.

Pat is short for PATRICIA, “Practivel Algorithm to Retrieve Information Coded in Alphanumeric.” [vz: oy!] It does full-text searching, using an approach now generally known as “suffix tree.” In fact, in the final implementation it was a “suffix array.”

Typical problem: text indexing. Let’s take a large text file, like all the documents/email for a company, or a genome. We need to construct a structure so that given an arbitrary phrase they can quickly find where this phrase occurs in the “document.” Call the “extra stuff” an index.

What’s “suffix array”? It’s a method: an array of pointers referring to text positions in lexicographic order. Allows binary research. More on it here. (By the way, about this and other links: yeah, it’s wikipedia. Don’t even start with me on it being a Bad Resource. It’s not, unless you take it for gods’ word.)

Then IM describes suffix tries. This is all so far over my head that I’m not even going to try to summarize it; besides, the link does it pretty well.

From the OED project, IM and colleagues’ work proceeded to:

- more text search;

- data warehousing for asking complex queries

- enabling people to view relational databases as text (tags substituted for fields)

- enabling people to get things in “sorted” order: online phone books; buildings wired separately [tell me the companies that have offices in buildings I, a phone company, have wired – but who are not yet customers of mine); Sarah Lee = Sara Li; Romeo and Juliet (how many places are there in England where someone named Romeo lives near someone named Juliet? IM says that the answer is three.)

Where do things go next? IM wants to get rid of the tedium of searching in raw form, scanning texts etc, all parts of humanities work; improve the language interface; utilize better OCR (optical character recognition); build an application that can handle archaic linguistic forms.

Categories: digital humanities Tags:

Switch to our mobile site