Pytlik Zillig on TokenX

Brian Pytlik Zillig is an all-around digital-library tech wizard at the University of Nebraska-Lincoln (UNL), which hosted the first annual Digital Humanities Workshop a few weeks ago. The full title of his paper is “TokenX: a text visualization, analysis, and play tool designed for the XML document tree.”

Some history:

- CDRH and other digital centers significantly rely upon XML

- XML is a 1998 [whoa, old] recommendation [hunh, not a standard] of the W3C

- XML is a robust and flexible medium for content

- UNL has been using XML/SGML since 1998

- all CDRH projects use XML

Research question, born in 2004:

- can emerging standards assist in text visualization, analysis and play? (for example: XSLT)

BPZ’s goal:

- use XSLT to explore text visualization, analysis, and play (TVAP)

- provide TVAP options useful to facilitate the creative, qualitative, and quantitative exploration of XML text

Why another text analysis tool?

- there are good tools available written in a variety of languages, but none are created in XSLT, and none that takes advantage of the special relationship between XML and XSLT

Say we take a Shakespearean sonnet line: “when to the sessions of sweet silent thought.” You’d be crazy to try to try to mark up every word in XML, it’d be a huge undertaking. But XSLT 2.0 can add markup to words using tokenization! Way cool! Tokenized, each word will look like this: <w>word</w> – and here’s a punctuation mark: <nonWord>,</nonWord>

With this markup, XSLT can be used to do a variety of TVAP actions on a text. TokenX ingests XML documents, retaining the original markup, and adds tokens like the examples above. Visualizations include word highlighting, keywords in context (looks a bit like a concordance), replacing words with blocks (for example, to find words that are too long?), highlight punctuation and non-words, all kinds of stuff.

Here’s the TokenX site, if you’d like to play with it.

Analyze, in the TokenX context, means:

- count words in context

- decontextualize words and count them (frex, list all the words in the document alphabetically, or by frequency, each word only once with a number of its occurrences next to it)

- word statistics (how many words, how many elements containing words, mean number of words per element)

- punctuation and non-word statistics

TokenX exports into spreadsheets, so you can export and save your dataset.

You can play with TokenX:

- substitute words

- replace words with images

Best part: it’s free and open-source. “You can change it!” Brian exclaims. Excellent.

Comments are closed.


Switch to our mobile site