Pytlik Zillig on TokenX
Brian Pytlik Zillig is an all-around digital-library tech wizard at the University of Nebraska-Lincoln (UNL), which hosted the first annual Digital Humanities Workshop a few weeks ago. The full title of his paper is “TokenX: a text visualization, analysis, and play tool designed for the XML document tree.”
Some history:
- CDRH and other digital centers significantly rely upon XML
- XML is a 1998 [whoa, old] recommendation [hunh, not a standard] of the W3C
- XML is a robust and flexible medium for content
- UNL has been using XML/SGML since 1998
- all CDRH projects use XML
Research question, born in 2004:
- can emerging standards assist in text visualization, analysis and play? (for example: XSLT)
BPZ’s goal:
- use XSLT to explore text visualization, analysis, and play (TVAP)
- provide TVAP options useful to facilitate the creative, qualitative, and quantitative exploration of XML text
Why another text analysis tool?
- there are good tools available written in a variety of languages, but none are created in XSLT, and none that takes advantage of the special relationship between XML and XSLT
Say we take a Shakespearean sonnet line: “when to the sessions of sweet silent thought.” You’d be crazy to try to try to mark up every word in XML, it’d be a huge undertaking. But XSLT 2.0 can add markup to words using tokenization! Way cool! Tokenized, each word will look like this: <w>word</w> – and here’s a punctuation mark: <nonWord>,</nonWord>
With this markup, XSLT can be used to do a variety of TVAP actions on a text. TokenX ingests XML documents, retaining the original markup, and adds tokens like the examples above. Visualizations include word highlighting, keywords in context (looks a bit like a concordance), replacing words with blocks (for example, to find words that are too long?), highlight punctuation and non-words, all kinds of stuff.
Here’s the TokenX site, if you’d like to play with it.
Analyze, in the TokenX context, means:
- count words in context
- decontextualize words and count them (frex, list all the words in the document alphabetically, or by frequency, each word only once with a number of its occurrences next to it)
- word statistics (how many words, how many elements containing words, mean number of words per element)
- punctuation and non-word statistics
TokenX exports into spreadsheets, so you can export and save your dataset.
You can play with TokenX:
- substitute words
- replace words with images
Best part: it’s free and open-source. “You can change it!” Brian exclaims. Excellent.