Lukon and Juola on building an index generator.

Shelly Lukon and Patrick Juola are both at Duquesne University. The full title of their paper (presented by Lukon) is “Designing a context-sensitive machine-aided index generator.”

Problem definition.

Back-of-the-book indexing provides relevant terms, identifies cross-references and subcategories, and has a static, rigid structure (as opposed to web indexing). Human indexers invest a LOT of time into indexing (1 week per 100 pages of text); use software to automate mundane texts; and make all the intelligent indexing decisions. SL&PJ’s prototype system bridges the gap between the human and currently available tools, but not to replace the human indexers.

They’ve interviewed professional indexers, product-tested some of the software packages they tend to use, and looked at some mathematical techniques (particularly LSA, latent semantic analysis) that have had proven success in text processing and capturing semantic content of terms.

Cognitive tasks involved in index construction:

- identifying terms to index;

- locate all informative references;

- identify/locate synonymous terms;

- split index terms into subterms;

- develop cross-references within text;

- compile page numbers.

Their techniques for obtaining semantic information:

- parsing/tagging of terms, frequency analysis

- LSA

- word sense disambiguation (WSD)

- hierarchical cluster analysis (HCA)

This is still a work in progress. So far they’ve been able to locate all informative terms in text, and to allow the user to set thresholds/parameters. LSA, WSD and HCA show first level of clustering nearly 40% accurate upon inspection (not great, but a solid start). Their single-processor PC takes several hours to process small (60K words) corpora. Better than a human’s speed!

They’re categorizing words into parts of speech: identify the part-of-speech of each term; label each term with delimiter and acronym (home becomes home/NN since home is a noun). They’re only dealing with English right now. Their app is written in Java, as is MontyLingua which they’re using for part-of-speech tagging.

LSA:

-use factor analysis to generate numerical representations of terms and their meanings;

-divide corpus into “documents” (paragraphs), then analyze each unique “term” (word) relative to each document;

-create term-by-document matrix;

- create term-by-term covariance matrix (look at how each pair of terms vary together)

- singular value decomposition (SVD) – a way of explaining variability among random variables (dimensions)

- decompose covariance matrix into three submatrices [over my head here]

- rank resulting values

- reconstruct using most significant dimensions (reduce noise, sharpen similarities/contrasts)

- 200 most significant dimensions: pinpoint each term’s location in 200-dimension “semantic space” [why 200?]

WSA

- separate out different senses (meanings) of each term token

- numerical encodings generated by LSA give average context for each term token

- look at encodings of the other terms surrounding each occurrence of the token

- Example: the word “bass” occurs throughout text (both as fish and as musical instrument), proximate to other words (guitar, boat, fish) that help disambiguate

- disambiguate “bass” into “bass_fish” and “bass_instrument”

HCA

- partition terms into subsets with similar properties/characteristics

- antonyms as well as synonyms will cluster together (both have strong relationships, but the system doesn’t know whether they’re positive or negative)

- this information can be used to identify cross-refs (see also) and subterms

This is a machine-aided system. Its purpose is not to replace but to assist the human indexer, whose judgment and experience cannot be fully captured by a sophisticated expert system. Users can edit results at any stage, control indexing parameters, etc.

Metrics for evaluating the “goodness” of the resulting index:

- side-by-side comparison between entirely-human-generated and machine-aided indexes of the same dataset, quantify what percentage of agreement is acceptable, maybe find meaningful information in how they disagree as well.

Future work:

- incremental refinement

- system has modular architecture for ease of swapping out individual components

- need robust, effective user interface

- empirically vary frequency thresholds, weighting methods, number/percentage of dimensions to use in the reduced data matrix

- continue to build in the latest/most efficient indexing/retrieval methods.

What a great project. I’d love to use it for RolandHT, but it probably won’t be done in time. Enabling the software to read/process XML is on their wish list of big enhancements, hooray!

Comments are closed.


Switch to our mobile site