For this, your first exercise in “text mining”, as it has come to be called, I am asking you to construct a kind of visualization known as a word cloud. You have no doubt seen them in a variety of places, sometimes on blogs as “tag clouds.” For a particularly entertaining example of using word clouds in service of textual analysis, you need look no further than DeadSpin.com’s examination of the hate mail it received for its series of posts on “Why Your Team Sucks.” Be forewarned, enraged NFL fans use strong language, and DS doesn’t flinch from quoting it. Link to DeadSpin article.
Word clouds sure look cool, but what are they? At their most basic, they are simply visualizations of the number of times a word appears in a given text or collection of texts. That is, an analyst inputs a text or set of texts, the program counts words, and then it creates a visualization that essentially increases the the font type size of the word in relationship to the number of times it occurs.
Let’s look at a fairly well-known passage and then count its words. What if we were to take the opening paragraph of a Charles Dickens’ A Tale of Two Cities and count the words? Here’s the paragraph:
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
That’s 613 characters making up 119 words. That’s all prose is, after all. A series of characters making up a series of words which, with the help of some curious other characters we call punctuation, make up a series of sentences, which make up a series of paragraphs, which make up … well, you get the idea. For smaller things, like the kinds of essays you write in college, the largest units will be paragraphs, but in larger texts we often group paragraphs into sections or chapters or parts of things like books. Most of the people my age, and your parents age, tend to think of books as these really amazing physical objects, more properly known as codices – the singular is codex – but we live in a moment where the “idea” of a book is somewhat in flux.
119 words. But kind of repetitive, isn’t it? There is an awful lot of “it was” in that passage. In fact, “it was” occurs ten times. If we were to break the lines to emphasize the “it was” structure, it would look something like this:
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
Formatted that way, some things become rather obvious don’t they? Take a closer look at the ends of the lines and you see: times/times, wisdom/foolishness, belief/incredulity, Light/Darkness, hope/despair. The pairs are rather striking, and, now turning to the middle of the lines, they are all held in place by periods of time: age, epoch, season, and then particular seasons.
Just as importantly, our attention is now drawn to how the following clauses are constructed:
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way
From it was, it was to we had, we were. See how counting encourages us to break things apart? The literal meaning of analysis is the process of separating something into its constituent elements. Breaking up any kind of data allows the analyst to sift the data in various ways to see if there are other kinds of patterns, and thus other kinds of information (even meaning), than what the surface organization indicates at first.
Let’s do some literal counting now:
Unique:58 Total:119 Freq. Word 14 THE 12 OF 11 WAS 10 IT 4 WE 2 AGE 2 ALL 2 BEFORE 2 DIRECT 2 EPOCH 2 FOR 2 GOING 2 HAD 2 IN 2 ITS 2 PERIOD 2 SEASON 2 TIMES 2 US 2 WERE 1 AUTHORITIES 1 BEING 1 BELIEF 1 BEST 1 COMPARISON 1 DARKNESS 1 DEGREE 1 DESPAIR 1 EVERYTHING 1 EVIL 1 FAR 1 FOOLISHNESS 1 GOOD 1 HEAVEN 1 HOPE 1 INCREDULITY 1 INSISTED 1 LIGHT 1 LIKE 1 NOISIEST 1 NOTHING 1 ON 1 ONLY 1 OR 1 OTHER 1 PRESENT 1 RECEIVED 1 SHORT 1 SO 1 SOME 1 SPRING 1 SUPERLATIVE 1 THAT 1 TO 1 WAY 1 WINTER 1 WISDOM 1 WORST
I have listed the words here first by their frequency, the number of times they appear in the text, and second by alphabetical order.
Here are some on-line word counters: WriteWords, Rainbow Arch’s Word Counter and Frequency Tool that you could try out. There are also a few Java applets available, but all the ones I found had some cost associated with them – trivial at \$10 and \$15, but still a cost. The third alternative to free on-line or to ready-made, and downloadable, packages is to build your own application or script that will do exactly what you want it to do.
I have a very primitive Python script that you are free to try, but you could certainly write your own in any of the popular scripting languages, all of which are free and are well worth your effort to try to learn. (Indeed, these should be the focus of “computer literacy” efforts, because they teach you how to analyze texts not how to format texts – don’t get me wrong: good design is important, but that is not the focus of most computer literacy courses.)
Python is an interpreted, general-purpose high-level programming language whose design philosophy emphasizes code readability. It is used by a number of scholars and scientists in the humanities, linguistics, and the sciences. Thus there are already a number of solutions to problems you encounter and all you need to do is find them. Python itself is open source and a lot of the people who create added functionality for it have made their solutions open source as well. If you are in the sciences, you may want to consider [R], a programming language and software environment for statistical computing and graphics. R has gotten increasing attention in the digital humanities as scholars realize that they can put its statistical capabilities and its graphical outputs to good use, but I find the natural language capabilities of Python a good reason to stick with the language when working with texts.
Up next: Visualizing Words in Clouds