Visualizing Words in Clouds

Let’s continue with our example from the previous post, using the excerpt from Charles Dickens’ A Tale of Two Cities. If we were to create a word cloud from the opening passage using an on-line solution like Wordle, we would get something like this:

Nota Bene: Wordle requires that you have Java applets enabled in your browser’s security preferences. Please note that Java and Javascript are two different, if also related, things. Java itself must be enabled for the on-line app to work.

Visually, all the word cloud does is confirm what we know from reading the first four lines of the list from our previous analysis: there are only four words that break into double digits in our 119-word text:

14  THE
12  OF
11  WAS
10  IT

Of those four words, perhaps only “it” from our previous analysis reveals itself as having any kind of importance, and even that is somewhat debatable. By and large these words are are what linguists call function words. If you select Remove common English Words under the Language menu while using the Wordle on-line app, then you get something like this:

Of course, working with a text this small, much of this is simply a representation of the unique words less a few function words and less their sequential (syntactic) arrangement in the text – which still makes for interesting starting points for analysis. But even in this small example, we can see that the role of “it” and “we” are worth thinking about more than the simple zapping of them as a “common English word” suggests. That is certainly the case when reading a short story like “The Most Dangerous Game.”

You could use Wordle to create a word cloud of “The Most Dangerous Game” by simply copying the contents of the text file and pasting it into the Wordle app. You would get something like this:

Again, it’s not very interesting, but you can certainly begin to see some things when you drop out the common words:

But, as we saw in our tiniest of all possible examples, sometimes the common words are more powerful than their classification as mere “functionaries” would seem to indicate. It would be nice, then, to know what exactly the list of common words is that Wordle uses and, even better, to be able to revise that list in light of our own sense of the text.

Computer scientists, in fact, seek out humanists to work with large data sets for precisely this sense, only they call it “subject matter expertise.” Being able to become an expert on a particular subject matter is one of the goals of a college education – at least to expose you to the process if not to give you the experience for yourself. To use our own expertise to refine our analysis, we have to have more control over the process. To do that, we need to control the software.

Another term for a list of common words is a drop list, since most applications generally work by “dropping” out frequently used words that are generally considered to be of less interest when looking at the meaning of texts. In some disciplines, like sociology, they may use stop list but it’s all the same idea. In general, the idea looks something like this in pseudo code:

input (yourtext)
  remove (droppedwords)
  counteach (remainingwords)
  sizeword (number of times it occurs)
  create cloud

In other words, what a word cloud app does is to take whatever text you feed it and immediately remove the words you have told it are not worth counting. It counts the words that remain and then sizes that word according to the number of times it occurs. (Note that it doesn’t care where a word occurs in a text, only about the number of times it occurs.) Finally, it creates a cloud using various algorithms to make it look nice, depending upon your layout preferences.

Next: Running the Word Cloud Generator program

Counting Words

For this, your first exercise in “text mining”, as it has come to be called, I am asking you to construct a kind of visualization known as a word cloud. You have no doubt seen them in a variety of places, sometimes on blogs as “tag clouds.” For a particularly entertaining example of using word clouds in service of textual analysis, you need look no further than DeadSpin.com’s examination of the hate mail it received for its series of posts on “Why Your Team Sucks.” Be forewarned, enraged NFL fans use strong language, and DS doesn’t flinch from quoting it. Link to DeadSpin article.

Word clouds sure look cool, but what are they? At their most basic, they are simply visualizations of the number of times a word appears in a given text or collection of texts. That is, an analyst inputs a text or set of texts, the program counts words, and then it creates a visualization that essentially increases the the font type size of the word in relationship to the number of times it occurs.

Let’s look at a fairly well-known passage and then count its words. What if we were to take the opening paragraph of a Charles Dickens’ A Tale of Two Cities and count the words? Here’s the paragraph:

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.

That’s 613 characters making up 119 words. That’s all prose is, after all. A series of characters making up a series of words which, with the help of some curious other characters we call punctuation, make up a series of sentences, which make up a series of paragraphs, which make up … well, you get the idea. For smaller things, like the kinds of essays you write in college, the largest units will be paragraphs, but in larger texts we often group paragraphs into sections or chapters or parts of things like books. Most of the people my age, and your parents age, tend to think of books as these really amazing physical objects, more properly known as codices – the singular is codex – but we live in a moment where the “idea” of a book is somewhat in flux.

119 words. But kind of repetitive, isn’t it? There is an awful lot of “it was” in that passage. In fact, “it was” occurs ten times. If we were to break the lines to emphasize the “it was” structure, it would look something like this:

It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,

Formatted that way, some things become rather obvious don’t they? Take a closer look at the ends of the lines and you see: times/times, wisdom/foolishness, belief/incredulity, Light/Darkness, hope/despair. The pairs are rather striking, and, now turning to the middle of the lines, they are all held in place by periods of time: age, epoch, season, and then particular seasons.

Just as importantly, our attention is now drawn to how the following clauses are constructed:

we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way

From it was, it was to we had, we were. See how counting encourages us to break things apart? The literal meaning of analysis is the process of separating something into its constituent elements. Breaking up any kind of data allows the analyst to sift the data in various ways to see if there are other kinds of patterns, and thus other kinds of information (even meaning), than what the surface organization indicates at first.

Let’s do some literal counting now:

Unique:58  Total:119
Freq.   Word
14  THE
12  OF
11  WAS
10  IT
4   WE
2   AGE
2   ALL
2   BEFORE
2   DIRECT
2   EPOCH
2   FOR
2   GOING
2   HAD
2   IN
2   ITS
2   PERIOD
2   SEASON
2   TIMES
2   US
2   WERE
1   AUTHORITIES
1   BEING
1   BELIEF
1   BEST
1   COMPARISON
1   DARKNESS
1   DEGREE
1   DESPAIR
1   EVERYTHING
1   EVIL
1   FAR
1   FOOLISHNESS
1   GOOD
1   HEAVEN
1   HOPE
1   INCREDULITY
1   INSISTED
1   LIGHT
1   LIKE
1   NOISIEST
1   NOTHING
1   ON
1   ONLY
1   OR
1   OTHER
1   PRESENT
1   RECEIVED
1   SHORT
1   SO
1   SOME
1   SPRING
1   SUPERLATIVE
1   THAT
1   TO
1   WAY
1   WINTER
1   WISDOM
1   WORST

I have listed the words here first by their frequency, the number of times they appear in the text, and second by alphabetical order.

Here are some on-line word counters: WriteWords, Rainbow Arch’s Word Counter and Frequency Tool that you could try out. There are also a few Java applets available, but all the ones I found had some cost associated with them – trivial at \$10 and \$15, but still a cost. The third alternative to free on-line or to ready-made, and downloadable, packages is to build your own application or script that will do exactly what you want it to do.

I have a very primitive Python script that you are free to try, but you could certainly write your own in any of the popular scripting languages, all of which are free and are well worth your effort to try to learn. (Indeed, these should be the focus of “computer literacy” efforts, because they teach you how to analyze texts not how to format texts – don’t get me wrong: good design is important, but that is not the focus of most computer literacy courses.)

Python is an interpreted, general-purpose high-level programming language whose design philosophy emphasizes code readability. It is used by a number of scholars and scientists in the humanities, linguistics, and the sciences. Thus there are already a number of solutions to problems you encounter and all you need to do is find them. Python itself is open source and a lot of the people who create added functionality for it have made their solutions open source as well. If you are in the sciences, you may want to consider [R][], a programming language and software environment for statistical computing and graphics. R has gotten increasing attention in the digital humanities as scholars realize that they can put its statistical capabilities and its graphical outputs to good use, but I find the natural language capabilities of Python a good reason to stick with the language when working with texts.

Up next: Visualizing Words in Clouds

Michael Hyatt on Note-taking

Michael Hyatt floats somewhere in that middle world of the technorati: he has actual accomplishments to his name but he is now a consultant, and so sometimes there is a sales agenda to his writing. Not all the time, just some of the time, do I feel like he’s trying to sell me something.

That noted, Hyatt has a number of posts where he documents his processes, and given his success, both as a publisher and now as a blogger, it’s safe to say there is proof already in his pudding. And so I feel fairly comfortable recommending that you read his blog post on note-taking. It’s short and to the point.

First Reading, First Writing

Your first assignment of the semester is to read Richard Connell’s “The Most Dangerous Game” and to come to class prepared to write an in-class essay on a topic of your choosing. You may have the text with you while you write, but you may have no other resources.