Let’s continue with our example from the previous post, using the excerpt from Charles Dickens’ A Tale of Two Cities. If we were to create a word cloud from the opening passage using an on-line solution like Wordle, we would get something like this:
Visually, all the word cloud does is confirm what we know from reading the first four lines of the list from our previous analysis: there are only four words that break into double digits in our 119-word text:
14 THE 12 OF 11 WAS 10 IT
Of those four words, perhaps only “it” from our previous analysis reveals itself as having any kind of importance, and even that is somewhat debatable. By and large these words are are what linguists call function words. If you select Remove common English Words under the Language menu while using the Wordle on-line app, then you get something like this:
Of course, working with a text this small, much of this is simply a representation of the unique words less a few function words and less their sequential (syntactic) arrangement in the text – which still makes for interesting starting points for analysis. But even in this small example, we can see that the role of “it” and “we” are worth thinking about more than the simple zapping of them as a “common English word” suggests. That is certainly the case when reading a short story like “The Most Dangerous Game.”
You could use Wordle to create a word cloud of “The Most Dangerous Game” by simply copying the contents of the text file and pasting it into the Wordle app. You would get something like this:
Again, it’s not very interesting, but you can certainly begin to see some things when you drop out the common words:
But, as we saw in our tiniest of all possible examples, sometimes the common words are more powerful than their classification as mere “functionaries” would seem to indicate. It would be nice, then, to know what exactly the list of common words is that Wordle uses and, even better, to be able to revise that list in light of our own sense of the text.
Computer scientists, in fact, seek out humanists to work with large data sets for precisely this sense, only they call it “subject matter expertise.” Being able to become an expert on a particular subject matter is one of the goals of a college education – at least to expose you to the process if not to give you the experience for yourself. To use our own expertise to refine our analysis, we have to have more control over the process. To do that, we need to control the software.
Another term for a list of common words is a drop list, since most applications generally work by “dropping” out frequently used words that are generally considered to be of less interest when looking at the meaning of texts. In some disciplines, like sociology, they may use stop list but it’s all the same idea. In general, the idea looks something like this in pseudo code:
input (yourtext) remove (droppedwords) counteach (remainingwords) sizeword (number of times it occurs) create cloud
In other words, what a word cloud app does is to take whatever text you feed it and immediately remove the words you have told it are not worth counting. It counts the words that remain and then sizes that word according to the number of times it occurs. (Note that it doesn’t care where a word occurs in a text, only about the number of times it occurs.) Finally, it creates a cloud using various algorithms to make it look nice, depending upon your layout preferences.