The European Paper Company has a nice, short post on why and how to take notes here.
If the computer is a magic box for you, then the next step in our process may be somewhat scary, because it involves downloading an application from our course’s Moodle site that does not have a GUI (a graphical user interface). The Word Cloud Generator, courtesy of IBM, is a Java applet that runs on the command line.
In fairness, there are other options out there. They include, for PC users, a free application called Wordaizer, at the very least. Or you can always use the Processing language to create you own word clouds. Someone has developed WordCram which you can use as a basis from which others can start.
First, you need to find the downloaded file, which should be a
zip archive. Most modern operating systems should have the necessary applications to unzip the file – if your doesn’t, then look for a good archive utility that handles
tar, and other forms of compression on a site like MacUpdate or some other reliable source for software.
The unzipped file turns out to be a folder. Inside the folder you are going to see the following:
- a directory (folder) labelled examples
- a directory labelled license
- the actual word cloud generator application:
- a read-me file,
- a Windows batch file named
- a Unix shell script named
I am writing this from a computer running Mac OS X, which is a Unix machine with a pretty face, and so I am going to use the Unix shell script as my foundation, but the corresponding steps should work similarly for those of you running Windows OS and using the batch file. (Properly, I believe I should describe Mac OS X as POSIX-compliant, but I don’t know how many people, including myself, would understand at all what that meant.)
readme file is somewhat helpful, but I find two documents (files) to be even more helpful. One is the shell script itself, which gives me an exact idea of what I should type — or, better, paste — into a terminal to begin to get results. And so the first thing you should do is copy the shell script, start with
java and copy all the way to
example.png into an open terminal window and hit return.
If you haven’t used the terminal before, or whatever your OS calls getting access to the command line, then you are in both for something of a shock as well as a real treat. The shock will come from something that looks, for those of you raised in the era of GUIs, so, well, textual, and the treat will come with realizing that even though its textiness seems so foreign, it’s actually fairly easy to use and you will be surprised how quickly you are going to get results.
And so, perhaps, the first place to begin is finding out where to find this Terminal application: in Mac OS X, it’s in the Utilities directory/folder within the Applications directory, which is at the root directory. The file hierarchy looks something like this:
You read this as follows:
- / (root)
- Applications/ (Applications directory)
- Utilities/ (Utilities directory)
- Terminal.app (application named Terminal)
Okay, now you have the Terminal application open, which means you have a window on your desktop which contains something like this:
Last login: Wed Mar 30 16:51:05 on console [~]%
% is known as the prompt, which is short for “the command line prompt”, which means you are now working with the command line interface (CLI). Congratulations, you have just earned your first CLI credit.
Your prompt may very well be longer: I have shortened mine so that it places my current working directory between square braces and then gives me a percentage sign to tell me it’s ready to receive instructions. (There’s a lot more to say about the environment in which you now find yourself, but for the sake of getting on with this tutorial we will leave that for another time.)
If you paste the code that you copied out of the shell script above and try to run it from where you are, chances are you will get nothing. That is because the prompt can only run things when it knows where they are — much the same applies in the GUI, but Windows and Mac and Linux GUIs do a lot of work behind the scenes to find applications for you. You have two choices: add the file hierarchy to your command or to navigate to where the WCG application is and run it from within its directory. (If you were going to use the application a lot, there are some other considerations, but we will leave those for another time — but feel free to ask if you like.)
Typically, most Terminal windows will start you in your user home directory — which is indicated by the use of the tilde (
~). My best advice for the sake of this current activity is to use Windows Explorer or the Mac Finder and move the unzipped folder containing the WCG, which is named “IBM Word Cloud” in my case, to the Desktop o TRUNCATED! Please download pandoc if you want to convert large files.
Let’s continue with our example from the previous post, using the excerpt from Charles Dickens’ A Tale of Two Cities. If we were to create a word cloud from the opening passage using an on-line solution like Wordle, we would get something like this:
Visually, all the word cloud does is confirm what we know from reading the first four lines of the list from our previous analysis: there are only four words that break into double digits in our 119-word text:
14 THE 12 OF 11 WAS 10 IT
Of those four words, perhaps only “it” from our previous analysis reveals itself as having any kind of importance, and even that is somewhat debatable. By and large these words are are what linguists call function words. If you select Remove common English Words under the Language menu while using the Wordle on-line app, then you get something like this:
Of course, working with a text this small, much of this is simply a representation of the unique words less a few function words and less their sequential (syntactic) arrangement in the text – which still makes for interesting starting points for analysis. But even in this small example, we can see that the role of “it” and “we” are worth thinking about more than the simple zapping of them as a “common English word” suggests. That is certainly the case when reading a short story like “The Most Dangerous Game.”
You could use Wordle to create a word cloud of “The Most Dangerous Game” by simply copying the contents of the text file and pasting it into the Wordle app. You would get something like this:
Again, it’s not very interesting, but you can certainly begin to see some things when you drop out the common words:
But, as we saw in our tiniest of all possible examples, sometimes the common words are more powerful than their classification as mere “functionaries” would seem to indicate. It would be nice, then, to know what exactly the list of common words is that Wordle uses and, even better, to be able to revise that list in light of our own sense of the text.
Computer scientists, in fact, seek out humanists to work with large data sets for precisely this sense, only they call it “subject matter expertise.” Being able to become an expert on a particular subject matter is one of the goals of a college education – at least to expose you to the process if not to give you the experience for yourself. To use our own expertise to refine our analysis, we have to have more control over the process. To do that, we need to control the software.
Another term for a list of common words is a drop list, since most applications generally work by “dropping” out frequently used words that are generally considered to be of less interest when looking at the meaning of texts. In some disciplines, like sociology, they may use stop list but it’s all the same idea. In general, the idea looks something like this in pseudo code:
input (yourtext) remove (droppedwords) counteach (remainingwords) sizeword (number of times it occurs) create cloud
In other words, what a word cloud app does is to take whatever text you feed it and immediately remove the words you have told it are not worth counting. It counts the words that remain and then sizes that word according to the number of times it occurs. (Note that it doesn’t care where a word occurs in a text, only about the number of times it occurs.) Finally, it creates a cloud using various algorithms to make it look nice, depending upon your layout preferences.
For this, your first exercise in “text mining”, as it has come to be called, I am asking you to construct a kind of visualization known as a word cloud. You have no doubt seen them in a variety of places, sometimes on blogs as “tag clouds.” For a particularly entertaining example of using word clouds in service of textual analysis, you need look no further than DeadSpin.com’s examination of the hate mail it received for its series of posts on “Why Your Team Sucks.” Be forewarned, enraged NFL fans use strong language, and DS doesn’t flinch from quoting it. Link to DeadSpin article.
Word clouds sure look cool, but what are they? At their most basic, they are simply visualizations of the number of times a word appears in a given text or collection of texts. That is, an analyst inputs a text or set of texts, the program counts words, and then it creates a visualization that essentially increases the the font type size of the word in relationship to the number of times it occurs.
Let’s look at a fairly well-known passage and then count its words. What if we were to take the opening paragraph of a Charles Dickens’ A Tale of Two Cities and count the words? Here’s the paragraph:
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
That’s 613 characters making up 119 words. That’s all prose is, after all. A series of characters making up a series of words which, with the help of some curious other characters we call punctuation, make up a series of sentences, which make up a series of paragraphs, which make up … well, you get the idea. For smaller things, like the kinds of essays you write in college, the largest units will be paragraphs, but in larger texts we often group paragraphs into sections or chapters or parts of things like books. Most of the people my age, and your parents age, tend to think of books as these really amazing physical objects, more properly known as codices – the singular is codex – but we live in a moment where the “idea” of a book is somewhat in flux.
119 words. But kind of repetitive, isn’t it? There is an awful lot of “it was” in that passage. In fact, “it was” occurs ten times. If we were to break the lines to emphasize the “it was” structure, it would look something like this:
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
Formatted that way, some things become rather obvious don’t they? Take a closer look at the ends of the lines and you see: times/times, wisdom/foolishness, belief/incredulity, Light/Darkness, hope/despair. The pairs are rather striking, and, now turning to the middle of the lines, they are all held in place by periods of time: age, epoch, season, and then particular seasons.
Just as importantly, our attention is now drawn to how the following clauses are constructed:
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way
From it was, it was to we had, we were. See how counting encourages us to break things apart? The literal meaning of analysis is the process of separating something into its constituent elements. Breaking up any kind of data allows the analyst to sift the data in various ways to see if there are other kinds of patterns, and thus other kinds of information (even meaning), than what the surface organization indicates at first.
Let’s do some literal counting now:
Unique:58 Total:119 Freq. Word 14 THE 12 OF 11 WAS 10 IT 4 WE 2 AGE 2 ALL 2 BEFORE 2 DIRECT 2 EPOCH 2 FOR 2 GOING 2 HAD 2 IN 2 ITS 2 PERIOD 2 SEASON 2 TIMES 2 US 2 WERE 1 AUTHORITIES 1 BEING 1 BELIEF 1 BEST 1 COMPARISON 1 DARKNESS 1 DEGREE 1 DESPAIR 1 EVERYTHING 1 EVIL 1 FAR 1 FOOLISHNESS 1 GOOD 1 HEAVEN 1 HOPE 1 INCREDULITY 1 INSISTED 1 LIGHT 1 LIKE 1 NOISIEST 1 NOTHING 1 ON 1 ONLY 1 OR 1 OTHER 1 PRESENT 1 RECEIVED 1 SHORT 1 SO 1 SOME 1 SPRING 1 SUPERLATIVE 1 THAT 1 TO 1 WAY 1 WINTER 1 WISDOM 1 WORST
I have listed the words here first by their frequency, the number of times they appear in the text, and second by alphabetical order.
Here are some on-line word counters: WriteWords, Rainbow Arch’s Word Counter and Frequency Tool that you could try out. There are also a few Java applets available, but all the ones I found had some cost associated with them – trivial at \$10 and \$15, but still a cost. The third alternative to free on-line or to ready-made, and downloadable, packages is to build your own application or script that will do exactly what you want it to do.
I have a very primitive Python script that you are free to try, but you could certainly write your own in any of the popular scripting languages, all of which are free and are well worth your effort to try to learn. (Indeed, these should be the focus of “computer literacy” efforts, because they teach you how to analyze texts not how to format texts – don’t get me wrong: good design is important, but that is not the focus of most computer literacy courses.)
Python is an interpreted, general-purpose high-level programming language whose design philosophy emphasizes code readability. It is used by a number of scholars and scientists in the humanities, linguistics, and the sciences. Thus there are already a number of solutions to problems you encounter and all you need to do is find them. Python itself is open source and a lot of the people who create added functionality for it have made their solutions open source as well. If you are in the sciences, you may want to consider [R], a programming language and software environment for statistical computing and graphics. R has gotten increasing attention in the digital humanities as scholars realize that they can put its statistical capabilities and its graphical outputs to good use, but I find the natural language capabilities of Python a good reason to stick with the language when working with texts.
Up next: Visualizing Words in Clouds
Michael Hyatt floats somewhere in that middle world of the technorati: he has actual accomplishments to his name but he is now a consultant, and so sometimes there is a sales agenda to his writing. Not all the time, just some of the time, do I feel like he’s trying to sell me something.
That noted, Hyatt has a number of posts where he documents his processes, and given his success, both as a publisher and now as a blogger, it’s safe to say there is proof already in his pudding. And so I feel fairly comfortable recommending that you read his blog post on note-taking. It’s short and to the point.
Your first assignment of the semester is to read Richard Connell’s “The Most Dangerous Game” and to come to class prepared to write an in-class essay on a topic of your choosing. You may have the text with you while you write, but you may have no other resources.