The practice of writing, always in flux, has over the last two decades been especially influenced by the emergence of digital innovations in new text genres – email messages, newsgroup postings, and weblogs. Many of the compositional practices of conventional (that is, print-intended) writing – the sense of a linear structure comprising a beginning, middle, and end, for example – can be said to be in crisis in the new medium. Digital genres have traded many such notions of form for the more convenient parameters offered by a tool or genre. Length is one of the attributes in greatest deviation – the essay or chapter is practically non-existent in a medium that by nature emphasizes instead the assertion of, and reaction to, specific, closely circumscribed points, rather than larger-scale topics. There are, to be sure, such compound structures in digital writing, the canonical example being the case of discussion list threads – sequences of messages that form a conversation around a question-and-answer or declaration-and-response form. There is also the blog, the online equivalent of a diary, with journal entries posted in reverse chronological order. But in both cases, the constituent posts comprising each of these forms function as discrete, condensed, focused statements, and not as the colorful, scenery-creating experiences familiar to us in the notion of chapter from the world of print. These forms exemplify an anatomical truth of digital writing: at its core it is point-based rather than topic-based. Further justifying our view of the basic semantic unit as the point are the extreme forms that promote it: media formats and software exclusively dedicated to creating and organizing and displaying points: the presentation package (e.g., PowerPoint), the idea processor (e.g., Inspiration), and the semantic drawing system (e.g., Microsoft Visio). We might also note, in contrast, that no software exists exclusively for topic-making.
For a culture whose millennia have been given with consistent rigor to the improvement of communicative competence – starting, not least, with the educational requirement that young Classical Greeks master the five-point techné of rhetoric – any new-fangled deviation from this progressive path toward communicative erudition is bound to have some noticeable impact. In particular, the displacement of topic-driven thinking by point-driven media introduces a special economy of language that reduces the archetypal expressive unit, the sentence, to an almost irrelevant and archaic artifact rarely seen in these new forms. Point-driven writing, whether manifest in drawing, diagram, or bullet list, desires to emphasize process and to communicate some kind of how concerning what is presented, and through its insistent visuality, renders the written form almost unnecessary. Not surprisingly, this alteration has attracted the attention of critics from diverse perspectives. There are some for whom the historically evolved forms of textual expression fulfill conditions of understanding that are not attainable merely with points, glyphs, or graphs, and others for whom the comparison between topical and point-driven media is not necessary, for each is its own class of communicative tool. Edward Tufte, who interestingly enough comes from a statistical rather than a literary background, finds these non-manuscript forms incapable of natural exposition and rich development of ideas13. For Tufte, focusing particularly on the single category of presentation software, the characteristic cascade of bullet points, garish mastheads with oversized, condescendingly obvious graphics, and distracting animations typical of its “texts” all amount to a jarring, disconcerting experience lacking not only in depth but also in all that “comes across” in the fullness of ample explanation. Conversely, a more exploratory interpretation comes from David Byrne, who, as visual artist, begins with parody, with “making fun of the iconography of Powerpoint”3 and redirects the medium into its own kind of expressive genre unrelated to any historically determined textual function. Utilizing the tools of the tool, such as its ability to render arrows, for aesthetic production in its own right, he subsumes the domain of text to that of image.2
Two differing understandings – one pragmatic, one aesthetic – of the same phenomenon, now set the foundation for what has developed between the semi-textual and the orthodox-textual: the space of the neo-textual where oral and textual, and semi-improvised emerged, and we see a new kind of conversational writing be born in the form of blogs, emails, and other aleatory genres. Traditional textual practices, in the essay or novel, for instance, are perhaps six centuries old; the semi-textual is as old as the algorithmic or process diagram, whose heyday is somewhere in the twentieth century. But conversational writing has no direct ancestor, save perhaps for the personal diary, which, however, never took the sprawling, sometimes fragmented, form common today. The term is not entirely new, however, a version of it – “conversational literacy” – appeared in Janzen-Wilde?s 6meta-analysis of oral and mediated communication. Wilde, too, was the first to notice that lying entirely neither in print-based nor orally based genres, mediated communication synthesizes from both and “has characteristics typically assigned to both ‘oral? and ‘literate? ends of the continuum”.
The characteristics of this style, embedded in its digital medium, are the subject of speculation equal to that of the effects of the medium on “traditional” writing. Special “mechanics” distinguish this new style, and these lexical characteristics which reflect the degree of orality embedded in the medium are encoded both for visual prominence and lexical conciseness, raising the prominence of one and attenuating the other, an increasingly evident observation:
The oral conventions are evident in the way people subvert or abandon traditional conventions of grammar and punctuation in electronic writing. Meaning is very often conveyed by cues recognized only by users of computer-mediated communication. Some examples are acronyms like BTW (by the way) and IMO (in my opinion), and specialized use of typography — for example, *word* to signify italics and the use of nonverbal icons or emoticons like a smiley face 🙂 — which differ from traditionally recognized textual cues.5
Another relation justifies the popular belief in significant differences between oral, written, and online modes of communication: the production of digital textuality in relation to the principle of commitment. To consider the conditions of speech is to accept evanescent, improvisatory modes of expression projected literally into the air. Everything more or less spontaneous in this sense is captured within the notion of the utterance. To produce writing, on the other hand, is to engage in preparatory organizational work and editing prior to “committing” expression to paper, its natural chosen medium. We could take these practices as two ends of a spectrum, and see in electronic writing a middle ground with sufficient latitude to draw arbitrarily from each pole. Here, any resemblance to print text emerges from the common lexical nature of both: words uniformly arranged on a visual medium. The contrast however, is equally significant, for, as with air, the digital medium is highly unstable and fleeting, and its production is sufficiently simple that textual operations based on structural organization, prewriting, and detailed editing are anathema to its aleatory affordances, regardless of other distinctions within the medium, for instance whether the genre in question is transmitted synchronously (e.g., chatrooms) or not (e.g., e-mail, mailing lists, newsgroups, and discussion groups). In conventional writing, the author generally writes to make a point. But in digital conversational writing online authors may pursue another purpose, since the style of this writing appears structured so as to reflect the social network in which the authors are participating.
Judging from what others have observed in this regard, a special relationship between physicality and orality holds, the lack of the former being compensated by the latter, both because the lack of another means of communicating introduces non-verbal communication into a predominantly textual medium and because networked users, interacting in large numbers, can experience many kinds of interaction almost simultaneously. Here a nexus of language is superimposed on one of populace, a two-dimensional grid of continuous interaction: optimizations to the traditional model of expository text are essential. In defining the notion of virtual community, Rheingold refers indirectly to the displacement of presence by expression, when, pondering both the breadth of collective contact and the demand for using language in the absence of material presence to assist in those relationships, he remarks that people in virtual communities continually
exchange pleasantries and argue, engage in intellectual discourse, conduct commerce, exchange knowledge, share emotional support, make plans, brainstorm, gossip, feud, fall in love, find friends and lose them, play games, flirt, create a little high art and a lot of idle talk. People in virtual communities do just about everything people do in real life, but we leave our bodies behind.10
We might conclude altogether that the more one looks into conversational writing, the less it resembles traditional text, in purpose or structure. The speed and quantity of messages (again, emphasizing points rather than topics) almost compels a new definition for its medium-specific functions, and one can understand the rationale for assertions like Ferris?s “computer users often treat electronic writing as an oral medium: communication is often fragmented, computer-mediated communication is used for phatic communion, and formulaic devices have arisen” or Murray?s classification of such writing as comprising a “language of action”.9
We have enough here to deepen our examination of conversational writing into specific questions posed by the foregoing comparisons. With the generous amount of speculation on the characteristics of digital conversational writing, one would expect a somewhat proportional body of observational data to support or refute theoretical claims, but this has not materialized. Considering that the operational nature and environment of digital text leads transparently to its archival– which is what the innumerable server logs and search engine indexes do – the paucity of systematic studies of data and material produced within and through the digital medium is nothing less than surprising. And it is not entirely clear what useful inferences can be drawn from much of what does exist, certainly stylistic knowledge – knowledge of what and how authors are creating online, and how the conventions adopted and evolving in their medium compare with those long established in the world of print – does not appear to be the focus of such analyses. One would, for instance, like to observe whether stylistic practices in the new medium conform to conventional modes of print-based writing: is there consensus on the length of sentences between both conventional and conversational writing? If conversational writing derives attributes from orality (Ferris?s observation that “electronic writing is characterized by the use of oral conventions over traditional conventions, of argument over exposition, and of group thinking over individual thinking” is representative of this belief) , how significant and present are these in any digital corpus, such as an online discussion group, or a library of similar communications documents? This suggests a spectrum of communicative modes ranging from most to least “formal” along lexical and semantic criteria defined next.
1. Analytic Criteria
Let us establish the first dimension of an analytic framework that qualitatively incorporates the different “kinds” and forms of writing we wish to compare. We first assume, along with the general academic consensus, that print-oriented or traditional writing stands in structural contrast with oral communication – this point has been navigated throughout an entire literature and, as mentioned earlier, a relatively early informative meta-analysis (Janzen-Wilde, 1993) assembled relevant conclusions for comparative media, between literacy and facilitated communication:
Janzen-Wilde concludes that “characteristics of orality which are common in facilitated communication include its use in regulating social interactions and the opportunity for the listeners/ communication partners to give immediate feedback to the speaker”, in this sense, conversational writing is most unlike its traditional predecessor. In the new millennium, emails and blogs are the everyday examples of conversational writing.
Content and genre present special problems for comparative media work, any conclusions deduced from textual analysis must only reflect structural features of the medium, such as specific conventions and communicative practices, rather than content features of it. The importance of structural inference can be illustrated in a simple example. Let us imagine a (flawed) comparison of print versus orality by means of examining five works of each. Our sample from print media, in other words, would comprise five novels and our oral sample, five transcripts transcribed from legal cases argued in a court of law. This assessment, after lexico-statistical analysis, would lead us to infer almost inescapably that print media are more ‘romantic? and that orality, on the other hand, is more ‘factual?. This error of inference would reflect the nature of the samples utilized for each medium, not any features inherent in how the medium is used. In seeking to establish objective differences in communicative practices between media, therefore, we must choose criteria that are independent of special content-level features such as “factuality” or “romance”, for factuality is not intrinsic to any medium (were that it were so!). We must, therefore, confine ourselves to comparing media on strictly structural features that may emerge from communicative practices within them. And the structural characteristics must be present in all the media under scrutiny. Three such structural features offer themselves without much bias: sentence length, pronoun usage, and lexical density – let us now define each.
If, as some research in the Characteristics of Literacy table claims 4,14,12 that written text possesses unique structural characteristics, concise use of syntax and ideas and cohesion based on linguistic markers, then the first and most important measure by which to compare communicative differences between text, orality, and conversational writing is the word length of sentences in each medium: if the belief is that oral media are more “rambling” and free than print-based ones, we ought to expect longer sentences from the former. Intuitively, it is reasonable to surmise that the length of sentences in one medium or genre might be radically different than in another; why should they be the same? We will analyze this variable below. Similarly, a second criterion, relative pronoun usage, is also worth exploring across media. Measuring the extent of pronoun usage across different media would indicate the degree to which persons are “close to the text” by way of direct reference, and may justify answering the question of whether one medium is in general more impersonal than another. Again, the instinctive hypothesis might be that orality is more informal and therefore more “personal” or intimate than text, and that pronoun usage in blogs and emails lies ostensibly somewhere between both. This is intriguing, but it is worth cautioning ourselves that pronoun usage may belong more to specific kind of content than to the intrinsic structure of how communication in media takes place. Nevertheless, given this instinctive hypothesis and caveat, comparative statistics on pronoun usage are presented here without firm conclusion, should they prove helpful for future linguistic investigations in new media. Finally, lexical density, the opposite of redundancy in language, is an indicator, in a text, of the percentage of words that are unique within it; the lower the density, the greater the verbal redundancy and therefore the presumed ease of comprehension. The formula for calculating the lexical density D for any text is
D = (U/N)* 100
where U is the number of unique words in a text sample, and N is its total word count. Lexical density is more than a statistical number; it confirms a central principle from information theory that the amount of redundancy in a message boosts its comprehension. Let us imagine that you want to learn a dialectical kind of Spanish, Cuban street argot, one word at a time. Today?s word is astilla, a noun which translates to splinter, although the slang means something completely different. With a single utterance, you might or might not guess the slang term?s denotation:
Use astilla for dinner.
The lexical density of this utterance is 100% because 4 out of the 4 words are unique. In my next lesson, my new phrase:
Use astilla for dinner. use astilla for payment
has 5 out of 8 unique words. Its lower (63.%) lexical density reflects the possibility that the redundancy in this text boosts its potential comprehension, and you now have some feasible ideas as to the slang meaning of astilla. My next phrase
Use astilla for dinner, use astilla for payment, use astilla for purchases
now has only 6 unique words out of 12, or a 50% lexical density, and this increased redundancy has supports your grwing conjecture that astilla means money. In this sense, comparative measures of lexical density would corroborate or disprove the claim that orality emphasizes familiar words as well as repetitive syntax and ideas (Westby, 1985; Rubin, 1987), and based on those research claims, we would expect to find lower lexical density in oral data than in print, and the density of online texts would presumably lie between both.
In having converged on these criteria, we attempt to determine whether, structurally, it is possible to infer higher-level patterns and implications about online conversational writing in contradistinction to oral and written text. The data for this investigation ranges across each of the four communicative modes in question, including sentence samples from print text, emails, blogs, and transcripts of spoken occasions. Scanning software was designed fur a number of purposes, including automatic retrieval of emails from a public database, retrieval of blog postings with archival in text-only form, and to gather statistical measures from each corpus. In the area of print, there is a half millennium of source material to choose from, but oral practices of today cannot compare to texts older than about a century. Subsequently, the works of the chosen text corpus (listed in Appendix B), though small, takes a roughly equal number of short stories from classic literature as far back as 1898 and combines with modern stories in an online magazine . Given the stylistic breadth of printed texts In fact, we do not need a massive text corpus because we are choosing a few thousand sentences for analysis from each of four modes: print, oral communication, emails, and blogs. The oral sources, listed in Appendix A, include transcripts of political debates, television talk shows, and one radio documentary interview. The email samples come from the Enron email corpus made public as a result of a U.S. federal criminal investigation. It consists of the text of 619,446 email messages from the Enron Corporation by 158 users who wrote an average of 757 messages each15, and a representative but random sample is chosen. Numerous analyses have been made of this corpus, the most systematic of these being that at the University of Massachusetts1, and exemplifying rules have been derived as to the adequate length of an email and its subject line8. The source of blogs is taken from 30 sequential postings in each of 61 random blogs (1830 unique postings with 8726 sentences). As with emails, attempts have been made to categorize blogs automatically7 and results from writing style will be presented as well. Overall, a comparable number of sentences from emails (9875), blogs (8780), debates (8748), and texts (8600) was taken and analyzed.
There is much theory on blogging, but few empirical studies exist of semantics or stylistic composition in blogs (or emails), and methodological problems are epidemic. One 2004 study11 analyzed 203 blogs but reached conclusions based on the reported number of sentences detected (3260) and words collected (42930) cannot have looked at more than the first page of each blog, for in my study of 61 blogs, the scanning program requested 30 blog postings from each, for a total of 8726 sentences and 94433 words – from fewer than one-third the blogs in the 2004 study. In all, the statistics are based on are 522 individual postings. My analysis found the average number of words per post to be 303, not similar to Herring?s 210. We did, however not disagree on the average number of words per sentence; I found 15, Herring 16.
Herring count the number of paragraphs in their blog corpus, but I find this measure somewhat problematic in the blog genre. A paragraph, in the realm of conventional print, is a group of one of more sentences separated by one or more empty lines. However, the definition of paragraphs is different in web genres, where, rather than being used to separate groups of ideas in the same text, paragraph breaks instead introduce whole new ideas or micro texts. Similarly, the paragraph – or a set of empty lines, to be precise – is overloaded in blog style, being the default marker between blog posts, the separator between texts and graphic elements, the break between a text and an inserted quoted, and a mere cosmetic device where inserting white space adds visual balance to existing text blocks. None of these uses is related to the original purpose of the paragraph. A much more difficult problem is that of quoted phrases in blogs. Herring?s count presents the methodological complication that no single definition was given for what constitutes a quoted phrase. They provide two separate counts, quoted sentences/fragments and quoted words per sentence but do not state how quotes were counted, for, in blog style, there are at least three ways to quote. One is by inserting the desired text within quotes – the conventional way. Another is by inserting a block of quoted text, for which an HTML tag specifically exists. The third is not to include the text at all, but rather to link to it. This makes questionable the statistical measure presented there, the number of “quoted words per sentence”, which they find to be 7.6 – an almost impossible number if we accept their 13.2 “words per sentence” measure, as it would mean that over half of everything written is in quotes.
In an initial examination of 500 emails and 500 posts on random blogs, the pattern of sentence length for each genre appears to be very similar (see figure 7).
If, even taking into consideration the wide disparities in style across all possible authors, significant stylistic differences are found with the distribution of sentence length in other genres, these could be attributed to the structure of the genre, and its writing practices, which for the most part lack interventions such as word count limit, editorship, and revision, all of which would influence its average length of sentence.
If we overlay sentence length (grouped in ranges of 5 words) all communicative modes a single graphical frequency distribution, we find the first significant difference between text and conversational writing.
From this table we can ask whether significant relationships hold between sentence lengths across media modes. A regression analysis of emails and blogs shows a powerful (98.5% ) correlation between them (p < 0.05):
Likewise, blogs and spoken text share a tight 75.4% correlation (p<0.05) in length:
And the correlation between email and spoken data is significantly high (70.7%, p<0.05):
These analyses demonstrate significant similarity in sentence length across oral and conversational writing modes. Conversely, and as expected, there is a low (28.4%) correlation of sentence length between email and written text, from which we reject the null hypothesis that they are from similar populations (or lengths):
In summary, this shows, at the conventions of sentence length, how much closer emails and blogs are to spoken genres than to written texts.
One structural point about blog stylistics bears consideration: the notion of sentence must be somewhat redefined in this genre, which gives equal importance both to the “traditional” declarative sentence and the caption, which is not a sentence but a verbal adjunct to reinforce an associated idea or a graphic. Thus, what appear under normal grammatical conditions to be nonsensical fragments like “Rewards of some hard digging” or “gander mountain credit card” emphasize the dependence of text on other non-textual elements in order to substantiate meaning. This has become standard practice in blog writing. Typically, the fragment-caption will be a sentence missing either the verb, e.g., “Lots of diggers”, “Myself with a very good find”, “picture of beetle bug” or the subject, e.g., “Screening ore”.
The results of sentence length, which show conversational writing to be of similar length as oral utterances, does not carry in the area of pronoun usage, for as the next figure shows:
The frequency slope shows that text employs more pronouns than blogs or email samples, and approximates only speech in frequency of use. This runs against the generally accepted polarity of orality versus literacy, with conversational writing synthesizing elements of both. The formula for determining pronoun usage is simply the percentage of words in a corpus that are pronouns. No doubt, a larger corpus is necessary to determine this more authoritatively, and we might keep in mind that some text genres are bound to have more pronouns than others. In the present case, the text corpus was comprised entirely of fiction works, but if we used scientific monographs, the resulting pronoun usage would differ greatly. Nonetheless, as a starting point for discussion, these results invite certain speculation. In particular, we might infer that blogs are more “impersonal” than email, and both are less personal than speech, which is as we might expect, since speech is more improvisatory; and email is easier to compose than blogs. In the Enron sample, many emails were of a highly personal nature whose appropriateness in a blog format may not be evident. Further research should statistically probe the comparative degree of personal reference in blogs and emails.
The final measure of potential communicative differences, lexical density, shows the differences divided into three groups – Speech (6%); Blogs (9%) & Emails (10.7%); and Texts (17.2%). It is accepted that text has a higher lexical density than speech, and, in support of our hypothesis, blogs and emails lie between both.
This analysis of four distinct communicative modes – speech, blogs, emails, and printed text (as fiction works) – exposes sufficiently significant differences in sentence length, pronoun usage, and lexical density between them so as to support the assertion that blogs and emails, which I am calling instances of conversational writing, conform to stylistic and structural characteristics somewhere between speech and print. This may suggest that usage of different communicative media appears to respond to fundamental differences between them, with the most marked contrast being observed in sentence length and the least, for usage of pronouns. We might say that sentence length is the most structural of our three metrics, and pronoun usage the most stylistic, with lexical density somewhere between both. In that the observed differences were largest in the structural variables of observation, further research should examine similarly structural variables in corpus samples across these or similar communicative media.
Bekkerman, H. Document Classification on Enron Email Dataset, 2005 [cited 20 May 2005]. Available from http://www.cs.umass.edu/~ronb/enron_dataset.html.
Byrne, David. E.E.E.I (Envisioning Emotional Epistemological Information). Göttingen, Germany: Steidl Publishing, 2003.
———. Personal Communication, 2005.
Chafe, W. L. “Linguistic Differences Produced by Differences between Speaking and Writing.” In Literacy, Language and Learning: The Nature and Consequences of Reading and Writing, edited by N. Torrance D.R. Olson, & A. Hildyard. Cambridge: Cambridge University Press, 1985.
Ferris, S. P. “Writing Electronically: The Effects of Computers on Traditional Writing.” Journal of Electronic Publishing 8 (2002).
Hildyard, A., & Hidi, S. “Oral-written Differences in the Production and Recall of Narratives.” In Literacy, Language and Learning: The Nature and Consequence of Reading and Writing, edited by D.R. Olson, N. Torrance, & A. Hildyard, 285-306. Cambridge: Cambridge University Press, 1985.
Janzen-Wilde, Lori. “Oral and Literate Characteristics of Facilitated Communication.” Facilitated Communication Digest 1993, no. 2.
Ku, H. Blogs Classification Using Nlp Techniques, 2005 [cited 20 May 2005 2005]. Available from http://www.sims.berkeley.edu/~hqu/papers/Blogs_Classification_Using_NLP.pdf.
McDonald, Lauren. How Message Size, # of Links and Subject Length Affects Email Results, 2005 [cited 20 May 2005 2005]. Available from http://www.emaillabs.com/articles/email_articles/message_size_length_links.html.
Murray, D. E. “Literacy at Work: Medium of Communication as Choice.” Paper presented at the American Association of Applied Linguistics, Seattle, WA 1985.
Redeker, G. On Difference Between Spoken and Written Language. Discourse Processes 7 (1984): 43-55.
Rheingold, Howard. The Virtual Community: Homesteading on the Electronic Frontier. Reading, Mass.: Addison-Wesley, 1993.
Rubin, D. L. Divergence and Convergence Between Oral and Written Communication. Topics in Language Disorders 7 (1987): 1-18.
S.C. Herring, L.A. Scheidt, S. Bonus, S. and E. Wright. “Bridging the Gap: A Genre Analysis of Weblogs.” Paper presented at the 37th Annual Hawaii International Conference on System Sciences (HICSS’04) 2004.
Tannen, Deborah. “Relative Focus on Involvement in Oral and Written Discourse.” In Literacy, Language and Learning: The Nature and Consequence of Reading and Writing, edited by N. Torrance D.R. Olson, & A. Hildyard. Cambridge: Cambridge University Press, 1985.
Tufte, Edward. The Cognitive Style of Powerpoint. Cheshire, Connecticut: Graphics Press, 2003.
Wallach, G. “Magic Buries Celtics: Looking for Broader Interpretations of Language Learning and Literacy.” Topics in Language Disorders 10 (1990): 63-80.
Westby, C.E. “Learning to Talk – Talking to Learn: Oral-Literate Language Differences.” In Communication Skills and Classroom Success: Therapy Methodologies for Language-Learning Disabled Students, edited by C.S. Simon. San Diego: College-Hill Press, 1985.
Yang, B. Klimt and Y. “Introducing the Enron Corpus.” Paper presented at the CEAS 2005, The Second Conference on Email and Anti-Spam, Stanford University, Stanford, CA, July 21-22, 2005 2005.
Appendix A – Oral transcripts from radio and television broadcasts
- The Third Bush-Kerry Presidential Debate (broadcast October 13, 2004, available from http://www.debates.org/pages/trans2004d.html)
- The Second Bush-Kerry Presidential Debate (October 8, 2004, http://www.debates.org/pages/trans2004c.html)
- The First Bush-Kerry Presidential Debate (broadcast September 30, 2004, available from http://www.debates.org/pages/trans2004a.html)
- The Cheney-Edwards Vice Presidential Debate (broadcast October 5, 2004 , available from http://www.debates.org/pages/trans2004b.html)
- The Abrams Report for July 6, 2005
- The Abrams Report for July 1,2005
- NPR Weekend Edition on Reincarnation: Tibetan Buddhism, radio broadcast Saturday January 10th, 1998 Weekend Edition Saturday; available from http://www.npr.org/programs/death/980110.death.html
- Hardball with Chris Matthews July 6, 2005, http://www.msnbc.msn.com/id/8498025/
- Hardball with Chris Matthews for June 30,2005 (http://www.msnbc.msn.com/id/8430780/)
- Hardball with Chris Matthews for June 29,2005 (http://www.msnbc.msn.com/id/8416840/)
- Hardball with Chris Matthews for July 5,2005
- Hardball with Chris Matthews for July 1,2005 (http://www.msnbc.msn.com/id/8485041/)
- Countdown with Keith Olbermann for July 6, 2005 (http://www.msnbc.msn.com/id/8498013/)
All moderator comments tags identifying the speaker, and “stubs” (pre-written introductions and transitions between commercials) were removed to preserve only the actual spoken sentences.
Appendix B – Source Texts
- Dracula by Bram Stoker, electronic version courtesy of The University of Adelaide Library, http://etext.library.adelaide.edu.au/s/s87d/
- Evening Tide by Neal Gordon, Intertext, Issue #57, December 5, 2004, available from http://www.intertext.com/magazine/v13n2/eveningtide.html
- Father Christmas Must Die by Patrick Whittaker Intertext, Issue #57, December 5, 2004, available from http://www.intertext.com/magazine/v13n2/christmas.html
- Metamorphosis by Franz Kafka, available at http://www.zwyx.org/portal/kafka/kafka_metamorphosis.html
- Star Quality by Melanie Miller, Intertext, Issue #5, January-February 1992, available from http://www.intertext.com/magazine/v2n1/star.html
- The Legion of Lost Gnomes by T.G. Browning, Intertext, Issue #57, December 5, 2004, available from http://www.intertext.com/magazine/v13n2/gnomes.html
- War of the Worlds by H.G. Wells, available from http://www.gutenberg.org/dirs/3/36/36.txt
All chapter and/or section numbers, headings, or titles were removed from the texts prior to analysis.
Appendix C – The Enron Mail Corpus
In emails, inserting extraneous text (e.g., news stories from The Associated Press, Reuters) is common, and these had to be removed so that the true style of email writing could be examined. The manual distillation process the elimination of all person references as well as titles (which are not part of the body of a text). Incidentally, having controlled for spam or automatically generated titles (e.g., “Breaking News from ABCNEWS.com”), “RE:”, “FWD:” and repeated entries, the average email title is 3.56 words in length. 500 random messages from the Enron email corpus were cleaned, scanned and parsed for style according to the criteria indicated below.
- Repeated or extratextual lines were eliminated (those beginning with “>”);
- Reports included in emails were eliminated (e.g., “Energy Executive Daily”);
- Words containing “@”were eliminated as potential emails;
- Lines containing email headers (e.g., “From:”, “To:”, “cc:”, “Subject:”, etc.) were eliminated.
The original extraction was of 99,241 words, 493,144 characters on 17,229 lines, the equivalent of 303 pages of text.
 One could mention the case of outlining software as the clear exception. This class of software offers, after all, the swift and ready capacity for promoting, demoting and reordering items, from lines to entire paragraphs. It would seem the perfect topic processor were it not for the fact that what is moved is being arranged merely graphically, not semantically; the software applies no rules for identifying, relating, or maintaining coherence among the topics in the user?s text. All manipulation is purely visual, none of it topical. It would “work” just as well with pages of gibberish text.