Attributing Authorship: An Introduction
In the aftermath of finishing my dissertation and starting in a new faculty position, I've been thinking about various research problems within natural language processing that I might wish to pursue, and generally trying to read broadly within the field to make up for the tunnel vision associated (for me at least) with the last years of a PhD program. One of the problems that I've become interested in is authorship attribution. I've read a number of papers about statistical approaches to authorship determination, but I knew that there was a huge body of work that I knew nothing about, so when I saw a reference to Love's book, I snapped it up.
Love is not writing for computer scientists in particular, and is taking a much broader view of the problem of authorship attribution than is usually considered by an NLP researcher. His book traces the problem of authorship identification over its long history, starting with Biblical and Homeric scholarship before working up to the modern state of affairs. He approaches the topic as a literary scholar. For me, this insight into the factors in authorship identification from the perspective of those who wish to take those attributions as a jumping off point, and not simply an ends in themselves, was particularly illuminating. At the same time, this is an accessible work for those with a more technical interest in the topic. Love confines his discussion of the question of what "authorship" really means and the accuracy of saying that a text ever has a singular author to an introductory chapter. While he does not dismiss this debate out of hand, for the purposes of this book he takes as given that "every author has a verifably unique style" for the purposes of exploring the attribution techniques developed under that hypothesis.
The bulk of the text can be divided into two parts, each of several chapters. In the first part, Love surveys the major features used for attributing authorship. In the second part, Love considers specific studies and special subtopics within the general problem of attribution. Love classifies the major features into external evidence, internal evidence, and stylistic evidence. As you would expect, external evidence includes features from outside the text being considered - outsider claims as to the authorship, publisher records, etc. Internal evidence looks at the content of the text, including features such as historical events mentioned, which can locate the text within time, or correlations between opinions expressed or phrases used in texts of known and unknown authorship. Internal evidence starts to blur into stylistic evidence when looking at verbal parallelism or parallelism of thought, but stylistic evidence most commonly focuses on smaller features, such as selection between synonyms, unusual word choices, or idiosyncractic grammatical structures. Love says that historically, and for many people even today, external evidence is considered the most compelling, and internal or stylistic evidence is only useful so far as it can bolster or discredit external evidence. Love's position is that no single one of these features can, alone, create a compelling argument for authorship. Rather, before we can be convinced of an attribution, we should look for agreement between a variety of these features which, together, justify a positive identification.
In the later half of the book, Love then looks at a variety of special topics, including the problem of author-gender identification, the problem of forgery identification, the long-running debate on the authorship of "Shakespeare's works", the application of bibliographical studies (the study of the physical artifact of the text) and of stylometry (statistical methods) to author identification. The discussion of stylometry, being the area of my own familiarity, was of particular interest. The historical survey covered not only the classic Mosteller and Wallace study of the use of function words in the Federalist Papers, but interesting older efforts to build statistical fingerprints, such as one by Yule using the number of nouns occuring once, twice, three, and so on number of times within a text as descriptive of that author's style. While Love does not delve into the details of these algorithms, he presents principle component analysis and neural network techniques in a manner that I suspect would be accessible to a non-technical audience and yet which acknowledges the risks in their use, such as sample size, overfitting the data, and the difficulty in finding feature sets with the necessary distribution patterns.
Most fascinating, in the end, was Love's discussion of the debate over the role of sylometry within the larger field of authorship attribution. As Love presents it, there is little if any acceptance within the literary community of attributions made solely on the basis of stylometry. Love criticizes stylometry as a tool, but not a science - and in fact describes it as a grab bag of tools, from which the researcher may pull statistical techniques until they find one which produces a satisfying result. Furthermore, even having found such a tool, the analyses produced have little explanatory power. As Love says:
An example! 'The result', Thomas Merriam reports, 'of counting the letters in the 43 plays was the implausible discovery that the letter "o" differentiates Marlowe and Shakespeare plays to an extent well in excess of chance. If the cut-off ratio of 0.078 is adopted, then six of the Marlowe plays are grouped together at less than 0.078, 36 of the Shakespeare plays are classified correctly at greater than 0.078, and one Marlowe play, Edward II, is very marginally out of place by reason of its being greater than 0.078. The likelihood of 42 out of 43 correct chance classifications is infintesimal.' .... This is by any standards a dazzling result achieved with the simplest of means; but it explains nothing and is itself unexplained. Merriam has no interest in the particular reasons why the relative frequencies of the letters 'o' and 'a' should be able to distinguish between Shakespeare and Marlowe or what an investigation of this might contribute to the discipline of linguistics.
This is, as far as it goes, a fair criticism. Most statistical methods do not purport to produce an explanation as to the stylistic differences - something that may be of central interest to the individual looking for the attribution. I would note that the particular quote selected here illustrates a troubling use of stylometry, since the frequency of each of the 26 letters of the alphabet was computed, and only the letters which have discriminatory power in this instance have been focused on, without any evidence that frequency of the letter 'o' can help distinguish Shakespeare from any other authors. Instead, robust stylometry should present a generic technique which can be applied in any case in which two authors must be distinguished, and which will produce a complex enough "statistical fingerprint" to distinguish any such pair.
In fact, the phrase "statistical fingerprint" seems particularly apropos to me with regards to the central criticism in the above quote. Physical fingerprinting similarly lacks explanatory power; however, we generally accept that correct (or statistically very likely) identifications can be made based on a fingerprint. I would suggest that a statistical fingerprint be treated in the same manner - just as a police investigator does not take a fingerprint as an explanation for one's presence, but simply as strong evidence for the fact of one's presence, a statistical fingerprint - if produced in a robust manner - should be taken as strong evidence for an authorship attribution.
Review written July 2006.
All contents of this site copyright, contact firstname.lastname@example.org with any questions or comments.