Attributing Authorship Review

Over the past few years I have entirely neglected the book review section of this site, and the truth is that I have hardly had time to read in the past year until a couple of weeks ago, but I’m going to make an effort to revive the site, beginning with a lengthy Attributing Authorship by Harold Love, reproduced below for your convenience.


Attributing Authorship
by Harold Love
Rating: +
In the aftermath of finishing my dissertation and starting in a new faculty
position, I’ve been thinking about various research problems within natural
language processing that I might wish to pursue, and generally trying to
read broadly within the field to make up for the tunnel vision associated
(for me at least) with the last years of a PhD program. One of the problems
that I’ve become interested in is authorship attribution. I’ve read a
number of papers about statistical approaches to authorship determination,
but I knew that there was a huge body of work that I knew nothing about, so
when I saw a reference to Love’s book, I snapped it up.
Love is not writing for computer scientists in particular, and is taking
a much broader view of the problem of authorship attribution than is
usually considered by an NLP researcher. His book traces the problem of
authorship identification over its long history, starting with Biblical and
Homeric scholarship before working up to the modern state of affairs. He
approaches the topic as a literary scholar. For me, this insight into the
factors in authorship identification from the perspective of those who wish
to take those attributions as a jumping off point, and not simply an ends
in themselves, was particularly illuminating. At the same time, this is an
accessible work for those with a more technical interest in the topic. Love
confines his discussion of the question of what “authorship” really means
and the accuracy of saying that a text ever has a singular author to an
introductory chapter. While he does not dismiss this debate out of hand,
for the purposes of this book he takes as given that “every author has a
verifably unique style” for the purposes of exploring the attribution
techniques developed under that hypothesis.
The bulk of the text can be divided into two parts, each of several
chapters. In the first part, Love surveys the major features used for
attributing authorship. In the second part, Love considers specific studies and special
subtopics within the general problem of attribution. Love classifies the
major features into external evidence, internal evidence, and stylistic
evidence. As you would expect, external evidence includes features from
outside the text being considered – outsider claims as to the authorship,
publisher records, etc. Internal evidence looks at the content of the text,
including features such as historical events mentioned, which can locate
the text within time, or correlations between opinions expressed or phrases
used in texts of known and unknown authorship. Internal evidence starts to
blur into stylistic evidence when looking at verbal parallelism or
parallelism of thought, but stylistic evidence most commonly focuses on
smaller features, such as selection between synonyms, unusual word choices,
or idiosyncractic grammatical structures.
Love says that historically, and for many people even today, external
evidence is considered the most compelling, and internal or stylistic
evidence is only useful so far as it can bolster or discredit external
evidence. Love’s position is that no single one of these features can,
alone, create a compelling argument for authorship. Rather, before we can
be convinced of an attribution, we should look for agreement between a
variety of these features which, together, justify a positive
identification.
In the later half of the book, Love then looks at a variety of special
topics, including the problem of author-gender identification, the problem
of forgery identification, the long-running debate on the authorship of
“Shakespeare’s works”, the
application of bibliographical studies (the study of the physical artifact
of the text) and of stylometry (statistical methods) to author identification.
The discussion of stylometry, being the area of my own familiarity, was of
particular interest. The historical survey covered not only the classic
Mosteller and Wallace study of the use of function words in the Federalist
Papers, but interesting older
efforts to build statistical fingerprints, such as one by Yule using the
number of nouns occuring once, twice, three, and so on number of times
within a text as descriptive of that author’s style. While Love does not
delve into the details of these algorithms, he presents principle component
analysis and neural network techniques in a manner that I suspect would be
accessible to a non-technical audience and yet which acknowledges the risks
in their use, such as sample size, overfitting the data, and the difficulty
in finding feature sets with the necessary distribution patterns.
Most fascinating, in the end, was Love’s discussion of the debate over
the role of sylometry within the larger field of authorship attribution.
As Love presents it, there is little if any acceptance within the literary
community of attributions made solely on the basis of stylometry. Love
criticizes stylometry as a tool, but not a science – and in fact describes
it as a grab bag of tools, from which the researcher may pull statistical
techniques until they find one which produces a satisfying result. Furthermore,
even having found such a tool, the analyses produced have little
explanatory power. As Love says:

An example! ‘The result’, Thomas Merriam reports, ‘of counting
the letters in the 43 plays was the implausible discovery that the letter
“o” differentiates Marlowe and Shakespeare plays to an extent well in
excess of chance. If the cut-off ratio of 0.078 is adopted, then six of the
Marlowe plays are grouped together at less than 0.078, 36 of the
Shakespeare plays are classified correctly at greater than 0.078, and one
Marlowe play, Edward II, is very marginally out of place by reason
of its being greater than 0.078. The likelihood of 42 out of 43 correct
chance classifications is infintesimal.’ …. This is by any standards a
dazzling result achieved with the simplest of means; but it explains
nothing and is itself unexplained. Merriam has no interest in the
particular reasons why the relative frequencies of the letters ‘o’ and ‘a’
should be able to distinguish between Shakespeare and Marlowe or what an
investigation of this might contribute to the discipline of
linguistics.

This is, as far as it goes, a fair criticism. Most statistical methods
do not purport to produce an explanation as to the stylistic differences –
something that may be of central interest to the individual looking for the
attribution. I would note that the particular quote selected here
illustrates a troubling use of stylometry, since the
frequency of each of the 26 letters of the alphabet was computed, and only
the letters which have discriminatory power in this instance have been
focused on, without any evidence that frequency of the letter ‘o’ can help
distinguish Shakespeare from any other authors. Instead, robust stylometry
should present a generic technique which can be applied in any case in
which two authors must be distinguished, and which will produce a complex
enough “statistical fingerprint” to distinguish any such pair.
In fact, the phrase “statistical fingerprint”
seems particularly apropos to me with regards to the central criticism in
the above quote. Physical fingerprinting similarly lacks
explanatory power; however, we generally accept that correct (or
statistically very likely) identifications can be made based on a
fingerprint. I would suggest that a statistical fingerprint be treated in
the same manner – just as a police investigator does not take a fingerprint
as an explanation for one’s presence, but simply as strong evidence for the
fact of one’s presence, a statistical fingerprint – if produced in a robust
manner – should be taken as strong evidence for an authorship attribution.

One thought on “Attributing Authorship Review

Leave a Reply to Michael J. Farrand Cancel reply

Your email address will not be published. Required fields are marked *