" /> Screenshot: A Weblog: July 2006 Archives

« June 2006 | Main | August 2006 »

July 27, 2006

Science and Tech Feedback

The ACM"s public policy weblog has a nice discussion of Congress's need for scientific and technical advice, prompted by a hearing on Tuesday on the topic. As in most other domains, the need for advice comes not from a lack of information, but rather from information overload, and specifically highly technical information overload:

Congress does not face an information shortage. Each day hundreds of documents are dumped on Congress, many of them dealing with technical issues. One witness said that staffers now receive about 200 e-mails daily from advocacy groups. Numerous groups provide scientific advice to Congress including think tanks, professional societies (such as ACM), the National Academies, governmental agencies, and even Congress’ own research service. None of the witnesses argued Congress needed more scientific and technical advice. They argued it needed independent advice that was more closely aligned with Congress’ needs, and that this need couldn’t be fulfilled by the various outside groups.

Particularly interesting was the analysis of how the lack of interest in reconstituting the old Office of Technology Assessment would impact the effectiveness of organizations such as the ACM providing technical advice. There was a definite note of frustration in the article, though I wonder if anyone is really surprised by the observation that under the current system scientific recommendations often take a back seat to political recommendations. Not having a codified method for collecting such input is probably indicative of a lowered interest in such input, but it doesn't follow that a centralized clearinghouse for technical advice will guarantee that it is listened to.

At its heart, this seems like an education issue to me - so long as it is socially acceptable for even "well-educated" people to say that math and science are "hard" and beyond their grasp or interest, our government representatives are unlikely to have the inclination or abilities to evaluate even well-presented technical arguments.

July 24, 2006

Bronze Today

Most years I forget to make note until too long after the fact, but today is the eighth anniversary of my first post to Screenshot. I asked a couple of friends what I should do to mark the occasion, and one interesting suggestion was to comment on what weblogs I'm currently reading on a regular basis. It's an interesting question, because looking back over my years of weblogging (1) I used to read a lot more weblogs than i do now, and (2) I used to maintain a list of favorites, whereas now I keep my bookmarks private. The first change is almost entirely due to no longer being a grad student. The second is due to bad memories of the explosion of the weblog "community" and some of the drawing up of sides that came out of who linked to whom. But it has been a while since I've made up a list of favorites, and there are some good sites that I've been enjoying recently.

Looking at the weblogs I visit on a regular basis, there are a few categories - I'll offer you one from each set.

The huge ones that everyone reads - they don't need plugs from me, but I will say that Boing Boing probably has the best staying power for me.

The old ones I've been reading since the 90s - a lot of them are gone, or have morphed into something other than what they were, but Bifurcated Rivets has been keeping on with the old-school Robot Wisdom style snippets and Ghost in the Machine continues to be an interesting mix of, well, everything.

Education themed - I found a whole bunch of these about a year ago, and the best are a good combination of useful insights and cathartic venting. Favorites include Learning Curves and New Kid on the Hallway. Not surprisingly, the summer is a bit of a downtime for these weblogs.

Feminist themed - my favorite in this catagory, Bitch Ph.D. actually overlaps a bit with the preceding catagory as well.

Craft themed - not martha seems to consistently find the coolest projects out on the web.

Pop culture themed - I've been reading Pop Culture Junk Mail forever and it's a lot of fun. More recently, I've been laughing at (and not with) the comics over at The Comics Curmudgeon.

This isn't everything by any means, but it's a pretty good snapshot of the types of things I look at - this month at least!

July 21, 2006

Friday Geek Humor

A new favorite is the geeky, surreal, math-and-linguistics-infused webcomic xkcd [via Boing Boing]. The two that made me laugh raucously were Stacy's Dad and Computational Linguistics (caution, bad language). However, I highly recommend paging through from the beginning. If you don't have the patience, some of my other favorites are Copyright, Fourier, Self-Reference, Hyphen, Geico, Hobby, Graduation, Pillow Talk, Wright Brothers, and
Centrifugal Force.

July 20, 2006

Dangers of Web 2.0

An interesting pair of articles about the privacy implications of Web 2.0 applications came through on Slashdot and Digg respectively last week. The first linked the Louisiana State University in Shreveport's Career Services reprint of an article about the impact of social network sites on getting hired. It mentions that even people who think they are being careful by restricting access to their on-line content might find it accessed by a potential employer, citing a specific case in which a state agency obtained access to restricted Facebook pages due to provision of the Patriot Act. It also reiterates the necessary point that these sites need to be treated as public, not private spaces. Interestingly, the article also suggests that it is ethically questionable for an employer to look at these sites for background on a potential employee. I think that it is a mistake for an employer to take standard goofiness too sereiously, but I think that it is totally natural for them to Google applicant names or look in other public sites. That is, at least, content that an applicant has power over, as compared to employers asking colleagues who previously knew the applicant for feedback, which definitely happens.

The second article talks about steps you can take to clean up your on-line presence, particularly prior to a job search. Soberingly, its first recommendation is to Google yourself, but suggests that if there is something unflattering that appears about you, there is little you can do about it.

July 14, 2006

Back to Firefox

I ran through my planned trial week with Opera, and I've decided to go back to Firefox. I definitely think that Opera has fewer memory leaks, which is a plus, and I really like the session manager. However, it never felt right - there were differences switching from IE to Firefox, but Firefox was never irritating. Opera never seemed intuitive about when it opened things in the same window as compared to in a new tab, as compared to some strange sub-window to a tab. I had trouble getting it to put and keep my bookmarks in the order I wanted. I couldn't right-click on a bookmark or tab to change its properties or open it in a new tab or window. All together, it didn't work for the way that I wanted to use it.

So, I set up a Firefox extension to enable a session manager for it, which will also enable me to close Firefox on a regular basis even if I haven't finished with all of my tabs. I'm almost tempted to try out the Firefox 2.0 beta, but I think I've had enough browser fun for the month.

The Look of Your Book

This weblogger describes their job, book interior designer, and describes the number of things that it makes perfect sense someone has to do when producing a book, and yet which I never really thought about as part of the process. It's not just choosing the font, as they note, but layout and material issues that have to balance attractiveness and readability with the financial considerations of publishing the book. For example, they are told how many pages the book will have (based on non-design considerations), and then have to find the best way to put the book into that many pages. There's also an interesting bit on the picking of a font for a book. If you like the entry, definitely check out the archives, as there are other goodies about designing books in there.

July 10, 2006

Photoshop Tricks

I have a basic familiarity with Photoshop, and use it for the little photo editing that I do, but I know there are lots of capabilities to the software that I am not utilizing, so this description of using Photoshop filters to sharpen focus on a photo element was really useful, though I haven't found a photo with which to try it yet. I like how the article uses terminology like "depth of field", but doesn't assume that I know exactly what that is or how to use it in my photography. It's really a lesson in how to take good photographs, but explains how you can use Photoshop to tweak settings instead of having to get everything perfect at the moment you take the picture. I'll definitely be trying some of this out. [via Digg]

July 7, 2006

Attributing Authorship Review

Over the past few years I have entirely neglected the book review section of this site, and the truth is that I have hardly had time to read in the past year until a couple of weeks ago, but I'm going to make an effort to revive the site, beginning with a lengthy Attributing Authorship by Harold Love, reproduced below for your convenience.

Attributing Authorship
by Harold Love
Rating: +

In the aftermath of finishing my dissertation and starting in a new faculty
position, I've been thinking about various research problems within natural
language processing that I might wish to pursue, and generally trying to
read broadly within the field to make up for the tunnel vision associated
(for me at least) with the last years of a PhD program. One of the problems
that I've become interested in is authorship attribution. I've read a
number of papers about statistical approaches to authorship determination,
but I knew that there was a huge body of work that I knew nothing about, so
when I saw a reference to Love's book, I snapped it up.

Love is not writing for computer scientists in particular, and is taking
a much broader view of the problem of authorship attribution than is
usually considered by an NLP researcher. His book traces the problem of
authorship identification over its long history, starting with Biblical and
Homeric scholarship before working up to the modern state of affairs. He
approaches the topic as a literary scholar. For me, this insight into the
factors in authorship identification from the perspective of those who wish
to take those attributions as a jumping off point, and not simply an ends
in themselves, was particularly illuminating. At the same time, this is an
accessible work for those with a more technical interest in the topic. Love
confines his discussion of the question of what "authorship" really means
and the accuracy of saying that a text ever has a singular author to an
introductory chapter. While he does not dismiss this debate out of hand,
for the purposes of this book he takes as given that "every author has a
verifably unique style" for the purposes of exploring the attribution
techniques developed under that hypothesis.

The bulk of the text can be divided into two parts, each of several
chapters. In the first part, Love surveys the major features used for
attributing authorship. In the second part, Love considers specific studies and special
subtopics within the general problem of attribution. Love classifies the
major features into external evidence, internal evidence, and stylistic
evidence. As you would expect, external evidence includes features from
outside the text being considered - outsider claims as to the authorship,
publisher records, etc. Internal evidence looks at the content of the text,
including features such as historical events mentioned, which can locate
the text within time, or correlations between opinions expressed or phrases
used in texts of known and unknown authorship. Internal evidence starts to
blur into stylistic evidence when looking at verbal parallelism or
parallelism of thought, but stylistic evidence most commonly focuses on
smaller features, such as selection between synonyms, unusual word choices,
or idiosyncractic grammatical structures.
Love says that historically, and for many people even today, external
evidence is considered the most compelling, and internal or stylistic
evidence is only useful so far as it can bolster or discredit external
evidence. Love's position is that no single one of these features can,
alone, create a compelling argument for authorship. Rather, before we can
be convinced of an attribution, we should look for agreement between a
variety of these features which, together, justify a positive
identification.

In the later half of the book, Love then looks at a variety of special
topics, including the problem of author-gender identification, the problem
of forgery identification, the long-running debate on the authorship of
"Shakespeare's works", the
application of bibliographical studies (the study of the physical artifact
of the text) and of stylometry (statistical methods) to author identification.
The discussion of stylometry, being the area of my own familiarity, was of
particular interest. The historical survey covered not only the classic
Mosteller and Wallace study of the use of function words in the Federalist
Papers, but interesting older
efforts to build statistical fingerprints, such as one by Yule using the
number of nouns occuring once, twice, three, and so on number of times
within a text as descriptive of that author's style. While Love does not
delve into the details of these algorithms, he presents principle component
analysis and neural network techniques in a manner that I suspect would be
accessible to a non-technical audience and yet which acknowledges the risks
in their use, such as sample size, overfitting the data, and the difficulty
in finding feature sets with the necessary distribution patterns.

Most fascinating, in the end, was Love's discussion of the debate over
the role of sylometry within the larger field of authorship attribution.
As Love presents it, there is little if any acceptance within the literary
community of attributions made solely on the basis of stylometry. Love
criticizes stylometry as a tool, but not a science - and in fact describes
it as a grab bag of tools, from which the researcher may pull statistical
techniques until they find one which produces a satisfying result. Furthermore,
even having found such a tool, the analyses produced have little
explanatory power. As Love says:

An example! 'The result', Thomas Merriam reports, 'of counting the letters in the 43 plays was the implausible discovery that the letter "o" differentiates Marlowe and Shakespeare plays to an extent well in excess of chance. If the cut-off ratio of 0.078 is adopted, then six of the Marlowe plays are grouped together at less than 0.078, 36 of the Shakespeare plays are classified correctly at greater than 0.078, and one Marlowe play, Edward II, is very marginally out of place by reason of its being greater than 0.078. The likelihood of 42 out of 43 correct chance classifications is infintesimal.' .... This is by any standards a dazzling result achieved with the simplest of means; but it explains nothing and is itself unexplained. Merriam has no interest in the particular reasons why the relative frequencies of the letters 'o' and 'a' should be able to distinguish between Shakespeare and Marlowe or what an investigation of this might contribute to the discipline of linguistics.

This is, as far as it goes, a fair criticism. Most statistical methods
do not purport to produce an explanation as to the stylistic differences -
something that may be of central interest to the individual looking for the
attribution. I would note that the particular quote selected here
illustrates a troubling use of stylometry, since the
frequency of each of the 26 letters of the alphabet was computed, and only
the letters which have discriminatory power in this instance have been
focused on, without any evidence that frequency of the letter 'o' can help
distinguish Shakespeare from any other authors. Instead, robust stylometry
should present a generic technique which can be applied in any case in
which two authors must be distinguished, and which will produce a complex
enough "statistical fingerprint" to distinguish any such pair.

In fact, the phrase "statistical fingerprint"
seems particularly apropos to me with regards to the central criticism in
the above quote. Physical fingerprinting similarly lacks
explanatory power; however, we generally accept that correct (or
statistically very likely) identifications can be made based on a
fingerprint. I would suggest that a statistical fingerprint be treated in
the same manner - just as a police investigator does not take a fingerprint
as an explanation for one's presence, but simply as strong evidence for the
fact of one's presence, a statistical fingerprint - if produced in a robust
manner - should be taken as strong evidence for an authorship attribution.

July 6, 2006

Alternate Browser

After hearing some positive feedback, I've decided to try out the Opera browser for the next week. I've been using Firefox, and I loooooove tabbed browsing, but it leaks memory like a sieve, at least for me. Opera, at first glance, seems to have many of the same nice features, plus it has a built-in setting that you can close your browser and have it reopen to the same set of tabs - it is possible there is a plug-in for Firefox that does this, but I haven't seen it yet. Opera is acting a little sluggish for me, but Iit's possible that's just my wireless connection being cranky tonight.

Expect to hear back from me next week about my thoughts after a week of use. I will say that I really like how easy all of the browsers make it to transfer your bookmarks back and forth between them - way nicer than the bad old days....