Notes on O'Reilly Radar's State of the Computer Book Market

Friday, February 27, 2009 by DeWitt Clinton

For the past several years one of my favorite places to track programming language trends has been the "State of the Computer Book Market" series on O'Reilly Radar.

O'Reilly's Mike Hendrickson dives deep again this year into the statistics and details of the computer book market in a 5-part series:

  1. State of the Computer Book Market 2008, Part 1: The Market
  2. State of the Computer Book Market 2008, Part 2: The Technologies
  3. State of the Computer Book Market 2008, Part 3: The Publishers
  4. State of the Computer Book Market 2008, part 4: The Languages
  5. State of the Computer Book Market 2008, part 5: eBooks and Summary

My high level summary:

Languages ranked by 2008 book sales (%market share, relative to 2007 rank):

  1. C# - 15.58% (↑)
  2. Java - 12.09% (↓)
  3. PHP - 9.93% (⇈)
  4. JavaScript - 9.89% (↓)
  5. C/C++ - 8.36% (↓)
  6. ActionScript - 5.76% (⇈)
  7. .NET Languages - 5.40% (↓)
  8. VisualBasic - 5.04% (↓)
  9. SQL - 4.57% (↓)
  10. Ruby - 3.51% (↓)
  11. Python - 3.41% (↑)
  12. VBA - 3.18% (↓)
  13. Objective-C - 2.56% (⇈)
  14. Perl - 2.14% (↓)

The most telling charts from the series:

2008 Market Share by Language:

2008 Market Share by Language

A Treemap view of the Programming Languages:

A Treemap view of the Programming Languages

Percentage of 5 Year Sales Per Quarter by Media Type:

Percentage of Lifetime Sales Per Quarter by Media Type

(Images copyright O'Reilly Media, and used without asking permission first -- will definitely take them down if necessary.)

Mike goes into far more detail on each topic over on Radar. Start reading the series here.

A Survey of Rel Values on the Web

Monday, February 16, 2009 by DeWitt Clinton

One of the interesting things about sharing an office with Jyri is that our free-association stream-of-consciousness conversations often lead to places worth exploring further.

On Friday Jyri and I started wondering about the link rel values documented in the XFN 1.1 profile, which include not only the relatively commonplace me and friend values, but also such unconventional values such as colleague, muse, and spouse. But how frequently are the lesser known rel values really used? Rather than speculate blindly, I wrote a simple mapreduce to check the web and find out for sure.

The mapreduce scanned approximately 177 million recently crawled HTML documents, parsing and counting rel values in link and anchor tags along the way. In those 177M documents, I found just over 19 billion <a> and <link> tags in total. And of those 19B tags, 1.8 billion of them contained a non-empty rel attribute.

Following the HTML5 rules for space separated tokens I split each rel value on [\s\t\n\r\f] and extracted each individual value. In total, over 1.9B instances of rel values were found, or an average of just over 10 per HTML document (with some tags having more than one rel value).

I found a staggering 1.8M unique rel value strings in use, with many used only once or twice across all the web. In fact, the top 6 most-frequently-used rel values accounted for 80% of all usage, and the top 11 alone were responsible for 90% of all usage. In fact, less than 1000 of the most frequently unique rel values are sufficient to represent the 99th percentile of all usage. In other words, the tail is long indeed, with the remainder of those 1.8M unique rel values accounting for less than 1% of the total usage.

In passing, I noticed that approximately 3 million rel value strings also contained a comma character; presumably cases where the author may mistakenly have thought that the "," character would be used as a delimiter. However, since these cases account for just 0.18% of all rel value strings, they have little impact in the overall totals.

Here are the top 25 rel values found in <a> and <link> tags in a moderately sized sample of the web today:

RankValueCountRelative Frequency
1nofollow832980014
2stylesheet338648161
3tag168764800
4alternate109150404
5icon69183607
6chapter56395793
7forum55920646
8shortcut53906964
9bookmark30683701
10archives25381711
11category24361195
12external19181232
13search14227485
14edituri8109835
15apple-touch-icon6753583
16help4842211
17prev4537344
18next4390373
19pingback4302068
20wlwmanifest4125573
21contents3959350
22contact3504587
23service.post2678873
24top2502015
25me2501273

The most frequently used values are not surprising at all. The nofollow value is used as a hint to search engines that the target of an <a> tag should not be used in ranking calculations. The stylesheet value is used on <link> tags to indicate that the target is an external CSS document. The tag is a microformat used to indicate a category for the page, as popularized by sites such as Technorati and Delicious. And alternate is frequently used to facilitate the autodiscovery of an RSS or Atom feed for a given site.

Further down we learn that as OpenID continues to gain in adoption the openid.server and openid.delegate rel values come in at #35 and #43 respectively -- impressive, since each are only needed once per-page. And even the newer OpenID2-style tags not far behind, with openid2.provider and openid2.local_id reaching #51 and #837 respectively.

Near and dear to my heart, I was pleased to see the search rel value, the OpenSearch discovery mechanism, ranked so high at #13. Again these discovery links are only needed once per page; a sign of strong adoption. Admittedly, not all rel="search" links are OpenSearch related, but I have another more comprehensive analysis of OpenSearch documents that shows similarly pervasive adoption rates.

Even the newly agreed-upon canonical rel value makes a showing at #271, and will surely rise to the top 25 or so over the next year or two.

And the XFN rel values? The contact rel value is the most common at #22, with me and friend just behind at #25 and #28 respectively. Filling out the list are acquaintance (#58), met (#68), colleague (#84), co-worker (#126), neighbor (#180), muse (#196), co-resident (#232), parent (#255), sibling (#414), sweetheart (#446), spouse (#570), crush (#794), kin (#834), child (#879), with date bringing up the rear at #1086.

This survey indicates that rel values are both widely and meaningfully used, with adoption being driven by a wide array of needs, such as semantic markup, search engine hints, client-side rendering, discovery and identity protocols, blogging, and/or content that can be later edited.

But more importantly, we learned that a full 0.0003% of all the links have declared, for all the world to see, that some URI out there is their source of inspiration, their Calliope, their Erato, their muse.

Unbelievable Mental Lapse

Monday, January 26, 2009 by DeWitt Clinton

I receive a fair bit of misaddressed mail at my gmail.com addresses. Sometimes it is the result of a typo on the part of the sender. But with surprising frequency it is the result of a real person accidentally entering my email address into a web form instead of their own. I've seen shipping confirmations, subscriptions to mailing lists, responses to job applications, new account notices on sites like Facebook and Twitter, etc. How someone could enter the wrong email address into one of these forms is beyond me.

But nothing, nothing will ever top the mental lapses responsible for this email I just received:

Internal Revenue Service <refunds@irs.gov>
Date: Mon, Jan 26, 2009 at 9:46 AM
To: DClinton@gmail.com
Dear Dianne Clinton,

Your Stimulus Payment request has beed submited.

A Stimulus Payment can be delayed for a variety of reasons.
For example submitting invalid records or applying after the deadline.

Stimulus Payment request issuer:

Name: Dianne Clinton
Address: [redacted]
City:  [redacted]
State: [redacted]
Postal Code:  [redacted]
Phone: [redacted]
Date of birth:  [redacted]/ [redacted]  (mmddyyyy)
Social Security Number:  [redacted]
Mother name:  [redacted]
Credit card Number:  [redacted]
Credit card expiration:  [redacted]/ [redacted]  (mm/yyyy)
CVV:  [redacted]

Note: For security reasons, we recorded your ip-address, the date and
time.
Deliberate wrong inputs are criminally pursued.
IP:  [redacted]
Date: Mon Jan 26, 2009 6:46 pm

Regards,
Internal Revenue Service

Yes, every single one of those [redacted] fields was filled out completely. Social security #, credit card #, mother's maiden name. The works. An identity thief could clear out her accounts and bankrupt her by morning.

Want to know the saddest part?

It was an identity thief. (The grammar and spelling errors were a bit of a dead giveaway. Besides, I can't imagine the real IRS would be so stupid as to send your private details back to you over plain-text email.)

A ten-second perusal of the address headers showed that, not-surprisingly, this message did not originate from irs.gov.

Rather, the mail originated from this site: (And needless to say, don't you go filling it out!)

http://www.ieaf.es/bbdd/apps/news/stimulus.refund/stimulus.php

This woman was the victim of a phishing scam; she probably thought she was entering her very personal data into a legitimate United States government website, and she may never realize how wrong she was. She didn't notice the lack of https, or that the domain was ieaf.es, a known IRS phishing site, hosted on a Spanish top-level domain.

I will submit the site to the various phish-tracking websites and make the appropriate notifications at work. That said, I'm on the fence about trying to contact her directly. Morally it would be the right thing to do. However, in this litigious era, it might be exactly the wrong thing to do. Needless to say the email itself will be permanently deleted from my inbox.

This whole episode makes me very, very sad.

Sampling Twitter

Friday, January 02, 2009 by DeWitt Clinton

Update: I had made a mistake in gathering the samples and took this post down temporarily to fetch new data. A typo I made in one of my scripts radically under-counted the number of friends each account had. Mea culpa, and the lesson I learned is that if a number seems counter-intuitive, then it likely is. I apologize for the mistake!

...

Most of the surveys that have attempted to study Twitter usage do so by scraping the public stream of tweets. This provides reasonable data about how people are publicly using Twitter but it suffers from sample bias insofar as only active Twitter accounts are counted while private accounts and accounts that go dormant are overlooked.

After Dion posted that I was the first person he subscribed to on Twitter, a conversation started in which we speculated about the distribution of used vs. unused user IDs on Twitter. Specifically, I became curious as to how Dion, who signed up for Twitter less than three months after I did, could have a user id 3.5 million higher than mine. Was Twitter really signing up users at a rate of more than one million a month this time last year? Doubtful, but the only way to find out was to gather some hard data.

Instead of mining data from the public feed, I wrote a short script to query a sample across of all the possible Twitter ids. I first created a a test Twitter account, which was assigned a new user ID of 18496098. Using this id as an upper bound on the population of all Twitter IDs, I selected samples at random from the range (0, 18496098) (exclusive), and queried the Twitter API at a metered rate from several machines over the course of a day.

After rerunning the script on queries that returned transient server-side or client-side errors ("502 Bad Gateway", "503 Service Temporarily Unavailable", etc), I arrived at an clean, unbiased sample pool of 4414 ids.

Of the 4414 sampled ids, 1270 ids have been assigned (returned "200 OK") and 3144 are not in use (returned "404 Not Found"). Of the 3144 unassigned ids, 3120 of those were "Not found" (and presumably were never used), and 24 were "User has been suspended".

By this ratio, we can infer that approximately 5,000,000-5,500,000 accounts have been created since Twitter's private launch in early 2006.

Of the 1270 sampled ids that have been assigned to a user, 847 accounts have posted at least one update, 759 are being followed, and 735 are following another user.

Further breaking down the 1270 assigned ids, we find that 635 ids are both being followed and follow someone else, 574 ids have posted a status and are being followed, and another 541 have posted a status and follow someone else. And 501 have posted a status, follow someone else, and are being followed.

Of the 1270 ids that have been assigned, 97 have protected their status updates, and 1173 have left their status updates public (the default).

Of the 1173 public ids, 1048 of the accounts were created more than 30 days ago.

Of the 1048 more-than-30-day-old public ids, 691 have posted a status message at least once.

And of those 691 sampled public ids that are over 30 days old and have posted at least one status message, 305 of those accounts have returned to post an update more than 30 days after their account was first created.

This last metric -- users that have returned to post again more than 30 days after creating their account -- is the best metric I can come up with for a return user on Twitter. Extrapolating this ratio of return vs non-return users back across the segment of users too new to test and the private accounts, we estimate that 29.1% of assigned ids, and thus roughly 8.3% of all possible ids, are assigned to someone that has returned at least to post something more than 30 days their account was initially created.

Given an potential max population of 18,496,098, this ratio implies that there are up to 1,500,000-1,600,000 users that have returned to Twitter to post again after first creating their account, which is a respectable number, and is consistent with the estimates made by others observing the pattern of public status updates.

Of the 305 return accounts, 249 are both following at least one other account and have at least one follower.

Again extrapolating for accounts too new to test and private accounts, this suggests that 23% of all assigned ids, and thus 6.8% of all potential user ids, are assigned to someone who is posting regularly, is following other users, and is being followed by at least one other user. This implies that there there are up to 1,200,000-1,300,000 active, connected users on Twitter.

Of the public users sampled, 470 are followed by no one, 524 are followed by between 1 and 10 people, 159 are followed by 11 to 100 people, and 20 are followed by more than 100.

Of the public users sampled, 499 are following no one, 514 are following between 1 and 10 people, and 139 are following 11 to 100 people, 20 are following more than 101-1000 people, and 1 is following more than 1000.

Of the public users sampled, 403 have no status updates, 535 have posted between 1 and 10 status updates, 141 have posted between 11-100 status updates, 87 have posted between 101-1000 status updates, 6 have posted between 1001-10000 status updates, and 1 has posted more than 10,000 status updates. (The outlier with 10000+ updates is a bot.)

And to return to the question that started the experiment, there is indeed both an upward trend to the number of users created per month, and a sharp transition in early 2008 when huge blocks of ids no longer went unassigned between the creation of each account.

These numbers, while not perfect, should be a reasonably accurate ballpark estimate -- ±2% within 2σ over the total population, and ±3% over population of assigned ids -- and the numbers likely wouldn't change significantly with a larger sample. However, there's always the chance of mistakes, so please feel free to download the data set to confirm. Or if you work at Twitter and would like to verify these numbers, even privately, I'd love to hear how close the sampled numbers come to reality.

Sunsetting Delancey

Sunday, December 28, 2008 by DeWitt Clinton

Three years ago I launched a little service called Delancey. Delancey was an early del.icio.us mashup that kept track of how many times you clicked on each bookmark. This usage metadata was valuable because with it you could sort your bookmarks in order of how often they were used, making for a simple but powerful default browser homepage that learned from your behavior.

In designing Delancey I made several decisions that I knew might eventually cause a maintenance challenge, but I felt that they were justified at the time because of the benefit that they offered.

The first design decision was that users of Delancey would be able to "claim" their del.icio.us account so that they, and only they, would be able to access or update their personal click data. Given that a standard mechanism for claiming external identities was still in its infancy, I hacked together a technique by which a signed-in del.icio.us user would automatically bookmark a secret claim URL that the Delancey application would verify and then delete. It was a surprisingly effective approach that created a strong verifiable claim over a del.icio.us identity without the Delancey application ever requesting the del.icio.us user's password. However, it took advantage of how the del.icio.us front-end was implemented at that particular moment, and hence it was unlikely to remain stable forever.

The second design decision was that the Delancey application would never store a user's del.icio.us username or the URLs of the bookmarks the user accessed. This wasn't strong security, as both were stored as simple one-way md5 hashes of the plaintext data, however it prevented casual abuse as reversing the associations would require a complete dictionary of del.icio.us usernames and the bookmarked URLs. (A truly secure system would require a secret key for each Delancey user and a mechanism to encrypt the data on the client side -- a reasonable exercise, but overkill for this type of application.) This kept your data private (even from Delancey itself), but it meant that exporting it would be more difficult down the road.

Then del.icio.us relaunched as delicious.com this fall. I suspected that Delancey would break during the transition, but I couldn't find a migration guide for del.icio.us developers so I wasn't exactly sure what to be on the lookout for. So I shelved the project for a while longer and hoped that nothing serious was broken.

This weekend I had the opportunity to investigate, and sure enough, a few of the features that made Delancey possible were no longer supported. Most critically, the automatic bookmarking trick that Delancey used for claim verification was rendered ineffective because the new delicious.com front-end signs (with a key or nonce) the form used to post new URLs, thus blocking Delancey's attempt to mimic the POST request via an external page. This is a perfectly reasonable decision on the part of of the delicious team, but without it Delancey would need a new mechanism for account verification.

The Delicious API does provide a way of posting new bookmarks, and hence a backdoor mechanism for account verification. However, the official API relies on HTTP Basic auth, which would mean the user would be presented with an unexpected browser-based interstitial login box if Delancey were to use it. Or worse, Delancey application could itself request and proxy credentials for the del.icio.us user -- exactly what I was trying to avoid. The latter technique is a form of the password anti-pattern, and I couldn't in good faith implement that anti-pattern myself. Fortunately, either (or both) of OpenID and OAuth would be sufficient for verifying the account claim if delicious supported them. Unfortunately, Delicious currently supports neither OpenID nor OAuth as far as I can tell. While I'm reasonably confident that I could come with some other hacked-together solution on top of the current delicious.com front-end, I don't feel comfortable again investing in a solution that isn't officially supported.

So with that as background, I've decided to sunset Delancey. I don't have a hard date as to when I'll shut it down, but I won't be adding new features, nor will I be fixing any bugs beyond those that I fixed today. I promise not to take it down for another 90 days (through the end of March, 2009), but after that I may turn it off completely to save bandwidth. However, before I take it down I will provide links to export your existing click data. That said, since the bookmark URLs are stored as hashes only, you will need to do some work on your end to associate the underlying URL with the click counts. (This is possible because you know your own username and you can easily retrieve a list of all of your bookmarks from delicious.)

For those that want to start extracting their data now, you can play around with the auto-documented Delancey API. For example, to see a list of your tags, you can use the /delancey/tags/{username} method (e.g., as YAML or as JSON). And you can retrieve your click counts via the /delancey/bookmarks/{username}/{tag}/ method (eg. as YAML or as JSON). (You'll note that the latter API call automatically does the hash association with the public bookmark data -- if the del.icio.us account is private, or if the bookmarks are deeply buried, then the titles and the URLs are unavailable. The full Delancey export will contain on the (hash, count) tuple.)

Going forward, if Delicious implements OAuth or OpenID in the next month or two I may change my mind and get this running again, perhaps by porting it over to Python and App Engine in the process (a port I started a while back but never completed). If not, then I hope people understand my reasons for shutting down this service. It was a fun app to build, and if you used it I hope you found it valuable for what it did. I still believe that there is great opportunity in the social bookmarking space, and whether it is to be found the relaunched delicious.com, at a site like digg.com, or at a site yet unseen, I look forward to whomever pushes the technology forward next.