I dedicated a couple of hours during the weekend to (gently) spidering a well-known online jobs site in Argentina for Python-related positions, and then running hierarchical cluster analysis on hand-selected keywords using Pycluster.

According to this analysis (with all the caveats about rushed work, low n, etc), there are roughly three differentiated “domains of competence” according to the people who write posts in job boards:

  • A “narrow web domain”: ajax, dhtml, hibernate, apache, tomcat, spring, corba, rails, java, ruby, perl, php
  • A “wide web domain”:html, javascript, css, xml, mysql, cms, xhtml, c, cctv, ethernet, django, turbogears, flex, flash, coldfusion, xslt, lamp, mssql, soap, clusters, hpc, jboss, jetty, subversion, snmp, samba, excel, sybase, smarty, postgresql, rpc, plone, openerp, zope
  • A “server domain” (more sharply distinct from the rest): c++, boost, dns, firewalls, jython, unix, oracle, sql, solaris, ip, api, tcp, openssl, linux, svn

The labels I chose are of course largely arbitrary, but the grouping itself is less so, and not obviously derived from technological reasons. The “server domain” is more or less self-explanatory (although not devoid of weirdness), but why are the two first categories grouped as they are? Off the top of my head, I think that this reflects the existence of a more programming-oriented web domain among Python-mentioning jobs where dynamic languages are notorious and relatively large deployments are expected (hence the “enterprise” java technologies), and a large and heterogeneous “bag domain” of web-related technologies where everything goes. Needlessly to say, this description falls apart quite quickly (“clusters” and “hpc” belong to this bag domain, where they should logically go somewhere else), but, still, it seems to be a workable first approximation.

Looking at a finer granularity, things become much clearer. You have, for example, a “dynamic languages” cluster, and a very well defined “classic websites” cluster (html, javascript, css, xml, mysql).

It would be interesting to see how well these clusters (specially at the finer granularity levels) correlate with actual demands during work, but I’m not sure where to get that data from.

Digital Rights Management is the bone-headed idea that you can control information once it has left your systems.

Privacy is the bone-headed idea that you can control information once it has left your systems.

The digital movie that can’t be pirated is as mythical as the business or government record that won’t be warehoused and cross-indexed. In both cases technology makes too easy and untraceable what interests make too desirable.

Eric Schmidt is an hypocrite and a tool. That said, I don’t think we’re going to have any more success controlling our bits than the media distributors have had with theirs, so we need to figure out something else.

Am I happy with this? Nope. But it is what it is.