programming

For you Firefox users, I’ve put a new version of LJ New Comments up.

It supports Russian keyboards, courtesy of some code from mumi_0.

I’ve also made it expand collapsed comments when you move to them by pressing the “n” or “p” keys, as my assumption when you do that is that you want to read the comment. The thread expansion stuff needs the style to have an “Expand” link for the comment (literally, a link to the thread with the text “Expand”: if anyone’s got any ideas on how to identify it in a way which doesn’t assume it’s labelled in English, let me know).

Comments/questions to the entry for the script.

Whenever I post a comment on LiveJournal, I get an email containing the text of it. I’ve written a Python program which turns these into an Atom feed, so that people could stalk me more easily by subscribing to the feed. I get into some interesting discussions on other people’s LJs, so I thought such a feed might be useful.

The program checks that the comment is on a public posting and doesn’t publish it if it isn’t (you can do this by submitting an HTTP HEAD request for the entry in question and seeing whether LJ redirects you to login or sends you a 4xx response, both of which I take to mean “don’t publish”). Edited to add: the program also periodically re-checks for posts changing their privacy settings (there’s a cache with an exponential backoff from a couple of hours to a month to avoid annoying LJ: the backoff is restarted if the entry’s privacy changes).

I’m not sure whether to do further checks before publishing the comment. On the one hand, all I’m doing is publishing my own words as they appear in someone’s public posting. On the other hand, sometimes people are quite surprised to find that people read stuff they’ve made public, and I don’t want to annoy my friends. Since I mostly comment on your journals, what do you think?

[ LJ Poll 1242038 ]

Some of you customise your journal’s appearance using LiveJournal’s S2 system. The whole thing looks a bit baroque, with much mentioning of “layers”, so I thought I’d just ask questions of people who might know how to do what I want.

I have a thingy which generates an Atom feed from the comments on each of public postings (by “thingy” I mean a couple of Python scripts, one of which is a heavily modified version of ljdump and the other of which is a script which generates the feeds using the dumped information. I might publish them if anyone’s interested). Proper blogs have these feeds, so I thought mine should too.

I would like to make people’s browsers aware of the feed when viewing an entry, which means sticking some extra <link> elements into the <head> element of the entry view. I’d also like to link to the feed somewhere on the entry page, probably in the little list of stuff you can do with the entry (you know, permalink, write a comment, add to memories, denounce, and so on). The link would reference http://www.noctua.org.uk/paul/lj/feeds/nnnn.xml, where nnnn is the unique number which LJ puts in the permalink to the entry itself (for example, this entry‘s number is 83644).

At some stage the script is also going to produce a single feed of all comments on my public postings (and maybe all my comments on your public postings, if you don’t object). So I’d also like to know how to stick stuff in the <head> of the entire journal view (which already links to the Atom feed of my postings which LJ generates itself).

Any help you can give in telling me where to start with this stuff would be much appreciated.

I’ve updated the LJ New Comments script so that it’s a lot faster at marking comments as new. While following the recent shitstorms in news, in which hundreds of angry LJ users are laying into the management, Firefox would seize up for a while and eventually warn me that the script was refusing to let go. Hopefully, the new version fixes that.

I’ve also changed the behaviour of the “NEW” link on new comments, so that clicking it now selects that comment. I think this makes more sense than taking you to the next new comment, as previous versions did, as I like to click to select the comment and then use the “n” and “p” keys to navigate.

Comments to the entry for the script, please.

This is an article about the sort of thing I spend my days doing. Usually, I can’t talk about that, for reasons of commercial confidentiality, however, this particular case is completely unrelated to anything my employer sells, so I should be OK. I’ve tried to explain things sufficiently well that someone non-technical can get it. Hopefully it’s not too dull or incomprehensible.

First, we need some technical background…

The Naming of the Parts

A graph is a bunch of things connected by lines. The CDC Snog Graph is an example of what I mean. The things (representing people, in this case) are known as vertices or nodes, and the lines (representing joyous sharing experiences of some sort, in this case) are known as edges.

The lines on the Snog Graph don’t have a direction, so we call it an undirected graph. If you add a direction to each edge, you get a directed graph, which we can represent by putting arrows on the lines. Friendship on Facebook can be represented as an undirected graph (where the nodes are people and the edges mean “is friends with”), because all friendships are mutual, but LiveJournal friendships need a directed graph to represent them, since I can be your friend without you being mine, and vice versa.

A graph is cyclic if there’s a way to walk along edges starting at one node (following the arrows if the edges are directed) and get back to that node again without walking the same edge more than once. The Snog Graph is cyclic, as is the graph of friendships on a Facebook or LJ (trivially so on LJ, where it’s possible to friend yourself. Regular dancers might consider how this applies to their graph, vis-a-vis people who consider avoiding other dancers an optional extra). Graphs where you cannot do this are called acyclic.

Usenet news: you tell kids today that, and they don’t believe you

Usenet is an electronic discussion system which pre-dates all this newfangled World Wide Web nonsense. It distributes messages which are known as articles. It has some desirable features that web-based discussion forums often lack, like comment threading and remembering what you’ve already read. These days, people think Usenet is owned by Google, but in fact it’s a distributed system, with no central server (another advantage over LJ). When you want to talk to Usenet without using Google, you run a client program on your own machine, which talks to your local server. Your local server forwards your article to servers it knows about, which then forward it to servers they know about, and so on (the path of an article through the servers forms a directed acyclic graph, in fact). When you want to read other people’s articles, your client program fetches them from your local server.

Each article is posted to one or more groups, which are like communities on LJ. Note that, unlike LJ, the same article can be posted to more than one group without having to cut and paste it: the same article exists in each group it is cross-posted to.

On each server, articles have a number within each group. The first article to arrive in a group has number 1, the second article number 2, and so on. Cross-posted articles have more than one number, one for each group the article appears in.

Article numbers differ between servers, because the order of arrival depends on the path the article has taken to reach the server, but since your client program only talks to your local Usenet server, it usually refers to articles by their number (there’s also a unique string of letters and numbers which identifies the article, which is how servers know which ones they’ve seen already, but that’s not important right now). Remembering which articles you have read is then just a matter of storing some ranges of numbers for each group (so your client might remember that you have read articles 1-100,243-299 and 342-400, say).

The problem

We wanted to de-commission a Usenet server and move its articles to another server. The servers run different and incompatible software, so the most obvious way to get articles from one server to another is to post them like a client or another server would.

The new server is supposed to be a drop-in replacement for the old one, so we can’t change the numbers or the existing client programs will get confused and think you’ve read articles you haven’t, or vice versa. So you can’t just grab all the articles from the old server any old how and post them to the new one, because they’ll be jumbled up. Unfortunately, the old server has no way of directly telling us the precise order that articles arrived in, though it will tell us article numbers within each group.

“Aha!”, we think. Since the order of arrival matters, we’ll grab the articles in order from one group on the old server, post them in that order on the new server, and move on to the next group, where we’ll do the same, and so on until we’ve done all the groups.

This idea is ruined by cross-posts, because they have more than one article number associated with them. If a cross-posted article is number 10 in one group and number 3 in another, you’d better post the first 9 articles to the first group, the first 2 to the second group, and then you can make the cross-post. But maybe there’s a cross-post to a third group in those 9 articles, so you’ll need to get that group up to date before you can post one of those. How do you work out what order to post the articles in?

Obscure Unix tools to the rescue

You might have guessed that the wibbling about graphs wasn’t entirely tangential to all this. You can draw a graph of the problem. Each article is a node. For each group, connect each article to the one with the next number up by a directed edge (in this case, the direction of the arrow means “must be posted before”). You’ve drawn yourself a directed acyclic (since the article numbers only increase) graph. The cross-posted articles are then nodes with more than one edge coming into them.

One of the wizards at work realised this, and also pointed out that there’s a standard Unix tool for converting such graphs into a list of nodes whose order preserves the order implied by the arrows, a procedure which is known as a topological sort. The tool’s called tsort. From there, it’s just a matter of representing each article in the way tsort understands. When you do that, tsort gives you an order in which you can post the articles from the old server to the new server so they’ll be given the same numbers on the new server as they had on the old one.

In which I’m not as knowledgeable as the wizards

The way I thought of to do it was to write my own program. You pick a group, and start trying to post articles from the old to the new server in article number order. If they’re not cross-posted, you just keep going. When you hit a cross-posted article, you switch to each group it is cross-posted to in turn, and try to post all the articles before the cross-post to each of those groups. If while you’re doing that, you hit another cross-post, you remember where you got up to and do exactly the same thing with the groups that second cross-post is posted to, and when you’ve done that, you switch back to the groups for the first cross-post. I had a working version of this at about the same time as someone pointed out that tsort does the job 🙂

(Aside for the techies: this can be done using recursion. It looks like this is pretty much equivalent to one of the ways of doing a topological sort, namely the depth-first search mentioned on the Wikipedia page).

So I learned about topological sorts by re-inventing one. Most of the problems I deal with aren’t in fact cases of well-known problems (although I would say that if I was merely bad at recognising the fact, wouldn’t I?), but when they are, recognising that can save a lot of time.

And that’s what I do when I’m not on holiday.

Here’s a C language technique which I’d not seen before starting my current job, which I think deserves to be better known.

Let’s say you have a bunch of things, to which you’ve given sequential integer IDs using an enum.

  enum my_things {
  thing_this,
  thing_that,
  thing_other,
  num_things
  };


For each of those things, you want to store some other information in various arrays, and index into the array using the enumerated value to pull that information out. Examples might be a parser (where the enums might identify operators, and an array might contain pointers to functions to implement them) or a message handling system (where the enums might identify types of message, and the array might contain pointers to functions to handle the messages).

void (*thing_fns[num_things])(void) = {
  do_this,
  do_that,
  do_other
  };
  
  void do_thing(enum my_things current_thing)
  {
  thing_fns[current_thing]();
  }
  
  void do_this(void)
  {
  /* wibble */
  }
  /* etc, etc. */
  


The problem comes when you want to add a new thing. You’re sizing the array using the enum, so it’s hard to get that wrong. But if you have a lot of things, sooner or later someone will add a new my_things member to the middle of the list (perhaps you’re listing your things in some order which makes sense in the context), go to update thing_fns, miscount how far down the definition thing_fns they’ve gone and screw up the ordering of the function pointers. Now the wrong function handles some (but not all) of the things, with hilarious results.

The solution is to maintain a single list of your things, and transform it into definitions for the enum and the associated arrays. You don’t need to write fancy code generator scripts for this, you can use the pre-processor. Wikipedia explains how.

(Note that Wikipedia’s example does away with sizing the array using the enum. They need to have a final NULL entry with no comma on the end to keep the compiler happy, so they’d end up writing the clunky num_things + 1. Now their array is sized correctly anyway.)

This chap goes into it in more detail, and also illustrates that your list of things doesn’t have to be in a seperate file.

Re-writing my example to use X-macros is left as an exercise for the reader. 🙂

Edited To Add: gjm11 points out that two of his friends have posted about this recently. gareth_rees has a more detailed example and some discussion.

I recently finished Andrew Hodges’s Alan Turing: the Enigma. The book is a definitive account of Turing’s life and work. In some places I found the level of detail overwhelming, but in others I admired the way Hodges uses his obviously extensive research to evoke the places and people in Turing’s life. The book is well worth reading for the perspective it gives on Turing, something which is absent from other, purely technical, accounts of his work.

Hodges portrays Turing as a man ahead of time, conceiving of the Turing machine as a thought experiment before the invention of the general purpose electronic computer, and inventing the Turing test when computing was in its infancy. Turing’s naivete was reflected in his refusal to accept what other people said could be done, but also in a lack of interest in the politics of his post-war work on computers and of his own homosexuality. A proto-geek, Turing was prickly, odd, and seemed to expect that the facts alone, when shown to people, would lead them to the same conclusions as he found.

Turing’s suicide is placed in the context of a move from regarding homosexuality as criminal to regarding it as a medical problem, and an increasing suspicion of homosexuals in classified government work. Hodges seems to conclude that Turing felt he had nowhere else to go.

You can’t help but wonder what else Turing might have accomplished had he not committed suicide. Greg Egan’s short, Oracle, is an entertaining what-if story, which also features a character very obviously based on C.S. Lewis. What if Turing had received help from a friend? It’s a pity that in reality there was no-one to lead him out of his cage.