This is an article about the sort of thing I spend my days doing. Usually, I can’t talk about that, for reasons of commercial confidentiality, however, this particular case is completely unrelated to anything my employer sells, so I should be OK. I’ve tried to explain things sufficiently well that someone non-technical can get it. Hopefully it’s not too dull or incomprehensible.
First, we need some technical background…
The Naming of the Parts
A graph is a bunch of things connected by lines. The CDC Snog Graph is an example of what I mean. The things (representing people, in this case) are known as vertices or nodes, and the lines (representing joyous sharing experiences of some sort, in this case) are known as edges.
The lines on the Snog Graph don’t have a direction, so we call it an undirected graph. If you add a direction to each edge, you get a directed graph, which we can represent by putting arrows on the lines. Friendship on Facebook can be represented as an undirected graph (where the nodes are people and the edges mean “is friends with”), because all friendships are mutual, but LiveJournal friendships need a directed graph to represent them, since I can be your friend without you being mine, and vice versa.
A graph is cyclic if there’s a way to walk along edges starting at one node (following the arrows if the edges are directed) and get back to that node again without walking the same edge more than once. The Snog Graph is cyclic, as is the graph of friendships on a Facebook or LJ (trivially so on LJ, where it’s possible to friend yourself. Regular dancers might consider how this applies to their graph, vis-a-vis people who consider avoiding other dancers an optional extra). Graphs where you cannot do this are called acyclic.
Usenet news: you tell kids today that, and they don’t believe you
Usenet is an electronic discussion system which pre-dates all this newfangled World Wide Web nonsense. It distributes messages which are known as articles. It has some desirable features that web-based discussion forums often lack, like comment threading and remembering what you’ve already read. These days, people think Usenet is owned by Google, but in fact it’s a distributed system, with no central server (another advantage over LJ). When you want to talk to Usenet without using Google, you run a client program on your own machine, which talks to your local server. Your local server forwards your article to servers it knows about, which then forward it to servers they know about, and so on (the path of an article through the servers forms a directed acyclic graph, in fact). When you want to read other people’s articles, your client program fetches them from your local server.
Each article is posted to one or more groups, which are like communities on LJ. Note that, unlike LJ, the same article can be posted to more than one group without having to cut and paste it: the same article exists in each group it is cross-posted to.
On each server, articles have a number within each group. The first article to arrive in a group has number 1, the second article number 2, and so on. Cross-posted articles have more than one number, one for each group the article appears in.
Article numbers differ between servers, because the order of arrival depends on the path the article has taken to reach the server, but since your client program only talks to your local Usenet server, it usually refers to articles by their number (there’s also a unique string of letters and numbers which identifies the article, which is how servers know which ones they’ve seen already, but that’s not important right now). Remembering which articles you have read is then just a matter of storing some ranges of numbers for each group (so your client might remember that you have read articles 1-100,243-299 and 342-400, say).
We wanted to de-commission a Usenet server and move its articles to another server. The servers run different and incompatible software, so the most obvious way to get articles from one server to another is to post them like a client or another server would.
The new server is supposed to be a drop-in replacement for the old one, so we can’t change the numbers or the existing client programs will get confused and think you’ve read articles you haven’t, or vice versa. So you can’t just grab all the articles from the old server any old how and post them to the new one, because they’ll be jumbled up. Unfortunately, the old server has no way of directly telling us the precise order that articles arrived in, though it will tell us article numbers within each group.
“Aha!”, we think. Since the order of arrival matters, we’ll grab the articles in order from one group on the old server, post them in that order on the new server, and move on to the next group, where we’ll do the same, and so on until we’ve done all the groups.
This idea is ruined by cross-posts, because they have more than one article number associated with them. If a cross-posted article is number 10 in one group and number 3 in another, you’d better post the first 9 articles to the first group, the first 2 to the second group, and then you can make the cross-post. But maybe there’s a cross-post to a third group in those 9 articles, so you’ll need to get that group up to date before you can post one of those. How do you work out what order to post the articles in?
Obscure Unix tools to the rescue
You might have guessed that the wibbling about graphs wasn’t entirely tangential to all this. You can draw a graph of the problem. Each article is a node. For each group, connect each article to the one with the next number up by a directed edge (in this case, the direction of the arrow means “must be posted before”). You’ve drawn yourself a directed acyclic (since the article numbers only increase) graph. The cross-posted articles are then nodes with more than one edge coming into them.
One of the wizards at work realised this, and also pointed out that there’s a standard Unix tool for converting such graphs into a list of nodes whose order preserves the order implied by the arrows, a procedure which is known as a topological sort. The tool’s called tsort. From there, it’s just a matter of representing each article in the way
tsort understands. When you do that,
tsort gives you an order in which you can post the articles from the old server to the new server so they’ll be given the same numbers on the new server as they had on the old one.
In which I’m not as knowledgeable as the wizards
The way I thought of to do it was to write my own program. You pick a group, and start trying to post articles from the old to the new server in article number order. If they’re not cross-posted, you just keep going. When you hit a cross-posted article, you switch to each group it is cross-posted to in turn, and try to post all the articles before the cross-post to each of those groups. If while you’re doing that, you hit another cross-post, you remember where you got up to and do exactly the same thing with the groups that second cross-post is posted to, and when you’ve done that, you switch back to the groups for the first cross-post. I had a working version of this at about the same time as someone pointed out that
tsort does the job 🙂
(Aside for the techies: this can be done using recursion. It looks like this is pretty much equivalent to one of the ways of doing a topological sort, namely the depth-first search mentioned on the Wikipedia page).
So I learned about topological sorts by re-inventing one. Most of the problems I deal with aren’t in fact cases of well-known problems (although I would say that if I was merely bad at recognising the fact, wouldn’t I?), but when they are, recognising that can save a lot of time.
And that’s what I do when I’m not on holiday.