Mysterious comments explained, and ruminations on spam

For a while now, I’ve been getting comments on my LiveJournal which apparently aren’t spam, but rather are questions which are totally out of context. For instance, I got one the other day which said “Hi. I find forum about work and travel. Where can I to see it?”

I recently got some more comment spam advertising something called XRumer, a clever and nasty program for spamming bulletin boards and other forums (like LJ), which is brought to us by some evil Russians (“No Meester Bond, I expect you to die”). One of the things the authors claim it can do is a crude form of astroturfing. They say you can configure it to post a comment asking about something, and response apparently from another user mentioning the site you actually want to advertise. It looks like this feature doesn’t quite work, and that the questions I’ve been seeing are examples of it misfiring. Mystery solved.

The spammers seem to favour certain entries of mine, so I’m screening anonymous comments on those entries (and on this one too, since I imagine it might attract undesirables). I don’t want to do that for my entire journal, as I get comments from people who aren’t on LJ but who say worthwhile things. In an ideal world, the way round this would be OpenID, but that’s not in widespread use yet, possibly because people who have an OpenID often don’t know they do. [Attention LJ users: you have an OpenID. Congrats. You’ve got a Jabber instant messaging account, too. See how good bradfitz is to you?]

A system which allows easy communication between two people who have no previous connection to each other is susceptible to spam. The trick is to keep this desirable feature while not being buried in junk (you could go the other way and remove this feature, of course, as many some IM users have, or make a virtue of it with social networking sites, but that’s not really an option for public blogs). Anything an ordinary user might to do create an identity, a spammer can do too, so cryptographic certificates aren’t a magical solution. Legislation doesn’t help, because the police don’t care and anyhow, spammers are in Wild West states like China or Russia, or at least run front operations there.

Most spam is still sent via email. Email spammers have been subject to an evolutionary arms race. The remaining effective spammers are bright and totally amoral. They’ll hijack millions of other people’s computers to send their spam or even to host the website they’re advertising, making it hard for blacklists to keep up (and they’ll use these computers to flood centralised blacklist sites with traffic in an attempt to knock them off the net). They’ll vary the text they use, to defeat schemes which detect the same posting lots of times. They’ll use images rather than text, or simply links to those images, to defeat textual analysis. You can bet that blog spammers will learn from this (some of them are probably email spammers too).

What’s working for email spam, and will similar ideas work for blog spam?

  • Banning mail sent directly from consumer ISP connections is the single most effective thing I do (you can do this with the Spamhaus PBL and with a few checks for generic rDNS to catch what the PBL misses). You can’t do that with blog comments, as spam or not, they almost all come from consumer ISP connections.

  • Banning mail sent from IPs which are known sources of spam is also effective. You can do that with blog comments, but you either need to be big enough to generate your own list (as LJ might be) or have the resources to run a centralised list like Spamhaus (which will be attacked by spammers). There are currently no IP blacklists devoted to blog spamming, as far as I know, although some spam comments I’ve seen came from IPs which were in the Spamhaus XBL.

  • Filtering on ways in which spamming programs differ from legitimate SMTP clients (greylisting, greet pause) is currently effective, but only as long as these methods don’t become so widespread that it’s worth the spammers’ while to look more like a legitimate sender. Still, this doesn’t seem that likely. Incompetent admins aren’t in short supply, and I don’t have to outrun the bear, only outrun them. This sounds promising against blog spammers. Apparently simple minded schemes are pretty effective.


What else can we do with a website that we can’t do on email?

  • CAPTCHAs are popular, but a bit of a bugger if you’re blind. The evil Russians claim to have defeated most of the deployed ones which use obscured letters, though that still leaves the “click on the picture of a cat” variant.

  • Proof-of-work or hashcash schemes are currently very effective, suggesting that blog spammers don’t yet have the huge amounts of stolen computing resources available to email spammers, or that they don’t have the knowledge to implement the hashcash algorithm in their spamming software. By using proof-of-work, we can at least drive the weak blog spammers to the wall.

    You can think of proof-of-work as a variant on the tactic of differentiating spam programs from real humans. Spammers can defeat simple-minded checks on how long a user has been reading a page before commenting without slowing their spamming rate up by much (how to do this is left as an exercise to the prospective spammer), but if a web browser has to do a computation which takes a fixed time and send the result along with the comment, the spammers have to slow down or do the work in parallel on many computers. If you can work out a way of doing the calculation in the background as the user looks at your page and writes their comment, so much the better. If you can dynamically generate the code you send to the browser to make it prove it’s done some work, you stop the spammers writing something equivalent in a real programming language and force them to run it in Java or Javascript. That’d really show them who’s boss.

    This hurts people who’ve turned off Javascript or Java, but it’s time for those dinosaurs to join the web 2.0 world, right?

10 Comments on "Mysterious comments explained, and ruminations on spam"


  1. It looks like this feature doesn’t quite work, and that the questions I’ve been seeing are examples of it misfiring.

    At a guess: LJ is better at blocking the “answer” comments that have the URLs in them than the “question” comments that don’t.

    Reply

    1. Ooh, maybe. That said, a search for similar text on Google didn’t find any astroturf quetions with actual replies, either. I prefer to hope the software doesn’t work.

      Reply

  2. Don’t modern capatchas also have an audio version for blind people?

    I’m sure I’ve read somewhere a plan for dealing with capatchas. I think the plan was to link them with porn sites. So many people are trying to get free porn that if you made them answer a capatcha before they got the porn you’d get lots and lots of capatchas solved per hour. The trick being to display to the porn surfer the capatcha from somewhere like LJ.

    Reply

  3. Problem with CAPCHAs is that I’m really bad at them, and I’m pretty sure I’m human. Quite often takes me two or three goes…

    Reply

  4. I think the most evil captcha-style defence I’ve seen was recently described on Digg: it displays nine images that received top or bottom scores on http://www.hotornot.com, and asks the viewer to identify the three hot ones. }:-)

    (If anyone reading this is stupid enough to visit a site with the name “Hot or Not” while at work, I’m afraid they’re also too stupid to understand any warning that an ethical surfer would insert at this point…)

    As for the Java/Javascript issue, it may be old hat, but frankly I find the increased security and reduced annoyance dramatically outweighs any benefits of allowing scripting on untrusted sites. Whitelists were invented for a reason!

    Reply

    1. The worst I’ve seen happen from Java/JS is confused deputy attacks like those against LJ a little while back. I was running something which enabled Javascript only for whitelisted sites while LJ were getting their act together, and found a lot of sites seemed to require JS to work properly. In the end, while I don’t want the world reading my friends-only stuff, it wouldn’t be catastrophic either.

      Pop-up ads are the main thing that’s annoying with JS enabled, but Firefox and AdBlock deal with those.

      Reply

      1. I have JS turned off by default on my home machine – and turn it back on on a site by site basis. After about a month of doing so I find I don’t generally need to enable more than one site a day.

        Reply

  5. Just a note … wordpress.com now accommodates OpenID (a recent change).

    WordPress also has a spam filter that seems to be 99.9% effective. Once in a very long while it filters out a legitimate commenter. A little more often, a spam comment will get through. But it’s efficient enough that spam doesn’t constitute much of a problem, til the spammers improve their game.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *