- What is Strict Aliasing and Why do we Care? · GitHub
- More type punning worries.
(tags: aliasing c programming punning)
- How I cut GTA Online loading times by 70%
- Good writeup of another possible entry in the Accidentally Quadratic tumblr (though that itself hasn’t been updated for a while).
(tags: games performance programming quadratic)
- Harry and Meghan: The union of two great houses, the Windsors and the Celebrities, is complete
- Brutal and hilarious article.
(tags: ireland britain monarchy)
- A response to: on the safety of women – Dr Reka Solymosi
- Seeing less serious incidents of harrassment with a “near miss” mindset as used in safety critical industries.
(tags: sexism safety)
- Using Vim for C++ development
- dreamwidth as vindication of a few cherished theories
- DW’s co-creator on how to make a successful open source project. Via brainwane.
(tags: dreamwidth open-source)
- Sync Any Folder to OneDrive in Windows 10 | Tutorials
- Make a symlink from the OneDrive folder to the thing you want to sync.
(tags: onedrive backup)
- Principles for the Application of Human Intelligence – Behavioral Scientist
- “Before humans become the standard way in which we make decisions, we need to consider the risks and ensure implementation of human decision-making systems does not cause widespread harm.”
(tags: artificial-intelligence ai psychology parody)
- A Decade of Vim
- Some interesting looking Vim screencasts.
(tags: vim editor programming)
- TinyPilot: Build a KVM Over IP for Under $100 · mtlynch.io
- A remote keyboard and monitor with a Raspberry Pi.
(tags: kvm pi server programming)
- danyspin97’s site – Colorize your CLI
- More colours are good.
(tags: shell tutorial colour)
- The Korean Playbook for COVID-19 (Translated) | by Indi Samarajiva | indica | Medium
- Age of Attention – SDr
- “A leverage point in avoiding toxoplasma, is the bridge people: people who are being rewarded for taking offense, and therefore select for the worst possible behavior of the outgroup. These people act as stressors, specifically triggering ideations of worst-case-scenarios. The fix here is removing these people from your feeds/circles of influence.”
(tags: toxoplasma internet rage social-networks)
- Imperial College simulation code for COVID-19 | Clive Best
- In which someone runs the code, and it seems to work reasonably well.
(tags: covid19 simulation model mathematics programming)
- The Imperial College code | …and Then There’s Physics
- Someone else ran it too.
(tags: covid19 simulation model)
- Jared Yates Sexton on Twitter: “PLEASE. Tell people about this. I’m going to provide some history of Neo-Confederate, white-identity, apocalyptic evangelicalism, what I call the Cult of the Shining City. This is who Donald Trump was messaging yesterday wi
- some history of Neo-Confederate, white-identity, apocalyptic evangelicalism, what I call the Cult of the Shining City.
(tags: christianity politics usa evangelicalism)
Occasionally I write about debugging, for the edification of others and to try to explain to muggles what I do all day. I ran into a fun one the other day.
Joel Spolsky’s explanation of Unicode is excellent, but long. In brief: on a computer, we represent letters (“a”, “b” and so on) as numbers. Computers work with zeroes and ones, binary digits (or bits), usually in groups of 8 bits called bytes. Back in the mists of time, someone came up with ASCII, a way to represent decent American letters by giving each letter a number. All those numbers fitted a single byte (a byte can represent 256 different numbers), so one byte was one letter, and all was well… unless you weren’t American and wanted to represent funny foreign letters like “£”, or some non-Latin alphabet, or a frowning pile of poo.
The modern way of handling those foreign letters and poos is Unicode. Each different letter still has a number assigned to it, but there are a lot them, so the numbers can be bigger than you can fit in a byte. Computers still like to work in bytes, so you need to represent a letter using a sequence of one or more bytes. A way of doing this is called an encoding. One popular encoding, UTF-8, has the handy feature that all those decent American letters have the same single byte representation as they did in ASCII, but other letters get longer sequences of bytes.
The series of tubes we call the Internet is a way of carrying bytes around. As a programmer, you often end up writing code to connect to other computers and read data. Suppose we just want to sit there forever doing something with a continuous stream of bytes the other computer is sending us1:
connection = connect_to_the_thing() # loop forever while True: # receive up to 1024 bytes from the other computer bytes = connection.recv(1024) do_something_with(bytes)
The data that comes back from the other computer is a series of bytes. What if you know it’s UTF-8 encoded text, and you want to turn those bytes into that text?
connection = connect_to_the_thing() # loop forever while True: # receive up to 1024 bytes from the other computer bytes = connection.recv(1024) # turn it into text text = bytes.decode("utf-8") do_something_with(text)
This seems to work fine, but very occasionally crashes on line 5 with a mysterious error message: “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe2 in position 1023: unexpected end of data”. Whaaat?
Some frantic Googling of “UnicodeDecodeError” turns up a bunch of people getting that error because they weren’t actually reading UTF-8 encoded text at all, but something else2. So, you check what the other side is sending, and in this case, you’re pretty sure it is sending UTF-8. Whaaat?
Squint at the error message a bit more, and you find it’s complaining about the last byte it’s read. You have to give the
recv() a maximum number of bytes to read, so you picked 1024 (a handy power of 2, as is traditional). “Position 1023” is the 1024th byte received (since we start counting from 0, as is tradidional). That “0xe2” thing is hexadecimal E2, equivalent to 11100010 in binary. Read the UTF-8 stuff a bit more, and you find that 11100010 means “this letter is made up of this byte and the two more bytes following this one”. It stopped in the middle of the sequence of bytes which represent a single letter, hence the “unexpected end of data” in the error message.
At this point, if you have control over the other computer, you might be thinking up cunning schemes to ensure that what it passes to each
send() is always less than 1024 bytes at a time, without breaking up a multi-byte letter. After all, the data goes out in packets, so what you get when you invoke
recv() must line up with the other side’s
send()s, right? Wrong.
The series of tubes is narrower in some places than others, and your data may be broken up to fit. A single carrier pigeon can only carry so much weight, you see, and the RSPB is pretty strict about that sort of thing. All that’s guaranteed is that you get the bytes out in the order they went in, not how many you get out at a time.
Fortunately, Guido thought of this and blessed us with
IncrementalDecoder, which knows how to remember that it was part way through a letter when it left off, so that the next time around the loop, it’ll hopefully get the rest of the bytes and give you the letter you were hoping for:
connection = connect_to_the_thing() decoder_class = codecs.getincrementaldecoder("utf-8") # Make a new instance of the decoder_class decoder = decoder_class() # loop forever while True: # receive up to 1024 bytes from the other computer bytes = connection.recv(1024) text = decoder.decode(bytes) do_something_with(text)
We’ll not worry about the other side closing the connection or the wifi packing up, for now. ↩
I do wonder whether questions on Stack Overflow about errors from Python’s Unicode handling have more views in the aggregate than the “How do I exit Vim?” question (which is at 2.1 million views as I write this). ↩