Hole in Notepad

Titulli: Hole in Notepad Mon Mar 10, 2008 2:43 am

Hole in Notepad

Quote:

Over at WinCustomize, someone thought they'd found an Easter Egg in the Windows Notepad application. If you:

1. Open Notepad
2. Type the text "this app can break" (without quotes)
3. Save the file
4. Re-open the file in Notepad

Notepad displays seemingly-random Chinese characters, or boxes if your default Notepad font doesn't support those characters.

It's not an Easter egg (even though it seems like a funny one), and as
it turns out, Notepad writes the file correctly. It's only when Notepad
reads the file back in that it seems to lose its mind.

But we can't even blame Notepad: it's a limitation of Windows itself,
specifically the Windows function that Notepad uses to figure out if a
text file is Unicode or not.

You see, text files containing Unicode (more correctly, UTF-16-encoded
Unicode) are supposed to start with a "Byte-Order Mark" (BOM), which is
a two-byte flag that tells a reader how the following UTF-16 data is
encoded. Given that these two bytes are exceedingly unlikely to occur
at the beginning of an ASCII text file, it's commonly used to tell
whether a text file is encoded in UTF-16.

But plenty of applications don't bother writing this marker at the
beginning of a UTF-16-encoded file. So what's an app like Notepad to
do?

Windows helpfully provides a function called IsTextUnicode()--you pass
it some data, and it tells you whether it's UTF-16-encoded or not.

Sorta.

It actually runs a couple of heuristics over the first 256 bytes of the
data and provides its best guess. As it turns out, these tests aren't
terribly reliable for very short ASCII strings that contain an even
number of lower-case letters, like "this app can break", or more
appropriately, "this api can break".

The documentation for IsTextUnicode says:

These tests are not foolproof. The statistical tests assume certain
amounts of variation between low and high bytes in a string, and some
ASCII strings can slip through. For example, if lpBuffer points to the
ASCII string 0x41, 0x0A, 0x0D, 0x1D (A\n\r^Z), the string passes the
IS_TEXT_UNICODE_STATISTICS test, though failure would be preferable.