4

The latest Notepad.exe has a Save as UTF-8 and UTF-8 with BOM.

Is UTF-8 with BOM the old UTF? What is UTF-8 now?

Old Geezer
  • 1,161
  • 7
  • 17
  • 39
  • 1
    Different site but same question answered here: https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom – MC10 May 21 '20 at 02:49
  • 1
    [This answer](https://superuser.com/a/1041323/167207) also answers that. No need for the downvote either; good question for this site as well. – Giacomo1968 May 21 '20 at 03:28
  • https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-istextunicode – Mark May 21 '20 at 05:45

2 Answers2

8

UTF-8 is UTF-8 regardless of whether a BOM exists.

Saving a file with a BOM (byte order mark) is not really needed for UTF-8.

The fact that Notepad allows the saving of files in “UTF-8” or “UTF-8 with BOM” seems to be an option that exists to allow flexibility in cases where a BOM (byte order mark) is needed. But in general, just saving the file without a BOM — meaning plain UTF-8 — is really the best way to handle text files with UTF-8 content.

As explained on the Wikipedia page for byte order mark:

“BOM use is optional. Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.”

And the article delves deeper into it by stating the following; bold emphasis is mine:

“The UTF-8 representation of the BOM is the (hexadecimal) byte sequence 0xEF,0xBB,0xBF.

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work. The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."

Not using a BOM allows text to be backwards-compatible with some software that is not Unicode-aware. Examples include programming languages that permit non-ASCII bytes in string literals but not at the start of the file.”

As for why Microsoft cares about saving UTF-8 with a BOM in Notepad? This explains it well; seems to be a specific requirement of Microsoft programming tools and not any other non-Microsoft tool out there:

“Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.”

So unless you explicitly need to save a UTF-8 file with a BOM to be set for a file, just don’t worry about that saving option.

Giacomo1968
  • 53,069
  • 19
  • 162
  • 212
  • 2
    I wonder why standardizing on file meta data to specify the encoding type is a poorer choice than making everyone adding all the extra logic to infer the actual encoding in use. – Old Geezer May 21 '20 at 03:58
  • @OldGeezer Because metadata an be fudged and “lie.” It is better to create a standard that doesn’t require metadata for file content parsing than hope that every application in the world — new and old — can understand that newly introduced metadata. – Giacomo1968 May 21 '20 at 04:06
  • 1
    @OldGeezer Metadata doesn't transfer well. Upload your file to a website and all metadata, except for filename, is lost. And BOM isn't perfect either, it's fine unless another encoding happens to interpret it as correct characters and you have to use heuristics anyway. Compatibility with legacy standards is hard. – gronostaj May 21 '20 at 08:45
  • AutoHotkey requires BOM in its configuration file (if you use extended UTF-8 characters). So even though Notepad displays it correct without BOM, it will not work until you save it with "UTF-8 with BOM" encoding. – Axel Bregnsbo Aug 08 '22 at 07:52
-3

The other answer is wrong. It is some political thing. ANSI is the default text format in Windows and has been for 36 years.

In Windows files are assumed to be ANSI. Therefore you always use a BOM. Unix programs that can't handle BOMs are not Unicode compliant.

I write text editors. If the user doesn't specify it is ANSI - ALWAYS.

Assuming you will get BOMless Unicode means you have to call https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-istextunicode to guess the format. Hardly proper programming.

Mark
  • 706
  • 4
  • 3
  • 6
    _"In Windows files are assumed to be ANSI [...] If the user doesn't specify it is ANSI - ALWAYS"_ - either you're referring to some subset of Windows software (and this answer should clarify what subset it is) or this is incorrect. All competent text editors can heuristically detect UTF-8 without BOM, regardless of platform. Even Notepad does (tested with Windows 10 v1909 build 18363.836). – gronostaj May 21 '20 at 07:58
  • Read the link on the haphazard nature of it. What do we use text editors for - for writing OEM batch files. Already there is a character problem. The user specifies the encoding. If they don't it must be ANSI. – Mark May 21 '20 at 08:09
  • 5
    That's your opinion, not a fact. I've literally created a UTF-8 file with non-ASCII characters in Sublime Text, confirmed in hex view that there is no BOM and some characters are encoded multibyte, and then opened that file in Notepad. It worked just fine. Whether you like it or not, it's just not true that Windows software assumes ANSI unless indicated otherwise by BOM. – gronostaj May 21 '20 at 08:21
  • Sublime Text has nothing to do with Windows. – Mark May 21 '20 at 08:27
  • 2
    Method of creating the file is irrelevant. – gronostaj May 21 '20 at 08:39
  • @Mark You have lots of decent knowledge as shown in other answers, but when you say “Hardly proper programming.” you show a distinct lack of knowledge of how modern text editors handle UTF-8 and similar encodings: It is all expected to be done heuristically by using methods that you describe as “guess(ing) the format.” That is the expectation of trolls that handle UTF-8: Determine the encoding heuristically and not via metadata or other “hints” that can easily “lie” by saying one thing despite the contents saying something else. – Giacomo1968 May 21 '20 at 15:30
  • 3
    Let me repeat: Notepad will correctly open an UTF-8 file without BOM, even on Windows 7 SP1. So it's not assuming ANSI since at least 2011. Your (original) opening sentence is factually incorrect. Notepad on Windows 10 will also by default save as UTF-8 without BOM, so your (new) opening sentence is also incorrect. – gronostaj May 22 '20 at 07:32
  • @gronostaj but it will save new files as ANSI. The other answer is a Unix answer. – Mark May 22 '20 at 08:10
  • 4
    No, it won't. UTF-8 without BOM is the default for saving in Windows 10 v1909. I also don't see how the other answer is a "Unix answer". – gronostaj May 22 '20 at 08:49