|
|
|
date: Thu, 19 Feb 2009 11:15:59 -0000,
group: uk.net.web.authoring
back
Notepad Plus - ANSI or Unicode?
I downloaded Notepad+ and got those 2 versions in the zip. What should I be
using for web languages?
If anybody has a simple explanation for the A/U then do give it, but I have
bigger things on my hands so it's not important.
-dE|_---
date: Thu, 19 Feb 2009 11:15:59 -0000
author: dE|_
|
Re: Notepad Plus - ANSI or Unicode?
On 19 Feb, 11:15, "dE|_" wrote:
Go for Unicode.
> If anybody has a simple explanation for the A/U then do give it,
Unicode is a character set. A great big long list of characters,
sufficient to do every language in the world, except some of the
Klingon dialects and of course Welsh(*) No more having to _choose_ a
character set, just use Unicode and you're sorted. No need to know
which character set was used to create a document (was it ISO-8859-1
or ISO-8859-17?), in Unicode there's just one. This also means you can
mix languages and their local characters on the same page.
However Unicode is only half the story. It lists the characters, but
it doesn't specify how they're encoded into bytes. That's done by
something called UTF. UTF comes in three common flavours (and some
rare ones): UTF8, UTF8Y aka UTF8-BOM and UTF16.
Many operating systems and programming languages use UTF16 internally,
but it's not a great storage format as it's twice as many bytes as
ASCII. So it's best avoided. Watch out for Windows programs with a
"Save as Unicode" option - these usually mean UTF16. Look further down
and seek out UTF8. UTF16 also breaks some text-handling utilities,
including Subversion.
UTF8 is the one to use. It's complicated, as it's a variable length
encoding, but that's not your problem! "Simple" characters use one
byte, same as ASCII. In fact the ASCII characters and their bytes
overlap, so all ASCII documents are automatically valid and readable
UTF8 Unicode as well, without needing conversion. If it's not a simple
character, then UTF8 uses more bytes to represent it, as needed.
UTF8Y puts a three byte magic sequence onto the start of every
document to indicate that it's UTF8Y. Works fine, but you lose that
ASCII compatibility.
All HTML uses Unicode and has done since time immemorial. The encoding
may vary (and often isn't UTF8), but the character set is still that
of Unicode. So when you use a numeric entity, those numbers are
_always_ the numbers that refer to the Unicode character.
date: Thu, 19 Feb 2009 16:15:36 -0800 (PST)
author: Andy Dingley
|
Re: Notepad Plus - ANSI or Unicode?
Andy Dingley wrote:
> On 19 Feb, 11:15, "dE|_" wrote:
>
> Go for Unicode.
>
>> If anybody has a simple explanation for the A/U then do give it,
>
> Unicode is a character set. A great big long list of characters,
> sufficient to do every language in the world, except some of the
> Klingon dialects and of course Welsh(*) No more having to _choose_ a
> character set, just use Unicode and you're sorted. No need to know
> which character set was used to create a document (was it ISO-8859-1
> or ISO-8859-17?), in Unicode there's just one. This also means you can
> mix languages and their local characters on the same page.
>
> However Unicode is only half the story. It lists the characters, but
> it doesn't specify how they're encoded into bytes. That's done by
> something called UTF. UTF comes in three common flavours (and some
> rare ones): UTF8, UTF8Y aka UTF8-BOM and UTF16.
>
> Many operating systems and programming languages use UTF16 internally,
> but it's not a great storage format as it's twice as many bytes as
> ASCII. So it's best avoided. Watch out for Windows programs with a
> "Save as Unicode" option - these usually mean UTF16. Look further down
> and seek out UTF8. UTF16 also breaks some text-handling utilities,
> including Subversion.
>
> UTF8 is the one to use. It's complicated, as it's a variable length
> encoding, but that's not your problem! "Simple" characters use one
> byte, same as ASCII. In fact the ASCII characters and their bytes
> overlap, so all ASCII documents are automatically valid and readable
> UTF8 Unicode as well, without needing conversion. If it's not a simple
> character, then UTF8 uses more bytes to represent it, as needed.
>
> UTF8Y puts a three byte magic sequence onto the start of every
> document to indicate that it's UTF8Y. Works fine, but you lose that
> ASCII compatibility.
>
>
> All HTML uses Unicode and has done since time immemorial. The encoding
> may vary (and often isn't UTF8), but the character set is still that
> of Unicode. So when you use a numeric entity, those numbers are
> _always_ the numbers that refer to the Unicode character.
Not often do you see a contribution as clear, complete, and well-written.
Phil
date: Fri, 20 Feb 2009 09:15:12 +0000
author: Philip Herlihy
|
Re: Notepad Plus - ANSI or Unicode?
"Andy Dingley" wrote the likes of:
>
> Go for Unicode.
>
>> If anybody has a simple explanation for the A/U then do give it,
>
> Unicode is a character set. A great big long list of characters,
> sufficient to do every language in the world, except some of the
> Klingon dialects and of course Welsh(*) No more having to _choose_ a
> character set, just use Unicode and you're sorted. No need to know
> which character set was used to create a document (was it ISO-8859-1
> or ISO-8859-17?), in Unicode there's just one. This also means you can
> mix languages and their local characters on the same page.
>
> However Unicode is only half the story. It lists the characters, but
> it doesn't specify how they're encoded into bytes. That's done by
> something called UTF. UTF comes in three common flavours (and some
> rare ones): UTF8, UTF8Y aka UTF8-BOM and UTF16.
>
> Many operating systems and programming languages use UTF16 internally,
> but it's not a great storage format as it's twice as many bytes as
> ASCII. So it's best avoided. Watch out for Windows programs with a
> "Save as Unicode" option - these usually mean UTF16. Look further down
> and seek out UTF8. UTF16 also breaks some text-handling utilities,
> including Subversion.
>
> UTF8 is the one to use. It's complicated, as it's a variable length
> encoding, but that's not your problem! "Simple" characters use one
> byte, same as ASCII. In fact the ASCII characters and their bytes
> overlap, so all ASCII documents are automatically valid and readable
> UTF8 Unicode as well, without needing conversion. If it's not a simple
> character, then UTF8 uses more bytes to represent it, as needed.
>
> UTF8Y puts a three byte magic sequence onto the start of every
> document to indicate that it's UTF8Y. Works fine, but you lose that
> ASCII compatibility.
>
>
> All HTML uses Unicode and has done since time immemorial. The encoding
> may vary (and often isn't UTF8), but the character set is still that
> of Unicode. So when you use a numeric entity, those numbers are
> _always_ the numbers that refer to the Unicode character.
Thankyou Andy, that was generous.
-dE|_---
date: Fri, 20 Feb 2009 11:28:33 -0000
author: dE|_
|
Re: Notepad Plus - ANSI or Unicode?
On 20 Feb, 09:15, Philip Herlihy wrote:
> Not often do you see a contribution as clear, complete, and well-written.
Thankyou - I can type it from memory, as that's something like the
fourth time I've written it this week.
Does anyone listen to it yet? Do they hell 8-( A sizable fraction
of my day job is mucking out other developer's mis-encoded files that
are breaking our continuous integration server.
I also have the handy-dandy guide to interpreting encoding errors by
just _what_ sort of garbage you're seeing. It's often possible to tell
what was done to it, sometimes to reverse it cleanly.
Oh, and I still don't know how to do Welsh properly with it.
date: Fri, 20 Feb 2009 06:08:38 -0800 (PST)
author: Andy Dingley
|
|
|