-
Notifications
You must be signed in to change notification settings - Fork 301
Character encodings
This is a guide to how character encodings are used within WeeWX.
If you're a seasoned hands at character encodings, you can skip this section. If not, you should stop and read this classic 20-year-old article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Fortunately, since the article was written, the rainbow of encodings has collapsed down to pretty much one, UTF-8, making life much simpler.
There are four places where encodings come into play within WeeWX:
- In the Python code;
- The encoding of text created by the Cheetah generator and inserted into templates;
- The encoding in the templates themselves; and
- The encoding of the resultant HTML file.
Back in the old days, if you used an encoding other than simple ASCII in your Python source code, you had to tell the Python interpreter what that encoding was. However, with Python 3, that problem has gone away: by default, Python 3 source code uses UTF-8. Nevertheless, it's useful to understand that if your program has a line such as this
label = "°C"
the actual degree symbol is being represented using a UTF-8 encoding, which, in hex, is C2 B0
.
So, the two characters °
and C
actually take three bytes to encode.
The Cheetah generator is responsible for taking tags such as $current.outTemp
and turning them
into a byte sequence that is inserted into a template. What encoding that byte sequence will use is
determined by the option encoding
in the [CheetahGenerator]
section of skin.conf
. So,
a directive such as
[CheetahGenerator]
encoding = html_entities
tells the WeeWX to use HTML entities to encode any non-ASCII characters. As an example, if the
current temperature is 20C, then the tag $current.outTemp
will result in either
20 °C
or, possibly,
20 °C
By contrast, if you were to specify UTF-8:
[CheetahGenerator]
encoding = utf_8
then the emitted byte sequence would use UTF-8 encoding:
20 °C
which, in hex, is the byte sequence
32 30 20 C2 B0 43
2 0 ° C
Note how it takes two bytes, C2 B0
, to represent the degree sign.
The Cheetah generator uses templates, which are source files that include tags that are to be
replaced by the engine. The Cheetah directive #encoding
tells the generator what encoding
a template uses. This is why most WeeWX templates start with:
#errorCatcher Echo
#encoding UTF-8
Almost always you want UTF-8, although there may be instances where all the special characters in
the template have been encoded using HTML entities, so the template itself is actually in pure
ASCII. If that's the case, a UTF-8 encoding doesn't hurt, but you could alternatively specify
#encoding ascii
.
Note that this encoding is different from the encoding that will be used by the tag after substitution. It's entirely possible to have the template in ASCII, but the substitutions use UTF-8, so the result is a mix. Which brings us to the final kind of encoding directive.
The job of the final directive, the <meta>
tag, is to give the browser a clue of what encoding
the HTML file uses
<html lang="en">
<head>
<title>weewx: Documentation</title>
<meta charset="UTF-8">
</head>
Again, these days, this is almost always in UTF-8.
To find out what character encoding your browser is using, open up the
"Developer's Tools" window, navigate to the console, and type document.characterSet
.
A common problem is special characters in a text (not HTML) file. These are files with the
suffix .txt
, instead of .html
. When a browser encounters a file ending in .txt
it treats
the contents as raw text and does not try to interpret any HTML tags within it, including
a <meta>
tag. Because it ignores any possible <meta>
tag, it does not know what encoding the
file uses, so it has to guess.
Most browsers guess cp1252
, aka Windows-1252
. This means that if the file
includes special characters that cannot be encoded in cp1252
, then they will
end up looking "funny". This is a common problem with NOAA reports (which are
text files) with location names that include special characters that cannot be
encoded in cp1252
.
There are three solutions:
-
Specify an encoding of
normalized_ascii
. The Cheetah generator will turn something likecrêpe
intocrepe
. The results won't be correct, but they will be recognizable. -
Specify an encoding of
cp1252
. This way, the output of the Cheetah generator will match what the browser is expecting. Of course, this will only work if all the characters in the location name can be represented incp1252
. If not, you will end up with a garbled location name. -
Wrap the text file in an HTML file that specifies UTF-8 encoding. This is how the Seasons skin does it, provided you always navigate to the NOAA files via the title bar drop down option list.
What will not work is specifying an encoding of UTF-8. Remember, the browser
thinks the file is in cp1252
so, for example, the degree symbol, which takes
two bytes to encode using UTF-8, is instead interpreted by the browser as two
characters encoded in cp1252
, in this case, the character 'Á' and the
character '⁰'.