encodings: Character encodings
This module contains constants and types for dealing with text data in different character encodings. Its main use is converting strings between 16-bit Unicode and 8-bit (narrow) representations. Conversions between 8-bit narrow strings and 16-bit Unicode strings are often needed, since many standard library functions expect strings to be encoded in 16-bit Unicode, whereas many stream classes (including File and Socket) and some functions only support 8-bit strings.
In addition to 8-bit Unicode encodings, this module also defines several locale-specific encodings that only support a subset of Unicode.
Examples:
"\u00c3\u00a4".decode(Utf8) -- Decode "ä" in UTF-8 to 16-bit Unicode "\u20ac".encode(Utf8) -- Encode the Euro sign using UTF-8 "sää".encode(Ascii, Unstrict) -- "s??" (replace characters that cannot be represented -- in ASCII with question marks) TextStream("file.txt", Latin1) -- Open encoded text file for reading
See also: Use methods Str encode and Str decode for encoding and decoding strings.
See also: The classes io::TextFile and io::TextStream provide a simple way of accessing encoded text streams.
Constants
- Strict as Constant
- Mode option for encoding objects that indicates strict encoding or decoding. Invalid input causes an EncodeError or a DecodeError exception to be raised. This is the default behavior.
- Unstrict as Constant
- Mode option for encoding objects that indicates unstrict encoding and decoding. Invalid characters are replaced with question marks ("?", when encoding) or with replacement characters ("\ufffd", when decoding).
- Bom as Str
- The byte order mark character ("\ufeff"). Some platforms (most notably, Windows) often insert this character to the beginning of text files when using a Unicode encoding such as UTF-8.
Character encodings
This module defines the following character encoding objects. They all implement the interface Encoding.
- Ascii as Encoding
- The 7-bit ASCII encoding.
- Utf8 as Encoding
- The UTF-8 Unicode encoding.
- Uft16 as Encoding
- Utf16Le as Encoding
- Utf16Be as Encoding
- Utf16Le as Encoding
- The UTF-16 Unicode encoding. The different variants stand for native, little endian and big endian byte orders.
- Iso8859_1 as Encoding (Latin1)
- Iso8859_2 as Encoding (Latin2)
- Iso8859_3 as Encoding (Latin3)
- Iso8859_4 as Encoding (Latin4)
- Iso8859_5 as Encoding
- Iso8859_6 as Encoding
- Iso8859_7 as Encoding
- Iso8859_8 as Encoding
- Iso8859_9 as Encoding (Latin5)
- Iso8859_10 as Encoding (Latin6)
- Iso8859_11 as Encoding
- Iso8859_13 as Encoding (Latin7)
- Iso8859_14 as Encoding (Latin8)
- Iso8859_15 as Encoding (Latin9)
- Iso8859_16 as Encoding (Latin10)
- Iso8859_2 as Encoding (Latin2)
- The ISO 8859 8-bit encodings, with alias constants in parentheses.
- Windows1250 as Encoding
- Windows1251 as Encoding
- Windows1252 as Encoding
- Windows1251 as Encoding
- These Windows encodings are also known as Windows code pages.
- Cp437 as Encoding
- Cp850 as Encoding
- Legacy encodings that match MS-DOS code pages. Note that encoded characters in the range 0 to 31 and character 127 are decoded as the corresponding ASCII characters instead of the legacy graphical characters.
- Koi8R as Encoding
- The KOI8-R encoding for Russian.
- Koi8U as Encoding
- The KOI8-U encoding for Ukrainian.
Interface Encoding
Character encoding objects support creating encoder and decoder objects using the methods encoder and decoder, respectively. These methods can be called without arguments or with an optional mode argument (Strict or Unstrict). If mode is not specified, it defaults to Strict. Each encoder / decoder instance keeps track of the state of a single encoded / decoded text sequence.
Programs typically do not use Encoder and Decoder objects directly, but they use Str encode and Str decode methods and text streams.
- encoder([mode as Constant]) as Encoder
- Construct an encoder object for the encoding.
- decoder([mode as Constant]) as Decoder
- Construct a decoder object for the encoding.
- name as Str
-
The name of the encoding. Example:
Utf8.name -- "Utf8"
Interface Encoder
- encode(str as Str) as Str
- Encode the argument string and return the encoded string. The entire string is always encoded.
Interface Decoder
- decode(str as Str) as Str
- Decode as many characters as possible from the argument string and return them. If any partial characters remain at the end of the string, remember them and prepend them to the next argument to decode. Use unprocessed to have a peek at them.
- unprocessed() as Str
- Return the current buffer of partial characters, or an empty string if there are none.
Exceptions
- class EncodeError
- Raised when encoding is not successful due to invalid input. Inherits from std::ValueError.
- class DecodeError
- Raised when decoding is not successful due to invalid input. Inherits from std::ValueError.
Functions
- Decode(string as Str, encoding as Encoding[, mode as Constant]) as Str
-
Deprecated (this feature will be removed in a future Alore version).
Decode a string. The mode argument may be Strict
(this is the default if the argument is omitted) or Unstrict.
Example:
Decode(s, Utf8) -- Decode UTF-8 string to 16-bit Unicode
Note: Use the Str decode method instead.
- Encode(string as Str, encoding as Encoding[, mode as Constant]) as Str
-
Deprecated (this feature will be removed in a future Alore version).
Encode a string. Identical to
encoding.encoder([mode]).encode(s).
The mode argument may be Strict (this is the default if
the argument is omitted) or Unstrict. Example:
Encode("\u20ac", Utf8) -- Encode the Euro sign in UTF-8
Note: Use the Str encode method instead.
About supported character encodings
This module supports only a small and somewhat arbitrary set of locale-specific encodings, with a bias towards encodings for European languages. New encodings are likely to be added to this module (or to separate, additional modules) in future Alore releases.