108.6. Character Set Conversion (xm_charconv)
This module provides tools for converting strings between different character
sets (codepages). All the encodings available to iconv are supported. See
iconv -l
for a list of encoding names.
108.6.1. Configuration
The xm_charconv module accepts the following directives in addition to the common module directives.
- AutodetectCharsets
-
This optional directive accepts a comma-separated list of character set names. When
auto
is specified as the source encoding for convert() or convert_fields(), these character sets will be tried for conversion. This directive is not related to the LineReader directive or the corresponding InputType registered by the module.
- BigEndian
-
This optional boolean directive specifies the endianness to use during the encoding conversion. If this directive is not specified, it defaults to the host’s endianness. This directive only affects the registered InputType and is only applicable if the LineReader directive is set to a non-Unicode encoding and CharBytes is set to 2 or 4.
- CharBytes
-
This optional integer directive specifies the byte-width of the encoding to use during conversion. Acceptable values are 1 (the default), 2, and 4. Most variable width encodings will work with the default value. This directive only affects the registered InputType and is only applicable if the LineReader directive is set to a non-Unicode encoding.
- LineReader
-
If this optional directive is specified with an encoding, an InputType will be registered using the name of the xm_charconv module instance. The following Unicode encodings are supported: UTF-8, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, and UTF-7. For other encodings, it may be necessary to also set BigEndian and/or CharBytes.
108.6.4. Examples
This configuration shows an example of character set auto-detection. The input file can contain differently encoded lines, and the module normalizes output to UTF-8.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<Extension charconv>
Module xm_charconv
AutodetectCharsets utf-8, euc-jp, utf-16, utf-32, iso8859-2
</Extension>
<Input filein>
Module im_file
File "tmp/input"
Exec convert_fields("auto", "utf-8");
</Input>
<Output fileout>
Module om_file
File "tmp/output"
</Output>
<Route r>
Path filein => fileout
</Route>
This configuration uses the InputType registered via the LineReader directive to read a file with the ISO-8859-2 encoding.