mirror of
https://github.com/Gator96100/ProxSpace.git
synced 2025-03-12 04:36:22 -07:00
501 lines
24 KiB
HTML
501 lines
24 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html401/loose.dtd">
|
|
<html>
|
|
<!-- Created on October, 16 2022 by texi2html 1.78a -->
|
|
<!--
|
|
Written by: Lionel Cons <Lionel.Cons@cern.ch> (original author)
|
|
Karl Berry <karl@freefriends.org>
|
|
Olaf Bachmann <obachman@mathematik.uni-kl.de>
|
|
and many others.
|
|
Maintained by: Many creative people.
|
|
Send bugs and suggestions to <texi2html-bug@nongnu.org>
|
|
|
|
-->
|
|
<head>
|
|
<title>GNU libunistring: 1. Introduction</title>
|
|
|
|
<meta name="description" content="GNU libunistring: 1. Introduction">
|
|
<meta name="keywords" content="GNU libunistring: 1. Introduction">
|
|
<meta name="resource-type" content="document">
|
|
<meta name="distribution" content="global">
|
|
<meta name="Generator" content="texi2html 1.78a">
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
|
<style type="text/css">
|
|
<!--
|
|
a.summary-letter {text-decoration: none}
|
|
pre.display {font-family: serif}
|
|
pre.format {font-family: serif}
|
|
pre.menu-comment {font-family: serif}
|
|
pre.menu-preformatted {font-family: serif}
|
|
pre.smalldisplay {font-family: serif; font-size: smaller}
|
|
pre.smallexample {font-size: smaller}
|
|
pre.smallformat {font-family: serif; font-size: smaller}
|
|
pre.smalllisp {font-size: smaller}
|
|
span.roman {font-family:serif; font-weight:normal;}
|
|
span.sansserif {font-family:sans-serif; font-weight:normal;}
|
|
ul.toc {list-style: none}
|
|
-->
|
|
</style>
|
|
|
|
|
|
</head>
|
|
|
|
<body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000">
|
|
|
|
<table cellpadding="1" cellspacing="1" border="0">
|
|
<tr><td valign="middle" align="left">[ << ]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_2.html#SEC8" title="Next chapter"> >> </a>]</td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left">[<a href="libunistring_toc.html#SEC_Top" title="Cover (top) of document">Top</a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_toc.html#SEC_Contents" title="Table of contents">Contents</a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_21.html#SEC92" title="Index">Index</a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_abt.html#SEC_About" title="About (help)"> ? </a>]</td>
|
|
</tr></table>
|
|
|
|
<hr size="2">
|
|
<a name="Introduction"></a>
|
|
<a name="SEC1"></a>
|
|
<h1 class="chapter"> <a href="libunistring_toc.html#TOC1">1. Introduction</a> </h1>
|
|
|
|
<p>This library provides functions for manipulating Unicode strings and
|
|
for manipulating C strings according to the Unicode standard.
|
|
</p>
|
|
<p>It consists of the following parts:
|
|
</p>
|
|
<dl compact="compact">
|
|
<dt> <code><unistr.h></code></dt>
|
|
<dd><p>elementary string functions
|
|
</p></dd>
|
|
<dt> <code><uniconv.h></code></dt>
|
|
<dd><p>conversion from/to legacy encodings
|
|
</p></dd>
|
|
<dt> <code><unistdio.h></code></dt>
|
|
<dd><p>formatted output to strings
|
|
</p></dd>
|
|
<dt> <code><uniname.h></code></dt>
|
|
<dd><p>character names
|
|
</p></dd>
|
|
<dt> <code><unictype.h></code></dt>
|
|
<dd><p>character classification and properties
|
|
</p></dd>
|
|
<dt> <code><uniwidth.h></code></dt>
|
|
<dd><p>string width when using nonproportional fonts
|
|
</p></dd>
|
|
<dt> <code><unigbrk.h></code></dt>
|
|
<dd><p>grapheme cluster breaks
|
|
</p></dd>
|
|
<dt> <code><uniwbrk.h></code></dt>
|
|
<dd><p>word breaks
|
|
</p></dd>
|
|
<dt> <code><unilbrk.h></code></dt>
|
|
<dd><p>line breaking algorithm
|
|
</p></dd>
|
|
<dt> <code><uninorm.h></code></dt>
|
|
<dd><p>normalization (composition and decomposition)
|
|
</p></dd>
|
|
<dt> <code><unicase.h></code></dt>
|
|
<dd><p>case folding
|
|
</p></dd>
|
|
<dt> <code><uniregex.h></code></dt>
|
|
<dd><p>regular expressions (not yet implemented)
|
|
</p></dd>
|
|
</dl>
|
|
|
|
<a name="IDX1"></a>
|
|
<a name="IDX2"></a>
|
|
<p>libunistring is for you if your application involves non-trivial text
|
|
processing, such as upper/lower case conversions, line breaking, operations
|
|
on words, or more advanced analysis of text. Text provided by the user can,
|
|
in general, contain characters of all kinds of scripts. The text processing
|
|
functions provided by this library handle all scripts and all languages.
|
|
</p>
|
|
<p>libunistring is for you if your application already uses the ISO C / POSIX
|
|
<a href="http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/ctype.h.html"><code><ctype.h></code></a>, <a href="http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/wctype.h.html"><code><wctype.h></code></a> functions and the text it
|
|
operates on is provided by the user and can be in any language.
|
|
</p>
|
|
<p>libunistring is also for you if your application uses Unicode strings as
|
|
internal in-memory representation.
|
|
</p>
|
|
|
|
<hr size="6">
|
|
<a name="Unicode"></a>
|
|
<a name="SEC2"></a>
|
|
<h2 class="section"> <a href="libunistring_toc.html#TOC2">1.1 Unicode</a> </h2>
|
|
|
|
<p>Unicode is a standardized repertoire of characters that contains characters
|
|
from all scripts of the world, from Latin letters to Chinese ideographs
|
|
and Babylonian cuneiform glyphs. It also specifies how these characters
|
|
are to be rendered on a screen or on paper, and how common text processing
|
|
(word selection, line breaking, uppercasing of page titles etc.) is supposed
|
|
to behave on Unicode text.
|
|
</p>
|
|
<p>Unicode also specifies three ways of storing sequences of Unicode
|
|
characters in a computer whose basic unit of data is an 8-bit byte:
|
|
<a name="IDX3"></a>
|
|
<a name="IDX4"></a>
|
|
<a name="IDX5"></a>
|
|
<a name="IDX6"></a>
|
|
</p><dl compact="compact">
|
|
<dt> UTF-8</dt>
|
|
<dd><p>Every character is represented as 1 to 4 bytes.
|
|
</p></dd>
|
|
<dt> UTF-16</dt>
|
|
<dd><p>Every character is represented as 1 to 2 units of 16 bits.
|
|
</p></dd>
|
|
<dt> UTF-32, a.k.a. UCS-4</dt>
|
|
<dd><p>Every character is represented as 1 unit of 32 bits.
|
|
</p></dd>
|
|
</dl>
|
|
|
|
<p>For encoding Unicode text in a file, UTF-8 is usually used. For encoding
|
|
Unicode strings in memory for a program, either of the three encoding forms
|
|
can be reasonably used.
|
|
</p>
|
|
<p>Unicode is widely used on the web. Prior to the use of Unicode, web pages
|
|
were in many different encodings (ISO-8859-1 for English, French, Spanish,
|
|
ISO-8859-2 for Polish, ISO-8859-7 for Greek, KOI8-R for Russian, GB2312 or
|
|
BIG5 for Chinese, ISO-2022-JP-2 or EUC-JP or Shift_JIS for Japanese, and many
|
|
many others). It was next to impossible to create a document that contained
|
|
Chinese and Polish text in the same document. Due to the many encodings for
|
|
Japanese, even the processing of pure Japanese text was error prone.
|
|
</p>
|
|
<p>References:
|
|
</p><ul>
|
|
<li>
|
|
The Unicode standard: <a href="https://www.unicode.org/">https://www.unicode.org/</a>
|
|
</li><li>
|
|
Definition of UTF-8: <a href="https://www.rfc-editor.org/rfc/rfc3629.txt">https://www.rfc-editor.org/rfc/rfc3629.txt</a>
|
|
</li><li>
|
|
Definition of UTF-16: <a href="https://www.rfc-editor.org/rfc/rfc2781.txt">https://www.rfc-editor.org/rfc/rfc2781.txt</a>
|
|
</li><li>
|
|
Markus Kuhn's UTF-8 and Unicode FAQ:
|
|
<a href="https://www.cl.cam.ac.uk/~mgk25/unicode.html">https://www.cl.cam.ac.uk/~mgk25/unicode.html</a>
|
|
</li></ul>
|
|
|
|
<hr size="6">
|
|
<a name="Unicode-and-i18n"></a>
|
|
<a name="SEC3"></a>
|
|
<h2 class="section"> <a href="libunistring_toc.html#TOC3">1.2 Unicode and Internationalization</a> </h2>
|
|
|
|
<p>Internationalization is the process of changing the source code of a program
|
|
so that it can meet the expectations of users in any culture, if culture
|
|
specific data (translations, images etc.) are provided.
|
|
</p>
|
|
<p>Use of Unicode is not strictly required for internationalization, but it
|
|
makes internationalization much easier, because operations that need to
|
|
look at specific characters (like hyphenation, spell checking, or the
|
|
automatic conversion of double-quotes to opening and closing double-quote
|
|
characters) don't need to consider multiple possible encodings of the text.
|
|
</p>
|
|
<p>Use of Unicode also enables multilingualization: the ability of having text
|
|
in multiple languages present in the same document or even in the same line
|
|
of text.
|
|
</p>
|
|
<p>But use of Unicode is not everything. Internationalization usually consists
|
|
of four features:
|
|
</p><ul>
|
|
<li>
|
|
Use of Unicode where needed for text processing. This is what this library
|
|
is for.
|
|
</li><li>
|
|
Use of message catalogs for messages shown to the user, This is what
|
|
GNU gettext is about.
|
|
</li><li>
|
|
Use of locale specific conventions for date and time formats, for numeric
|
|
formatting, or for sorting of text. This can be done adequately with the
|
|
POSIX APIs and the implementation of locales in the GNU C library.
|
|
</li><li>
|
|
In graphical user interfaces, adapting the GUI to the default text direction
|
|
of the current locale (see
|
|
<a href="https://en.wikipedia.org/wiki/Right-to-left">right-to-left languages</a>).
|
|
</li></ul>
|
|
|
|
<hr size="6">
|
|
<a name="Locale-encodings"></a>
|
|
<a name="SEC4"></a>
|
|
<h2 class="section"> <a href="libunistring_toc.html#TOC4">1.3 Locale encodings</a> </h2>
|
|
|
|
<p>A locale is a set of cultural conventions. According to POSIX, for a program,
|
|
at any moment, there is one locale being designated as the “current locale”.
|
|
(Actually, POSIX supports also one locale per thread, but this feature is not
|
|
yet universally implemented and not widely used.)
|
|
<a name="IDX7"></a>
|
|
The locale is partitioned into several aspects, called the “categories”
|
|
of the locale. The main various aspects are:
|
|
</p><ul>
|
|
<li>
|
|
The character encoding and the character properties. This is the
|
|
<code>LC_CTYPE</code> category.
|
|
</li><li>
|
|
The sorting rules for text. This is the <code>LC_COLLATE</code> category.
|
|
</li><li>
|
|
The language specific translations of messages. This is the
|
|
<code>LC_MESSAGES</code> category.
|
|
</li><li>
|
|
The formatting rules for numbers, such as the decimal separator. This is
|
|
the <code>LC_NUMERIC</code> category.
|
|
</li><li>
|
|
The formatting rules for amounts of money. This is the <code>LC_MONETARY</code>
|
|
category.
|
|
</li><li>
|
|
The formatting of date and time. This is the <code>LC_TIME</code> category.
|
|
</li></ul>
|
|
|
|
<a name="IDX8"></a>
|
|
<p>In particular, the <code>LC_CTYPE</code> category of the current locale determines
|
|
the character encoding. This is the encoding of ‘<samp>char *</samp>’ strings.
|
|
We also call it the “locale encoding”. GNU libunistring has a function,
|
|
<code>locale_charset</code>, that returns a standardized (platform independent)
|
|
name for this encoding.
|
|
</p>
|
|
<p>All locale encodings used on glibc systems are essentially ASCII compatible:
|
|
Most graphic ASCII characters have the same representation, as a single byte,
|
|
in that encoding as in ASCII.
|
|
</p>
|
|
<p>Among the possible locale encodings are UTF-8 and GB18030. Both allow
|
|
to represent any Unicode character as a sequence of bytes. UTF-8 is used in
|
|
most of the world, whereas GB18030 is used in the People's Republic of China,
|
|
because it is backward compatible with the GB2312 encoding that was used in
|
|
this country earlier.
|
|
</p>
|
|
<p>The legacy locale encodings, ISO-8859-15 (which supplanted ISO-8859-1 in
|
|
most of Europe), ISO-8859-2, KOI8-R, EUC-JP, etc., are still in use in
|
|
some places, though.
|
|
</p>
|
|
<p>UTF-16 and UTF-32 are not used as locale encodings, because they are not
|
|
ASCII compatible.
|
|
</p>
|
|
<hr size="6">
|
|
<a name="In_002dmemory-representation"></a>
|
|
<a name="SEC5"></a>
|
|
<h2 class="section"> <a href="libunistring_toc.html#TOC5">1.4 Choice of in-memory representation of strings</a> </h2>
|
|
|
|
<p>There are three ways of representing strings in memory of a running
|
|
program.
|
|
</p><ul>
|
|
<li>
|
|
As ‘<samp>char *</samp>’ strings. Such strings are represented in locale encoding.
|
|
This approach is employed when not much text processing is done by the
|
|
program. When some Unicode aware processing is to be done, a string is
|
|
converted to Unicode on the fly and back to locale encoding afterwards.
|
|
</li><li>
|
|
As UTF-8 or UTF-16 or UTF-32 strings. This implies that conversion from
|
|
locale encoding to Unicode is performed on input, and in the opposite
|
|
direction on output. This approach is employed when the program does
|
|
a significant amount of text processing, or when the program has multiple
|
|
threads operating on the same data but in different locales.
|
|
</li><li>
|
|
As ‘<samp>wchar_t *</samp>’, a.k.a. “wide strings”. This approach is misguided,
|
|
see <a href="libunistring_18.html#SEC81">The <code>wchar_t</code> mess</a>.
|
|
</li></ul>
|
|
|
|
<p>Of course, a ‘<samp>char *</samp>’ string can, in some cases, be encoded in UTF-8.
|
|
You will use the data type depending on what you can guarantee about how
|
|
it's encoded: If a string is encoded in the locale encoding, or if you
|
|
don't know how it's encoded, use ‘<samp>char *</samp>’. If, on the other hand,
|
|
you can <em>guarantee</em> that it is UTF-8 encoded, then you can use the
|
|
UTF-8 string type, <code>uint8_t *</code>, for it.
|
|
</p>
|
|
<p>The five types <code>char *</code>, <code>uint8_t *</code>, <code>uint16_t *</code>,
|
|
<code>uint32_t *</code>, and <code>wchar_t *</code> are incompatible types at the C
|
|
level. Therefore, ‘<samp>gcc -Wall</samp>’ will produce a warning if, by mistake,
|
|
your code contains a mismatch between these types. In the context of
|
|
using GNU libunistring, even a warning about a mismatch between
|
|
<code>char *</code> and <code>uint8_t *</code> is a sign of a bug in your code
|
|
that you should not try to silence through a cast.
|
|
</p>
|
|
<hr size="6">
|
|
<a name="char-_002a-strings"></a>
|
|
<a name="SEC6"></a>
|
|
<h2 class="section"> <a href="libunistring_toc.html#TOC6">1.5 ‘<samp>char *</samp>’ strings</a> </h2>
|
|
|
|
<p>The classical C strings, with its C library support standardized by
|
|
ISO C and POSIX, can be used in internationalized programs with some
|
|
precautions. The problem with this API is that many of the C library
|
|
functions for strings don't work correctly on strings in locale
|
|
encodings, leading to bugs that only people in some cultures of the
|
|
world will experience.
|
|
</p>
|
|
<a name="IDX9"></a>
|
|
<p>The first problem with the C library API is the support of multibyte
|
|
locales. According to the locale encoding, in general, every character
|
|
is represented by one or more bytes (up to 4 bytes in practice — but
|
|
use <code>MB_LEN_MAX</code> instead of the number 4 in the code).
|
|
When every character is represented by only 1 byte, we speak of an
|
|
“unibyte locale”, otherwise of a “multibyte locale”. It is important
|
|
to realize that the majority of Unix installations nowadays use UTF-8
|
|
or GB18030 as locale encoding; therefore, the majority of users are
|
|
using multibyte locales.
|
|
</p>
|
|
<a name="IDX10"></a>
|
|
<p>The important fact to remember is:
|
|
</p><table class="cartouche" border="1"><tr><td>
|
|
<p><em>A ‘<samp>char</samp>’ is a byte, not a character.</em>
|
|
</p></td></tr></table>
|
|
|
|
<p>As a consequence:
|
|
</p><ul>
|
|
<li>
|
|
The <a href="http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/ctype.h.html"><code><ctype.h></code></a> API is useless in this context; it does not work in
|
|
multibyte locales.
|
|
</li><li>
|
|
The <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strlen.html"><code>strlen</code></a> function does not return the number of characters
|
|
in a string. Nor does it return the number of screen columns occupied
|
|
by a string after it is output. It merely returns the number of
|
|
<em>bytes</em> occupied by a string.
|
|
</li><li>
|
|
Truncating a string, for example, with <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strncpy.html"><code>strncpy</code></a>, can have the
|
|
effect of truncating it in the middle of a multibyte character. Such
|
|
a string will, when output, have a garbled character at its end, often
|
|
represented by a hollow box.
|
|
</li><li>
|
|
<a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strchr.html"><code>strchr</code></a> and <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strrchr.html"><code>strrchr</code></a> do not work with multibyte strings
|
|
if the locale encoding is GB18030 and the character to be searched is
|
|
a digit.
|
|
</li><li>
|
|
<a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strstr.html"><code>strstr</code></a> does not work with multibyte strings if the locale encoding
|
|
is different from UTF-8.
|
|
</li><li>
|
|
<a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strcspn.html"><code>strcspn</code></a>, <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strpbrk.html"><code>strpbrk</code></a>, <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strspn.html"><code>strspn</code></a> cannot work
|
|
correctly in multibyte locales: they assume the second argument is a list of
|
|
single-byte characters. Even in this simple case, they do not work with
|
|
multibyte strings if the locale encoding is GB18030 and one of the
|
|
characters to be searched is a digit.
|
|
</li><li>
|
|
<a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strsep.html"><code>strsep</code></a> and <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strtok_r.html"><code>strtok_r</code></a> do not work with multibyte strings
|
|
unless all of the delimiter characters are ASCII characters < 0x30.
|
|
</li><li>
|
|
The <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strcasecmp.html"><code>strcasecmp</code></a>, <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strncasecmp.html"><code>strncasecmp</code></a>, and <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strcasestr.html"><code>strcasestr</code></a>
|
|
functions do not work with multibyte strings.
|
|
</li></ul>
|
|
|
|
<p>The workarounds can be found in GNU gnulib
|
|
<a href="https://www.gnu.org/software/gnulib/">https://www.gnu.org/software/gnulib/</a>.
|
|
</p><ul>
|
|
<li>
|
|
gnulib has modules ‘<samp>mbchar</samp>’, ‘<samp>mbiter</samp>’, ‘<samp>mbuiter</samp>’ that
|
|
represent multibyte characters and allow to iterate across a multibyte
|
|
string with the same ease as through a unibyte string.
|
|
</li><li>
|
|
gnulib has functions <code>mbslen</code> and <code>mbswidth</code> that can be
|
|
used instead of <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strlen.html"><code>strlen</code></a> when the number of characters or the
|
|
number of screen columns of a string is requested.
|
|
</li><li>
|
|
gnulib has functions <code>mbschr</code> and <code>mbsrrchr</code> that are
|
|
like <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strchr.html"><code>strchr</code></a> and <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strrchr.html"><code>strrchr</code></a>, but work in multibyte locales.
|
|
</li><li>
|
|
gnulib has a function <code>mbsstr</code>, like <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strstr.html"><code>strstr</code></a>, but works
|
|
in multibyte locales.
|
|
</li><li>
|
|
gnulib has functions <code>mbscspn</code>, <code>mbspbrk</code>, <code>mbsspn</code>
|
|
that are like <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strcspn.html"><code>strcspn</code></a>, <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strpbrk.html"><code>strpbrk</code></a>, <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strspn.html"><code>strspn</code></a>, but
|
|
work in multibyte locales.
|
|
</li><li>
|
|
gnulib has functions <code>mbssep</code> and <code>mbstok_r</code> that are
|
|
like <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strsep.html"><code>strsep</code></a> and <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strtok_r.html"><code>strtok_r</code></a> but work in multibyte locales.
|
|
</li><li>
|
|
gnulib has functions <code>mbscasecmp</code>, <code>mbsncasecmp</code>,
|
|
<code>mbspcasecmp</code>, and <code>mbscasestr</code> that are like <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strcasecmp.html"><code>strcasecmp</code></a>,
|
|
<a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strncasecmp.html"><code>strncasecmp</code></a>, and <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/strcasestr.html"><code>strcasestr</code></a>, but
|
|
work in multibyte locales. Still, the function <code>ulc_casecmp</code> is
|
|
preferable to these functions; see below.
|
|
</li></ul>
|
|
|
|
<p>The second problem with the C library API is that it has some assumptions built-in that are not valid in some languages:
|
|
</p><ul>
|
|
<li>
|
|
It assumes that there are only two forms of every character: uppercase
|
|
and lowercase. This is not true for Croatian, where the character
|
|
<small>LETTER DZ WITH CARON</small> comes in three forms:
|
|
<small>LATIN CAPITAL LETTER DZ WITH CARON</small> (DZ),
|
|
<small>LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON</small> (Dz),
|
|
<small>LATIN SMALL LETTER DZ WITH CARON</small> (dz).
|
|
</li><li>
|
|
It assumes that uppercasing of 1 character leads to 1 character. This
|
|
is not true for German, where the <small>LATIN SMALL LETTER SHARP S</small>, when
|
|
uppercased, becomes ‘<samp>SS</samp>’.
|
|
</li><li>
|
|
It assumes that there is 1:1 mapping between uppercase and lowercase forms.
|
|
This is not true for the Greek sigma: <small>GREEK CAPITAL LETTER SIGMA</small> is
|
|
the uppercase of both <small>GREEK SMALL LETTER SIGMA</small> and
|
|
<small>GREEK SMALL LETTER FINAL SIGMA</small>.
|
|
</li><li>
|
|
It assumes that the upper/lowercase mappings are position independent.
|
|
This is not true for the Greek sigma and the Lithuanian i.
|
|
</li></ul>
|
|
|
|
<p>The correct way to deal with this problem is
|
|
</p><ol>
|
|
<li>
|
|
to provide functions for titlecasing, as well as for upper- and
|
|
lowercasing,
|
|
</li><li>
|
|
to view case transformations as functions that operates on strings,
|
|
rather than on characters.
|
|
</li></ol>
|
|
|
|
<p>This is implemented in this library, through the functions declared in <code><unicase.h></code>, see <a href="libunistring_14.html#SEC67">Case mappings <code><unicase.h></code></a>.
|
|
</p>
|
|
<hr size="6">
|
|
<a name="Unicode-strings"></a>
|
|
<a name="SEC7"></a>
|
|
<h2 class="section"> <a href="libunistring_toc.html#TOC7">1.6 Unicode strings</a> </h2>
|
|
|
|
<p>libunistring supports Unicode strings in three representations:
|
|
<a name="IDX11"></a>
|
|
<a name="IDX12"></a>
|
|
<a name="IDX13"></a>
|
|
</p><ul>
|
|
<li>
|
|
UTF-8 strings, through the type ‘<samp>uint8_t *</samp>’. The units are bytes
|
|
(<code>uint8_t</code>).
|
|
</li><li>
|
|
UTF-16 strings, through the type ‘<samp>uint16_t *</samp>’, The units are 16-bit
|
|
memory words (<code>uint16_t</code>).
|
|
</li><li>
|
|
UTF-32 strings, through the type ‘<samp>uint32_t *</samp>’. The units are 32-bit
|
|
memory words (<code>uint32_t</code>).
|
|
</li></ul>
|
|
|
|
<p>As with C strings, there are two variants:
|
|
</p><ul>
|
|
<li>
|
|
Unicode strings with a terminating NUL character are represented as
|
|
a pointer to the first unit of the string. There is a unit containing
|
|
a 0 value at the end. It is considered part of the string for all
|
|
memory allocation purposes, but is not considered part of the string
|
|
for all other logical purposes.
|
|
</li><li>
|
|
Unicode strings where embedded NUL characters are allowed. These
|
|
are represented by a pointer to the first unit and the number of units
|
|
(not bytes!) of the string. In this setting, there is no trailing
|
|
zero-valued unit used as “end marker”.
|
|
</li></ul>
|
|
|
|
<hr size="6">
|
|
<table cellpadding="1" cellspacing="1" border="0">
|
|
<tr><td valign="middle" align="left">[<a href="#SEC1" title="Beginning of this chapter or previous chapter"> << </a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_2.html#SEC8" title="Next chapter"> >> </a>]</td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left">[<a href="libunistring_toc.html#SEC_Top" title="Cover (top) of document">Top</a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_toc.html#SEC_Contents" title="Table of contents">Contents</a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_21.html#SEC92" title="Index">Index</a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_abt.html#SEC_About" title="About (help)"> ? </a>]</td>
|
|
</tr></table>
|
|
<p>
|
|
<font size="-1">
|
|
This document was generated by <em>Bruno Haible</em> on <em>October, 16 2022</em> using <a href="https://www.nongnu.org/texi2html/"><em>texi2html 1.78a</em></a>.
|
|
</font>
|
|
<br>
|
|
|
|
</p>
|
|
</body>
|
|
</html>
|