mirror of
https://github.com/Gator96100/ProxSpace.git
synced 2025-03-12 04:36:22 -07:00
305 lines
13 KiB
HTML
305 lines
13 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html401/loose.dtd">
|
|
<html>
|
|
<!-- Created on October, 16 2022 by texi2html 1.78a -->
|
|
<!--
|
|
Written by: Lionel Cons <Lionel.Cons@cern.ch> (original author)
|
|
Karl Berry <karl@freefriends.org>
|
|
Olaf Bachmann <obachman@mathematik.uni-kl.de>
|
|
and many others.
|
|
Maintained by: Many creative people.
|
|
Send bugs and suggestions to <texi2html-bug@nongnu.org>
|
|
|
|
-->
|
|
<head>
|
|
<title>GNU libunistring: 10. Grapheme cluster breaks in strings <unigbrk.h></title>
|
|
|
|
<meta name="description" content="GNU libunistring: 10. Grapheme cluster breaks in strings <unigbrk.h>">
|
|
<meta name="keywords" content="GNU libunistring: 10. Grapheme cluster breaks in strings <unigbrk.h>">
|
|
<meta name="resource-type" content="document">
|
|
<meta name="distribution" content="global">
|
|
<meta name="Generator" content="texi2html 1.78a">
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
|
<style type="text/css">
|
|
<!--
|
|
a.summary-letter {text-decoration: none}
|
|
pre.display {font-family: serif}
|
|
pre.format {font-family: serif}
|
|
pre.menu-comment {font-family: serif}
|
|
pre.menu-preformatted {font-family: serif}
|
|
pre.smalldisplay {font-family: serif; font-size: smaller}
|
|
pre.smallexample {font-size: smaller}
|
|
pre.smallformat {font-family: serif; font-size: smaller}
|
|
pre.smalllisp {font-size: smaller}
|
|
span.roman {font-family:serif; font-weight:normal;}
|
|
span.sansserif {font-family:sans-serif; font-weight:normal;}
|
|
ul.toc {list-style: none}
|
|
-->
|
|
</style>
|
|
|
|
|
|
</head>
|
|
|
|
<body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000">
|
|
|
|
<table cellpadding="1" cellspacing="1" border="0">
|
|
<tr><td valign="middle" align="left">[<a href="libunistring_9.html#SEC53" title="Beginning of this chapter or previous chapter"> << </a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_11.html#SEC57" title="Next chapter"> >> </a>]</td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left">[<a href="libunistring_toc.html#SEC_Top" title="Cover (top) of document">Top</a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_toc.html#SEC_Contents" title="Table of contents">Contents</a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_21.html#SEC92" title="Index">Index</a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_abt.html#SEC_About" title="About (help)"> ? </a>]</td>
|
|
</tr></table>
|
|
|
|
<hr size="2">
|
|
<a name="unigbrk_002eh"></a>
|
|
<a name="SEC54"></a>
|
|
<h1 class="chapter"> <a href="libunistring_toc.html#TOC54">10. Grapheme cluster breaks in strings <code><unigbrk.h></code></a> </h1>
|
|
|
|
<p>This include file declares functions for determining where in a string
|
|
“grapheme clusters” start and end. A “grapheme cluster” is an
|
|
approximation to a user-perceived character, which sometimes
|
|
corresponds to multiple Unicode characters. Editing operations such as
|
|
mouse selection, cursor movement, and backspacing often operate on
|
|
grapheme clusters as units, not on individual characters.
|
|
</p>
|
|
<p>Some grapheme clusters are built from a base character and a combining
|
|
character. The letter ‘<samp>é</samp>’,
|
|
for example, is most commonly represented in Unicode as a single
|
|
character U+00E8 <small>LATIN SMALL LETTER E WITH ACUTE</small>. It is,
|
|
however, equally valid to use the pair of characters U+0065 <small>LATIN
|
|
SMALL LETTER E</small> followed by U+0301 <small>COMBINING ACUTE ACCENT</small>. Since
|
|
the user would perceive this pair of characters as a single character,
|
|
they would be grouped into a single grapheme cluster.
|
|
</p>
|
|
<p>But there are also grapheme clusters that consist of several base characters.
|
|
For example, a Devanagari letter and a Devanagari vowel sign that follows it
|
|
may form a grapheme cluster. Similarly, some pairs of Thai characters and
|
|
Hangul syllables (formed by two or three Hangul characters) are grapheme
|
|
clusters.
|
|
</p>
|
|
|
|
<hr size="6">
|
|
<a name="Grapheme-cluster-breaks-in-a-string"></a>
|
|
<a name="SEC55"></a>
|
|
<h2 class="section"> <a href="libunistring_toc.html#TOC55">10.1 Grapheme cluster breaks in a string</a> </h2>
|
|
|
|
<p>The following functions find a single boundary between grapheme
|
|
clusters in a string.
|
|
</p>
|
|
<dl>
|
|
<dt><u>Function:</u> void <b>u8_grapheme_next</b><i> (const uint8_t *<var>s</var>, const uint8_t *<var>end</var>)</i>
|
|
<a name="IDX769"></a>
|
|
</dt>
|
|
<dt><u>Function:</u> void <b>u16_grapheme_next</b><i> (const uint16_t *<var>s</var>, const uint16_t *<var>end</var>)</i>
|
|
<a name="IDX770"></a>
|
|
</dt>
|
|
<dt><u>Function:</u> void <b>u32_grapheme_next</b><i> (const uint32_t *<var>s</var>, const uint32_t *<var>end</var>)</i>
|
|
<a name="IDX771"></a>
|
|
</dt>
|
|
<dd><p>Returns the start of the next grapheme cluster following <var>s</var>,
|
|
or <var>end</var> if no grapheme cluster break is encountered before it.
|
|
Returns NULL if and only if <code><var>s</var> == <var>end</var></code>.
|
|
</p>
|
|
<p>Note that these functions do not handle the case when a character
|
|
outside of the range between <var>s</var> and <var>end</var> is needed to
|
|
determine the boundary. Use <code>_grapheme_breaks</code> functions for such
|
|
cases.
|
|
</p></dd></dl>
|
|
|
|
<dl>
|
|
<dt><u>Function:</u> void <b>u8_grapheme_prev</b><i> (const uint8_t *<var>s</var>, const uint8_t *<var>start</var>)</i>
|
|
<a name="IDX772"></a>
|
|
</dt>
|
|
<dt><u>Function:</u> void <b>u16_grapheme_prev</b><i> (const uint16_t *<var>s</var>, const uint16_t *<var>start</var>)</i>
|
|
<a name="IDX773"></a>
|
|
</dt>
|
|
<dt><u>Function:</u> void <b>u32_grapheme_prev</b><i> (const uint32_t *<var>s</var>, const uint32_t *<var>start</var>)</i>
|
|
<a name="IDX774"></a>
|
|
</dt>
|
|
<dd><p>Returns the start of the grapheme cluster preceding <var>s</var>, or
|
|
<var>start</var> if no grapheme cluster break is encountered before it.
|
|
Returns NULL if and only if <code><var>s</var> == <var>start</var></code>.
|
|
</p>
|
|
<p>Note that these functions do not handle the case when a character
|
|
outside of the range between <var>start</var> and <var>s</var> is needed to
|
|
determine the boundary. Use <code>_grapheme_breaks</code> functions for such
|
|
cases.
|
|
</p>
|
|
<p>Note also that these functions work only on well-formed Unicode strings.
|
|
</p></dd></dl>
|
|
|
|
<p>The following functions determine all of the grapheme cluster
|
|
boundaries in a string.
|
|
</p>
|
|
<dl>
|
|
<dt><u>Function:</u> void <b>u8_grapheme_breaks</b><i> (const uint8_t *<var>s</var>, size_t <var>n</var>, char *<var>p</var>)</i>
|
|
<a name="IDX775"></a>
|
|
</dt>
|
|
<dt><u>Function:</u> void <b>u16_grapheme_breaks</b><i> (const uint16_t *<var>s</var>, size_t <var>n</var>, char *<var>p</var>)</i>
|
|
<a name="IDX776"></a>
|
|
</dt>
|
|
<dt><u>Function:</u> void <b>u32_grapheme_breaks</b><i> (const uint32_t *<var>s</var>, size_t <var>n</var>, char *<var>p</var>)</i>
|
|
<a name="IDX777"></a>
|
|
</dt>
|
|
<dt><u>Function:</u> void <b>ulc_grapheme_breaks</b><i> (const char *<var>s</var>, size_t <var>n</var>, char *<var>p</var>)</i>
|
|
<a name="IDX778"></a>
|
|
</dt>
|
|
<dt><u>Function:</u> void <b>uc_grapheme_breaks</b><i> (const ucs_t *<var>s</var>, size_t <var>n</var>, char *<var>p</var>)</i>
|
|
<a name="IDX779"></a>
|
|
</dt>
|
|
<dd><p>Determines the grapheme cluster break points in <var>s</var>, an array of
|
|
<var>n</var> units, and stores the result at <code><var>p</var>[0..<var>nx</var>-1]</code>.
|
|
</p><dl compact="compact">
|
|
<dt> <code><var>p</var>[i] = 1</code></dt>
|
|
<dd><p>means that there is a grapheme cluster boundary between
|
|
<code><var>s</var>[i-1]</code> and <code><var>s</var>[i]</code>.
|
|
</p></dd>
|
|
<dt> <code><var>p</var>[i] = 0</code></dt>
|
|
<dd><p>means that <code><var>s</var>[i-1]</code> and <code><var>s</var>[i]</code> are part of the
|
|
same grapheme cluster.
|
|
</p></dd>
|
|
</dl>
|
|
<p><code><var>p</var>[0]</code> is always set to 1, because there is always a
|
|
grapheme cluster break at start of text.
|
|
</p>
|
|
<p>In addition to the above variants for UTF-8, UTF-16, and UTF-32 strings,
|
|
<code><unigbrk.h></code> provides another variant: <code>uc_grapheme_breaks</code>.
|
|
</p>
|
|
<p>This is similar to <code>u32_grapheme_breaks</code>, but it accepts any
|
|
characters which may not be represented in UTF-32, such as control
|
|
characters.
|
|
</p></dd></dl>
|
|
|
|
<hr size="6">
|
|
<a name="Grapheme-cluster-break-property"></a>
|
|
<a name="SEC56"></a>
|
|
<h2 class="section"> <a href="libunistring_toc.html#TOC56">10.2 Grapheme cluster break property</a> </h2>
|
|
|
|
<p>This is a more low-level API. The grapheme cluster break property is a
|
|
property defined in Unicode Standard Annex #29, section “Grapheme Cluster
|
|
Boundaries”, see
|
|
<a href="https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries">https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries</a>.
|
|
It is used for determining the grapheme cluster breaks in a string.
|
|
</p>
|
|
<p>The following are the possible values of the grapheme cluster break
|
|
property. More values may be added in the future.
|
|
</p>
|
|
<dl>
|
|
<dt><u>Constant:</u> int <b>GBP_OTHER</b>
|
|
<a name="IDX780"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_CR</b>
|
|
<a name="IDX781"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_LF</b>
|
|
<a name="IDX782"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_CONTROL</b>
|
|
<a name="IDX783"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_EXTEND</b>
|
|
<a name="IDX784"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_PREPEND</b>
|
|
<a name="IDX785"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_SPACINGMARK</b>
|
|
<a name="IDX786"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_L</b>
|
|
<a name="IDX787"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_V</b>
|
|
<a name="IDX788"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_T</b>
|
|
<a name="IDX789"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_LV</b>
|
|
<a name="IDX790"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_LVT</b>
|
|
<a name="IDX791"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_RI</b>
|
|
<a name="IDX792"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_ZWJ</b>
|
|
<a name="IDX793"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_EB</b>
|
|
<a name="IDX794"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_EM</b>
|
|
<a name="IDX795"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_GAZ</b>
|
|
<a name="IDX796"></a>
|
|
</dt>
|
|
<dt><u>Constant:</u> int <b>GBP_EBG</b>
|
|
<a name="IDX797"></a>
|
|
</dt>
|
|
</dl>
|
|
|
|
<p>The following function looks up the grapheme cluster break property of a
|
|
character.
|
|
</p>
|
|
<dl>
|
|
<dt><u>Function:</u> int <b>uc_graphemeclusterbreak_property</b><i> (ucs4_t <var>uc</var>)</i>
|
|
<a name="IDX798"></a>
|
|
</dt>
|
|
<dd><p>Returns the Grapheme_Cluster_Break property of a Unicode character.
|
|
</p></dd></dl>
|
|
|
|
<p>The following function determines whether there is a grapheme cluster
|
|
break between two Unicode characters. It is the primitive upon which
|
|
the higher-level functions in the previous section are directly based.
|
|
</p>
|
|
<dl>
|
|
<dt><u>Function:</u> bool <b>uc_is_grapheme_break</b><i> (ucs4_t <var>a</var>, ucs4_t <var>b</var>)</i>
|
|
<a name="IDX799"></a>
|
|
</dt>
|
|
<dd><p>Returns true if there is an grapheme cluster boundary between Unicode
|
|
characters <var>a</var> and <var>b</var>.
|
|
</p>
|
|
<p>There is always a grapheme cluster break at the start or end of text.
|
|
You can specify zero for <var>a</var> or <var>b</var> to indicate start of text or end
|
|
of text, respectively.
|
|
</p>
|
|
<p>This implements the extended (not legacy) grapheme cluster rules
|
|
described in the Unicode standard, because the standard says that they
|
|
are preferred.
|
|
</p>
|
|
<p>Note that this function does not handle the case when three or more
|
|
consecutive characters are needed to determine the boundary. Use
|
|
<code>uc_grapheme_breaks</code> for such cases.
|
|
</p></dd></dl>
|
|
<hr size="6">
|
|
<table cellpadding="1" cellspacing="1" border="0">
|
|
<tr><td valign="middle" align="left">[<a href="#SEC54" title="Beginning of this chapter or previous chapter"> << </a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_11.html#SEC57" title="Next chapter"> >> </a>]</td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left"> </td>
|
|
<td valign="middle" align="left">[<a href="libunistring_toc.html#SEC_Top" title="Cover (top) of document">Top</a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_toc.html#SEC_Contents" title="Table of contents">Contents</a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_21.html#SEC92" title="Index">Index</a>]</td>
|
|
<td valign="middle" align="left">[<a href="libunistring_abt.html#SEC_About" title="About (help)"> ? </a>]</td>
|
|
</tr></table>
|
|
<p>
|
|
<font size="-1">
|
|
This document was generated by <em>Bruno Haible</em> on <em>October, 16 2022</em> using <a href="https://www.nongnu.org/texi2html/"><em>texi2html 1.78a</em></a>.
|
|
</font>
|
|
<br>
|
|
|
|
</p>
|
|
</body>
|
|
</html>
|