libunibreak 6.1
Loading...
Searching...
No Matches
Data Structures | Enumerations | Functions | Variables
linebreakdef.h File Reference

Definitions of internal data structures, declarations of global variables, and function prototypes for the line breaking algorithm. More...

#include "unibreakdef.h"
Include dependency graph for linebreakdef.h:
This graph shows which files directly or indirectly include this file:

Go to the source code of this file.

Data Structures

struct  LineBreakProperties
 Struct for entries of line break properties. More...
 
struct  LineBreakPropertiesLang
 Struct for association of language-specific line breaking properties with language names. More...
 
struct  LineBreakContext
 Context representing internal state of the line breaking algorithm. More...
 

Enumerations

enum  LineBreakClass {
  LBP_Undefined , LBP_OP , LBP_CL , LBP_CP ,
  LBP_QU , LBP_GL , LBP_NS , LBP_EX ,
  LBP_SY , LBP_IS , LBP_PR , LBP_PO ,
  LBP_NU , LBP_AL , LBP_HL , LBP_ID ,
  LBP_IN , LBP_HY , LBP_BA , LBP_BB ,
  LBP_B2 , LBP_ZW , LBP_CM , LBP_WJ ,
  LBP_H2 , LBP_H3 , LBP_JL , LBP_JV ,
  LBP_JT , LBP_RI , LBP_EB , LBP_EM ,
  LBP_ZWJ , LBP_CB , LBP_AI , LBP_BK ,
  LBP_CJ , LBP_CR , LBP_LF , LBP_NL ,
  LBP_SA , LBP_SG , LBP_SP , LBP_XX
}
 Line break classes. More...
 
enum  BreakOutputType { LBOT_PER_CODE_UNIT , LBOT_PER_CODE_POINT }
 

Functions

void lb_init_break_context (struct LineBreakContext *lbpCtx, utf32_t ch, const char *lang)
 Initializes line breaking context for a given language.
 
int lb_process_next_char (struct LineBreakContext *lbpCtx, utf32_t ch)
 Updates LineBreakingContext for the next codepoint and returns the detected break.
 
enum LineBreakClass lb_get_char_class (const struct LineBreakContext *lbpCtx, utf32_t ch)
 Gets the line breaking class of a character for a line breaking context.
 
size_t set_linebreaks (const void *s, size_t len, const char *lang, enum BreakOutputType outputType, char *brks, get_next_char_t get_next_char)
 Sets the line breaking information for a generic input string.
 

Variables

const struct LineBreakProperties lb_prop_supplementary []
 Line breaking properties for supplementary planes.
 
const unsigned int lb_prop_supplementary_len
 
const char lb_prop_bmp []
 Line breaking properties for BMP.
 
const struct LineBreakPropertiesLang lb_prop_lang_map []
 Association data of language-specific line breaking properties with language names.
 

Detailed Description

Definitions of internal data structures, declarations of global variables, and function prototypes for the line breaking algorithm.

Author
Wu Yongwei
Petr Filipsky

Enumeration Type Documentation

◆ BreakOutputType

Enumerator
LBOT_PER_CODE_UNIT 
LBOT_PER_CODE_POINT 

◆ LineBreakClass

Line break classes.

This is a mapping of Table 1 of Unicode Standard Annex 14.

Enumerator
LBP_Undefined 

Undefined.

LBP_OP 

Opening punctuation.

LBP_CL 

Closing punctuation.

LBP_CP 

Closing parenthesis.

LBP_QU 

Ambiguous quotation.

LBP_GL 

Glue.

LBP_NS 

Non-starters.

LBP_EX 

Exclamation/Interrogation.

LBP_SY 

Symbols allowing break after.

LBP_IS 

Infix separator.

LBP_PR 

Prefix.

LBP_PO 

Postfix.

LBP_NU 

Numeric.

LBP_AL 

Alphabetic.

LBP_HL 

Hebrew letter.

LBP_ID 

Ideographic.

LBP_IN 

Inseparable characters.

LBP_HY 

Hyphen.

LBP_BA 

Break after.

LBP_BB 

Break before.

LBP_B2 

Break on either side (but not pair)

LBP_ZW 

Zero-width space.

LBP_CM 

Combining marks.

LBP_WJ 

Word joiner.

LBP_H2 

Hangul LV.

LBP_H3 

Hangul LVT.

LBP_JL 

Hangul L Jamo.

LBP_JV 

Hangul V Jamo.

LBP_JT 

Hangul T Jamo.

LBP_RI 

Regional indicator.

LBP_EB 

Emoji base.

LBP_EM 

Emoji modifier.

LBP_ZWJ 

Zero width joiner.

LBP_CB 

Contingent break.

LBP_AI 

Ambiguous (alphabetic or ideograph)

LBP_BK 

Break (mandatory)

LBP_CJ 

Conditional Japanese starter.

LBP_CR 

Carriage return.

LBP_LF 

Line feed.

LBP_NL 

Next line.

LBP_SA 

South-East Asian.

LBP_SG 

Surrogates.

LBP_SP 

Space.

LBP_XX 

Unknown.

Function Documentation

◆ lb_get_char_class()

enum LineBreakClass lb_get_char_class ( const struct LineBreakContext * lbpCtx,
utf32_t ch )

Gets the line breaking class of a character for a line breaking context.

This function will check the language-specific data first, and then the default data if there is no language-specific property available for the character.

Parameters
lbpCtxpointer to the line breaking context
chcharacter to check
Returns
the line breaking class if found; LBP_XX otherwise

◆ lb_init_break_context()

void lb_init_break_context ( struct LineBreakContext * lbpCtx,
utf32_t ch,
const char * lang )

Initializes line breaking context for a given language.

Parameters
[in,out]lbpCtxpointer to the line breaking context
[in]chthe first character to process
[in]langlanguage of the input
Postcondition
the line breaking context is initialized

◆ lb_process_next_char()

int lb_process_next_char ( struct LineBreakContext * lbpCtx,
utf32_t ch )

Updates LineBreakingContext for the next codepoint and returns the detected break.

Parameters
[in,out]lbpCtxpointer to the line breaking context
[in]chUnicode codepoint
Returns
break result, one of LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, and LINEBREAK_NOBREAK
Postcondition
the line breaking context is updated

◆ set_linebreaks()

size_t set_linebreaks ( const void * s,
size_t len,
const char * lang,
enum BreakOutputType outputType,
char * brks,
get_next_char_t get_next_char )

Sets the line breaking information for a generic input string.

Currently, this implementation has customization for the following ISO 639-1 language codes (for lang):

  • de (German)
  • en (English)
  • es (Spanish)
  • fr (French)
  • ja (Japanese)
  • ko (Korean)
  • ru (Russian)
  • zh (Chinese)

In addition, a suffix "-strict" may be added to indicate strict (as versus normal) line-breaking behaviour. See the Conditional Japanese Starter section of UAX #14 for more details.

Parameters
[in]sinput string
[in]lenlength of the input
[in]langlanguage of the input
[in]outputTypeoutput per code-unit or per code-point
[out]brkspointer to the output breaking data, containing LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK, or LINEBREAK_INSIDEACHAR
[in]get_next_charfunction to get the next UTF-32 character
Returns
The number of entries in brks filled. This is equal to the number of code-points or code-units in the source string, depending on the outputType parameter.

Variable Documentation

◆ lb_prop_bmp

const char lb_prop_bmp[]
extern

Line breaking properties for BMP.

◆ lb_prop_lang_map

const struct LineBreakPropertiesLang lb_prop_lang_map[]
extern

Association data of language-specific line breaking properties with language names.

This is the definition for the static data in this file. If you want more flexibility, or do not need the data here, you may want to redefine lb_prop_lang_map in your C source file.

◆ lb_prop_supplementary

const struct LineBreakProperties lb_prop_supplementary[]
extern

Line breaking properties for supplementary planes.

◆ lb_prop_supplementary_len

const unsigned int lb_prop_supplementary_len
extern