This API provides access to detailed information for all characters, blocks and planes in version 15.1.0 of the Unicode Standard (released Sep 12, 2023). In an attempt to adhere to the tenants of REST, the API is organized around the following principles:
- URLs are predictable and resource-oriented.
- Uses standard HTTP verbs and response codes.
- Returns JSON-encoded responses.
- Interactive API Documents (Swagger UI)
- Created by Aaron Luna
The top-level API resources for Unicode Characters and Unicode Blocks have support for retrieving all character/block objects via "list" API methods. These API methods (/v1/characters
and /v1/blocks
) share a common structure, taking at least these three parameters: limit
, starting_after
, and ending_before
.
For your initial request, you should only provide a value for limit
(if the default value of limit=10
is ok, you do not need to provide values for any parameter in your initial request). The response of a list API method contains a data
parameter that represents a single page of results, and a hasMore
parameter that indicates whether the list contains more results after this set.
The starting_after
parameter acts as a cursor to navigate between paginated responses, however, the value used for this parameter is different for each endpoint. For Unicode Characters, the value of this parameter is the codepoint
property, while for Unicode Blocks the id
property is used.
For example, if you request 10 items and the response contains hasMore=true
, there are more search results beyond the first 10. If the 10th search result has codepoint=U+0346
, you can retrieve the next set of results by sending starting_after=U+0346
in a subsequent request.
The ending_before
parameter also acts as a cursor to navigate between pages, but instead of requesting the next set of results it allows you to access previous pages in the list.
For example, if you previously requested 10 items beyond the first page of results, and the first search result of the current page has codepoint=U+0357
, you can retrieve the previous set of results by sending ending_before=U+0357
in a subsequent request.
starting_after
or ending_before
may be used in a request, sending a value for both parameters will produce a response with status 400 Bad Request
.
The top-level API resources for Unicode Characters and Unicode Blocks also have support for retrieval via "search" API methods. These API methods (/v1/characters/search
and /v1/blocks/search
) share an identical structure, taking the same four parameters: name
, min_score
, per_page
, and page
.
The name
parameter is the search term and is used to retrieve a character/block using the official name defined in the UCD. Since a fuzzy search algorithm is used for this process, the value of name
does not need to be an exact match with a character/block name.
The response will contain a results
parameter that represents the characters/blocks that matched your query. Each object in this list has a score
property which is a number ranging from 0-100 that describes how similar the character/block name is to the name
value provided by the user (A value of 100 means that the name
provided by the user is an exact match with a character/block name). The list contains all results where score
>= min_score
, sorted by score
(the first element in the list being the most similar).
The default value for min_score
is 80, however if your request is returning zero results, you can lower this value to potentially surface lower-quality results. Keep in mind, the lowest value for min_score
that is permitted is 70, since the relevence of results quickly drops off around a score of 72, often producing hundreds of results with no relevance to the search term.
The per_page
parameter controls how many results are included in a single response. The response will include a hasMore
parameter that indicates whether there are more search results beyond the current page, as well as currentPage
and totalResults
parameters. If hasMore=true
, the response will also contain a nextPage
parameter.
For example, if you receive a response to a search request with hasMore=true
and nextPage=2
, you can update your request to include page=2
to fetch the next page of results. If the next response includes hasMore=true
and nextPage=3
, update your request to include page=3
, etc. Rinse and repeat until you receive a response with hasMore=false
, indicating that you have received the final set of search results.
Unicode specifies a set of rules to be used when comparing symbolic values, such as block names, known as loose matching rule UAX44-LM3. The algotithm for UAX44-LM3 is simple: Ignore case, whitespace, underscore ('_'), hyphens, and any initial prefix string "is".
This rule applies to many of the parameters that are included with API requests, which avoids returning a 400 response when a parameter name, for example, is sent as 'script', but the expected value is 'Script'. Under UAX44-LM3, both values are equivalent.
For another example, under this rule the block name "Supplemental Arrows-A" is equivalent to "supplemental_arrows__a" and "SUPPLEMENTALARROWSA" since all three of these strings would be reduced to "supplementalarrowsa" after applying UAX44-LM3. For any query or path parameter that expects the name of a Unicode block, any of these three values could be provided and would be understood to refer to block U+27F0..U+27FF SUPPLEMENTAL ARROWS-A
.
Whenever the loose-matching rule applies to a parameter, it will be called out in the docuentation for each individual API endpoint below.
- GET
/v1/characters/-/{string}
- Retrieve one or more character(s)*
- GET
/v1/characters
- List all characters*
- GET
/v1/characters/filter
- List characters that match filter settings†
- GET
/v1/characters/search
- Search characters†
The UnicodeCharacter
object represents a single character/codepoint in the Unicode Character Database (UCD). It contains a rich set of properties that document the purpose and intended representation of the character.
If each response contained every character property, it would be massively inneficient. To ensure that the API remains responsive and performant while also allowing clients to access the full set of character properties, each property is assigned to a property group.
Since they are designed to return lists of characters, responses from the /v1/characters
or /v1/characters/search
endpoints will only include properties from the Minimum property group:
Minimum
- character
- A unit of information used for the organization, control, or representation of textual data.
- name
- A unique string used to identify each character encoded in the Unicode standard.
- description
(CJK Characters ONLY) -
An English definition for this character. Definitions are for modern written Chinese and are usually (but not always) the same as the definition in other Chinese dialects or non-Chinese languages.
- codepoint
-
In character encoding terminology, a codepoint is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but sometimes represent symbols, control characters, or formatting. The set of all possible code points within a given encoding/character set make up that encoding's codespace.
For example, the character encoding scheme ASCII comprises 128 code points in the range
00-7F
, Extended ASCII comprises 256 code points in the range00-FF
, and Unicode comprises 1,114,112 code points in the range0000-10FFFF
. The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 216) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112. - uriEncoded
-
The character as a URI encoded string. A URI is a string that identifies an abstract or physical resource on the internet (The specification for the URI format is defined in RFC 3986).
A URI string must contain only a defined subset of characters from the standard 128 ASCII character set, any other characters must be replaced by an escape sequence representing the UTF-8 encoding of the character.
For example, ∑ (
U+2211 N-ARY SUMMATION
) in UTF-8 encoding is0xE2 0x88 0x91
. To include this character in a URI, each UTF-8 byte is prefixed with the%
character to produce the URI-encoded string:%E2%88%91
.
show_props=Minimum
in any request is redundent since the Minimum property group is included in all responses.
If you wish to explore the properties of one or more specifc characters, the /v1/characters/-/{string}
and /v1/characters/filter
endpoints accept one or more show_props
parameters that allow you to specify additional property groups to include in the response.
For example, you could view the properties from groups UTF-8, Numeric, and Script for the character Ⱒ (U+2C22 GLAGOLITIC CAPITAL LETTER SPIDERY HA
), which is equal to 0xE2 0xB0 0xA2
in UTF-8 encoding by submitting the following request: /v1/characters/-/%E2%B0%A2?show_props=UTF8&show_props=Numeric&show_props=Script.
The value of many of the properties that are defined for each character are only meaningful for specific blocks or a small subset of codepoints (e.g., the hangul_syllable_type
property will have a (Not Applicable) NA
value for all codepoints except those in the four blocks that contain characters from the Hangul writing system).
By default, the hangul_syllable_type
property will NOT be included with the response for any character that has this default value even if the user has submitted a request containing show_props=hangul
or show_props=all
. For actual Hangul characters, the property will be included in the response.
These properties are removed to make the size of each response as small as possible. Knowing that the 🇺 (U+1F1FA REGIONAL INDICATOR SYMBOL LETTER U
) character has the value hangul_syllable_type=NA
provides no real information about this character.
However, if you wish to see every property value, include verbose=true
with your request to the /v1/characters/-/{string}
or /v1/characters/filter
endpoints.
Basic
- block
- Name of the block to which the character belongs. Each block is a uniquely named, continuous, non-overlapping range of code points, containing a multiple of 16 code points, and starting at a location that is a multiple of 16. A block may contain unassigned code points, which are reserved.
- plane
- A range of 65,536 (
0x10000
) contiguous Unicode code points, where the first code point is an integer multiple of 65,536 (0x10000
). Planes are numbered from 0 to 16, with the number being the first code point of the plane divided by 65,536. Thus Plane 0 isU+0000...U+FFFF
, Plane 1 isU+10000...U+1FFFF
, ..., and Plane 16 (0x10
) isU+100000...10FFFF
.
The vast majority of commonly used characters are located in Plane 0, which is called the Basic Multilingual Plane (BMP). Planes 1-16 are collectively referred to as supplementary planes. - age
- The version of Unicode in which the character was assigned to a codepoint, such as "1.1" or "4.0.".
- generalCategory
- The General Category that this character belongs to (e.g., letters, numbers, punctuation, symbols, etc.). The full list of values which are valid for this property is defined in Unicode Standard Annex #44
- combiningClass
- Specifies, with a numeric code, which sequences of combining marks are to be considered canonically equivalent and which are not. This is used in the Canonical Ordering Algorithm and in normalization. For more info, please see Unicode Standard Section 4.3.
- htmlEntities
- A string begining with an ampersand (&) character and ending with a semicolon (;). Entities are used to display reserved characters (e.g., '<' in an HTML document) or invisible characters (e.g., non-breaking spaces). For more info, please see the MDN entry for HTML Entities.
- ideoFrequency
(CJK Characters ONLY) - A rough frequency measurement for the character based on analysis of traditional Chinese USENET postings; characters with a kFrequency of 1 are the most common, those with a kFrequency of 2 are less common, and so on, through a kFrequency of 5.
- ideoGradeLevel
(CJK Characters ONLY) - The primary grade in the Hong Kong school system by which a student is expected to know the character; this data is derived from 朗文初級中文詞典, Hong Kong: Longman, 2001.
- rsCountUnicode
(CJK Characters ONLY) -
The standard radical-stroke count for this character in the form “radical.additional strokes”. The radical is indicated by a number in the range (1..214) inclusive. An apostrophe (') after the radical indicates a simplified version of the given radical. The “additional strokes” value is the residual stroke-count, the count of all strokes remaining after eliminating all strokes associated with the radical.
This field is also used for additional radical-stroke indices where either a character may be reasonably classified under more than one radical, or alternate stroke count algorithms may provide different stroke counts.
The residual stroke count may be negative. This is because some characters (for example, U+225A9, U+29C0A) are constructed by removing strokes from a standard radical.
- rsCountKangxi
(CJK Characters ONLY) - The Kangxi radical-stroke count for this character consistent with the value of the character in the《康熙字典》Kangxi Dictionary in the form “radical.additional strokes”.
- totalStrokes
(CJK Characters ONLY) - The total number of strokes in the character (including the radical). When there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). When there is only one value, it is appropriate for both
UTF-8
- utf8
- The UTF-8 encoded value for the character as a hex string.
- utf8HexBytes
- The byte sequence for the UTF-8 encoded value for the character. This property returns a list of strings, hex values (base-16) in range
00-FF
. - utf8DecBytes
- The byte sequence for the UTF-8 encoded value for the character. This property returns a list of integers, decimal values (base-10) in range 0-127
UTF-16
- utf16
- The UTF-16 encoded value for the character as a hex string.
- utf16HexBytes
- The byte sequence for the UTF-16 encoded value for the character. This property returns a list of strings, hex values (base-16) in range
0000-FFFF
. - utf16DecBytes
- The byte sequence for the UTF-16 encoded value for the character. This property returns a list of integers, decimal values (base-10) in range 0-65,535
UTF-32
- utf32
- The UTF-32 encoded value for the character as a hex string.
- utf32HexBytes
- The byte sequence for the UTF-32 encoded value for the character. This property returns a list of strings, hex values (base-16) in range
00000000-0010FFFF
. - utf32DecBytes
- The byte sequence for the UTF-32 encoded value for the character. This property returns a list of integers, decimal values (base-10) in range 0-1,114,111
Bidirectionality
- bidirectionalClass
- A value assigned to each Unicode character based on the appropriate directional formatting style. For the property values, see Bidirectional Class Values.
- bidirectionalIsMirrored
- A normative property of characters such as parentheses, whose images are mirrored horizontally in text that is laid out from right to left. For example,
U+0028 LEFT PARENTHESIS
is interpreted as opening parenthesis; in a left-to-right context it will appear as “(”, while in a right-to-left context it will appear as the mirrored glyph “)”. This requirement is necessary to render the character properly in a bidirectional context. - bidirectionalMirroringGlyph
- A character that can be used to supply a mirrored glyph for the requested character. For example, "(" (
U+0028 LEFT PARENTHESIS
) mirrors ")" (U+0098 RIGHT PARENTHESIS
) and vice versa. - bidirectionalControl
-
Boolean value that indicates whether the character is one of 12 format control characters which have specific functions in the Unicode Bidirectional Algorithm:
U+200E LEFT-TO-RIGHT MARK
U+200F RIGHT-TO-LEFT MARK
U+202A LEFT-TO-RIGHT EMBEDDING
U+202B RIGHT-TO-LEFT EMBEDDING
U+202C POP DIRECTIONAL FORMATTING
U+202D LEFT-TO-RIGHT OVERRIDE
U+202E RIGHT-TO-LEFT OVERRIDE
U+2066 LEFT-TO-RIGHT ISOLATE
U+2067 RIGHT-TO-LEFT ISOLATE
U+2068 FIRST STRONG ISOLATE
U+2069 POP DIRECTIONAL ISOLATE
U+061C ARABIC LETTER MARK
- pairedBracketType
- Type of a paired bracket, either opening, closing or none (the default value). This property is used in the implementation of parenthesis matching.
- pairedBracketProperty
- For an opening bracket, the code point of the matching closing bracket. For a closing bracket, the code point of the matching opening bracket.
Decomposition
- decompositionType
-
The type of the decomposition (canonical or compatibility). The possible values are listed below:
none
Nonecan
Canonicalcom
Otherwise Unspecified Compatibility Characterenc
Encircled Formfin
Final Presentation Form (Arabic)font
Font Variantfra
Vulgar Fraction Forminit
Initial Presentation Form (Arabic)iso
Isolated Presentation Form (Arabic)med
Medial Presentation Form (Arabic)nar
Narrow (or Hankaku) Compatibility Characternb
No No-break Version Of A Space Or Hyphensml
Small Variant Form (CNS Compatibility)sqr
CJK Squared Font Variantsub
Subscript Formsup
Superscript Formvert
Vertical Layout Presentation Formwide
Wide (or Zenkaku) Compatibility Character
Quick Check
Unicode, being a unifying character set, contains characters that allow similar results to be expressed in different ways. Given that similar text can be written in different ways, we have a problem. How can we determine if two strings are equal ? How can we find a substring in a string?
The answer is to convert the string to a well-known form, a process known as normalization. Unicode normalization is a set of rules based on tables and algorithms. It defines two kinds of normalization equivalence: canonical and compatible.
Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, "Å" (U+212B ANGSTROM SIGN
) is canonically equivalent to BOTH "Å" (U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
) and "A" (U+00C5 LATIN CAPITAL LETTER A
) + "◌̊" (U+030A COMBINING RING ABOVE
).
Code point sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. An example of this could be representations of the decimal digit 6: "Ⅵ" (U+2165 ROMAN NUMERAL SIX
) and "⑥" (U+2465 CIRCLED DIGIT SIX
). In one particular sense they are the same, but there are many other qualities that are different between then.
Compatible equivalence is a superset of canonical equivalence. In other words each canonical mapping is also a compatible one, but not the other way around.
Composition is the process of combining marks with base letters (multiple code points are replaced by single points whenever possible). Decomposition is the process of taking already composed characters apart (single code points are split into multiple ones). Both processes are recursive.
An additional difficulty is that the normalized ordering of multiple consecutive combining marks must be defined. This is done using a concept called the Canonical Combining Class or CCC, a Unicode character property (available as the combiningClass property in the Basic property group).
When you take all of these concepts into consideration, four normalization forms are defined:
NFD
Canonical decomposition and orderingNFC
Composition after canonical decomposition and orderingNFKD
Compatible decomposition and orderingNFKC
Composition after compatible decomposition and ordering
In an effort to make the process of normalizing/determining if a string is already normalized less tedious and complex, four “quick check” properties exist for each character (NFD_QC, NFC_QC, NFKD_QC, and NFKC_QC, one for each normalization form).
These properties allow implementations to quickly determine whether a string is in a particular Normalization Form. This is, in general, many times faster than normalizing and then comparing.
- NFD_QC
- NFD_QC stands for Normalization Form D Quick Check. This property is used to quickly check if a character is already in NFD form, and thus does not need to be further normalized.
- NFC_QC
- NFC_QC stands for Normalization Form C Quick Check. This property is used to quickly check if a character is already in NFC form, and thus does not need to be further normalized.
- NFKD_QC
- NFKD_QC stands for Normalization Form KD Quick Check. This property is used to quickly check if a character is already in NFKD form, and thus does not need to be further normalized.
- NFKC_QC
- NFKC_QC stands for Normalization Form KC Quick Check. This property is used to quickly check if a character is already in NFKC form, and thus does not need to be further normalized.
Numeric
- numericType
-
If a character is normally used as a number, it will be assigned a value other than
None
, which is the default value used for all non-number characters:None
NoneDe
DecimalDi
DigitNu
Numeric
- numericValue
-
If the character has the property value
numericType=Decimal
, then thenumericValue
of that digit is represented with an integer value (limited to the range 0..9).If the character has the property value
numericType=Digit
, then thenumericValue
of that digit is represented with an integer value (limited to the range 0..9). This covers digits that need special handling, such as the compatibility superscript digits. Starting with Unicode 6.3.0, no newly encoded numeric characters will be givennumericValue=Digit
, nor will existing characters withnumericValue=Decimal
be changed tonumericValue=Digit
. The distinction between those two types is not considered useful.If the character has the property value
numericType=Numeric
, then thenumericValue
of that character is represented with a positive or negative integer or rational number. This includes fractions such as, for example, "1/5" for ⅕ (U+2155 VULGAR FRACTION ONE FIFTH
). - numericValueParsed
- This is NOT a property from the Unicode Standard. This is a floating point version of the numericValue property (which is a string value). For example,
0.2
for ⅕ (U+2155 VULGAR FRACTION ONE FIFTH
)
Joining
- joiningType
-
Each Arabic letter must be depicted by one of a number of possible contextual glyph forms. The appropriate form is determined on the basis of the cursive joining behavior of that character as it interacts with the cursive joining behavior of adjacent characters. In the Unicode Standard, such cursive joining behavior is formally described in terms of values of a character property called joiningType. Each Arabic character falls into one of the types listed below:
R
Right JoiningL
Left JoiningD
Dual JoiningC
Join CausingU
Non JoiningT
Transparent
Note that for cursive joining scripts which are typically rendered top-to-bottom, rather than right-to-left,
joiningType=L
conventionally refers to bottom joining, andjoiningType=R
conventionally refers to top joining. - joiningGroup
- The group of characters that the character belongs to in cursive joining behavior. For Arabic and Syriac characters.
- joiningControl
- Boolean value that indicates whether the character has specific functions for control of cursive joining and ligation.
Linebreak
- lineBreak
-
Line-breaking class of the character. Affects whether a line break must, may, or must not appear before or after the character. The possible values are listed below:
AL
Ordinary Alphabetic And SymbolAI
Ambiguous (Alphabetic Or Ideographic)BA
Break Opportunity AfterB2
Break Opportunity Before And AfterBK
Mandatory BreakBB
Break Opportunity BeforeCL
Closing PunctuationCB
Contingent Break OpportunityCR
Carriage ReturnCM
Attached Characters And Combining MarksGL
Non-breaking ("Glue")EX
Exclamation/InterrogationH3
Hangul LVT SyllableH2
Hangul LV SyllableID
IdeographicHY
HyphenIS
Infix SeparatorIN
InseparableJT
Hangul T JamoJL
Hangul L JamoLF
Line FeedJV
Hangul V JamoNS
Non StarterNL
Next LineOP
Opening PunctuationNU
NumericPR
Prefix (Numeric)PO
Postfix (Numeric)SA
Complex Context (South East Asian)QU
Ambiguous QuotationSP
SpaceSG
SurrogatesWJ
Word JoinerSY
Symbols Allowing BreaksZW
Zero Width SpacXX
Unknown
East Asian Width
- eastAsianWidth
-
The width of the character, in terms of East Asian writing systems that distinguish between full width, half width, and narrow. The possible values are listed in Unicode Standard Annex #11:
A
East Asian AmbiguousF
East Asian FullwidthH
East Asian HalfwidthN
Neutral Not East AsianNa
East Asian NarrowW
East Asian Wide
Case
- uppercase
- Boolean value that indicates whether the character is an uppercase letter.
- lowercase
- Boolean value that indicates whether the character is a lowercase letter.
- simpleUppercaseMapping
- The uppercase form of the character, if expressible as a single character.
- simpleLowercaseMapping
- The lowercase form of the character, if expressible as a single character.
- simpleTitlecaseMapping
- The titlecase form of the character, if expressible as a single character.
- simpleCaseFolding
- The case-folded (lowercase) form of the character when applying simple folding, which does not change the length of a string (and may thus fail to fold some characters correctly).
Script
- script
- The script (writing system) to which the character primarily belongs to, such as "Latin," "Greek," or "Common," which indicates a character that is used in different scripts.
- scriptExtensions
-
Further refines the script category of a character by providing additional information about the character's usage and context. This property allows for more specific categorization of characters that may have multiple uses or are used in multiple scripts.
The script extensions property can also be used to indicate characters that are used in multiple scripts, such as characters that are used in both Latin and Cyrillic scripts.
Hangul
- hangulSyllableType
-
Type of syllable, for characters that are Hangul (Korean) syllabic characters. Possible values
NA
Not ApplicableL
Leading JamoV
Vowel JamoT
Trailing JamoLV
Lv SyllableLVT
Lvt Syllable
Indic
- indicSyllabicCategory
- Used to identify the type of syllable that a character belongs to, such as a vowel, consonant, or a combination of both.
- indicMatraCategory
- Used to identify the type of matra (vowel sign) associated with a character, such as a short or long vowel sign.
- indicPositionalCategory
- Used to identify the position of a character in a syllable, such as the initial, medial, or final position.
CJK Variants
Although Unicode encodes characters and not glyphs, the line between the two can sometimes be hard to draw, particularly in East Asia. There, thousands of years worth of writing have produced thousands of pairs which can be used more-or-less interchangeably.
To deal with this situation, the Unicode Standard has adopted a three-dimensional model for determining the relationship between ideographs, and has formal rules for when two forms may be unified. Both are described in some detail in the Unicode Standard. Briefly, however, the three-dimensional model uses the x-axis to represent meaning, and the y-axis to represent abstract shape. The z-axis is used for stylistic variations.
The traditionalVariant
and simplifiedVariant
fields are used in character-by-character conversions between simplified and traditional Chinese (SC and TC, respectively).
Two variation fields, semanticVariant
and specializedSemanticVariant
, are used to mark cases where two characters have identical and overlapping meanings, respectively.
The spoofingVariant
field is used to denote a special class of variant, a spoofing variant. Spoofing variants are potentially used in bad faith to direct users to unexpected URLs, evade email filters, or otherwise deceive end-users.
For more information on CJK variants, please see UAX #38, Section 3.7.
- traditionalVariant
- The Unicode value(s) for the traditional Chinese variant(s) for this character.
- simplifiedVariant
- The Unicode value(s) for the simplified Chinese variant(s) for this character.
- zVariant
- The z-variants for the character, if any. Z-variants are instances where the same abstract shape has been encoded multiple times, either in error or because of source separation. Z-variant pairs also have identical semantics.
- compatibilityVariant
- The canonical Decomposition_Mapping value for the ideograph
- semanticVariant
- The Unicode value for a semantic variant for this character. A semantic variant is an x- or y-variant with similar or identical meaning which can generally be used in place of the indicated character.
- specializedSemanticVariant
- The Unicode value for a specialized semantic variant for this character. The syntax is the same as for the kSemanticVariant field. A specialized semantic variant is an x- or y-variant with similar or identical meaning only in certain contexts.
- spoofingVariant
- The spoofing variants for the character, if any. Spoofing variants include character pairs which look similar, particularly at small point sizes, which are not already z-variants or compatibility variants.
CJK Numeric
There are three fields, accountingNumeric
, otherNumeric
, and primaryNumeric
to indicate the numerical values an ideograph may have. Traditionally, ideographs were used both for numbers and words, and so many ideographs have (or can have) numeric values. The various kinds of numeric values are specified by these three fields.
The three numeric-value fields should have no overlap; that is, characters with a accountingNumeric
value should not have a otherNumeric
or primaryNumeric value as well.
- accountingNumeric
-
The value of the character when used as an accounting numeral to prevent fraud. A numeral such as 十 (ten) is easily transformed into 千 (thousand) by adding a single stroke, so monetary documents often use an accounting form of the numeral, such as 拾 (ten), instead of the more common—and simpler—form.
Characters with this property will have a single, well-defined value, which a native reader can reasonably be expected to understand.
- primaryNumeric
-
The value of the character when used as a numeral. Characters which have this property have numeric values that are common, and always convey the same numeric value.
For example, 千 always means “thousand.” A native reader is expected to understand the numeric value for these characters.
- otherNumeric
-
One or more values of the character when used as a numeral. Characters with this property are rarely used for writing numbers, or have non-standard or multiple values depending on the region.
For example, 㠪 is a rare character whose meaning, “five,” would not be recognized by most native readers. An English-language equivalent is “gross,” whose numeric value, “one hundred forty-four,” is not universally understood by native readers.
CJK Readings
The properties in this group include the pronunciations for a given character in Mandarin, Cantonese, Japanese, Sino-Japanese, Korean, and Vietnamese.
Any attempt at providing a reading or set of readings for a character is bound to be fraught with difficulty, because the readings will vary over time and from place to place, even within a language. Mandarin is the official language of both the PRC and Taiwan (with some differences between the two) and is the primary language over much of northern and central China, with vast differences from place to place. Even Cantonese, the modern language covered by the Unihan database with the least geographical range, is spoken throughout Guangdong Province and in much of neighboring Guangxi Zhuang Autonomous Region, and covers four large urban centers (Guangzhou, Shenzhen, Macao, and Hong Kong). There are therefore distinct regional variations in pronunciation and vocabulary.
Indeed, even the same speaker will pronounce the same word differently depending on the speaker or even the social context. This is particularly true for languages such as Cantonese, where there has been comparatively little government effort to standardize the language.
Add to this the fact that in none of these languages—the various forms of Chinese, Japanese, Korean, Vietnamese—is the syllable the fundamental unit of the language. As in the West, it’s the word, and the pronunciation of a character is tied to the word of which it is a part. In Chinese (followed by Vietnamese and Korean), the rule is one ideograph/one syllable, with most words written using multiple ideographs. In most cases, an ideograph has only one reading (or only one important reading), but there are numerous exceptions.
In Japanese, the situation is enormously more complex. Japanese has two pronunciation systems, one derived from Chinese (the on pronunciation, or Sino-Japanese), and the other from Japanese (the kun pronunciation).
The on readings derive from Chinese loan-words. They depend on factors such as when (and from which part of China) the loan-word was borrowed, and changes to Japanese since then. On readings can therefore have little obvious relationship to modern Chinese readings, and the same Chinese reading for a given kanji can be reflected in multiple on readings in Japanese. Contrary to Chinese practice, on readings may be polysyllabic.
Kun readings, on the other hand, derive from native Japanese words for which either existing kanji were adopted or new kanji coined.
The net result is that multiple readings are the rule for Japanese kanji. These multiple readings may bear no relationship to one another and are highly context-sensitive. Even a native Japanese reader may not know the correct pronunciation of a proper noun if it is written only in kanji.
- mandarin
- The most customary pīnyīn reading for this character. When there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). When there is only one value, it is appropriate for both.
- cantonese
- The most customary jyutping (Cantonese) reading for this character.
- japaneseKun
- The Japanese pronunciation(s) of this character in the Hepburn romanization.
- japaneseOn
- The Sino-Japanese pronunciation(s) of this character.
- hangul
- The modern Korean pronunciation(s) for this character in Hangul
- vietnamese
- The character's pronunciation(s) in Quốc ngữ.
Function and Graphic
- dash
- Boolean value that indicates whether the character is classified as a dash. This includes characters explicitly designated as dashes and their compatibility equivalents.
- hyphen
- Boolean value that indicates whether the character is regarded as a hyphen. This refers to those dashes that are used to mark connections between parts of a word and to the Katakana middle dot.
- quotationMark
- Boolean value that indicates whether the character is used as a quotation mark in some language(s).
- terminalPunctuation
- Boolean value that indicates whether the character is a punctuation mark that generally marks the end of a textual unit.
- sentenceTerminal
- Boolean value that indicates whether the character is used to terminate a sentence.
- diacritic
- Boolean value that indicates whether the character is diacritic. i.e., linguistically modifies another character to which it applies. A diacritic is usually, but not necessarily, a combining character.
- extender
- Boolean value that indicates whether the principal function of the character is to extend the value or shape of a preceding alphabetic character.
- softDotted
- Boolean value that indicates whether the character contains a dot that disappears when a diacritic is placed above the character (e.g., "i" and "j" are soft dotted).
- alphabetic
- Boolean value that indicates whether the character is alphabetic. i.e., a letter or comparable to a letter in usage. True for characters with generalCategory value of Lu, Ll, Lt, Lm, Lo, or Nl and additionally for characters with the otherAlphabetic property.
- math
- Boolean value that indicates whether the character is mathematical. This includes characters with Sm (Symbol, math) as the General Category value, and some other characters.
- hexDigit
- Boolean value that indicates whether the character is used in hexadecimal numbers. This is true for ASCII hexadecimal digits and their fullwidth versions.
- asciiHexDigit
- Boolean value that indicates whether the character is an ASCII character used to represent hexadecimal numbers (i.e., letters A-F, a-f and digits 0-9).
- defaultIgnorableCodePoint
- Boolean value that indicates whether the code point should be ignored in automatic processing by default.
- logicalOrderException
- Boolean value that indicates whether the character belongs to the small set of characters that do not use logical order and hence require special handling in most processing
- prependedConcatenationMark
- Boolean value that indicates whether the character belongs to a small class of visible format controls, which precede and then span a sequence of other characters, usually digits. These have also been known as "subtending marks", because most of them take a form which visually extends underneath the sequence of following digits.
- whiteSpace
- Boolean value that indicates whether the character should be treated by programming languages as a whitespace character when parsing elements. This concept does not match the more restricted whitespace concept in many programming languages, but it is a generalization of that concept to the "Unicode world."
- verticalOrientation
- A property used to establish a default for the correct orientation of characters when used in vertical text layout, as described in Unicode Standard Annex #50, "Unicode Vertical Text Layout"
- regionalIndicator
-
The regional indicator symbols are a set of 26 alphabetic Unicode characters (A–Z) intended to be used to encode ISO 3166-1 alpha-2 two-letter country codes in a way that allows optional special treatment.
They are encoded in the range 🇦 (
U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER A
) to 🇿 (U+1F1FF REGIONAL INDICATOR SYMBOL LETTER Z
) Within the Enclosed Alphanumeric Supplement block in the Supplementary Multilingual Plane.These were defined as an alternative to encoding separate characters for each country flag. Although they can be displayed as Roman letters, it is intended that implementations may choose to display them in other ways, such as by using national flags.
For example, since the ISO 3166-1 alpha-2 country code for Ukraine is
UA
, when the characters 🇺 (U+1F1FA
) and 🇦 (U+1F1E6
) are placed next to eachother the Ukrainian flag should be rendered: 🇺🇦.
Emoji
- emoji
- Boolean value that indicates whether the character is recommended for use as emoji.
- emojiPresentation
- Boolean value that indicates whether the character has emoji presentation by default.
- emojiModifier
- Boolean value that indicates whether the character is used as an emoji modifier. Currently this includes only the skin tone modifier characters.
- emojiModifierBase
- Boolean value that indicates whether the character can serve as a base for emoji modifiers.
- emojiComponent
- Boolean value that indicates whether the character is used in emoji sequences but normally does not appear on emoji keyboards as a separate choice (e.g., keycap base characters or Regional_Indicator characters).
- extendedPictographic
- Boolean value that indicates whether the character is a pictographic symbol or otherwise similar in kind to characters with the Emoji property. This enables segmentation rules involving emoji to be specified stably, even in cases where an existing non-emoji pictographic symbol later comes to be treated as an emoji.
The UnicodeCodepoint
resource is not an object like the other resources, it is simply a hexadecimal value that refers to a single character in the Unicode codespace.
This endpoint performs nearly the same function as the /v1/characters/-/{string}
endpoint. However, sending a request for a character to the /v1/characters/-/{string}
endpoint requires you to provide either the character itself or the URI encoded string representation of the character.
Since there are plenty of scenarios where it may be easier to supply the assigned codepoint for a character rather than the rendered glyph or URI-encoded value, the /v1/codepoints/{hex}
endpoint allows you to request the same sets of character property groups as the /v1/characters/-/{string}
endpoint.
The only difference between the two endpoints is requests to the /v1/characters/-/{string}
endpoint can retrieve data for one or more characters, while requests to the /v1/codepoints/{hex}
endpoint can only be used to retrieve details of a single character.
- GET
/v1/blocks/{name}
- Retrieve one or more Block(s)
- GET
/v1/blocks
- List Blocks
- GET
/v1/blocks/search
- Search Blocks
The UnicodeBlock
object represents a grouping of characters within the Unicode encoding space. Each block is generally, but not always, meant to supply glyphs used by one or more specific languages, or in some general application area such as mathematics, surveying, decorative typesetting, social forums, etc.
Each block is a uniquely named, continuous, non-overlapping range of code points, containing a multiple of 16 code points (additionally, the starting codepoint for each block is a multiple of 16). A block may contain unassigned code points, which are reserved.
The UnicodeBlock
object exposes a small set of properties such as the official name of the block, the range of code points assigned to the block and the total number of defined characters within the block:
UnicodeBlock
Properties
- id
- This is NOT a property from the Unicode Standard. This is an integer value used to navigate within a paginated list of
UnicodeBlock
objects. The first block (U+0000..U+007F BASIC LATIN
) hasid=1
and each block is numbered sequentially in order of starting codepoint. - name
- Unicode blocks are identified by unique names, which use only ASCII characters and are usually descriptive of the nature of the symbols (in English), such as "Tibetan" or "Supplemental Arrows-A".
- plane
- A string value equal to the abbreviated name of the Unicode Plane containing the block (e.g., "BMP" for Basic Multilingual Plane).
- start
- A string value equal to the first codepoint allocated to the block, expressed in
U+hhhhhh
format. - finish
- A string value equal to the last codepoint allocated to the block, expressed in
U+hhhhhh
format. - total_allocated
- An integer value equal to the total number of characters (defined or reserved) contained in the block.
- total_defined
- An integer value equal to the total number of characters with defined names, glyphs, etc in the block.
The UnicodePlane
object represents a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16. The first two positions of a character's codepoint value (U+hhhhhh) correspond to the plane number in hex format (possible values 0x00
–0x10
).
Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly used characters. The higher planes 1 through 16 are called "supplementary planes". The last code point in plane 16 is the last code point in Unicode, U+10FFFF.
UnicodePlane
Properties
- number
- The official number that identifies the range of codepoints within a plane. The first two positions of a character's codepoint value (U+hhhhhh) correspond to the plane number in hex format (possible values
0x00
...0x10
). This is a decimal value, however, with possible values 0...16. - name
-
The official name of a plane, according to the Unicode Standard. As of version 15.0.0, seven of the total 17 planes have official names (the official abbreviation for each plane if also given in parentheses):
- Basic Multilingual Plane (BMP)
- Supplementary Multilingual Plane (SMP)
- Supplementary Ideographic Plane (SIP)
- Tertiary Ideographic Plane (TIP)
- Supplementary Special-purpose Plane (SSP)
- Supplementary Private Use Area-A (SPUA-A)
- Supplementary Private Use Area-B (SPUA-B)
The codepoints within Planes 4-13 (
U+40000
...U+DFFFF
) are unassigned, and these planes currently have no official name/abbreviation. - abbreviation
- An acronym that identifies the plane, the list in the previous definition contains the abbreviation for each plane along with the official name.
- start
- A string value equal to the first codepoint allocated to the plane, expressed in
U+hhhhhh
format. - finish
- A string value equal to the last codepoint allocated to the plane, expressed in
U+hhhhhh
format. - total_allocated
- An integer value equal to the total number of characters (defined or reserved) contained in the plane (always 216).
- total_defined
- An integer value equal to the total number of characters with defined names, glyphs, etc in the plane.