Skip to content

Encode Unicode

Peter edited this page Feb 1, 2018 · 3 revisions

The choice of Unicode is preserved, the only change was from UTF-16 encode (of Unicode) to UTF-8.

About tools like "regular templates" and separators, see "" (F002) or "" (F003) of the PUA as neutral separators for "free UTF-8 alphabet" in multilingual corpus. F002 and F003 are only a text separators that are not content, and can be also in split or regular-expression operarations... So it is an internal representation on software-control.

Convention: use (F002) as open-tag and (F003) as close-tag. Example:
<section class="main"><p>Hello</p><p>Bye!</p></section>
substituted by representation of array+encode:
  ["section class=\"main\"","p","p"]
  +
  HelloBye!

So, it is easy to do fast what is not so fast in DOMDocument representation. Fast translate to TXT (tr// /), to replace tags, to analyze neighborhoods (eg. sequence any-open-tag and "Hello") and to analyze structre by complex regular expressions, where XPath is not so good.

See also wikipedia-discussion.

Clone this wiki locally