-
Notifications
You must be signed in to change notification settings - Fork 3
Encode Unicode
The choice of Unicode is preserved, the only change was from UTF-16 encode (of Unicode) to UTF-8.
About tools like "regular templates" and separators, see "" (F002) or "" (F003) of the PUA as neutral separators for "free UTF-8 alphabet" in multilingual corpus. F002
and F003
are only a text separators that are not content, and can be also in split or regular-expression operarations... So it is an internal representation on software-control.
Convention: use
(F002) as open-tag and
(F003) as close-tag. Example:<section class="main"><p>Hello</p><p>Bye!</p></section>
substituted by representation of array+encode:
["section class=\"main\"","p","p"]
+
HelloBye!
So, it is easy to do fast what is not so fast in DOMDocument representation. Fast translate to TXT (tr// /
), to replace tags, to analyze neighborhoods (eg. sequence any-open-tag and "Hello") and to analyze structre by complex regular expressions, where XPath is not so good.
See also wikipedia-discussion.