Natural language parser, for Latin-script languages, that produces nlcst.
- What is this?
- When should I use this?
- Install
- Use
- API
- Algorithm
- Types
- Compatibility
- Security
- Related
- Contribute
- License
This package exposes a parser that takes Latin-script natural language and produces a syntax tree.
If you want to handle natural language as syntax trees manually, use this.
Alternatively, you can use the retext plugin retext-latin
,
which wraps this project to also parse natural language at a higher-level
(easier) abstraction.
Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”), this project does a good job at tokenizing it.
For English and Dutch, you can instead use parse-english
and
parse-dutch
.
You can somewhat use this for Latin-like scripts, such as Cyrillic (“привет”), Georgian (“გამარჯობა”), Armenian (“Բարեւ”), and such.
This package is ESM only. In Node.js (version 16+), install with npm:
npm install parse-latin
In Deno with esm.sh
:
import {ParseLatin} from 'https://esm.sh/parse-latin@7'
In browsers with esm.sh
:
<script type="module">
import {ParseLatin} from 'https://esm.sh/parse-latin@7?bundle'
</script>
import {ParseLatin} from 'parse-latin'
import {inspect} from 'unist-util-inspect'
const tree = new ParseLatin().parse('A simple sentence.')
console.log(inspect(tree))
Yields:
RootNode[1] (1:1-1:19, 0-18)
└─0 ParagraphNode[1] (1:1-1:19, 0-18)
└─0 SentenceNode[6] (1:1-1:19, 0-18)
├─0 WordNode[1] (1:1-1:2, 0-1)
│ └─0 TextNode "A" (1:1-1:2, 0-1)
├─1 WhiteSpaceNode " " (1:2-1:3, 1-2)
├─2 WordNode[1] (1:3-1:9, 2-8)
│ └─0 TextNode "simple" (1:3-1:9, 2-8)
├─3 WhiteSpaceNode " " (1:9-1:10, 8-9)
├─4 WordNode[1] (1:10-1:18, 9-17)
│ └─0 TextNode "sentence" (1:10-1:18, 9-17)
└─5 PunctuationNode "." (1:18-1:19, 17-18)
This package exports the identifier ParseLatin
.
There is no default export.
Create a new parser.
Turn natural language into a syntax tree.
value
(string
, optional) — value to parse
Tree (RootNode
).
👉 Note: The easiest way to see how
parse-latin
parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.
parse-latin
splits text into white space, punctuation, symbol, and word
tokens:
- “word” is one or more unicode letters or numbers
- “white space” is one or more unicode white space characters
- “punctuation” is one or more unicode punctuation characters
- “symbol” is one or more of anything else
Then, it manipulates and merges those tokens into a syntax tree, adding sentences and paragraphs where needed.
- some punctuation marks are part of the word they occur in, such as
non-profit
,she’s
,G.I.
,11:00
,N/A
,&c
,nineteenth- and…
- some periods do not mark a sentence end, such as
1.
,e.g.
,id.
- although periods, question marks, and exclamation marks (sometimes) end a
sentence, that end might not occur directly after the mark, such as
.)
,."
- …and many more exceptions
This package is fully typed with TypeScript. It exports no additional types.
Projects maintained by me are compatible with maintained versions of Node.js.
When I cut a new major release, I drop support for unmaintained versions of
Node.
This means I try to keep the current release line, parse-latin@^7
, compatible
with Node.js 16.
This package is safe.
parse-english
— English (natural language) parserparse-dutch
— Dutch (natural language) parser
Yes please! See How to Contribute to Open Source.