Transforming DELA into SQL or JSON #2

ppKrauss · 2018-01-09T10:55:48Z

Hi, I am "new" here: I used Unitex ~10 years ago, and we translated all to SQL, was a good approach to manage big data... There are an "UnitexGramLab recipe" to transform DELA datasets into SQL (PostgreSQL) or JSON?

eric-laporte · 2018-01-10T10:23:24Z

Hi Peter, no, there is no such recipe. In which context does this approach help?

ppKrauss · 2018-01-10T12:42:31Z

Hi @eric-laporte,

Portuguese is not a "so natural" language, there are "official rules" as the Portuguese Language Orthographic Agreement of 1990's... So, 90% of portuguese-educated-vocabullary is defined "by law", not "by statistics"... Each corpus-dictionary pair must be organized by date-ranges... SQL is the only "universal tool" to manage all the relationships.

In my opinion, the SQL management is good for:

express "managing recipes" and "data provenance scripts" in dictionaries, all official-rule/dictionary relationships and all copus/dictionary relationships.
PS: we need transparency, enough eyeballs and easy-to-check "recipe language" in this context.
express and analyse each part with its proper "lingua franca", so with simple, reliable and universal algorithms:
- text corpus with SQL and XHTML (so XML and XPath operations);
- DELA dictionary with SQL or SQL+JSON, so relational algebra and JSON algebra;
- relationships with SQL or SQL+ranges (so date-range algebra);
- (sometimes relationships with semantic, so JSON-LD or simple SQL-RDF triples).
- (all metadata also can be efficiently stored as JSON+SQL, see peripherical datasets, multisets and fuzzy sets)
... to express validation rules, to compare, merge, import and export dictionaries, etc.
maintain "all into one container", the SQL database. One stop for manage, report generation, validation and version control.

In nowadays, with SQL-2016 standard (so with PostgreSQL v9.5+) all the listed features are possible.

Unitex-core is perfect for micro-managing and specialist operations, SQL is perfect for managing big-data and macro operations... It is not necessary to "plug" both, because is easy to export/import data automatically in both (!)... But, with little investment, it is also possible to plug by SQL C++ modules (can copy/paste some strategic Unitex-core functions into the database), unifying the systems.

10 years ago we used SQL in São Carlos municipality and in 2009 something (for plurals) in the LexML project... It was a proof of concept for DELA translation and SQL use: I think the results was positive.

eric-laporte · 2018-01-10T16:51:43Z

Peter,
I doubt many Unitex/GramLab users would be interested in storing corpus-dictionary pairs, but if you want to have an idea about that, you can post a message on the users' forum.
As to translating all the Unitex/GramLab code to SQL, it would be a huge project, so it's better to be sure that the result would be as satisfactory as the present state of the system in terms of the criteria that matter most to users, for example:

speed in the processing of text with a dictionary or a grammar,
readability of dictionaries.

eric-laporte · 2018-01-12T09:58:08Z

Peter, one more comment: your proposal is not specific to the language resources of Brazilian Portuguese. The present Unitex/GramLab code for corpora and dictionaries is the same for all languages, except for some configuration files that select some code for some languages.

ppKrauss · 2018-01-12T19:30:20Z

@eric-laporte,

I am starting with open data convertions before SQL and/or JSON: any database-user that want to "play" with Unitex dictionary data, need to access it without friction (see FrictionLessData initiative), by simple copy/paste or SQL's "copy from CSV".

I converted all to UTF8, that is the Web standard and is easy to manage at Github; them adapted all to a tabular CSV format, that is the most simple and accept format (RFC 4180 and W3C's tabular-data-model standards) for data interchange.

I am trying to do this first step here.

Answering your comments:

I doubt many Unitex/GramLab users would be interested in storing corpus-dictionary pairs,

Hum... Ok, I'm alone in this demand: I am abandoning it at this first moment.

but if you want to have an idea about that, you can post a message on the users' forum.

The idea is to solve, in a far future, "the problem of the many pt-BR dictionaries", one for each historic period, with well-defined date-range constraints (laws of ortographic reforms).
Our goal is to use LexML and SciELO texts as corpus. LexML have a collection of millions of official documents (so good and dated HTML text corpus)... SciELO also, another good (XML JATS) million to check correlations and changes-adoption in the "official educated-language".

Typical applications are "dated spell checker" (in OCR and digital preservation contexts) or "dated dictionary" (for search optimization).

As to translating all the Unitex/GramLab code to SQL, it would be a huge project,

... Not so huge, as I commented above, need only a "little investment ... copy/paste some strategic Unitex-core functions". The first step is to define a good data model, and is what I want to do next weeks...
Next step, yes, is not a job for one-man-show... Perhas a six-months-job for a little team, testing and enhancing perfomance in the SQL-side.

speed in the processing of text with a dictionary or a grammar,
readability of dictionaries.

yes, need focus in this goals.

... your proposal is not specific to the language resources of Brazilian Portuguese.

Yes, we can extend for others... But the main justification (my rationale above) is that pt-BR is "defined by law". For others, like English (defined by "cultural ecossystem"), the date-range reference is not attractive — for English is difficult to enforce corpus segments or dictionaries into some date range.

The present Unitex/GramLab code for corpora and dictionaries is the same for all languages, except for some configuration files that select some code for some languages.

Yes, I am also supposing that with pt-BR and all other Indo-European languages can change defaults from UTF-16 to UTF8. Ideal to an open dataset is to adopt its local base-charset by default.

... I will offer a Perl or shell-script that rebuilds the files, all into UTF16 and other Unitex's expected formats.

My next step is to understand what is the minimal primary source for a Unitex dictionary... I am supposing that DELAS+DELACF+Inflextions are all (minimal) that we need to generate a complete dictionary (ex. DELAF_PB).

eric-laporte · 2018-01-13T18:32:37Z

Hi Peter,
Unitex/GramLab already works with language resources in UTF-8. And it already has a converter to transcode UTF-16 into UTF-8.
As to the source of a Unitex/GramLab dictionary, you can read the user's manual: for an inflectional or agglutinative language, it's a DELAS or a DELAC plus the corresponding inflectional transducers, plus in some cases the morphological dictionary-graphs. For a language which is neither inflectional nor agglutinative, it can be a DELAF or a DELACF, plus in some cases the morphological dictionary-graphs.

martinec · 2018-06-13T12:44:21Z

I'm closing this issue. Transforming DELA into SQL or JSON is not, at least for now, a part of the project's goals. Nevertheless, feel free to open a PR on the core repository if you want to share a such module.

martinec closed this as completed Jun 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transforming DELA into SQL or JSON #2

Transforming DELA into SQL or JSON #2

ppKrauss commented Jan 9, 2018

eric-laporte commented Jan 10, 2018

ppKrauss commented Jan 10, 2018 •

edited

Loading

eric-laporte commented Jan 10, 2018

eric-laporte commented Jan 12, 2018

ppKrauss commented Jan 12, 2018

eric-laporte commented Jan 13, 2018

martinec commented Jun 13, 2018

Transforming DELA into SQL or JSON #2

Transforming DELA into SQL or JSON #2

Comments

ppKrauss commented Jan 9, 2018

eric-laporte commented Jan 10, 2018

ppKrauss commented Jan 10, 2018 • edited Loading

eric-laporte commented Jan 10, 2018

eric-laporte commented Jan 12, 2018

ppKrauss commented Jan 12, 2018

eric-laporte commented Jan 13, 2018

martinec commented Jun 13, 2018

ppKrauss commented Jan 10, 2018 •

edited

Loading