Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transforming DELA into SQL or JSON #2

Closed
ppKrauss opened this issue Jan 9, 2018 · 7 comments
Closed

Transforming DELA into SQL or JSON #2

ppKrauss opened this issue Jan 9, 2018 · 7 comments

Comments

@ppKrauss
Copy link

ppKrauss commented Jan 9, 2018

Hi, I am "new" here: I used Unitex ~10 years ago, and we translated all to SQL, was a good approach to manage big data... There are an "UnitexGramLab recipe" to transform DELA datasets into SQL (PostgreSQL) or JSON?

@eric-laporte
Copy link
Member

Hi Peter, no, there is no such recipe. In which context does this approach help?

@ppKrauss
Copy link
Author

ppKrauss commented Jan 10, 2018

Hi @eric-laporte,

Portuguese is not a "so natural" language, there are "official rules" as the Portuguese Language Orthographic Agreement of 1990's... So, 90% of portuguese-educated-vocabullary is defined "by law", not "by statistics"... Each corpus-dictionary pair must be organized by date-ranges... SQL is the only "universal tool" to manage all the relationships.

In my opinion, the SQL management is good for:

  • express "managing recipes" and "data provenance scripts" in dictionaries, all official-rule/dictionary relationships and all copus/dictionary relationships.
    PS: we need transparency, enough eyeballs and easy-to-check "recipe language" in this context.

  • express and analyse each part with its proper "lingua franca", so with simple, reliable and universal algorithms:

  • ... to express validation rules, to compare, merge, import and export dictionaries, etc.

  • maintain "all into one container", the SQL database. One stop for manage, report generation, validation and version control.

In nowadays, with SQL-2016 standard (so with PostgreSQL v9.5+) all the listed features are possible.


Unitex-core is perfect for micro-managing and specialist operations, SQL is perfect for managing big-data and macro operations... It is not necessary to "plug" both, because is easy to export/import data automatically in both (!)... But, with little investment, it is also possible to plug by SQL C++ modules (can copy/paste some strategic Unitex-core functions into the database), unifying the systems.

10 years ago we used SQL in São Carlos municipality and in 2009 something (for plurals) in the LexML project... It was a proof of concept for DELA translation and SQL use: I think the results was positive.

@eric-laporte
Copy link
Member

Peter,
I doubt many Unitex/GramLab users would be interested in storing corpus-dictionary pairs, but if you want to have an idea about that, you can post a message on the users' forum.
As to translating all the Unitex/GramLab code to SQL, it would be a huge project, so it's better to be sure that the result would be as satisfactory as the present state of the system in terms of the criteria that matter most to users, for example:

  • speed in the processing of text with a dictionary or a grammar,
  • readability of dictionaries.

@eric-laporte
Copy link
Member

Peter, one more comment: your proposal is not specific to the language resources of Brazilian Portuguese. The present Unitex/GramLab code for corpora and dictionaries is the same for all languages, except for some configuration files that select some code for some languages.

@ppKrauss
Copy link
Author

@eric-laporte,

I am starting with open data convertions before SQL and/or JSON: any database-user that want to "play" with Unitex dictionary data, need to access it without friction (see FrictionLessData initiative), by simple copy/paste or SQL's "copy from CSV".

I converted all to UTF8, that is the Web standard and is easy to manage at Github; them adapted all to a tabular CSV format, that is the most simple and accept format (RFC 4180 and W3C's tabular-data-model standards) for data interchange.

I am trying to do this first step here.


Answering your comments:

I doubt many Unitex/GramLab users would be interested in storing corpus-dictionary pairs,

Hum... Ok, I'm alone in this demand: I am abandoning it at this first moment.

but if you want to have an idea about that, you can post a message on the users' forum.

The idea is to solve, in a far future, "the problem of the many pt-BR dictionaries", one for each historic period, with well-defined date-range constraints (laws of ortographic reforms).
Our goal is to use LexML and SciELO texts as corpus. LexML have a collection of millions of official documents (so good and dated HTML text corpus)... SciELO also, another good (XML JATS) million to check correlations and changes-adoption in the "official educated-language".

Typical applications are "dated spell checker" (in OCR and digital preservation contexts) or "dated dictionary" (for search optimization).

As to translating all the Unitex/GramLab code to SQL, it would be a huge project,

... Not so huge, as I commented above, need only a "little investment ... copy/paste some strategic Unitex-core functions". The first step is to define a good data model, and is what I want to do next weeks...
Next step, yes, is not a job for one-man-show... Perhas a six-months-job for a little team, testing and enhancing perfomance in the SQL-side.

speed in the processing of text with a dictionary or a grammar,
readability of dictionaries.

yes, need focus in this goals.

... your proposal is not specific to the language resources of Brazilian Portuguese.

Yes, we can extend for others... But the main justification (my rationale above) is that pt-BR is "defined by law". For others, like English (defined by "cultural ecossystem"), the date-range reference is not attractive — for English is difficult to enforce corpus segments or dictionaries into some date range.

The present Unitex/GramLab code for corpora and dictionaries is the same for all languages, except for some configuration files that select some code for some languages.

Yes, I am also supposing that with pt-BR and all other Indo-European languages can change defaults from UTF-16 to UTF8. Ideal to an open dataset is to adopt its local base-charset by default.

... I will offer a Perl or shell-script that rebuilds the files, all into UTF16 and other Unitex's expected formats.


My next step is to understand what is the minimal primary source for a Unitex dictionary... I am supposing that DELAS+DELACF+Inflextions are all (minimal) that we need to generate a complete dictionary (ex. DELAF_PB).

@eric-laporte
Copy link
Member

Hi Peter,
Unitex/GramLab already works with language resources in UTF-8. And it already has a converter to transcode UTF-16 into UTF-8.
As to the source of a Unitex/GramLab dictionary, you can read the user's manual: for an inflectional or agglutinative language, it's a DELAS or a DELAC plus the corresponding inflectional transducers, plus in some cases the morphological dictionary-graphs. For a language which is neither inflectional nor agglutinative, it can be a DELAF or a DELACF, plus in some cases the morphological dictionary-graphs.

@martinec
Copy link
Member

I'm closing this issue. Transforming DELA into SQL or JSON is not, at least for now, a part of the project's goals. Nevertheless, feel free to open a PR on the core repository if you want to share a such module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants