-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transforming DELA into SQL or JSON #2
Comments
Hi Peter, no, there is no such recipe. In which context does this approach help? |
Hi @eric-laporte, Portuguese is not a "so natural" language, there are "official rules" as the Portuguese Language Orthographic Agreement of 1990's... So, 90% of portuguese-educated-vocabullary is defined "by law", not "by statistics"... Each corpus-dictionary pair must be organized by date-ranges... SQL is the only "universal tool" to manage all the relationships. In my opinion, the SQL management is good for:
In nowadays, with SQL-2016 standard (so with PostgreSQL v9.5+) all the listed features are possible. Unitex-core is perfect for micro-managing and specialist operations, SQL is perfect for managing big-data and macro operations... It is not necessary to "plug" both, because is easy to export/import data automatically in both (!)... But, with little investment, it is also possible to plug by SQL C++ modules (can copy/paste some strategic Unitex-core functions into the database), unifying the systems. 10 years ago we used SQL in São Carlos municipality and in 2009 something (for plurals) in the LexML project... It was a proof of concept for DELA translation and SQL use: I think the results was positive. |
Peter,
|
Peter, one more comment: your proposal is not specific to the language resources of Brazilian Portuguese. The present Unitex/GramLab code for corpora and dictionaries is the same for all languages, except for some configuration files that select some code for some languages. |
I am starting with open data convertions before SQL and/or JSON: any database-user that want to "play" with Unitex dictionary data, need to access it without friction (see FrictionLessData initiative), by simple copy/paste or SQL's "copy from CSV". I converted all to UTF8, that is the Web standard and is easy to manage at Github; them adapted all to a tabular CSV format, that is the most simple and accept format (RFC 4180 and W3C's tabular-data-model standards) for data interchange. I am trying to do this first step here. Answering your comments:
Hum... Ok, I'm alone in this demand: I am abandoning it at this first moment.
The idea is to solve, in a far future, "the problem of the many pt-BR dictionaries", one for each historic period, with well-defined date-range constraints (laws of ortographic reforms). Typical applications are "dated spell checker" (in OCR and digital preservation contexts) or "dated dictionary" (for search optimization).
... Not so huge, as I commented above, need only a "little investment ... copy/paste some strategic Unitex-core functions". The first step is to define a good data model, and is what I want to do next weeks...
yes, need focus in this goals.
Yes, we can extend for others... But the main justification (my rationale above) is that pt-BR is "defined by law". For others, like English (defined by "cultural ecossystem"), the date-range reference is not attractive — for English is difficult to enforce corpus segments or dictionaries into some date range.
Yes, I am also supposing that with pt-BR and all other Indo-European languages can change defaults from UTF-16 to UTF8. Ideal to an open dataset is to adopt its local base-charset by default. ... I will offer a Perl or shell-script that rebuilds the files, all into UTF16 and other Unitex's expected formats. My next step is to understand what is the minimal primary source for a Unitex dictionary... I am supposing that DELAS+DELACF+Inflextions are all (minimal) that we need to generate a complete dictionary (ex. DELAF_PB). |
Hi Peter, |
I'm closing this issue. Transforming DELA into SQL or JSON is not, at least for now, a part of the project's goals. Nevertheless, feel free to open a PR on the core repository if you want to share a such module. |
Hi, I am "new" here: I used Unitex ~10 years ago, and we translated all to SQL, was a good approach to manage big data... There are an "UnitexGramLab recipe" to transform DELA datasets into SQL (PostgreSQL) or JSON?
The text was updated successfully, but these errors were encountered: