Skip to content

Commit

Permalink
update bin for bulk generation and improve docs
Browse files Browse the repository at this point in the history
  • Loading branch information
rodrigopivi committed Aug 25, 2018
1 parent 2cb052e commit 05af67e
Show file tree
Hide file tree
Showing 8 changed files with 162 additions and 103 deletions.
2 changes: 1 addition & 1 deletion package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "chatito",
"version": "2.1.1",
"version": "2.1.2",
"description": "Generate training datasets for NLU chatbots using a simple DSL",
"bin": {
"chatito": "./dist/bin.js"
Expand Down
58 changes: 54 additions & 4 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,57 @@ This project contains the:
- [DSL AST parser in pegjs format](https://github.com/rodrigopivi/Chatito/blob/master/parser/chatito.pegjs)
- [Generator implemented in typescript + npm package](https://github.com/rodrigopivi/Chatito/tree/master/src)

### Chatito DSL specification
For the language specification and documentation, please refer to the [DSL spec document](https://github.com/rodrigopivi/Chatito/blob/master/spec.md).
### Chatito language
For the full language specification and documentation, please refer to the [DSL spec document](https://github.com/rodrigopivi/Chatito/blob/master/spec.md).

### Adapters
The language is independent from the generated output format and because each model can receive different parameters and settings, there are 3 data format adapters provided. This section describes the adapters, their specific behaviors and use cases:

#### Default format
Use the default format if you plan to train a custom model or if you are writting a custom adapter. This is the most flexible format because you can annotate `Slots` and `Intentts` with custom entity arguments, and they all will be present at the generated output, so for example, you could also include dialog/response generation logic with the dsl. E.g.:

```
%[some intent]('context': 'some annotation')
@[some slot] ~[please?]
@[some slot]('required': 'true', 'type': 'some type')
~[some alias here]
```

Custom entities like 'context', 'required' and 'type' will be available at the output so you can handle this custom arguments as you want.

#### [Rasa NLU](https://rasa.com/docs/nlu/)
[Rasa NLU](https://rasa.com/docs/nlu/) is a great open source framework for training NLU models.
One particular behavior of the Rasa adapter is that when a slot definition sentence only contains one alias, the generated rasa dataset will map the alias as a synonym. e.g.:

```
%[some intent]('training': '1')
@[some slot]
@[some slot]
~[some slot synonyms]
~[some slot synonyms]
synonym 1
synonym 2
```

In this example, the generated rasa dataset will contain the `entity_synonyms` of `synonym 1` and `synonym 1` mapping to `some slot synonyms`.

#### [Snips NLU](https://snips-nlu.readthedocs.io/en/latest/)
[Snips NLU](https://snips-nlu.readthedocs.io/en/latest/) is another great open source framework for NLU. One particular behavior of the Snips adapter is that you can define entity types for the slots. e.g.:

```
%[date search]('training':'1')
for @[date]
@[date]('entity': 'snips/datetime')
~[today]
~[tomorrow]
```

In the previous example, all `@[date]` values will be taged with the `snips/datetime` entity tag.

### NPM package

Expand All @@ -42,12 +91,13 @@ The generated dataset should be available next to your definition file.
Here is the full npm generator options:
```
npx chatito <pathToFile> --format=<format> --formatOptions=<formatOptions>
npx chatito <pathToFileOrDirectory> --format=<format> --formatOptions=<formatOptions> --outputPath=<outputPath>
```
- `<pathToFile>` path to the grammar file. e.g.: lightsChange.chatito
- `<pathToFileOrDirectory>` path to a `.chatito` file or a directory that contains chatito files. If it is a directory, will search recursively for all `*.chatito` files inside and use them to generate the dataset. e.g.: `lightsChange.chatito` or `./chatitoFilesFolder`
- `<format>` Optional. `default`, `rasa` or `snips`
- `<formatOptions>` Optional. Path to a .json file that each adapter optionally can use
- `<outputPath>` Optional. The directory where to save the generated dataset. Uses the current directory as default.
### Donate
Designing and maintaining chatito takes time and effort, if it was usefull for you, please consider making a donation and share the abundance! :)
Expand Down
4 changes: 3 additions & 1 deletion src/adapters/rasa.ts
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,9 @@ export async function adapter(dsl: string, formatOptions?: any) {
if (!synonyms[next.synonym]) {
synonyms[next.synonym] = new Set();
}
synonyms[next.synonym].add(next.value);
if (next.synonym !== next.value) {
synonyms[next.synonym].add(next.value);
}
}
acc.entities.push({
end: acc.text.length + next.value.length,
Expand Down
4 changes: 3 additions & 1 deletion src/adapters/snips.ts
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,9 @@ export async function adapter(dsl: string, formatOptions?: any) {
if (!synonyms[u.synonym]) {
synonyms[u.synonym] = new Set();
}
synonyms[u.synonym].add(u.value);
if (u.synonym !== u.value) {
synonyms[u.synonym].add(u.value);
}
}
}
}
Expand Down
120 changes: 63 additions & 57 deletions src/bin.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,39 +3,67 @@ import * as fs from 'fs';
import * as path from 'path';
import * as rasa from './adapters/rasa';
import * as snips from './adapters/snips';
import * as gen from './main';
import { ISentenceTokens, IUtteranceWriter } from './types';
import * as web from './adapters/web';
import * as utils from './utils';

// tslint:disable-next-line:no-var-requires
const argv = require('minimist')(process.argv.slice(2));

const adapters = { default: web, rasa, snips };

const workingDirectory = process.cwd();
const getExampleFilePath = (filename: string) => path.resolve(workingDirectory, filename);
const getFileWithPath = (filename: string) => path.resolve(workingDirectory, filename);

const writeFileStreams = (dir: string) => {
if (!fs.existsSync(dir)) {
fs.mkdirSync(dir);
const chatitoFilesFromDir = async (startPath: string, cb: (filename: string) => Promise<void>) => {
if (!fs.existsSync(startPath)) {
// tslint:disable-next-line:no-console
console.error(`Invalid directory: ${startPath}`);
process.exit(1);
}
let openWriteStreamsTraining: { [key: string]: fs.WriteStream } = {};
let openWriteStreamsTesting: { [key: string]: fs.WriteStream } = {};
const writeStream: IUtteranceWriter = (u: ISentenceTokens[], intentKey: string, isTrainingExample: boolean) => {
const openWriteStreams = isTrainingExample ? openWriteStreamsTraining : openWriteStreamsTesting;
let writer: fs.WriteStream;
if (openWriteStreams[intentKey]) {
writer = openWriteStreams[intentKey];
} else {
writer = fs.createWriteStream(path.resolve(dir, `${intentKey}_${isTrainingExample ? 'training' : 'testing'}.ndjson`));
openWriteStreams[intentKey] = writer;
const files = fs.readdirSync(startPath);
for (const file of files) {
const filename = path.join(startPath, file);
const stat = fs.lstatSync(filename);
if (stat.isDirectory()) {
await chatitoFilesFromDir(filename, cb);
} else if (/\.chatito$/.test(filename)) {
await cb(filename);
}
}
};

const adapterAccumulator = (format: 'default' | 'rasa' | 'snips', formatOptions?: any) => {
const trainingDataset: snips.ISnipsDataset | rasa.IRasaDataset | {} = {};
const testingDataset: any = {};
const adapterHandler = adapters[format];
if (!adapterHandler) {
throw new Error(`Invalid adapter: ${format}`);
}
return {
write: async (fullFilenamePath: string) => {
// tslint:disable-next-line:no-console
console.log(`Processing file: ${fullFilenamePath}`);
const dsl = fs.readFileSync(fullFilenamePath, 'utf8');
const { training, testing } = await adapterHandler.adapter(dsl, formatOptions);
utils.mergeDeep(trainingDataset, training);
utils.mergeDeep(testingDataset, testing);
},
save: (outputPath: string) => {
if (!fs.existsSync(outputPath)) {
fs.mkdirSync(outputPath);
}
const trainingJsonFilePath = path.resolve(outputPath, `${format}_dataset_training.json`);
fs.writeFileSync(trainingJsonFilePath, JSON.stringify(trainingDataset));
// tslint:disable-next-line:no-console
console.log(`Saved training dataset: ./${format}_dataset_training.json`);
if (Object.keys(testingDataset).length) {
const testingJsonFilePath = path.resolve(outputPath, `${format}_dataset_testing.json`);
fs.writeFileSync(testingJsonFilePath, JSON.stringify(testingDataset));
// tslint:disable-next-line:no-console
console.log(`Saved testing dataset: ./${format}_dataset_training.json`);
}
}
writer.write(JSON.stringify(u) + '\n');
};
const closeStreams = () => {
Object.keys(openWriteStreamsTraining).forEach(k => openWriteStreamsTraining[k].end());
openWriteStreamsTraining = {};
Object.keys(openWriteStreamsTesting).forEach(k => openWriteStreamsTesting[k].end());
openWriteStreamsTesting = {};
};
return { writeStream, closeStreams };
};

(async () => {
Expand All @@ -51,44 +79,22 @@ const writeFileStreams = (dir: string) => {
console.error(`Invalid format argument: ${format}`);
process.exit(1);
}
const outputPath = argv.outputPath || __dirname;
try {
// parse the formatOptions argument
const dslFilePath = getExampleFilePath(configFile);
const file = fs.readFileSync(dslFilePath, 'utf8');
const splittedPath = path.posix.basename(dslFilePath).split('.');
if (!splittedPath.length || 'chatito' !== splittedPath[splittedPath.length - 1].toLowerCase()) {
throw new Error('Invalid filename extension.');
let formatOptions = null;
if (argv.formatOptions) {
formatOptions = JSON.parse(fs.readFileSync(path.resolve(argv.formatOptions), 'utf8'));
}
const keyName = path.basename(dslFilePath, '.chatito');
if (format === 'default') {
const directory = path.resolve(path.dirname(dslFilePath), keyName);
const fileWriterStreams = writeFileStreams(directory);
await gen.datasetFromString(file, fileWriterStreams.writeStream);
// tslint:disable-next-line:no-console
console.log(`DONE! - Examples generated by intent at ${directory} directory`);
fileWriterStreams.closeStreams();
const dslFilePath = getFileWithPath(configFile);
const isDirectory = fs.existsSync(dslFilePath) && fs.lstatSync(dslFilePath).isDirectory();
const accumulator = adapterAccumulator(format, formatOptions);
if (isDirectory) {
await chatitoFilesFromDir(dslFilePath, accumulator.write);
} else {
let formatOptions = null;
if (argv.formatOptions) {
formatOptions = JSON.parse(fs.readFileSync(path.resolve(argv.formatOptions), 'utf8'));
}
const adapter = format === 'rasa' ? rasa : snips;
const { training, testing } = await adapter.adapter(file, formatOptions);
const trainingJsonFileName = splittedPath
.slice(0, splittedPath.length - 1)
.concat([`_${format}_training.json`])
.join('');
const trainingJsonFilePath = path.resolve(path.dirname(dslFilePath), trainingJsonFileName);
fs.writeFileSync(trainingJsonFilePath, JSON.stringify(training, null, 1));
if (Object.keys(testing).length) {
const testingJsonFileName = splittedPath
.slice(0, splittedPath.length - 1)
.concat([`_${format}_testing.json`])
.join('');
const testingJsonFilePath = path.resolve(path.dirname(dslFilePath), testingJsonFileName);
fs.writeFileSync(testingJsonFilePath, JSON.stringify(testing, null, 1));
}
await accumulator.write(dslFilePath);
}
accumulator.save(outputPath);
} catch (e) {
// tslint:disable:no-console
if (e && e.message && e.location) {
Expand Down
52 changes: 26 additions & 26 deletions src/tests/bin.spec.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ import * as path from 'path';
test('test npm command line generator for large example', () => {
const d = __dirname;
const generatedDir = path.resolve(`${d}/../../examples/dateBooking_large`);
const generatedTrainingFile = path.resolve(generatedDir, 'bookRestaurantsAtDatetime_training.ndjson');
const generatedTestingFile = path.resolve(generatedDir, 'bookRestaurantsAtDatetime_testing.ndjson');
const generatedTrainingFile = path.resolve(generatedDir, 'default_dataset_training.json');
const generatedTestingFile = path.resolve(generatedDir, 'default_dataset_testing.json');
const npmBin = path.resolve(`${d}/../bin.ts`);
const grammarFile = path.resolve(`${d}/../../examples/dateBooking_large.chatito`);
if (fs.existsSync(generatedTrainingFile)) {
Expand All @@ -18,14 +18,14 @@ test('test npm command line generator for large example', () => {
if (fs.existsSync(generatedDir)) {
fs.rmdirSync(generatedDir);
}
const child = cp.execSync(`node -r ts-node/register ${npmBin} ${grammarFile}`);
const child = cp.execSync(`node -r ts-node/register ${npmBin} ${grammarFile} --outputPath=${generatedDir}`);
expect(fs.existsSync(generatedDir)).toBeTruthy();
expect(fs.existsSync(generatedTrainingFile)).toBeTruthy();
expect(fs.existsSync(generatedTestingFile)).toBeFalsy();
const fileBuffer = fs.readFileSync(generatedTrainingFile);
const fileString = fileBuffer.toString();
const lines = fileString.split('\n');
expect(lines.length - 1).toEqual(1000);
const trainingDataset = JSON.parse(fs.readFileSync(generatedTrainingFile, 'utf8'));
expect(trainingDataset).not.toBeNull();
expect(trainingDataset.bookRestaurantsAtDatetime).not.toBeNull();
expect(trainingDataset.bookRestaurantsAtDatetime.length).toEqual(1000);
fs.unlinkSync(generatedTrainingFile);
fs.rmdirSync(generatedDir);
expect(fs.existsSync(generatedTrainingFile)).toBeFalsy();
Expand All @@ -35,8 +35,8 @@ test('test npm command line generator for large example', () => {
test('test npm command line generator for medium example', () => {
const d = __dirname;
const generatedDir = path.resolve(`${d}/../../examples/citySearch_medium`);
const generatedTrainingFile = path.resolve(generatedDir, 'findByCityAndCategory_training.ndjson');
const generatedTestingFile = path.resolve(generatedDir, 'findByCityAndCategory_testing.ndjson');
const generatedTrainingFile = path.resolve(generatedDir, 'default_dataset_training.json');
const generatedTestingFile = path.resolve(generatedDir, 'default_dataset_testing.json');
const npmBin = path.resolve(`${d}/../bin.ts`);
const grammarFile = path.resolve(`${d}/../../examples/citySearch_medium.chatito`);
if (fs.existsSync(generatedTrainingFile)) {
Expand All @@ -48,30 +48,30 @@ test('test npm command line generator for medium example', () => {
if (fs.existsSync(generatedDir)) {
fs.rmdirSync(generatedDir);
}
const child = cp.execSync(`node -r ts-node/register ${npmBin} ${grammarFile}`);
const child = cp.execSync(`node -r ts-node/register ${npmBin} ${grammarFile} --outputPath=${generatedDir}`);
expect(fs.existsSync(generatedDir)).toBeTruthy();
expect(fs.existsSync(generatedTrainingFile)).toBeTruthy();
const fileBuffer1 = fs.readFileSync(generatedTrainingFile);
const fileString1 = fileBuffer1.toString();
const lines1 = fileString1.split('\n');
expect(lines1.length - 1).toEqual(1000);
fs.unlinkSync(generatedTrainingFile);
expect(fs.existsSync(generatedTrainingFile)).toBeFalsy();
expect(fs.existsSync(generatedTestingFile)).toBeTruthy();
const fileBuffer2 = fs.readFileSync(generatedTestingFile);
const fileString2 = fileBuffer2.toString();
const lines2 = fileString2.split('\n');
expect(lines2.length - 1).toEqual(100);
const trainingDataset = JSON.parse(fs.readFileSync(generatedTrainingFile, 'utf8'));
expect(trainingDataset).not.toBeNull();
expect(trainingDataset.findByCityAndCategory).not.toBeNull();
expect(trainingDataset.findByCityAndCategory.length).toEqual(1000);
const testingDataset = JSON.parse(fs.readFileSync(generatedTestingFile, 'utf8'));
expect(testingDataset).not.toBeNull();
expect(testingDataset.findByCityAndCategory).not.toBeNull();
expect(testingDataset.findByCityAndCategory.length).toEqual(100);
fs.unlinkSync(generatedTrainingFile);
fs.unlinkSync(generatedTestingFile);
fs.rmdirSync(generatedDir);
expect(fs.existsSync(generatedTrainingFile)).toBeFalsy();
expect(fs.existsSync(generatedTestingFile)).toBeFalsy();
expect(fs.existsSync(generatedDir)).toBeFalsy();
});

test('test npm command line generator for rasa medium example', () => {
const d = __dirname;
const generatedTrainingFile = path.resolve(`${d}/../../examples/citySearch_medium_rasa_training.json`);
const generatedTestingFile = path.resolve(`${d}/../../examples/citySearch_medium_rasa_testing.json`);
const generatedTrainingFile = path.resolve(`${d}/../../examples/rasa_dataset_training.json`);
const generatedTestingFile = path.resolve(`${d}/../../examples/rasa_dataset_testing.json`);
const npmBin = path.resolve(`${d}/../bin.ts`);
const grammarFile = path.resolve(`${d}/../../examples/citySearch_medium.chatito`);
if (fs.existsSync(generatedTrainingFile)) {
Expand All @@ -80,7 +80,7 @@ test('test npm command line generator for rasa medium example', () => {
if (fs.existsSync(generatedTestingFile)) {
fs.unlinkSync(generatedTestingFile);
}
const child = cp.execSync(`node -r ts-node/register ${npmBin} ${grammarFile} --format=rasa`);
const child = cp.execSync(`node -r ts-node/register ${npmBin} ${grammarFile} --format=rasa --outputPath=${d}/../../examples`);
expect(fs.existsSync(generatedTrainingFile)).toBeTruthy();
const dataset = JSON.parse(fs.readFileSync(generatedTrainingFile, 'utf8'));
expect(dataset).not.toBeNull();
Expand All @@ -100,8 +100,8 @@ test('test npm command line generator for rasa medium example', () => {

test('test npm command line generator for snips medium example', () => {
const d = __dirname;
const generatedTrainingFile = path.resolve(`${d}/../../examples/citySearch_medium_snips_training.json`);
const generatedTestingFile = path.resolve(`${d}/../../examples/citySearch_medium_snips_testing.json`);
const generatedTrainingFile = path.resolve(`${d}/../../examples/snips_dataset_training.json`);
const generatedTestingFile = path.resolve(`${d}/../../examples/snips_dataset_testing.json`);
const npmBin = path.resolve(`${d}/../bin.ts`);
const grammarFile = path.resolve(`${d}/../../examples/citySearch_medium.chatito`);
if (fs.existsSync(generatedTrainingFile)) {
Expand All @@ -110,7 +110,7 @@ test('test npm command line generator for snips medium example', () => {
if (fs.existsSync(generatedTestingFile)) {
fs.unlinkSync(generatedTestingFile);
}
const child = cp.execSync(`node -r ts-node/register ${npmBin} ${grammarFile} --format=snips`);
const child = cp.execSync(`node -r ts-node/register ${npmBin} ${grammarFile} --format=snips --outputPath=${d}/../../examples`);
expect(fs.existsSync(generatedTrainingFile)).toBeTruthy();
const dataset = JSON.parse(fs.readFileSync(generatedTrainingFile, 'utf8'));
expect(dataset).not.toBeNull();
Expand Down
Loading

0 comments on commit 05af67e

Please sign in to comment.