Skip to content

Commit

Permalink
Merge branch 'devel' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
teoincontatto authored Sep 11, 2017
2 parents e788b18 + 2831cc9 commit 38e0fdf
Show file tree
Hide file tree
Showing 49 changed files with 1,829 additions and 584 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,5 @@ buildNumber.properties
hs_err*.log
/documentation/static
/documentation/site
.DS_Store
.DS_Store
.factorypath
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ script: |
then
bash .travis/build-packages
else
mvn -Psafer -Pintegration -B -e -T 1C -Dcheckstyle.consoleOutput=false verify
mvn -Psafer -Pintegration -B -e -T 1C -Dcheckstyle.consoleOutput=false --update-snapshots verify
fi
after_success:
Expand Down
51 changes: 30 additions & 21 deletions documentation/docs/about.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,51 @@
<h1>What is ToroDB Stampede?</h1>

Connected to a MongoDB replica set, ToroDB Stampede is able to replicate the NoSQL data into a relational backend (right now the only available backend is PostgreSQL) using the oplog.
ToroDB Stampede is a replication and mapping technology to maintain a live mirror of a MongoDB database (or [sub-set](configuration/filtered-replication.md)) in a SQL database. ToroDB Stampede uses [MongoDB's replica set oplog](https://docs.mongodb.com/manual/core/replica-set-oplog/) to keep track of the modifications in MongoDB.


![ToroDB Stampede Structure](images/toro_stampede_structure.jpg)

There are other solutions that are able to store the JSON document in a relational table using PostgreSQL JSON support, but it doesn't solve the real problem of 'how to really use that data'.
ToroDB Stampede replicates the document structure in different relational tables and stores the document data in different tuples using those tables.
During replication ToroDB Stempede transforms MongoDB's JSON documents into a [relational schema](relational-schema) that allows certain queries (such as aggregates) to complete faster as running against JSON documents.

![Mapping example](images/toro_stampede_mapping.jpg)

With the relational structure, some given problems from NoSQL solutions are easier to solve, such as aggregated query execution in an admissible time.

## ToroDB Stampede limitations
## Current Limitations

### SQL Target

Currently, ToroDB Stampede only supports the free open-source database [PostgreSQL](https://www.postgresql.org/) as target.

### MongoDB

Not everything could be perfect and there are some known limitations from ToroDB Stampede.
ToroDB Stampede only supports MongoDB 3.2 and 3.4 at the moment.

* The only current MongoDB version supported is 3.2.
* [Capped collections](https://docs.mongodb.com/manual/core/capped-collections/) usage is not supported.
* If character `\0` is used in a string it will be escaped because PostgreSQL doesn't support it.
* Command `applyOps` reception will stop the replication server.
* Command `collMod` reception will be ignored.
The following MongoDB features are not yet supported:

In addition to the previous limitations, just some kind of indexes are supported:
* [Capped collections](https://docs.mongodb.com/manual/core/capped-collections/)
* The [collMod](https://docs.mongodb.com/manual/reference/command/collMod/) command
* The [applyOps](https://docs.mongodb.com/manual/reference/command/applyOps/) command (will stop the replication server)
* The character `\0` is escaped in strings because PostgreSQL doesn't support it.

* Index of type ascending and descending (those that ends in 1 and -1 when declared in mongo)
* Simple indexes of one key
* All keys path with the exception to the paths resolving in scalar value (eg: `db.test.createIndex({"a": 1})` will not index value of key `a` for the document `{"a": [1,2,3]}`)
The automatic creation of indexes in the target database is currently limited as follows:

* Only simple one-key indexes (ascending and descending - those that ends in 1 and -1 when declared in MongoDB)
* Index properties `sparse` and `background` are ignored
* All keys path with the exception to the paths resolving in scalar value (e.g.: `db.test.createIndex({"a": 1})` will not index value of key `a` for the document `{"a": [1,2,3]}`)

## Incompatible Document Designs

The main benefit of ToroDB Stampede is to flatten nested documents into tables. There are some patterns that cause the relational tables to end up with a very high number of columns—possibly up to the limitation of the SQL backend (in PostgreSQL about 1600 columns).

## When ToroDB Stampede might not be the right choice
* **Pattern "key as values"**
Document key-names are turned into columns of tables. If you store values in the key-names, the table will have as many columns as you have distinct key-names.

As good as Stampede is, there are certain use-cases for which it is a bad choice or simply will not work:
* **Too many fields per document**
If several of them are optional and only some of them appear in each document, there might be thousands of columns.

* Pattern "key as values". When keys contain values, potentially thousands of different values may appear in keys, leading to an equally high number of columns
(which might break with some RDBMS which have limits to the number of columns per row, see next point) and/or tables, which might be terribly inconvenient and slow.
* Too many fields per document, several of them optional and only some appearing per document, which might lead to thousands of columns.
Some RDBMSs do not support such a high number of columns. For PostgreSQL this limit is around 1600 columns.
[TODO]: <> ('All keys path with the exception to the paths resolving in scalar value' might be wrong (given the example that relsolved to an array). Might mean "resolving in non-scalar values"?)

[TODO]: <> (Which PostreSQL version is required?)

[TODO]: <> (not supported types, we need a list)

Expand Down
50 changes: 48 additions & 2 deletions documentation/docs/appendix.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ Usage: `torodb-stampede [options]`
| --backend-ssl | Enable SSL for backend connection. |
| --backend-user | The user that will be used to connect. |
| -c, --conf | Configuration file in YAML format. |
| --offHeapBuffer-enabled | If set to `true`, it enabled the use of the off heap buffer system., if `false` it's disabled. |
| --offHeapBuffer-path | Absolute path to locate the off heap buffer files. |
| --offHeapBuffer-rollcycle | The Rolling cycle determines how often you create a new data file. The values can be: `DAILY`, `HOURLY` or `MINUTELY`. |
| --offHeapBuffer-maxFiles | Max number of files to store for the off heap buffer. |
| --connection-pool-size | Maximum number of connections to establish to the database. It must be higher or equal than 3. |
| --connection-pool-timeout | The timeout in milliseconds after which retrieve a connection from the pool will fail. |
| --enable-metrics | Enable metrics system. |
Expand Down Expand Up @@ -61,12 +65,23 @@ Another way to configure the system is through configuration file or setting con
| /logging/file | Overwrites the default value for the log output file path. |
| /metricsEnabled | With value `true` enables the metrics system, and `false` disables it. |

### Off Heap Buffer configuration

| Parameter |  |
|--------|-|
| /offHeapBuffer/enabled | If set to `true`, it enabled the use of the off heap buffer system., if `false` it's disabled.|
| /offHeapBuffer/path | Absolute path to locate the off heap buffer files. |
| /offHeapBuffer/rollCycle | The Rolling cycle determines how often you create a new data file. The values can be: `DAILY`, `HOURLY` or `MINUTELY`. | 
| /offHeapBuffer/maxFiles | Max number of files to store for the off heap buffer. |

### Replication configuration

| Parameter |  |
|--------|-|
| /replication/replSetName | Overwrites the default value of the MongoDB Replica Set used for replication. |
| /replication/syncSource | Overwrites the default connection address for the MongoDB Replica Set used for replication (host:port) |
| /replication/include/`<string>` | A map of databases and/or collections and/or indexes to exclusively replicate.<ul><li>Each entry represent a database name under which a list of collection names can be specified.</li><li>Each collection can contain a list of indexes each formed by one or more of those fields:<ul><li>name=<string> the index name</li><li>unqiue=<boolean> true when index is unique, false otherwise</li><li>keys/<string>=<string> the name of the field indexed and the index direction or type</ul><li>Character '\*' can be used to denote "any-character" and character '\' to escape characters.</li></ul> |
| /replication/exclude/`<string>` | A map of databases and/or collections and/or indexes to exclusively replicate.<ul><li>Each entry represent a database name under which a list of collection names can be specified.</li><li>Each collection can contain a list of indexes each formed by one or more of those fields:<ul><li>name=<string> the index name</li><li>unqiue=<boolean> true when index is unique, false otherwise</li><li>keys/<string>=<string> the name of the field indexed and the index direction or type</ul><li>Character '\*' can be used to denote "any-character" and character '\' to escape characters.</li></ul> |

### Replication SSL configuration

Expand All @@ -89,10 +104,41 @@ Another way to configure the system is through configuration file or setting con
| /replication/auth/mode | Specifies the authentication mode, that can take one of the next values.<ul><li>disabled: Disable authentication mechanism.</li><li>negotiate: The client will negotiate best mechanism to authenticate. With server version 3.0 or above, the driver will authenticate using the SCRAM-SHA-1 mechanism. Otherwise, the driver will authenticate using the Challenge Response mechanism.</li><li>cr: Challenge Response authentication</li><li>x509: X.509 authentication</li><li>scram_sha1: SCRAM-SHA-1 SASL authentication</li></ul> |
| /replication/auth/user | User to be authenticated. |
| /replication/auth/source | The source database where the user is present. |
| /replication/include/`<string>` | A map of databases and/or collections and/or indexes to exclusively replicate.<ul><li>Each entry represent a database name under which a list of collection names can be specified.</li><li>Each collection can contain a list of indexes each formed by one or more of those fields:<ul><li>name=<string> the index name</li><li>unqiue=<boolean> true when index is unique, false otherwise</li><li>keys/<string>=<string> the name of the field indexed and the index direction or type</ul><li>Character '\*' can be used to denote "any-character" and character '\' to escape characters.</li></ul> |
| /replication/exclude/`<string>` | A map of databases and/or collections and/or indexes to exclusively replicate.<ul><li>Each entry represent a database name under which a list of collection names can be specified.</li><li>Each collection can contain a list of indexes each formed by one or more of those fields:<ul><li>name=<string> the index name</li><li>unqiue=<boolean> true when index is unique, false otherwise</li><li>keys/<string>=<string> the name of the field indexed and the index direction or type</ul><li>Character '\*' can be used to denote "any-character" and character '\' to escape characters.</li></ul> |
| /replication/mongopassFile | Path to the file with MongoDB access configuration in `.pgpass` syntax. |

### Replication configuration for shards

| Parameter |  |
|--------|-|
| /replication/shards/<index>/replSetName | Overwrites the default value of the MongoDB Replica Set used for replication. |
| /replication/shards/<index>/syncSource | Overwrites the default connection address for the MongoDB Replica Set used for replication (host:port). If this parameter is specified leave empty `/replication/syncSource`. |

### Replication SSL configuration for shards

Any parameter not specified here will default to the value specified in the configuration under `/replication/ssl`.

| Parameter |  |
|--------|-|
| /replication/shards/<index>/ssl/enabled | If `false` the SSL/TLS layer is disabled if `true` it is enabled. |
| /replication/shards/<index>/ssl/allowInvalidHostnames | If `true` hostname verification is disabled, if `false` it is enabled. | 
| /replication/shards/<index>/ssl/trustStoreFile | The path to the Java Key Store file containing the Certification Authority. If CAFile is specified it will be used instead. | 
| /replication/shards/<index>/ssl/trustStorePassword | The password of the Java Key Store file containing the Certification Authority. |
| /replication/shards/<index>/ssl/keyStoreFile | The path to the Java Key Store file containing the certificate and private key used to authenticate client. |
| /replication/shards/<index>/ssl/keyStorePassword | The password of the Java Key Store file containing and private key used to authenticate client. |
| /replication/shards/<index>/ssl/keyPassword | The password of the private key used to authenticate client. |
| /replication/shards/<index>/ssl/fipsMode | If `true` enable FIPS 140-2 mode. |
| /replication/shards/<index>/ssl/caFile | The path to the Certification Authority in PEM format. |

### Replication authentication configuration for shards

Any parameter not specified here will default to the value specified in the configuration under `/replication/auth`.

| Parameter |  |
|--------|-|
| /replication/shards/<index>/auth/mode | Specifies the authentication mode, that can take one of the next values.<ul><li>disabled: Disable authentication mechanism.</li><li>negotiate: The client will negotiate best mechanism to authenticate. With server version 3.0 or above, the driver will authenticate using the SCRAM-SHA-1 mechanism. Otherwise, the driver will authenticate using the Challenge Response mechanism.</li><li>cr: Challenge Response authentication</li><li>x509: X.509 authentication</li><li>scram_sha1: SCRAM-SHA-1 SASL authentication</li></ul> |
| /replication/shards/<index>/auth/user | User to be authenticated. |
| /replication/shards/<index>/auth/source | The source database where the user is present. |

### PostgreSQL configuration

| Parameter |  |
Expand Down
16 changes: 16 additions & 0 deletions documentation/docs/configuration/bufferOffHeap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
<h1>Off Heap Buffer</h1>

ToroDB can use an off heap replication buffer to store the oplog operations fetched from the remote MongoDB and read it from ToroDB to process it. This way we avoid having to go back to recovery mode when MongoDB has a lot of work.
By default it's disabled because it can take up some extra space on disk, but we recommend to use it if you have many operations on MongoDB.

The recommended configuration is:

```json
offHeapBuffer:
enabled: true
path: "/tmp/torodb"
rollCycle: "DAILY"
maxFiles: 5
```

The [Options Reference](options-reference.md#off-heap-buffer-configuration) explains these settings in detail.
158 changes: 158 additions & 0 deletions documentation/docs/configuration/filtered-replication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
<h1>Configuring Filtered Replication</h1>

By default ToroDB Stampede replicates all databases and collections available in your MongoDB. You can configure ToroDB Stampede to limit the replication by specifying which databases, collections, and indexes to include or exclude from replcation.

Note that exclusions always override inclusions—i.e. if you exclude something it will not be replicated even if you include the same thing.

!!! danger "Changing Include and Exclude Configuration"
ToroDB Stampede does not keep track of changes to the **configuration**.

If you stop ToroDB Stampede, remove a database or collection inclusion, and restart ToroDB Stampede, the replication process will replicate operations on this database/collection without replicating previously data form not included database/collection, reaching an inconsistent state. In such cases, it is recommended to delete ToroDB Stampede database and restart the whole replication process from scratch.

The same is true for indexes: ToroDB Stampede only creates indexes at the initial recovery process and when a create index command is found in the oplog replication process, not because of configuration changes.

!!! note "Demo Setup"
The following examples assume two databases (`films` and `music`) wheras each has two collections (`title` and `performer`).


## Include: Select a Database, its Collections and Indexes

### Databases

To limit the replication to a single database (`<database name>`), use the `include` setting in the `replication` section of the configuration file:

```json
replication:
replSetName: rs1
syncSource: localhost:27017
include:
<database name>: "*"
```

### Collections

To further limit the replication to selected collections of this database, list them below the database:

```json
replication:
replSetName: rs1
syncSource: localhost:27017
include:
<database name>:
- <collection name 1>
- <collection name 2>
```

### Indexes

Likewise, you can limit the indexes that are automatically created in the relational backend by ToroDB Stampede:

```json
replication:
replSetName: rs1
syncSource: localhost:27017
include:
<database name>:
<collection name>:
- name: <index name>
```

The following example limits the replication to the `performer` collection in the `film` database and only creates the `city` index in the SQL backend:

```json
replication:
replSetName: rs1
syncSource: localhost:27017
include:
film:
performer:
- name: "city"
```

## Exclude: Ignore a Database, Collections, or Indexes

### Databases

To exclude a database (`<database name>`) from replication, use the `exclude` setting in the `replication` section of the configuration file:

```json
replication:
replSetName: rs1
syncSource: localhost:27017
exclude:
<database name>: "*"
```

### Collections

If you want to exclude some collections (but still repliate the others), list them below the database:

```json
replication:
replSetName: rs1
syncSource: localhost:27017
exclude:
<database name>:
- <collection name 1>
- <collection name 2>
```

### Indexes

Some indees created in MongoDB for OLTP operations might be useless for OLAP and analytics operations in the SQL backend. You can easily exclude them by listing the indexes that should not be created in the SQL backend below the respective collection:

```json
replication:
replSetName: rs1
syncSource: localhost:27017
exclude:
<database name>:
<collection name>:
- name: <index name>
```

The following example only excludes a single index from the replication: namely, the index `city` on the collection `performer` in the `film` database:

```json
replication:
replSetName: rs1
syncSource: localhost:27017
exclude:
film:
performer:
- name: city
```

!!! note "Unsupported index types are always excluded"
ToroDB Stampede generally ignores MongoDB indexes that are not yet supported (text, 2dsphere, 2d, hashed, ...).

## Mixing Include and Exclude

You can combine the `include` and `exclude` sections to limit the replication to a single database, but exclude a collections and or indexes.

The following example only replicates the `file` database, but excludes the collection `performer` from it.

```json
replication:
replSetName: rs1
syncSource: localhost:27017
include:
film: "*"
exclude:
film: "performer"
```

The next example limits the replication to the `performer` collection from the `film` database but excludes the index `city` from that collection:

```json
replication:
replSetName: rs1
syncSource: localhost:27017
include:
film:
performer: "*"
exclude:
film:
performer:
- name: "city"
```
Loading

0 comments on commit 38e0fdf

Please sign in to comment.