A new cli version that acts as a .sql dump filter for anonymization #156

mbeynon · 2022-02-18T22:46:08Z

Our db is using TimescaleDB in the same postgresql db as the tables that I need to anonymize. The main problem is you have to be very careful how you dump and restore to avoid problems.

If I pg_dump the pg db and look at the output, there's internal tables such as _timescaledb_config.bgw_job or _timescaledb_catalog.compression_algorithm that are somehow handled specially to avoid copying static data that cannot be later restored. If I use pg_datanymizer, it has internal logic to decide what to pull, and it gets it different. Excluding those tables is not enough, and I end up with other errors violating constraints on restore.

The other detail is the data I'm starting with is -Fc format pg_dump files, so I have to load them into an instance of pg for datanymizer to work. I think this use case could benefit from pg_datanymizer acting as an anonymization filter such as:

$ pg_restore db.dump | pg_datanymizer -i - -c config.yaml -f anon-db.sql

This would need a new dumper that can handle .sql dumps on stdin, but could reuse all the same anon code by parsing COPY rows one at a time. The main benefit is this .sql will load without problems because it's using the pg_dump output and just modifying the COPY sections with anonymized data. A side benefit is it eliminates the extra step of loading the dump into a dummy pg server just to pull it back out to anonymize it.

Has anyone considered this before?

The text was updated successfully, but these errors were encountered:

evgeniy-r · 2022-02-19T07:34:08Z

No. But it might be an interesting idea.

mbeynon · 2022-03-01T02:21:51Z

Here's a prototype for your feedback.

https://github.com/mbeynon/datanymizer/tree/filter-input-stream-mode

I still need to add docs, changelog, tests, etc., but it's working for my test cases of large dbs and doesn't cause problems with TimescaleDB. This works really well for being applied to our production db dumps we're already collecting by doing:

$ pg_restore -f - db.dump | pg_datanymizer -i - -c config.yaml -f anon-db.sql

I had to work around the exiting interfaces of traits to add a new SqlText dumper that only works by scanning an input .sql stream. I tried to minimize the changes to existing code. Much of the postgres dumper was reused. I also added a spinner indicator for status.

The biggest difference between this SqlText dumper and the Postgres dumper is that it's stream based. I watch all data go by and choose what to transform. There is no data vs schema exclusions or inclusions, and all the dependency following logic is not used. It's just an inline transformer.

Let me know what you think, and I'll clean it up a bit and create a PR.

kapilt · 2022-03-15T17:55:57Z

@mbeynon thanks, i'm looking for this as well

mbeynon · 2022-03-16T18:32:45Z

Feel free to try the prototype fork and let me know what you think. It works like I showed above. I will clean it up and add docs and PR against the main repo when I get a chance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new cli version that acts as a .sql dump filter for anonymization #156

A new cli version that acts as a .sql dump filter for anonymization #156

mbeynon commented Feb 18, 2022 •

edited

Loading

evgeniy-r commented Feb 19, 2022

mbeynon commented Mar 1, 2022 •

edited

Loading

kapilt commented Mar 15, 2022

mbeynon commented Mar 16, 2022

A new cli version that acts as a .sql dump filter for anonymization #156

A new cli version that acts as a .sql dump filter for anonymization #156

Comments

mbeynon commented Feb 18, 2022 • edited Loading

evgeniy-r commented Feb 19, 2022

mbeynon commented Mar 1, 2022 • edited Loading

kapilt commented Mar 15, 2022

mbeynon commented Mar 16, 2022

mbeynon commented Feb 18, 2022 •

edited

Loading

mbeynon commented Mar 1, 2022 •

edited

Loading