Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new cli version that acts as a .sql dump filter for anonymization #156

Open
mbeynon opened this issue Feb 18, 2022 · 4 comments
Open

A new cli version that acts as a .sql dump filter for anonymization #156

mbeynon opened this issue Feb 18, 2022 · 4 comments

Comments

@mbeynon
Copy link
Contributor

mbeynon commented Feb 18, 2022

Our db is using TimescaleDB in the same postgresql db as the tables that I need to anonymize. The main problem is you have to be very careful how you dump and restore to avoid problems.

If I pg_dump the pg db and look at the output, there's internal tables such as _timescaledb_config.bgw_job or _timescaledb_catalog.compression_algorithm that are somehow handled specially to avoid copying static data that cannot be later restored. If I use pg_datanymizer, it has internal logic to decide what to pull, and it gets it different. Excluding those tables is not enough, and I end up with other errors violating constraints on restore.

The other detail is the data I'm starting with is -Fc format pg_dump files, so I have to load them into an instance of pg for datanymizer to work. I think this use case could benefit from pg_datanymizer acting as an anonymization filter such as:

$ pg_restore db.dump | pg_datanymizer -i - -c config.yaml -f anon-db.sql

This would need a new dumper that can handle .sql dumps on stdin, but could reuse all the same anon code by parsing COPY rows one at a time. The main benefit is this .sql will load without problems because it's using the pg_dump output and just modifying the COPY sections with anonymized data. A side benefit is it eliminates the extra step of loading the dump into a dummy pg server just to pull it back out to anonymize it.

Has anyone considered this before?

@evgeniy-r
Copy link
Member

No. But it might be an interesting idea.

@mbeynon
Copy link
Contributor Author

mbeynon commented Mar 1, 2022

Here's a prototype for your feedback.

https://github.com/mbeynon/datanymizer/tree/filter-input-stream-mode

I still need to add docs, changelog, tests, etc., but it's working for my test cases of large dbs and doesn't cause problems with TimescaleDB. This works really well for being applied to our production db dumps we're already collecting by doing:

$ pg_restore -f - db.dump | pg_datanymizer -i - -c config.yaml -f anon-db.sql

I had to work around the exiting interfaces of traits to add a new SqlText dumper that only works by scanning an input .sql stream. I tried to minimize the changes to existing code. Much of the postgres dumper was reused. I also added a spinner indicator for status.

The biggest difference between this SqlText dumper and the Postgres dumper is that it's stream based. I watch all data go by and choose what to transform. There is no data vs schema exclusions or inclusions, and all the dependency following logic is not used. It's just an inline transformer.

Let me know what you think, and I'll clean it up a bit and create a PR.

@kapilt
Copy link

kapilt commented Mar 15, 2022

@mbeynon thanks, i'm looking for this as well

@mbeynon
Copy link
Contributor Author

mbeynon commented Mar 16, 2022

Feel free to try the prototype fork and let me know what you think. It works like I showed above. I will clean it up and add docs and PR against the main repo when I get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants