-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A new cli version that acts as a .sql dump filter for anonymization #156
Comments
No. But it might be an interesting idea. |
Here's a prototype for your feedback. https://github.com/mbeynon/datanymizer/tree/filter-input-stream-mode I still need to add docs, changelog, tests, etc., but it's working for my test cases of large dbs and doesn't cause problems with TimescaleDB. This works really well for being applied to our production db dumps we're already collecting by doing:
I had to work around the exiting interfaces of traits to add a new SqlText dumper that only works by scanning an input .sql stream. I tried to minimize the changes to existing code. Much of the postgres dumper was reused. I also added a spinner indicator for status. The biggest difference between this SqlText dumper and the Postgres dumper is that it's stream based. I watch all data go by and choose what to transform. There is no data vs schema exclusions or inclusions, and all the dependency following logic is not used. It's just an inline transformer. Let me know what you think, and I'll clean it up a bit and create a PR. |
@mbeynon thanks, i'm looking for this as well |
Feel free to try the prototype fork and let me know what you think. It works like I showed above. I will clean it up and add docs and PR against the main repo when I get a chance. |
Our db is using TimescaleDB in the same postgresql db as the tables that I need to anonymize. The main problem is you have to be very careful how you dump and restore to avoid problems.
If I pg_dump the pg db and look at the output, there's internal tables such as
_timescaledb_config.bgw_job
or_timescaledb_catalog.compression_algorithm
that are somehow handled specially to avoid copying static data that cannot be later restored. If I use pg_datanymizer, it has internal logic to decide what to pull, and it gets it different. Excluding those tables is not enough, and I end up with other errors violating constraints on restore.The other detail is the data I'm starting with is -Fc format pg_dump files, so I have to load them into an instance of pg for datanymizer to work. I think this use case could benefit from pg_datanymizer acting as an anonymization filter such as:
$ pg_restore db.dump | pg_datanymizer -i - -c config.yaml -f anon-db.sql
This would need a new dumper that can handle .sql dumps on stdin, but could reuse all the same anon code by parsing COPY rows one at a time. The main benefit is this .sql will load without problems because it's using the pg_dump output and just modifying the COPY sections with anonymized data. A side benefit is it eliminates the extra step of loading the dump into a dummy pg server just to pull it back out to anonymize it.
Has anyone considered this before?
The text was updated successfully, but these errors were encountered: