Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read CSV strings in perspective-python #1447

Merged
merged 2 commits into from
Jun 18, 2021
Merged

Read CSV strings in perspective-python #1447

merged 2 commits into from
Jun 18, 2021

Conversation

sc1f
Copy link
Contributor

@sc1f sc1f commented Jun 15, 2021

This PR enables direct loading of CSV strings inside perspective-python using the Apache Arrow-based loader used in perspective.js. The CSV loader uses the version of reader.cc shipped with Apache Arrow, and not the custom single_threaded_reader.cpp which is implemented for the WASM binding. During the loading of CSVs, as is the case with loading regular Arrow binaries, the GIL is released by the C++ binding.

Changelog

  • Enables and tests load & update of CSV strings in Python library
  • Unifies CMake build of Arrow CSV reader files
  • Fixes a bug in Python where parsing a CSV with a null column, such as "x,y\n1," would cause a segfault when trying to read into an invalid null_bitmap_data pointer.

@sc1f sc1f added enhancement Feature requests or improvements Python labels Jun 15, 2021
// https://github.com/emscripten-core/emscripten/issues/8574
#include <perspective/vendor/arrow_single_threaded_reader.h>
#else
#include <arrow/csv/reader.h>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@texodus I can enable the read_options.use_threads = false; to true a few lines down - I am assuming it should be safe given that the API is exposed by Arrow and we don't do any manual concurrency for this on our side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow seems to use this implementation for populating the default worker pool, which will consume all hardware concurrency without proper env flags. We should probably figure out how to override these flags in the loaded env, and to respect what was asked for in set_num_threads().

Interesting though, we could easily reuse this static threadpool in place of TBB, looks like ParallelFor() is nearly a dropin replacement for PSP_PARALLEL_FOR ..

Copy link
Member

@texodus texodus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for the PR!

@texodus texodus merged commit bc02d54 into master Jun 18, 2021
@texodus texodus deleted the python-csv branch June 18, 2021 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests or improvements Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants