Description
- I have checked the issue tracker for the same issue and I haven't found one similar
Superset version
0.19.0
Expected results
There are a large number of Issues asking about adding new Datasources / Connectors:
- Allow users to import CSV as datasource #381
- Add a python script as datasource #2790
- [feature] HDFS interface #2468
- Google BigQuery Support #945
- Spark SQL backend (to support Elasticsearch, Cassandra, etc) #241
- Support more NoSQL databases #600
- REST API client #245
Unfortunately, I can't find any examples of a working third party datasource / connector on Github and I think this is possibly because of the complexity and level of effort required to implement a BaseDatasource subclass with all the required methods. In particular, it needs to be able to report the schema and do filtering, grouping and aggregating.
Pandas has great import code, and I have seen Pandas proposed an a method for implementing a CSV connector - see #381 - read the CSV using Pandas and then output to sqlite and then connect to sqlite using the SQLA Datasource to create the slices.
This approach could be extended to other data formats that Pandas can read, e.g. Excel, HDF5, etc.
However, it is not ideal because the sqlite file will be potentially be out of date as soon as it is loaded.
I'd like to propose an altenative: a PandasDatasource that allows the user to specify the import method (read_csv
, read_table
, read_hdf
, etc.) and a URL and which then queries the URL using the method to create a Dataframe. It reports the columns available and their types based on the dtypes for the Dataframe. And by default it allows grouping, filtering and aggregating using Pandas built in functionality.
I realize that this approach won't work for very large datasets that could overwhelm the memory of the server, but it would work for my use case and probably for many others. The results of the read, filter, group and aggregate would be cached anyway, so the large memory usage is potentially only temporary.
This would also make it very much easier for people working with larger datasets to create a custom connector to suit their purposes. For example, someone wanting to use BigQuery (see #945) could extend the PandasDatasource to use read_gbq
and to pass the filter options through to BigQuery but still rely on Pandas to do grouping and aggregating. Given that starting point, someone else might come along later and add the additional code necessary to pass some group options through to BigQuery.
The point is that instead of having to write an entire Datasource and implement all methods, you could extend an existing one to scratch your particular itch, and over time as more itches get scratched we would end up with a much broader selection of datasources for Superset.