Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Dev] Dynamically generate the Connection wrapper methods #11202

Merged
merged 32 commits into from
Apr 17, 2024

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Mar 16, 2024

This was becoming increasingly hard to maintain, so I have decided to generate the methods instead.

Every method defined on DuckDBPyConnection now lives in tools/pythonpkg/scripts/connection_methods.json

They look like this:

	{
		"name": "register_filesystem",
		"function": "RegisterFilesystem",
		"docs": "Register a fsspec compliant filesystem",
		"args": [
			{
				"name": "filesystem",
				"type": "str"
			}
		],
		"return": "None"
	}

From this, we generate sections of code inside of:
tools/pythonpkg/duckdb/__init__.py
tools/pythonpkg/duckdb-stubs/__init__.pyi (DuckDBPyConnection)
tools/pythonpkg/duckdb-stubs/__init__.pyi (duckdb)
tools/pythonpkg/src/pyconnection.cpp (InitializeConnectionMethods)

From now on, when adjusting or creating a new connection method, the connection_methods.json file should be updated.

Old Process

  • Create a C++ function in pyconnection.hpp and pyconnection.cpp
  • Add a pybind definition for it inside pyconnection.cpp
  • Create a wrapper C++ function in connection_wrapper.hpp and connection_wrapper.cpp
  • Add a pybind definition for it inside duckdb_python.cpp
  • Add it to duckdb-stubs/__init__.pyi for both the DuckDBPyConnection method and the corresponding wrapper method.
  • Add the wrapper connection method to duckdb/__init__.py (expose the symbol from duckdb.duckdb)

New Process

  • Create a C++ function in pyconnection.hpp and pyconnection.cpp
  • Update the connection_methods.json
  • Run make generate-files

Some issues encountered:

  • values didn't take the optional connection as a keyword argument, now it does
  • arrow and df are overloaded twice, once for from_arrow/df and another for to_arrow/df.
    We now define these in C++ because it's easier (pybind11 handles the overloading on the Python side).

…om None todefault connection handled by pybind, but tidy will not be happy about passing by value if I don't use it, and its suggestion to make it a will break the implicit conversion rules we set up - so the *only* purpose of this change is to make tidy happy
@Tishj Tishj marked this pull request as ready for review March 19, 2024 11:06
@Mytherin Mytherin requested a review from Mause March 20, 2024 16:12
@Mytherin Mytherin added the Draft label Mar 20, 2024
@Mytherin
Copy link
Collaborator

Great idea to generate this - maybe we can instead of generating the function objects during import duckdb, have a separate step that generates the code and puts it into init.py, so that we don't need to execute complex code during the import?

@Tishj
Copy link
Contributor Author

Tishj commented Mar 21, 2024

Great idea to generate this - maybe we can instead of generating the function objects during import duckdb, have a separate step that generates the code and puts it into init.py, so that we don't need to execute complex code during the import?

While I agree, my source of truth for generating this is the definitions created by pybind11 for the connection methods.
Which means that to generate the __init__.py I would need to reference the __init__.py, which seems problematic

So I decided to take a step back and generate the cpp code for the DuckDBPyConnection methods that are shared, for which I am writing this gigantic json file that contains the information for both the stubs and the m.def(...) code, but that seems like it's going to take a bit of grunt work:

	{
		"name": "create_function",
		"function": "RegisterScalarUDF",
		"docs": "Create a DuckDB function out of the passing in Python function so it can be used in queries",
		"args": [
			{
				"name": "name",
				"type": "str"
			},
			{
				"name": "function",
				"type": "Callable"
			},
			{
				"name": "parameters",
				"type": "Optional[List[DuckDBPyType]]",
				"default": "None"
			},
			{
				"name": "return_type",
				"type": "Optional[DuckDBPyType]",
				"default": "None"
			}
		],
		"kwargs": [
			{
				"name": "type",
				"type": "Optional[PythonUDFType]",
				"default": "PythonUDFType.NATIVE"
			},
			{
				"name": "null_handling",
				"type": "Optional[FunctionNullHandling]",
				"default": "FunctionNullHandling.DEFAULT"
			},
			{
				"name": "exception_handling",
				"type": "Optional[PythonExceptionHandling]",
				"default": "PythonExceptionHandling.DEFAULT"
			},
			{
				"name": "side_effects",
				"type": "bool",
				"default": "False"
			}
		],
		"return": "DuckDBPyConnection"
	},

If everything goes according to plan however, I can write a script that generates all of these things from this json file:

  • stubs (connection)
  • stubs (connection wrapper)
  • init.py (connection wrapper)
  • cpp code (only the m.def(...) calls that allows pybind11 to expose these to python)

(connection stubs)

    def create_function(
        self,
        name: str,
        func: Callable,
        parameters: Optional[List[DuckDBPyType]] = None,
        return_type: Optional[DuckDBPyType] = None,
        **kwargs)  -> DuckDBPyConnection: ...

(wrapper stubs)

    def create_function(
        name: str,
        func: Callable,
        parameters: Optional[List[DuckDBPyType]] = None,
        return_type: Optional[DuckDBPyType] = None,
        **kwargs)  -> DuckDBPyConnection: ...

(wrapper init.py code)

def create_function(
    name: str,
    func: Callable,
    parameters: Optional[List[DuckDBPyType]] = None,
    return_type: Optional[DuckDBPyType] = None,
    **kwargs)  -> DuckDBPyConnection: ...
    if 'connection' in kwargs:
        connection = kwargs.pop('connection')
    else:
        connection = duckdb.connect(':default:')
    return connection.create_function(name, func, parameters, return_type, **kwargs)

(cpp definition)

	m.def(
		"create_function",
		&DuckDBPyConnection::RegisterScalarUDF,
		"Create a DuckDB function out of the passing in Python function so it can be used in queries",
		py::arg("name"),
		py::arg("function"),
		py::arg("parameters") = py::none(),
		py::arg("return_type") = py::none(),
		py::kw_only(),
		py::arg("type") = PythonUDFType::NATIVE,
		py::arg("null_handling") = 0,
		py::arg("exception_handling") = 0,
		py::arg("side_effects") = false);

@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 11, 2024 08:09
@Tishj Tishj marked this pull request as ready for review April 12, 2024 11:33
@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 12, 2024 13:18
@Tishj Tishj marked this pull request as ready for review April 12, 2024 14:43
@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 16, 2024 08:16
@Tishj Tishj marked this pull request as ready for review April 16, 2024 09:23
@Tishj Tishj requested a review from Mytherin April 16, 2024 16:25
@Mytherin Mytherin merged commit 12250d3 into duckdb:main Apr 17, 2024
47 checks passed
@Mytherin
Copy link
Collaborator

Thanks! LGTM now

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request May 1, 2024
Merge pull request duckdb/duckdb#11673 from hannes/safesign
Merge pull request duckdb/duckdb#11688 from carlopi/fixes_duckdb_wasm
Merge pull request duckdb/duckdb#11202 from Tishj/python_generated_connection_wrapper
@Tishj Tishj mentioned this pull request May 7, 2024
2 tasks
tm-drtina added a commit to tm-drtina/duckdb that referenced this pull request May 23, 2024
Connection wrapper functions used to have signature `connection: DuckDBPyConnection = ...`
However this was (accidentally) changed in duckdb#11202 to `connection: DuckDBPyConnection`, which causes the connection param to be required now - at least for type checking
tm-drtina added a commit to tm-drtina/duckdb that referenced this pull request May 23, 2024
Connection wrapper functions used to have signature `connection: DuckDBPyConnection = ...`
However this was (accidentally) changed in duckdb#11202 to `connection: DuckDBPyConnection`, which causes the connection param to be required now - at least for type checking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants