Welcome to the first post in LETSQL's tutorial series!
In this blog post, we take a detour from our usual theme of data pipelines to demonstrate how to create and publish a Python package with Poetry, using DataFusion as an example.
Harlequin is a TUI client for SQL databases known for its light-weight extensive support for SQL databases. It is a versatile tool for data exploration and analysis workflows. Harlequin provides an interactive SQL editor with features like autocomplete, syntax highlighting, and query history. It also has a results viewer that can display large result sets. However, Harlequin did not have a DataFusion adapter before. Thankfully, it was really easy to add one.
In this post, We'll demonstrate these concepts by building a Harlequin adapter for DataFusion. And, by way of doing so, we will also cover Poetry's essential features, project setup, and the steps to publish your package on PyPI.
To get the most out of this guide, you should have a basic understanding of virtual environments, Python packages and modules, and pip.
Our objectives are to:
By the end, you'll have practical experience with Poetry and an understanding of modern Python package management.
The code implemented in this post is available on GitHub and available in PyPI.
Harlequin is a SQL IDE that runs in the terminal. It provides a powerful and feature-rich alternative to traditional command-line database tools, making it versatile for data exploration and analysis workflows.
Some key things to know about Harlequin:
DataFusion is a fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.
DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.
It ships with it its own CLI, more information can be found here.
Poetry is a modern, feature-rich tool that streamlines dependency management and packaging for Python projects, making development more deterministic and efficient.
From the documentation:
Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on, and it will manage (install/update) them for you.
Poetry offers a lockfile to ensure repeatable installs and can build your project for distribution.
A Harlequin adapter is a Python package that allows Harlequin to work with a database system.
An adapter is a Python package that declares an entry point in the harlequin.adapters group. That entry point should reference a subclass of the HarlequinAdapter abstract base class.
This allows Harlequin to discover installed adapters and instantiate a selected adapter at run-time
In addition to the HarlequinAdapter class, the package must also provide implementations for HarlequinConnection, and HarlequinCursor. A more detailed description can be found on this
guide.
The first step for developing a Harlequin adapter is to generate a new repo from the existing harlequin-adapter-template
GitHub templates are repositories that serve as starting points for new projects. They provide pre-configured files, structures, and settings that are copied to new repositories, allowing for quick project setup without the overhead of forking.
This feature streamlines the process of creating consistent, well-structured projects based on established patterns.
The harlequin-adapter-template comes with a poetry.lock file and a pyproject.toml file, in addition to some boilerplate code for defining the required classes.
Let's explore the essential files needed for package distribution before we get into the specifics of coding.
The pyproject.toml file is now the standard for configuring Python packages for publication and other tools. Introduced in PEP 518 and PEP 621, this TOML-formatted file consolidates multiple configuration files into one. It enhances dependency management by making it more robust and standardized.
Poetry, utilizes pyproject.toml to handle the project's virtual environment, resolve dependencies, and create packages.
The pyproject.toml of the template is as follows:
[tool.poetry] name = "harlequin-myadapter" version = "0.1.0" description = "A Harlequin adapter for <my favorite database>." authors = ["Ted Conbeer <tconbeer@users.noreply.github.com>"] license = "MIT" readme = "README.md" packages = [ { include = "harlequin_myadapter", from = "src" }, ] [tool.poetry.plugins."harlequin.adapter"] my-adapter = "harlequin_myadapter:MyAdapter" [tool.poetry.dependencies] python = ">=3.8.1,<4.0" harlequin = "^1.7" [tool.poetry.group.dev.dependencies] ruff = "^0.1.6" pytest = "^7.4.3" mypy = "^1.7.0" pre-commit = "^3.5.0" importlib_metadata = { version = ">=4.6.0", python = "<3.10.0" } [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api"
As it can be seen:
The [tool.poetry] section of the pyproject.toml file is where you define the metadata for your Python package, such as the name, version, description, authors, etc.
The [tool.poetry.dependencies] subsection is where you declare the runtime dependencies your project requires. Running poetry add The [tool.poetry.dev-dependencies] subsection is where you declare development-only dependencies, like testing frameworks, linters, etc. The [build-system] section is used to store build-related data. In this case, it specifies the build-backend as "poetry.core.masonry.api". In a narrow sense, the core responsibility of a The repository also includes a poetry.lock file, a Poetry-specific component generated by running poetry install or poetry update. This lock file specifies the exact versions of all dependencies and sub-dependencies for your project, ensuring reproducible installations across different environments. It's crucial to avoid manual edits to the poetry.lock file, as this can cause inconsistencies and installation issues. Instead, make changes to your pyproject.toml file and allow Poetry to automatically update the lock file by running poetry lock. Per Poetry's installation warning ::: {.warning} Here we will presume you have access to Poetry by running pipx install poetry With our file structure clarified, let's begin the development process by setting up our environment. Since our project already includes pyproject.toml and poetry.lock files, we can initiate our environment using the poetry shell command. This command activates the virtual environment linked to the current Poetry project, ensuring all subsequent operations occur within the project's dependency context. If no virtual environment exists, poetry shell automatically creates and activates one. poetry shell detects your current shell and launches a new instance within the virtual environment. As Poetry centralizes virtual environments by default, this command eliminates the need to locate or recall the specific path to the activate script. To verify which Python environment is currently in use with Poetry, you can use the following commands: This will show all the virtual environments associated with your project and indicate which one is currently active. With the environment activated, use poetry install to install the required dependencies. The command works as follows To complete the environment setup, we need to add the datafusion library to our dependencies. Execute the following command: This command updates your pyproject.toml file with the datafusion package and installs it. If you don't specify a version, Poetry will automatically select an appropriate one based on available package versions. To create a Harlequin Adapter, you need to implement three interfaces defined as abstract classes in the harlequin.adapter module. The first one is the HarlequinAdapter. The second one is the HarlequinConnection, particularly the methods execute and get_catalog. For brevity, we've omitted the implementation of the get_catalog function. You can find the full code in the adapter.py file within our GitHub repository. Finally, a HarlequinCursor implementation must be provided as well: Your adapter must register an entry point in the harlequin.adapters group using the packaging software you use to build your project. An entry point is a mechanism for code to advertise components it provides to be discovered and used by other code. Notice that registering a plugin with Poetry is equivalent to the following pyproject.toml specification for entry points: The template provides a set of pre-configured tests, some of which are applicable to DataFusion while others may not be relevant. One test that's pretty cool checks if the plugin can be discovered, which is crucial for ensuring proper integration: To make sure the tests are passing, run: The run command executes the given command inside the project’s virtualenv. With the tests passing, we're nearly ready to publish our project. Let's enhance our pyproject.toml file to make our package more discoverable and appealing on PyPI. We'll add key metadata including: These additions will help potential users find and understand our package more easily. For reference: We're now ready to build our library and verify its functionality by installing it in a clean virtual environment. Let's start with the build process: This command will create distribution packages (both source and wheel) in the dist directory. The wheel file should have a name like harlequin_datafusion-0.1.1-py3-none-any.whl. This follows the standard naming convention: To test the installation, create a new directory called test_install. Then, set up a fresh virtual environment with the following command: To activate the virtual environment on MacOS or Linux: After running this command, you should see the name of your virtual environment (.venv) prepended to your command prompt, indicating that the virtual environment is now active. To install the wheel file we just built, use pip as follows: Replace /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl with the actual path to the wheel file you want to install. If everything works fined, you should see some dependencies installed, and you should be able to do: Congrats! You have built a Python library. Now it is time to share it with the world. The best practice before publishing to PyPI is to actually publish to the Test Python Package Index (TestPyPI) To publish a package to TestPyPI using Poetry, follow these steps: Create an account at TestPyPI if you haven't already. Generate an API token on your TestPyPI account page. Register the TestPyPI repository with Poetry by running: To publish your package, run: Replace This command uses two key arguments: Replace To publish to the actual Python Package Index (PyPI) instead: Create an account at https://pypi.org/ if you haven't already. Generate an API token on your PyPI account page. Run: The default repository is PyPI, so there's no need to specify it. Is worth noting that Poetry only supports the Legacy Upload API when publishing your project. Manually publishing each time is repetitive and error-prone, so to fix this problem, let us create a GitHub Action to Here are the key steps to publish a Python package to PyPI using GitHub Actions and Poetry: Set up PyPI authentication: You must provide your PyPI credentials (the API token) as GitHub secrets so the GitHub Actions workflow can access them. Name these secrets something like PYPI_TOKEN. Create a GitHub Actions workflow file: In your project's .github/workflows directory, create a new file like publish.yml with the following content: The key is to leverage GitHub Actions to automate the publishing process and use Poetry to manage your package's dependencies and metadata. Poetry is a user-friendly Python package management tool that simplifies project setup and publication. Its intuitive command-line interface streamlines environment management and dependency installation. It supports plugin development, integrates with other tools, and emphasizes testing for robust code. With straightforward commands for building and publishing packages, Poetry makes it easier for developers to share their work with the Python community. At LETSQL, we're committed to contributing to the developer community. We hope this blog post serves as a straightforward guide to developing and publishing Python packages, emphasizing best practices and providing valuable resources. As we continue to refine the adapter, we would like to provide better autocompletion and direct reading from files (parquet, csv) as in the DataFusion-cli. This requires a tighter integration with the Rust library without going through the Python bindings. Your thoughts and feedback are invaluable as we navigate this journey. Share your experiences, questions, or suggestions in the comments below or on our community forum. Let's redefine the boundaries of data science and machine learning integration. Thanks to Dan Lovell and Hussain Sultan for the comments and the thorough review.
build-backend is to build wheels and sdist.
Getting Poetry
Poetry should always be installed in a dedicated virtual environment to isolate it from the rest of your system. It should in no case be installed in the environment of the project that is to be managed by Poetry.
:::
Developing in the virtual environment
poetry env list --full-path
As an alternative, you can get the full path of only the current environment:
poetry env info -p
poetry add datafusion
Implementing the Interfaces
#| eval: false
#| code-fold: false
#| code-summary: implementation of HarlequinAdapter
class DataFusionAdapter(HarlequinAdapter):
def __init__(self, conn_str: Sequence[str], **options: Any) -> None:
self.conn_str = conn_str
self.options = options
def connect(self) -> DataFusionConnection:
conn = DataFusionConnection(self.conn_str, self.options)
return conn
#| eval: false
#| code-fold: false
#| code-summary: implementation of execution of HarlequinConnection
def execute(self, query: str) -> HarlequinCursor | None:
try:
cur = self.conn.sql(query) # type: ignore
if str(cur.logical_plan()) == "EmptyRelation":
return None
except Exception as e:
raise HarlequinQueryError(
msg=str(e),
title="Harlequin encountered an error while executing your query.",
) from e
else:
if cur is not None:
return DataFusionCursor(cur)
else:
return None
#| eval: false
#| code-fold: false
#| code-summary: implementation of HarlequinCursor
class DataFusionCursor(HarlequinCursor):
def __init__(self, *args: Any, **kwargs: Any) -> None:
self.cur = args[0]
self._limit: int | None = None
def columns(self) -> list[tuple[str, str]]:
return [
(field.name, _mapping.get(field.type, "?")) for field in self.cur.schema()
]
def set_limit(self, limit: int) -> DataFusionCursor:
self._limit = limit
return self
def fetchall(self) -> AutoBackendType:
try:
if self._limit is None:
return self.cur.to_arrow_table()
else:
return self.cur.limit(self._limit).to_arrow_table()
except Exception as e:
raise HarlequinQueryError(
msg=str(e),
title="Harlequin encountered an error while executing your query.",
) from e
Making the plugin discoverable
If you use Poetry, you can define the entry point in your pyproject.toml file:
[tool.poetry.plugins."harlequin.adapter"]
datafusion = "harlequin_datafusion:DataFusionAdapter"
[project.entry-points."harlequin.adapter"]
datafusion = "harlequin_datafusion:DataFusionAdapter"
Testing
#| eval: false
#| code-fold: false
if sys.version_info < (3, 10):
from importlib_metadata import entry_points
else:
from importlib.metadata import entry_points
def test_plugin_discovery() -> None:
PLUGIN_NAME = "datafusion"
eps = entry_points(group="harlequin.adapter")
assert eps[PLUGIN_NAME]
adapter_cls = eps[PLUGIN_NAME].load()
assert issubclass(adapter_cls, HarlequinAdapter)
assert adapter_cls == DataFusionAdapter
poetry run pytest
Building and Publishing to PyPI
classifiers = [
"Development Status :: 3 - Alpha",
"Intended Audience :: Developers",
"Topic :: Software Development :: User Interfaces",
"Topic :: Database :: Database Engines/Servers",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: Implementation :: CPython"
]
readme = "README.md"
repository = "https://github.com/mesejo/datafusion-adapter"
Building
poetry build
python -m venv .venv
source .venv/bin/activate
pip install /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl
harlequin -a datafusion
Publishing to PyPI
poetry config repositories.test-pypi https://test.pypi.org/legacy/
poetry publish -r testpypi --username __token__ --password <token>
python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple <package-name>
poetry publish --username __token__ --password <token>
Automated Publishing on GitHub release
publish each time we create a release.
name: Build and publish python package
on:
release:
types: [ published ]
jobs:
publish-package:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install Poetry
uses: snok/install-poetry@v1
- run: poetry config pypi-token.pypi "${{ secrets.PYPI_TOKEN }}"
- name: Publish package
run: poetry publish --build --username __token__
Conclusion
To subscribe to our newsletter, visit letsql.com.
Future Work
Acknowledgements
The above is the detailed content of How to build a new Harlequin adapter with Poetry. For more information, please follow other related articles on the PHP Chinese website!