如何使用 Poetry 构建新的 Harlequin 适配器-Python教程-PHP中文网

How to build a new Harlequin adapter with Poetry

欢迎来到 LETSQL 教程系列的第一篇文章！

在这篇博文中，我们脱离了通常的数据管道主题，以 DataFusion 为例演示如何使用 Poetry 创建和发布 Python 包。

介绍

Harlequin 是 SQL 数据库的 TUI 客户端，以其对 SQL 数据库的轻量级广泛支持而闻名。它是用于数据探索和分析工作流程的多功能工具。 Harlequin 提供了一个交互式 SQL 编辑器，具有自动完成、语法突出显示和查询历史记录等功能。它还具有可以显示大型结果集的结果查看器。然而，Harlequin 之前没有 DataFusion 适配器。值得庆幸的是，添加一个真的很容易。

在这篇文章中，我们将通过为 DataFusion 构建 Harlequin 适配器来演示这些概念。并且，通过这样做，我们还将介绍 Poetry 的基本功能、项目设置以及在 PyPI 上发布包的步骤。

要充分利用本指南，您应该对虚拟环境、Python 包和模块以及 pip 有基本的了解。
我们的目标是：

介绍诗歌及其优点
使用 Poetry 建立一个项目
为 DataFusion 开发 Harlequin 适配器
准备包并将其发布到 PyPI

最后，您将获得 Poetry 的实践经验并了解现代 Python 包管理。

本文中实现的代码可以在 GitHub 上找到，也可以在 PyPI 中找到。

丑角

Harlequin 是一个在终端中运行的 SQL IDE。它提供了传统命令行数据库工具的强大且功能丰富的替代方案，使其适用于数据探索和分析工作流程。

有关 Harlequin 的一些重要事项：

Harlequin 支持多个数据库适配器，可将您连接到 DuckDB、SQLite、PostgreSQL、MySQL 等。
Harlequin 提供了一个交互式 SQL 编辑器，具有自动完成、语法突出显示和查询历史记录等功能。它还具有一个结果查看器，可以显示大型结果集。
Harlequin 用更强大且用户友好的界面取代了传统的基于终端的数据库工具。
Harlequin 使用适配器插件作为任何数据库的通用接口。

数据融合

DataFusion 是一种快速、可扩展的查询引擎，用于使用 Apache Arrow 内存格式在 Rust 中构建高质量的以数据为中心的系统。

DataFusion 提供 SQL 和 Dataframe API、卓越的性能、对 CSV、Parquet、JSON 和 Avro 的内置支持、广泛的定制以及出色的社区。

它附带了自己的 CLI，可以在此处找到更多信息。

诗

Poetry 是一款功能丰富的现代工具，可简化 Python 项目的依赖管理和打包，使开发更加确定性和高效。
来自文档：

Poetry 是 Python 中的依赖管理和打包工具。它允许您声明您的项目所依赖的库，并且它将为您管理（安装/更新）它们。
Poetry 提供了一个锁定文件来确保可重复安装，并可以构建您的项目进行分发。

为 Harlequin 创建新适配器

Harlequin 适配器是一个 Python 包，允许 Harlequin 与数据库系统一起使用。

适配器是一个 Python 包，它在 harlequin.adapters 组中声明一个入口点。该入口点应该引用 HarlequinAdapter 抽象基类的子类。
这使得 Harlequin 能够发现已安装的适配器并在运行时实例化选定的适配器

除了 HarlequinAdapter 类之外，包还必须提供 HarlequinConnection 和 HarlequinCursor 的实现。更详细的描述可以在这个
指导。

Harlequin 适配器模板

开发 Harlequin 适配器的第一步是从现有的 harlequin-adapter-template 生成一个新的存储库

GitHub 模板是作为新项目起点的存储库。它们提供预配置的文件、结构和设置，这些文件、结构和设置可以复制到新存储库，从而可以快速设置项目，而无需分叉的开销。
此功能简化了根据既定模式创建一致、结构良好的项目的过程。

harlequin-adapter-template 附带一个诗歌.lock 文件和一个 pyproject.toml 文件，以及一些用于定义所需类的样板代码。

适配器编码

在讨论编码细节之前，让我们先探讨一下包分发所需的基本文件。

封装配置

pyproject.toml 文件现在是配置 Python 包以进行发布和其他工具的标准。这一 TOML 格式的文件在 PEP 518 和 PEP 621 中引入，将多个配置文件合并为一个。它通过使其更加健壮和标准化来增强依赖管理。

Poetry，利用 pyproject.toml 处理项目的虚拟环境、解决依赖关系并创建包。

模板的pyproject.toml如下：

[tool.poetry]
name = "harlequin-myadapter"
version = "0.1.0"
description = "A Harlequin adapter for <my favorite database>."
authors = ["Ted Conbeer <tconbeer@users.noreply.github.com>"]
license = "MIT"
readme = "README.md"
packages = [
    { include = "harlequin_myadapter", from = "src" },
]

[tool.poetry.plugins."harlequin.adapter"]
my-adapter = "harlequin_myadapter:MyAdapter"

[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
harlequin = "^1.7"

[tool.poetry.group.dev.dependencies]
ruff = "^0.1.6"
pytest = "^7.4.3"
mypy = "^1.7.0"
pre-commit = "^3.5.0"
importlib_metadata = { version = ">=4.6.0", python = "<3.10.0" }

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

登录后复制

可以看出：

pyproject.toml 文件的 [tool.poetry] 部分是定义 Python 包的元数据的位置，例如名称、版本、描述、作者等。
[tool.poetry.dependency] 小节是您声明项目所需的运行时依赖项的位置。跑步诗添加将自动更新此部分。
[tool.poetry.dev-dependency] 小节是您声明仅开发依赖项的位置，例如测试框架、linter 等。
[build-system] 部分用于存储与构建相关的数据。在本例中，它将构建后端指定为“poetry.core.masonry.api”。狭义上来说，核心责任是
build-backend就是构建wheels和sdist。

存储库还包括一个诗歌.lock 文件，这是通过运行诗歌安装或诗歌更新生成的特定于诗歌的组件。此锁定文件指定项目的所有依赖项和子依赖项的确切版本，确保跨不同环境的可重复安装。

避免手动编辑诗歌.lock 文件至关重要，因为这可能会导致不一致和安装问题。相反，对 pyproject.toml 文件进行更改，并允许 Poetry 通过运行 Poetry Lock 自动更新锁定文件。

获得诗歌

Per Poetry 的安装警告

::: {.警告}
Poetry 应始终安装在专用的虚拟环境中，以将其与系统的其他部分隔离。在任何情况下都不应将其安装在由 Poetry 管理的项目环境中。
:::

这里我们假设您可以通过运行 pipx install诗来访问 Poetry

在虚拟环境中开发

明确了文件结构后，让我们通过设置环境来开始开发过程。由于我们的项目已经包含 pyproject.toml 和诗歌.lock 文件，因此我们可以使用诗歌 shell 命令启动我们的环境。

此命令激活链接到当前 Poetry 项目的虚拟环境，确保所有后续操作都发生在项目的依赖上下文中。如果不存在虚拟环境，poetry shell 会自动创建并激活一个。

poetry shell 检测您当前的 shell 并在虚拟环境中启动一个新实例。由于 Poetry 默认集中虚拟环境，因此此命令无需查找或调用激活脚本的特定路径。

要验证 Poetry 当前使用的是哪个 Python 环境，您可以使用以下命令：

poetry env list --full-path

登录后复制

这将显示与您的项目关联的所有虚拟环境，并指示当前处于活动状态的虚拟环境。
作为替代方案，您可以仅获取当前环境的完整路径：

poetry env info -p

登录后复制

激活环境后，使用诗歌安装来安装所需的依赖项。该命令的工作原理如下

如果存在诗歌.lock 文件，诗歌安装将使用该文件中指定的确切版本，而不是动态解析依赖项。这确保了不同环境中一致、可重复的安装。我。如果您运行诗歌安装并且它似乎没有进展，您可能需要在要安装的 shell 中运行 export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
否则，它会读取当前项目中的 pyproject.toml 文件，解析其中列出的依赖项，然后安装它们。
如果不存在poetry.lock文件，poetry install会在解决依赖关系后创建一个，否则会更新现有的。

为了完成环境设置，我们需要将数据融合库添加到我们的依赖项中。执行以下命令：

poetry add datafusion

登录后复制

此命令使用 datafusion 包更新 pyproject.toml 文件并安装它。如果您不指定版本，Poetry 会根据可用的软件包版本自动选择合适的版本。

Implementing the Interfaces

To create a Harlequin Adapter, you need to implement three interfaces defined as abstract classes in the harlequin.adapter module.

The first one is the HarlequinAdapter.

#| eval: false
#| code-fold: false
#| code-summary: implementation of HarlequinAdapter

class DataFusionAdapter(HarlequinAdapter):
    def __init__(self, conn_str: Sequence[str], **options: Any) -> None:
        self.conn_str = conn_str
        self.options = options

    def connect(self) -> DataFusionConnection:
        conn = DataFusionConnection(self.conn_str, self.options)
        return conn

登录后复制

The second one is the HarlequinConnection, particularly the methods execute and get_catalog.

#| eval: false
#| code-fold: false
#| code-summary: implementation of execution of HarlequinConnection

 def execute(self, query: str) -> HarlequinCursor | None:
     try:
         cur = self.conn.sql(query)  # type: ignore
         if str(cur.logical_plan()) == "EmptyRelation":
             return None
     except Exception as e:
         raise HarlequinQueryError(
             msg=str(e),
             title="Harlequin encountered an error while executing your query.",
         ) from e
     else:
         if cur is not None:
             return DataFusionCursor(cur)
         else:
             return None

登录后复制

For brevity, we've omitted the implementation of the get_catalog function. You can find the full code in the adapter.py file within our GitHub repository.

Finally, a HarlequinCursor implementation must be provided as well:

#| eval: false
#| code-fold: false
#| code-summary: implementation of HarlequinCursor

class DataFusionCursor(HarlequinCursor):
    def __init__(self, *args: Any, **kwargs: Any) -> None:
        self.cur = args[0]
        self._limit: int | None = None

    def columns(self) -> list[tuple[str, str]]:
        return [
            (field.name, _mapping.get(field.type, "?")) for field in self.cur.schema()
        ]

    def set_limit(self, limit: int) -> DataFusionCursor:
        self._limit = limit
        return self

    def fetchall(self) -> AutoBackendType:
        try:
            if self._limit is None:
                return self.cur.to_arrow_table()
            else:
                return self.cur.limit(self._limit).to_arrow_table()
        except Exception as e:
            raise HarlequinQueryError(
                msg=str(e),
                title="Harlequin encountered an error while executing your query.",
            ) from e

登录后复制

Making the plugin discoverable

Your adapter must register an entry point in the harlequin.adapters group using the packaging software you use to build your project.
If you use Poetry, you can define the entry point in your pyproject.toml file:

[tool.poetry.plugins."harlequin.adapter"]
datafusion = "harlequin_datafusion:DataFusionAdapter"

登录后复制

An entry point is a mechanism for code to advertise components it provides to be discovered and used by other code.

Notice that registering a plugin with Poetry is equivalent to the following pyproject.toml specification for entry points:

[project.entry-points."harlequin.adapter"]
datafusion = "harlequin_datafusion:DataFusionAdapter"

登录后复制

Testing

The template provides a set of pre-configured tests, some of which are applicable to DataFusion while others may not be relevant. One test that's pretty cool checks if the plugin can be discovered, which is crucial for ensuring proper integration:

#| eval: false
#| code-fold: false
if sys.version_info < (3, 10):
    from importlib_metadata import entry_points
else:
    from importlib.metadata import entry_points


def test_plugin_discovery() -> None:
    PLUGIN_NAME = "datafusion"
    eps = entry_points(group="harlequin.adapter")
    assert eps[PLUGIN_NAME]
    adapter_cls = eps[PLUGIN_NAME].load()
    assert issubclass(adapter_cls, HarlequinAdapter)
    assert adapter_cls == DataFusionAdapter

登录后复制

To make sure the tests are passing, run:

poetry run pytest

登录后复制

The run command executes the given command inside the project’s virtualenv.

Building and Publishing to PyPI

With the tests passing, we're nearly ready to publish our project. Let's enhance our pyproject.toml file to make our package more discoverable and appealing on PyPI. We'll add key metadata including:

A link to the GitHub repository
A path to the README file
A list of relevant classifiers

These additions will help potential users find and understand our package more easily.

classifiers = [
    "Development Status :: 3 - Alpha",
    "Intended Audience :: Developers",
    "Topic :: Software Development :: User Interfaces",
    "Topic :: Database :: Database Engines/Servers",
    "License :: OSI Approved :: MIT License",
    "Programming Language :: Python :: Implementation :: CPython"
]
readme = "README.md"
repository = "https://github.com/mesejo/datafusion-adapter"

登录后复制

For reference:

The complete list of classifiers is available on PyPI's website.
For a detailed guide on writing pyproject.toml, check out this resource.
The formal, technical specification for pyproject.toml can be found on packaging.python.org.

Building

We're now ready to build our library and verify its functionality by installing it in a clean virtual environment. Let's start with the build process:

poetry build

登录后复制

This command will create distribution packages (both source and wheel) in the dist directory.

The wheel file should have a name like harlequin_datafusion-0.1.1-py3-none-any.whl. This follows the standard naming convention:

harlequin_datafusion is the package (or distribution) name
0.1.1 is the version number
py3 indicates it's compatible with Python 3
none compatible with any CPU architecture
any with no ABI (pure Python)

To test the installation, create a new directory called test_install. Then, set up a fresh virtual environment with the following command:

python -m venv .venv

登录后复制

To activate the virtual environment on MacOS or Linux:

source .venv/bin/activate

登录后复制

After running this command, you should see the name of your virtual environment (.venv) prepended to your command prompt, indicating that the virtual environment is now active.

To install the wheel file we just built, use pip as follows:

pip install /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl

登录后复制

Replace /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl with the actual path to the wheel file you want to install.

If everything works fined, you should see some dependencies installed, and you should be able to do:

harlequin -a datafusion

登录后复制

Congrats! You have built a Python library. Now it is time to share it with the world.

Publishing to PyPI

The best practice before publishing to PyPI is to actually publish to the Test Python Package Index (TestPyPI)

To publish a package to TestPyPI using Poetry, follow these steps:

Create an account at TestPyPI if you haven't already.
Generate an API token on your TestPyPI account page.
Register the TestPyPI repository with Poetry by running:
```
poetry config repositories.test-pypi https://test.pypi.org/legacy/
```
登录后复制

To publish your package, run:

poetry publish -r testpypi --username __token__ --password <token>

登录后复制

Replace with the actual token value you generated in step 2. To verify the publishing process, use the following command:

python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple <package-name>

登录后复制

This command uses two key arguments:

--index-url: Directs pip to find your package on TestPyPI.
--extra-index-url: Allows pip to fetch any dependencies from the main PyPI repository.

Replace with your specific package name (e.g., harlequin-datafusion if following this post). For additional details, consult the information provided in this post.

To publish to the actual Python Package Index (PyPI) instead:

Create an account at https://pypi.org/ if you haven't already.
Generate an API token on your PyPI account page.

Run:

poetry publish --username __token__ --password <token>

登录后复制

The default repository is PyPI, so there's no need to specify it.

Is worth noting that Poetry only supports the Legacy Upload API when publishing your project.

Automated Publishing on GitHub release

Manually publishing each time is repetitive and error-prone, so to fix this problem, let us create a GitHub Action to
publish each time we create a release.

Here are the key steps to publish a Python package to PyPI using GitHub Actions and Poetry:

Set up PyPI authentication: You must provide your PyPI credentials (the API token) as GitHub secrets so the GitHub Actions workflow can access them. Name these secrets something like PYPI_TOKEN.
Create a GitHub Actions workflow file: In your project's .github/workflows directory, create a new file like publish.yml with the following content:

   name: Build and publish python package

   on:
     release:
       types: [ published ]

   jobs:
     publish-package:
       runs-on: ubuntu-latest
       permissions:
         contents: write
       steps:
         - uses: actions/checkout@v3
         - uses: actions/setup-python@v4
           with:
             python-version: '3.10'

         - name: Install Poetry
           uses: snok/install-poetry@v1

         - run: poetry config pypi-token.pypi "${{ secrets.PYPI_TOKEN }}"

         - name: Publish package
           run: poetry publish --build --username __token__

登录后复制

The key is to leverage GitHub Actions to automate the publishing process and use Poetry to manage your package's dependencies and metadata.

Conclusion

Poetry is a user-friendly Python package management tool that simplifies project setup and publication. Its intuitive command-line interface streamlines environment management and dependency installation. It supports plugin development, integrates with other tools, and emphasizes testing for robust code. With straightforward commands for building and publishing packages, Poetry makes it easier for developers to share their work with the Python community.

At LETSQL, we're committed to contributing to the developer community. We hope this blog post serves as a straightforward guide to developing and publishing Python packages, emphasizing best practices and providing valuable resources.
To subscribe to our newsletter, visit letsql.com.

Future Work

As we continue to refine the adapter, we would like to provide better autocompletion and direct reading from files (parquet, csv) as in the DataFusion-cli. This requires a tighter integration with the Rust library without going through the Python bindings.

Your thoughts and feedback are invaluable as we navigate this journey. Share your experiences, questions, or suggestions in the comments below or on our community forum. Let's redefine the boundaries of data science and machine learning integration.