The reason why I’m writing this post is to share some insights on keeping a project clean, even with lots of contributors. This is especially important for data engineers, given the ever-changing nature of data and the processing demands in Python libraries and applications.
The title might sound a bit clickbaity, but if you follow these steps, your Python code will become much easier to manage. If you’re a senior developer, you probably won’t find anything new here — don’t worry, I’ve got a funny meme for you.
This may seem trivial, but I’ve actually known people who stored their code only on their local computers — and unfortunately lost it because they didn’t back it up anywhere else. There are several version control systems available, like Git, which works with platforms such as GitHub, GitLab, and Bitbucket. While Git is the go-to choice for many, other version control systems like SVN (Subversion) and Mercurial still play significant roles in code management.
For this guide, I’m setting out to create a small demo Python library with a single function to illustrate basic data handling. It’s not meant to be a full toolkit but serves as a simple example for demonstrating best practices like code quality, environment management, and CI/CD workflows.
To get started, I’ve created a repository for our demo Python library, calling it blog_de_toolkit. It’s public, so feel free to fork it and use the existing code as a starting point for your own projects. Since good documentation is essential, I’ve included a suggested empty README file. And to keep the repo tidy, I added a default Python .gitignore template to prevent unnecessary files from being uploaded.
Now that we have the repo, we can clone it.
I’m not going to dive into Git commands here — you can find plenty of tutorials online. If you’re not a fan of using the plain terminal CLI, you can also manage repositories with tools like GitHub Desktop or SourceTree, which provide a more visual, intuitive interface.
Personally, I really enjoy using GitHub Desktop. So, let’s clone our repo to the local computer and open it in your preferred IDE.
Let’s see what we’ve got so far:
Not bad for a start!
For Step 2, we’ll organize our de_toolkit project. A good structure makes it easy to find things and keeps everything tidy. We’ll create folders for our code, tests, and documentation, setting up a simple, clean framework to build on.
blog_de_toolkit/ │ ├── de_toolkit/ │ ├── __init__.py │ └── data_tools.py │ ├── tests/ │ ├── __init__.py │ └── test_data_tools.py │ ├── .gitignore ├── setup.py └── README.md
We’ve got a main folder for all the useful code we’ll be adding, a tests folder for our future unit tests, and a .gitignore to keep unnecessary files out of our repository. There’s also a setup.py file, which is a basic setup to make the project installable. I won’t go into detail about it now since we’ll cover it later in Step 8: Create a Distribution Package.
When setting up your project structure, keeping things consistent makes a huge difference. As your project grows, it’s a good idea to break things into smaller modules — like splitting data_tools.py into csv_tools.py and json_tools.py. This way, it’s easier to manage and find what you need without digging through long files.
Adding a docs/ folder is also a smart move, even if it just starts with a few notes. It’ll help you (and others) keep track of how things work as the project evolves. If you’re working with configurations like YAML or JSON, a configs/ folder can help keep things neat. And if you plan on writing scripts for automation or testing, a scripts/ folder will keep them organized.
At this point, we need to install a few additional libraries to continue building out the project.
Sure, we could just run pip install from the command line to install the dependencies we need. But what if we’re juggling multiple projects, each requiring different versions of Python and libraries? That’s where virtual environments come in—they isolate each project's dependencies, including specific Python versions, so everything stays self-contained and independent.
Luckily, there are quite a few tools to create virtual environments, so you can pick what works best for you:
virtualenv
venv
conda
pyenv
pipenv
poetry
Personally, I’m a big fan of pyenv, so that’s what I’ll be using here. I’ve already got it installed on my laptop since it’s my go-to for work and personal projects.
Let’s start by installing Python:
blog_de_toolkit/ │ ├── de_toolkit/ │ ├── __init__.py │ └── data_tools.py │ ├── tests/ │ ├── __init__.py │ └── test_data_tools.py │ ├── .gitignore ├── setup.py └── README.md
If pyenv doesn’t recognize this Python version, try updating it first. For example, if you’re on Mac and installed pyenv with Homebrew, run:
pyenv install 3.12.2
If you encounter the error ModuleNotFoundError: No module named '_lzma', try:
brew update && brew upgrade pyenv
Next, in our project folder, let’s create a new virtual environment:
brew install readline xz
Now, set the local Python version to the virtual environment you just created:
pyenv virtualenv 3.12.2 de_toolkit
If the environment doesn’t switch after running the command on MacOS, there’s a helpful thread online with a solution. Once everything is set up correctly, you should see de_toolkit at the beginning of your command line, like this:
Now, let’s install our dependencies:
pyenv local de_toolkit
Next, we’ll save all the installed packages, along with their versions, into a requirements.txt file. This makes it easy to share the project’s dependencies or recreate the same environment elsewhere:
pip install setuptools wheel twine pandas
Here’s the list of installed packages we got:
Of course, you can edit the requirements.txt file to keep only the main libraries and their versions, if you prefer.
This step is crucial — probably one of the most important ones. You’ve likely heard horror stories about credentials being exposed in GitHub repositories or sensitive tokens accidentally shared in public. To avoid this, it’s essential to keep sensitive information out of your code from the start. Otherwise, it’s easy to forget that you hardcoded a database password, push your changes, and boom — your credentials are now public.
Hard-coding passwords, API keys, or database credentials is a major security risk. If these make it into a public repo, they could compromise your entire system. The safest way to handle secrets is by storing them in environment variables or a .env file. To help load these variables into your Python project, we’ll use the python-dotenv library. It reads key-value pairs from a .env file and makes them available as environment variables in your code.
First, install the library with:
pip freeze > requirements.txt
Create a .env file in your project folder with the following content:
pip install python-dotenv
Now, let’s modify data_tools.py to load these secrets using python-dotenv:
When you call load_dotenv(), it searches for a .env file in the current directory and loads its contents into the environment. Using os.getenv() allows you to access these variables safely in your code, keeping credentials isolated from the source code and reducing the risk of accidental exposure.
A key tip is to avoid committing your .env file to version control. Add it to .gitignore to prevent it from being accidentally pushed:
blog_de_toolkit/ │ ├── de_toolkit/ │ ├── __init__.py │ └── data_tools.py │ ├── tests/ │ ├── __init__.py │ └── test_data_tools.py │ ├── .gitignore ├── setup.py └── README.md
If you’re using VSCode, there’s a helpful dotenv extension that automatically recognizes your .env *files. And if you prefer working from the terminal, you can export the *.env file like this:
pyenv install 3.12.2
When working on your project, try to write small, reusable functions that are easy to understand and manage. A good rule of thumb is: “If you use it more than twice, turn it into a function.”
In our data_tools.py, let’s create a function that demonstrates typical data engineering logic—like loading data from a CSV and cleaning it:
Pro tip: Stick to snake_case
for function and variable names in Python — it keeps your code consistent and easy to read. Avoid cryptic names like x or df2; clear, descriptive names make your code easier to work with.We use docstrings here to describe what the function does, its parameters, and return type. This makes it easy for other developers (and your future self) to understand how to use the function. There are several popular docstring conventions, but the most common ones include PEP 257, Google Style, and NumPy Style:
For smaller functions, PEP 257 is often enough, but for more complex projects, Google or NumPy styles offer more clarity and structure.
Type hints in Python, like file_path: str in our example, show the expected data types for function inputs and outputs. They improve readability, help catch bugs, and make collaboration easier by setting clear expectations.
Here’s an example of how type hints improve the function signature:
blog_de_toolkit/ │ ├── de_toolkit/ │ ├── __init__.py │ └── data_tools.py │ ├── tests/ │ ├── __init__.py │ └── test_data_tools.py │ ├── .gitignore ├── setup.py └── README.md
In this example, the type hint file_path: str shows that the argument should be a string, while -> pd.DataFrame indicates that the function returns a Pandas DataFrame. This makes the function’s behavior easy to understand at a glance. Type hints also work well with IDEs and linters, like PyCharm, VSCode, or mypy, offering autocompletion and early warnings if incompatible types are passed.
If a function can return multiple types or None, you can use Optional from the typing module:
pyenv install 3.12.2
This indicates that the function could return either a string or None. For more complex data structures, you can use List, Dict, or Tuple from the typing module to specify expected types.
Writing unit tests is a simple way to make sure your code does what it’s supposed to, without unexpected surprises. They help you catch bugs early and make changes with confidence, knowing everything still works as expected. In Python, there are several libraries available for unit testing, each with its strengths and ecosystem:
unittest
pytest
nose2
hypothesis
For this guide, we’ll go with pytest because it’s simple, flexible, and easy to use. You can install it with:
brew update && brew upgrade pyenv
Next, create a file named test_data_tools.py inside the tests/ folder. Let’s write some tests for the code we implemented earlier. Here’s a sample test for our load_and_clean_data() function and environment variable retrieval logic:
In test_load_and_clean_data(), we use StringIO to simulate a CSV file as input. This allows us to test without needing an actual file. The test verifies that the function correctly removes duplicates and NaN values, checks that the DataFrame has no missing data, and confirms that the unique entries in the "name" column are correct.
In test_get_database_url() and test_get_api_key(), we use monkeypatch, a utility provided by pytest, to temporarily set environment variables during the test. This ensures the functions return the expected values without needing real environment variables.
To run all the tests, simply execute the following command:
blog_de_toolkit/ │ ├── de_toolkit/ │ ├── __init__.py │ └── data_tools.py │ ├── tests/ │ ├── __init__.py │ └── test_data_tools.py │ ├── .gitignore ├── setup.py └── README.md
One of the reasons I love pytest is its flexibility. It goes beyond basic unit testing by offering powerful features like fixtures, parameterized tests, and plugins. Fixtures let you set up test data or configurations that multiple tests can reuse, which keeps your code DRY (Don’t Repeat Yourself). Parameterized tests allow you to run the same test with different inputs, saving time and reducing duplication. And if you need to extend pytest’s functionality, there’s a wide ecosystem of plugins for things like testing Django apps, measuring code coverage, or mocking HTTP requests.
Maintaining high code quality ensures your code is easy to read, maintain, and free of common bugs. Several tools can help enforce consistent coding standards, automatically format code, and detect potential issues early. Some popular options include pylint, flake8, black, and detect-secrets.
pylint enforces coding standards and catches common errors.
flake8 combines tools to detect style violations and logical errors.
black is an opinionated formatter that makes sure your code follows PEP8 standards.
detect-secrets scans your code to prevent hard-coded secrets from being exposed.
You can install these tools with:
pyenv install 3.12.2
For example, run pylint on a specific file or directory:
brew update && brew upgrade pyenv
You’ll get a report with warnings and suggestions to improve your code. To ignore specific warnings, you can use:
brew install readline xz
You can also use flake8 to find style issues and logical errors:
blog_de_toolkit/ │ ├── de_toolkit/ │ ├── __init__.py │ └── data_tools.py │ ├── tests/ │ ├── __init__.py │ └── test_data_tools.py │ ├── .gitignore ├── setup.py └── README.md
To automatically format your code, run black:
pyenv install 3.12.2
Instead of running these tools manually every time you make changes, you can automate the process with pre-commit hooks. Pre-commit hooks run automatically before each commit, blocking the commit if any tool fails.
First, install the pre-commit package:
brew update && brew upgrade pyenv
Next, create a .pre-commit-config.yaml file in your project directory with the following content (here I used all my favorite basic pre-commits):
Activate the pre-commit hooks in your local repository:
blog_de_toolkit/ │ ├── de_toolkit/ │ ├── __init__.py │ └── data_tools.py │ ├── tests/ │ ├── __init__.py │ └── test_data_tools.py │ ├── .gitignore ├── setup.py └── README.md
Now, every time you try to commit, these tools will run automatically. If any tool fails, the commit will be blocked until the issue is resolved. You can also run all hooks manually across your codebase:
pyenv install 3.12.2
Now that we’ve built our project, written some code, added tests, and set up pre-commit hooks, the next step is figuring out how others (or even future us) can easily use it. Packaging the project makes that possible. It allows us to bundle everything neatly so it can be installed and used without copying files manually.
To share your project, you need to structure the package properly, write a meaningful README, create a start script, and generate the distribution package. A good README usually includes the project name and a brief description of what it does, installation instructions, usage examples, development instructions for setting up the environment, and guidelines for contributing. You can find a simple README.md example for our blog_de_toolkit project in the repository.
At the core of any Python package is the setup.py file. This file is where we define the metadata and configuration needed to package and install our project. It includes the project’s name, version, and description, which make it identifiable. The long_description reads from the README file to give users more context about the project when they see it on PyPI. We specify dependencies in the install_requires list so they are automatically installed along with the package. The entry_points section defines a command-line interface (CLI) entry, so users can run the tool from their terminal. We use find_packages() to include all submodules in the package, and the classifiers section provides metadata, like which Python version and license the project uses. Finally, the python_requires field ensures the package installs only on compatible Python versions.. Here’s the setup.py configuration for our blog_de_toolkit project:
Once the setup.py is configured, you can build the distribution package. Start by installing the necessary tools with:
blog_de_toolkit/ │ ├── de_toolkit/ │ ├── __init__.py │ └── data_tools.py │ ├── tests/ │ ├── __init__.py │ └── test_data_tools.py │ ├── .gitignore ├── setup.py └── README.md
Then build the package:
pyenv install 3.12.2
This command creates two distribution files:
sdist: A source archive (e.g., .tar.gz)
bdist_wheel: A built package (e.g., .whl)
These files will be located in the dist/ directory. To test the package, install it locally with:
brew update && brew upgrade pyenv
You can also test the CLI command by running:
brew install readline xz
his should print the database URL, API key, and the cleaned data from sample_data.csv.
If you want to share the package publicly, you can upload it to PyPI. First, install Twine:
pyenv virtualenv 3.12.2 de_toolkit
Then upload the package:
pyenv local de_toolkit
You’ll be prompted to enter your PyPI credentials. Once uploaded, others can install your package directly from PyPI with:
pip install setuptools wheel twine pandas
As your project grows, more people will work on the same codebase, often at the same time. Without proper safeguards, it’s easy for mistakes, untested code, or accidental merges to sneak in and mess things up. To keep things running smoothly and maintain high standards, protecting the main branch becomes essential. In this step, we’ll look at how to set up branch protection rules and share some tips for conducting smooth code reviews on pull requests.
Branch protection rules make sure no one can push directly to the main branch without passing tests or getting a code review. This prevents unfinished features, bugs, or bad code from sneaking in and breaking the project. It also promotes teamwork by requiring pull requests, giving others a chance to provide feedback. Plus, automated checks — like tests and linters — ensure the code is solid before merging.
Setting up branch protection rules on GitHub is pretty straightforward. Head over to your repository’s Settings and click Branches under the “Code and automation” section. Look for Branch protection rules and click Add branch protection rule. Enter main in the branch name field, and now it’s time to tweak some settings.
You can set branch protection rules to require pull request reviews, ensuring that someone checks the code before it’s merged. Status checks make sure tests pass and linters run smoothly, and keeping branches up to date with the latest changes helps avoid conflicts. If needed, you can restrict who can push to the branch or even require signed commits for extra security. Once everything is set, click Create, and just like that — no more direct pushes or skipped tests.
When your pull request is up for review, it’s a good idea to make things easy for your reviewers. Start with a clear description of what your changes do and why they’re needed. Use meaningful commit messages that reflect what’s been updated. If the changes are small and focused, the review process becomes smoother and faster. Don’t forget to respond to comments politely and follow up on requested changes — it shows that you value feedback and helps keep the collaboration positive.
If you’re the one reviewing a pull request, your job goes beyond just finding mistakes — it’s about improving the code and supporting your teammate. Start by reading the pull request description to understand what the changes are trying to accomplish. Focus on giving constructive feedback — suggest alternatives if needed and explain why they might work better. Recognizing good work with a simple “Nice refactor ?!” also helps create a positive review experience. Keep an eye on tests to make sure they’re present, relevant, and passing. And if something isn’t clear, ask questions instead of making assumptions. At the end of the day, reviews are about teamwork — collaborating to make the project better together.
Using review templates can help make the process smoother by keeping everyone focused on what matters. Here’s an example of a pull request review template:
blog_de_toolkit/ │ ├── de_toolkit/ │ ├── __init__.py │ └── data_tools.py │ ├── tests/ │ ├── __init__.py │ └── test_data_tools.py │ ├── .gitignore ├── setup.py └── README.md
Adding a template like this to your contributing guidelines or linking it in the repository makes it easy for reviewers to stay on track. It also keeps things consistent across reviews, helping the team maintain a clean and organized codebase.
Building on the importance of keeping your main branch protected, it’s also crucial to make sure that every code change is properly tested, reviewed, and validated before merging or deploying it. This is where Continuous Integration (CI) and Continuous Delivery/Deployment (CD) come into play. CI/CD automates the process of running tests, performing code checks, and deploying changes, providing quick feedback to developers and reducing the chances of bugs making their way into production.
GitHub Actions is an automation tool integrated directly into GitHub. It enables you to create workflows that respond to events in your repository, such as pushes or pull requests. In GitHub Actions, we can automate several key tasks to maintain a healthy codebase. For example:
Running tests whenever code is pushed or a pull request is created, making sure new changes don’t break anything.
Checking code style and linting to enforce consistent coding standards.
Applying pre-commit hooks to format the code and catch small issues like trailing spaces.
Generating documentation or even deploying the code when all checks pass.
Let’s set up a GitHub Actions workflow that runs our unit tests and applies pre-commit linters (like black) whenever a push or pull request happens on the main branch.
First, create the workflow file:
blog_de_toolkit/ │ ├── de_toolkit/ │ ├── __init__.py │ └── data_tools.py │ ├── tests/ │ ├── __init__.py │ └── test_data_tools.py │ ├── .gitignore ├── setup.py └── README.md
Here’s the content for ci.yml:
This workflow automates testing and linting whenever code is pushed or a pull request is opened on the main branch. It makes sure all quality checks are passed before the code is merged. The actions/checkout action clones the repository into the runner, and we use actions/setup-python to configure Python 3.12 for the workflow. Dependencies are installed from requirements.txt using pip. After that, all tests are run with pytest, and pre-commit hooks ensure the code follows formatting and style guidelines. If any test or check fails, the workflow stops to prevent broken code from being merged.
Let’s test it out. First, create a new branch from main and make some changes.
In my case, I updated the README file. Commit your changes and open a pull request into the main branch.
Now you’ll see that the review is required, and GitHub Actions (GA) is running all its checks. Even though merging is blocked by branch protection rules, I can still “Merge without waiting for requirements to be met” because my permissions allow bypassing the protections.
You can track the results of your GitHub Actions workflow in the Actions tab.
Here’s an example of how the pytest step looks during a run:
Keeping track of your project’s version manually can get messy, especially as it grows. That’s where semantic versioning (SemVer) comes in — it follows the MAJOR.MINOR.PATCH pattern to communicate what has changed in each release. Automating versioning with python-semantic-release makes this even easier. It analyzes your commit messages, bumps the version based on the type of changes, tags releases, and can even publish your package to PyPI if you want. This takes the guesswork out of version management and ensures consistency.
For seamless versioning, you can integrate python-semantic-release directly into GitHub Actions. The official documentation provides workflows that automate version bumps and releases whenever you push to the main branch. With this setup, the release process becomes smooth and hands-off, so you can focus on writing code without worrying about managing versions manually.
Common Workflow Example — python-semantic-release
To make this work, your commit messages need to follow conventional commit standards. Each type of commit determines whether the version will bump the PATCH, MINOR, or MAJOR level:
fix: Triggers a PATCH version bump (e.g., 1.0.0 → 1.0.1).
feat: Triggers a MINOR version bump (e.g., 1.0.0 → 1.1.0).
BREAKING CHANGE: or ! in the commit message triggers a MAJOR version bump (e.g., 1.0.0 → 2.0.0).
By following these simple conventions, you’ll always know what to expect with each new version.
We’ve covered everything from organizing your project and managing secrets to writing tests, automating workflows, and handling releases with semantic versioning. With the right tools and processes, building a reliable and maintainable project becomes much smoother — and even fun.
The key is to stay consistent, automate where you can, and keep improving as you go. Each small step makes a big difference over time. Now it’s your turn — go build, experiment, and enjoy the process! Try applying these steps to your next project — and feel free to share your experience in the comments!
The above is the detailed content of Steps to Organize and Maintain Your Python Codebase for Beginners. For more information, please follow other related articles on the PHP Chinese website!