This text is the first in a series of texts about testing in data processing applications that I will be bringing here and on my personal blog.
When I made my career transition from software engineer to data engineer, I started having conversations with people in the data area who didn't have a background in software engineering. In these conversations, a question arose repeatedly: how to write tests?
Writing tests can, in fact, seem like a complex task to those who are not used to it, as it requires a change in the way of writing code. The truth is that there is no mystery, but rather a matter of practice and repetition. My main objective in this article is to guide you, who are just starting out, in a process that shows how we can create tests for applications that process data, ensuring quality and reliability in the code.
This text is part of a series that I will be bringing over the next few weeks where I share how to write automated tests in code aimed at data engineering. In today's article I want to explore a little about mocks. In several code scenarios, a data pipeline will be making connections, API calls, integration with Cloud services, etc., which can cause some confusion in how we can test this application. Today we will explore some interesting libraries for writing tests focusing on the use of mocks.
Mocks are mock objects used in tests to imitate the behavior of external dependencies or components that are not the focus of the test. They allow you to isolate the unit of code being tested, ensuring that testing is more controllable and predictive. The use of mocks is a common practice in unit testing and integration testing.
And we should use mocks when:
In data pipelines, Mocking allows you to create representations of external components – such as a database, a messaging service or an API – without depending on their real infrastructures. This is particularly useful in data processing environments, which integrate multiple technologies, such as PySpark for distributed processing, Kafka for messaging, as well as cloud services such as AWS and GCP.
In these scenarios where we have data pipelines, Mocking facilitates the execution of isolated and fast tests, minimizing costs and execution time. It allows each part of the pipeline to be verified accurately, without intermittent failures caused by real connections or external infrastructure, and with the confidence that each integration works as expected.
In each programming language, we can find internal modules that already provide Mock functions to be implemented. In Python, the native unittest.mock library is the main tool for creating mocks, allowing you to simulate objects and functions with ease and control. In Go, the Mocking process is commonly supported by external packages, such as mockery, as the language does not have a native Mock library; mockery is especially useful for generating mocks from interfaces, a native feature of Go. In Java, Mockito stands out as a popular and powerful library for creating mocks, integrating with JUnit to facilitate robust unit testing. These libraries provide an essential foundation for testing isolated components, especially in data pipelines and distributed systems where simulation of external data sources and APIs is critical.
Let's start with a basic example of how we can use Mocks. Supposing we have a function that makes API calls and we need to write unit tests for this function:
def get_data_from_api(url): import requests response = requests.get(url) if response.status_code == 200: return response.json() else: return None
To correctly approach test scenarios, we first need to understand which situations must be covered. As our function makes REST calls, tests must consider at least two main scenarios: one in which the request is successful and another in which the response is not as expected. We could run the code with a real URL to observe the behavior, but this approach has disadvantages, as we would not have control over the different types of responses, in addition to leaving the test vulnerable to changes in the URL response or its eventual unavailability. To avoid these inconsistencies, we will use Mocks.
from unittest import mock @mock.patch('requests.get') def test_get_data_from_api_success(mock_get): # Configura o mock para retornar uma resposta simulada mock_get.return_value.status_code = 200 mock_get.return_value.json.return_value = {"key": "value"} # Chama a função com o mock ativo result = get_data_from_api("http://fakeurl.com") # Verifica se o mock foi chamado corretamente e o resultado é o esperado mock_get.assert_called_once_with("http://fakeurl.com") self.assertEqual(result, {"key": "value"})
With the @mock.patch decoration from the Python unittest library, we can replace the requests.get call with a mock, a "fake object" that simulates the behavior of the get function within the test context, eliminating the external dependency.
By defining values for the mock's return_value, we can specify exactly what we expect the object to return when called in the function we are testing. It is important that the structure of the return_value follows the same as the real objects we are replacing. For example, a response object from the requests module has attributes like status_code and methods like json(). Thus, to simulate a response from the requests.get function, we can assign the expected value to these attributes and methods directly in the mock.
def get_data_from_api(url): import requests response = requests.get(url) if response.status_code == 200: return response.json() else: return None
In this specific case, the focus is to simulate the request response, that is, to test the behavior of the function with different expected results without depending on an external URL and without impact on our testing environment.
from unittest import mock @mock.patch('requests.get') def test_get_data_from_api_success(mock_get): # Configura o mock para retornar uma resposta simulada mock_get.return_value.status_code = 200 mock_get.return_value.json.return_value = {"key": "value"} # Chama a função com o mock ativo result = get_data_from_api("http://fakeurl.com") # Verifica se o mock foi chamado corretamente e o resultado é o esperado mock_get.assert_called_once_with("http://fakeurl.com") self.assertEqual(result, {"key": "value"})
By simulating API error responses in tests, we can go beyond the basics and check application behavior against different types of HTTP status codes such as 404, 401, 500, and 503. This provides broader coverage and ensures that the application adequately deals with each type of failure, I understand how these variations in the call can impact our application/data processing. In POST method calls, we can add an extra layer of validation, checking not only the status_code and basic functioning of the call, but also the schema of the sending and received response, ensuring that the data returned follows the expected format. This more detailed testing approach helps prevent future problems by ensuring that the application is prepared to handle a variety of error scenarios and that the data received is always in line with what was designed.
Now that we've seen a simple case of using Mocks in pure Python code, let's expand our cases to a snippet of code that uses Pyspark.
To test PySpark functions, especially DataFrame operations like filter, groupBy, and join, using mocks is an effective approach that eliminates the need to run real Spark, reducing testing time and simplifying the development environment. Python's unittest.mock library allows you to simulate the behaviors of these methods, making it possible to verify the code flow and logic without dependence on the Spark infrastructure.
Let's see, given the following function where we have a transformation that performs filter, groupBy and join operations on dataframes in Spark.
def get_data_from_api(url): import requests response = requests.get(url) if response.status_code == 200: return response.json() else: return None
To run a PySpark test we need Spark configuration to be done locally. This configuration is done in the setUpClass method, which creates an instance of Spark that will be used in all tests of the class. This allows us to run PySpark in isolation, making it possible to perform real transformation operations without relying on a full cluster. After testing is complete, the tearDownClass method is responsible for terminating the Spark session, ensuring that all resources are released properly and the test environment is clean.
from unittest import mock @mock.patch('requests.get') def test_get_data_from_api_success(mock_get): # Configura o mock para retornar uma resposta simulada mock_get.return_value.status_code = 200 mock_get.return_value.json.return_value = {"key": "value"} # Chama a função com o mock ativo result = get_data_from_api("http://fakeurl.com") # Verifica se o mock foi chamado corretamente e o resultado é o esperado mock_get.assert_called_once_with("http://fakeurl.com") self.assertEqual(result, {"key": "value"})
In the test_transform_data test, we start by creating example DataFrames for df and df_other, which contain the data that will be used in the transformations. We then execute the transform_data function without applying mocks, allowing the filter, groupBy and join operations to actually occur and result in a new DataFrame. After execution, we use the collect() method to extract the data from the resulting DataFrame, which allows us to compare this data with the expected values and, thus, validate the transformation carried out in a real and accurate way.
But we can also have scenarios where we want to test the result of one of these pyspark functions. It is necessary to mock another part of the code that may be representing a bottleneck at execution time and that does not represent a risk to our process. Therefore, we can use the technique of mocking a function/module, as we saw in the previous example using requests.
response.status_code = mock_get.return_value.status_code response.json() = mock_get.return_value.json.return_value
The Mock test for a specific operation was carried out in the test_transform_data_with_mocked_join method, where we applied a mock specifically for the filter method. This mock replaces the result of the join operation with a simulated DataFrame, allowing previous operations, such as groupBy and join, to be executed in a real way. The test then compares the resulting DataFrame with the expected value, ensuring that the join mock was used correctly, without interfering with the other transformations performed.
This hybrid approach brings several advantages. By ensuring that actual PySpark operations like join and groupBy are maintained, we can validate the logic of transformations without losing the flexibility of replacing specific operations like filter with mocks. This results in more robust and faster testing, eliminating the need for a full Spark cluster, which makes ongoing code development and validation easier.
It is important to emphasize that this strategy should be used with caution and only in scenarios where a bias in the results is not created. The purpose of the test is to ensure that processing occurs correctly; We shouldn't simply assign values without actually testing the function. Although it is valid to mock sections that we can guarantee will not affect the unit testing process, it is essential to remember that the function must be executed to validate its real behavior.
Thus, the hybrid approach makes much more sense when we have other types of processing added to this function. This strategy allows for an effective combination of real and simulated operations, ensuring more robust and reliable tests
Mocks are valuable allies in creating effective tests, especially when it comes to working with PySpark and other cloud services. The implementation we explored using unittest in Python not only helped us simulate operations but also maintain the integrity of our data and processes. With the flexibility that mocks offer, we can test our pipelines without the fear of wreaking havoc in production environments. So, ready for the next challenge? In our next text, we will dive into the world of integrations with AWS and GCP services, showing how to mock these calls and ensure that your pipelines work perfectly. Until next time!
The above is the detailed content of Mocks, what are they?. For more information, please follow other related articles on the PHP Chinese website!