Python object serialization and deserialization: Part 2-Python Tutorial-php.cn

Table of Contents

YAML

performance

Pickle

JSON

Safety

Other formats

Protocol Buffer

Message package

CBOR

how to choose?

Automatically save the local state of the Python program

Configuration file

Web API

High-capacity/low-latency large-scale communication

in conclusion

Home

Backend Development

Python Tutorial

Python object serialization and deserialization: Part 2

PHPz

Sep 03, 2023 pm 08:33 PM

Python 对象序列化和反序列化：第 2 部分

This is the second part of a tutorial on serializing and deserializing Python objects. In the first part, you learned the basics and then delved into the details of Pickle and JSON.

In this part, you'll explore YAML (make sure to have the running example from Part One), discuss performance and security considerations, learn about other serialization formats, and finally learn how to choose the right one. p>

YAML

YAML is my favorite format. It is a human-friendly data serialization format. Unlike Pickle and JSON, it is not part of the Python standard library, so you need to install it:

pip install yaml

The

yaml module only has load() and dump() functions. By default they take strings like loads() and dumps() but can take a second argument which is an open stream and can then dump/ Load to/from file.

import yaml



print yaml.dump(simple)



boolean: true

int_list: [1, 2, 3]

none: null

number: 3.44

text: string

Copy after login

Please note how readable YAML is compared to Pickle or even JSON. Now comes the cool part about YAML: it understands Python objects! No need for custom encoders and decoders. Here's the complex serialization/deserialization using YAML:

> serialized = yaml.dump(complex)

> print serialized



a: !!python/object:__main__.A

  simple:

    boolean: true

    int_list: [1, 2, 3]

    none: null

    number: 3.44

    text: string

when: 2016-03-07 00:00:00



> deserialized = yaml.load(serialized)

> deserialized == complex

True

Copy after login

As you can see, YAML has its own notation for marking Python objects. The output is still very easy to read. Datetime objects do not require any special markup because YAML inherently supports datetime objects.

performance

Before you start thinking about performance, you need to consider whether performance is an issue. If you're serializing/deserializing small amounts of data relatively infrequently (such as reading a config file at the beginning of your program), then performance isn't really an issue and you can move on.

However, assuming you profile your system and find that serialization and/or deserialization is causing performance issues, the following issues need to be addressed.

Performance has two aspects: how fast is the serialization/deserialization, and how big is the serialized representation?

To test the performance of various serialization formats, I will create a larger data structure and serialize/deserialize it using Pickle, YAML, and JSON. big_data List contains 5,000 complex objects.

big_data = [dict(a=simple, when=datetime.now().replace(microsecond=0)) for i in range(5000)]

Copy after login

Pickle

I'll use IPython here since it has the convenient %timeit magic function to measure execution time.

import cPickle as pickle



In [190]: %timeit serialized = pickle.dumps(big_data)

10 loops, best of 3: 51 ms per loop



In [191]: %timeit deserialized = pickle.loads(serialized)

10 loops, best of 3: 24.2 ms per loop



In [192]: deserialized == big_data

Out[192]: True



In [193]: len(serialized)

Out[193]: 747328

Copy after login

Default pickle takes 83.1 milliseconds to serialize and 29.2 milliseconds to deserialize, and the serialization size is 747,328 bytes.

Let's try using the highest protocol.

In [195]: %timeit serialized = pickle.dumps(big_data, protocol=pickle.HIGHEST_PROTOCOL)

10 loops, best of 3: 21.2 ms per loop



In [196]: %timeit deserialized = pickle.loads(serialized)

10 loops, best of 3: 25.2 ms per loop



In [197]: len(serialized)

Out[197]: 394350

Copy after login

Interesting results. Serialization time dropped to just 21.2ms, but deserialization time increased slightly to 25.2ms. The serialized size is significantly reduced to 394,350 bytes (52%).

JSON

In [253] %timeit serialized = json.dumps(big_data, cls=CustomEncoder)

10 loops, best of 3: 34.7 ms per loop



In [253] %timeit deserialized = json.loads(serialized, object_hook=decode_object)

10 loops, best of 3: 148 ms per loop



In [255]: len(serialized)

Out[255]: 730000

Copy after login

OK. Performance on encoding seems a little worse than Pickle, but performance on decoding is much, much worse: 6x slower. How is this going? This is an artifact of the object_hook function that needs to be run for each dictionary to check if it needs to be converted to an object. It runs much faster without using object hooks.

%timeit deserialized = json.loads(serialized)

10 loops, best of 3: 36.2 ms per loop

Copy after login

The lesson here is to carefully consider any custom encoding when serializing and deserializing to JSON, as they can have a significant impact on overall performance.

YAML

In [293]: %timeit serialized = yaml.dump(big_data)

1 loops, best of 3: 1.22 s per loop



In[294]: %timeit deserialized = yaml.load(serialized)

1 loops, best of 3: 2.03 s per loop



In [295]: len(serialized)

Out[295]: 200091

Copy after login

OK. YAML is really, really slow. However, note something interesting: the serialized size is only 200,091 bytes. Much better than both Pickle and JSON. Let’s take a quick look inside:

In [300]: print serialized[:211]

- a: &id001

    boolean: true

    int_list: [1, 2, 3]

    none: null

    number: 3.44

    text: string

  when: 2016-03-13 00:11:44

- a: *id001

  when: 2016-03-13 00:11:44

- a: *id001

  when: 2016-03-13 00:11:44

Copy after login

YAML is very clever here. It determines that all 5,000 dictionaries share the same "a" key value, so it only stores it once and references it using *id001 for all objects.

Safety

Security is often a critical issue. Pickle and YAML are vulnerable to code execution attacks due to the construction of Python objects. Cleverly formatted files can contain arbitrary code that will be executed by Pickle or YAML. No need to panic. This is by design and documented in Pickle's documentation:

Warning: The pickle module is not designed to protect against erroneous or maliciously constructed data. Never cancel data received from untrusted or unauthenticated sources.

And the content in the YAML document:

Warning: It is unsafe to call yaml.load with any data received from an untrusted source! yaml.load is as powerful as pickle.load, so it can call any Python function.

Just know that you should not use Pickle or YAML to load serialized data received from untrusted sources. JSON is fine, but if you have a custom encoder/decoder you might be exposed as well.

The

yaml module provides the yaml.safe_load() function which only loads simple objects, but then you lose a lot of the functionality of YAML and may choose to just use JSON.

Other formats

There are many other serialization formats available. Here are some of them.

Protocol Buffer

Protobuf (i.e. Protocol Buffer) is Google's data interchange format. It is implemented in C but has Python bindings. It has a sophisticated architecture and packages data efficiently. Very powerful, but not very easy to use.

Message package

MessagePack is another popular serialization format. It is also binary and efficient, but unlike Protobuf it does not require a schema. It has a type system similar to JSON, but richer. Keys can be of any type, not just strings and non-UTF8 strings are supported.

CBOR

CBOR stands for Concise Binary Object Representation. Likewise, it supports the JSON data model. CBOR is not as famous as Protobuf or MessagePack, but it is interesting for two reasons:

It is an official Internet standard: RFC 7049.
It is designed for the Internet of Things (IoT).

how to choose?

this is a big problem. So many choices, how do you choose? Let’s consider the various factors that should be considered:

Should the serialization format be human-readable and/or human-editable?
Will serialized content be received from untrusted sources?
Is serialization/deserialization a performance bottleneck?
Does serialized data need to be exchanged with non-Python environments?

I'll make it really simple for you and walk through a few common scenarios and the format I recommend for each:

Automatically save the local state of the Python program

Use pickle (cPickle) and HIGHEST_PROTOCOL here. It's fast, efficient, and can store and load most Python objects without any special code. It can also be used as a local persistent cache.

Configuration file

Definitely YAML. Nothing beats its simplicity for anything humans need to read or edit. It has been successfully used by Ansible and many other projects. In some cases, you may prefer to use direct Python modules as configuration files. This might be the right choice, but it's not serialization, it's actually part of the program, not a separate configuration file.

Web API

JSON is the clear winner here. Today, Web APIs are most commonly used by JavaScript web applications that use JSON natively. Some web APIs may return other formats (e.g. csv for dense tabular result sets), but I think you can pack the csv data into JSON with minimal overhead (no need to repeat each row as an object with all column names ).

High-capacity/low-latency large-scale communication

Use one of the binary protocols: Protobuf (if architecture is required), MessagePack, or CBOR. Run your own tests to verify the performance and representation capabilities of each option.

in conclusion

Serialization and deserialization of Python objects is an important aspect of distributed systems. You cannot send Python objects directly over the network. You often need to interoperate with other systems implemented in other languages, and sometimes you just want to store the state of your program in persistent storage.

Python comes with several serialization schemes in its standard library, and many more are available as third-party modules. Understanding all the options and the pros and cons of each will allow you to choose the method that best suits your situation.

The above is the detailed content of Python object serialization and deserialization: Part 2. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Where to find the Crane Control Keycard in Atomfall

1 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7433

CakePHP Tutorial

1359

What is the format of the account name of steam

win11 activation key permanent

Related knowledge

How to solve the permissions problem encountered when viewing Python version in Linux terminal? Apr 01, 2025 pm 05:09 PM

Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

How Do I Use Beautiful Soup to Parse HTML? Mar 10, 2025 pm 06:54 PM

This article explains how to use Beautiful Soup, a Python library, to parse HTML. It details common methods like find(), find_all(), select(), and get_text() for data extraction, handling of diverse HTML structures and errors, and alternatives (Sel

How to Perform Deep Learning with TensorFlow or PyTorch? Mar 10, 2025 pm 06:52 PM

This article compares TensorFlow and PyTorch for deep learning. It details the steps involved: data preparation, model building, training, evaluation, and deployment. Key differences between the frameworks, particularly regarding computational grap

Mathematical Modules in Python: Statistics Mar 09, 2025 am 11:40 AM

Python's statistics module provides powerful data statistical analysis capabilities to help us quickly understand the overall characteristics of data, such as biostatistics and business analysis. Instead of looking at data points one by one, just look at statistics such as mean or variance to discover trends and features in the original data that may be ignored, and compare large datasets more easily and effectively. This tutorial will explain how to calculate the mean and measure the degree of dispersion of the dataset. Unless otherwise stated, all functions in this module support the calculation of the mean() function instead of simply summing the average. Floating point numbers can also be used. import random import statistics from fracti

What are some popular Python libraries and their uses? Mar 21, 2025 pm 06:46 PM

The article discusses popular Python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Django, Flask, and Requests, detailing their uses in scientific computing, data analysis, visualization, machine learning, web development, and H

How to Create Command-Line Interfaces (CLIs) with Python? Mar 10, 2025 pm 06:48 PM

This article guides Python developers on building command-line interfaces (CLIs). It details using libraries like typer, click, and argparse, emphasizing input/output handling, and promoting user-friendly design patterns for improved CLI usability.

How to efficiently copy the entire column of one DataFrame into another DataFrame with different structures in Python? Apr 01, 2025 pm 11:15 PM

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

Explain the purpose of virtual environments in Python. Mar 19, 2025 pm 02:27 PM

The article discusses the role of virtual environments in Python, focusing on managing project dependencies and avoiding conflicts. It details their creation, activation, and benefits in improving project management and reducing dependency issues.

See all articles