This is the second part of a tutorial on serializing and deserializing Python objects. In the first part, you learned the basics and then delved into the details of Pickle and JSON.
In this part, you'll explore YAML (make sure to have the running example from Part One), discuss performance and security considerations, learn about other serialization formats, and finally learn how to choose the right one. p>
YAML is my favorite format. It is a human-friendly data serialization format. Unlike Pickle and JSON, it is not part of the Python standard library, so you need to install it:
pip install yaml
yaml module only has load()
and dump()
functions. By default they take strings like loads()
and dumps()
but can take a second argument which is an open stream and can then dump/ Load to/from file.
import yaml print yaml.dump(simple) boolean: true int_list: [1, 2, 3] none: null number: 3.44 text: string
Please note how readable YAML is compared to Pickle or even JSON. Now comes the cool part about YAML: it understands Python objects! No need for custom encoders and decoders. Here's the complex serialization/deserialization using YAML:
> serialized = yaml.dump(complex) > print serialized a: !!python/object:__main__.A simple: boolean: true int_list: [1, 2, 3] none: null number: 3.44 text: string when: 2016-03-07 00:00:00 > deserialized = yaml.load(serialized) > deserialized == complex True
As you can see, YAML has its own notation for marking Python objects. The output is still very easy to read. Datetime objects do not require any special markup because YAML inherently supports datetime objects.
Before you start thinking about performance, you need to consider whether performance is an issue. If you're serializing/deserializing small amounts of data relatively infrequently (such as reading a config file at the beginning of your program), then performance isn't really an issue and you can move on.
However, assuming you profile your system and find that serialization and/or deserialization is causing performance issues, the following issues need to be addressed.
Performance has two aspects: how fast is the serialization/deserialization, and how big is the serialized representation?
To test the performance of various serialization formats, I will create a larger data structure and serialize/deserialize it using Pickle, YAML, and JSON. big_data
List contains 5,000 complex objects.
big_data = [dict(a=simple, when=datetime.now().replace(microsecond=0)) for i in range(5000)]
I'll use IPython here since it has the convenient %timeit
magic function to measure execution time.
import cPickle as pickle In [190]: %timeit serialized = pickle.dumps(big_data) 10 loops, best of 3: 51 ms per loop In [191]: %timeit deserialized = pickle.loads(serialized) 10 loops, best of 3: 24.2 ms per loop In [192]: deserialized == big_data Out[192]: True In [193]: len(serialized) Out[193]: 747328
Default pickle takes 83.1 milliseconds to serialize and 29.2 milliseconds to deserialize, and the serialization size is 747,328 bytes.
Let's try using the highest protocol.
In [195]: %timeit serialized = pickle.dumps(big_data, protocol=pickle.HIGHEST_PROTOCOL) 10 loops, best of 3: 21.2 ms per loop In [196]: %timeit deserialized = pickle.loads(serialized) 10 loops, best of 3: 25.2 ms per loop In [197]: len(serialized) Out[197]: 394350
Interesting results. Serialization time dropped to just 21.2ms, but deserialization time increased slightly to 25.2ms. The serialized size is significantly reduced to 394,350 bytes (52%).
In [253] %timeit serialized = json.dumps(big_data, cls=CustomEncoder) 10 loops, best of 3: 34.7 ms per loop In [253] %timeit deserialized = json.loads(serialized, object_hook=decode_object) 10 loops, best of 3: 148 ms per loop In [255]: len(serialized) Out[255]: 730000
OK. Performance on encoding seems a little worse than Pickle, but performance on decoding is much, much worse: 6x slower. How is this going? This is an artifact of the object_hook
function that needs to be run for each dictionary to check if it needs to be converted to an object. It runs much faster without using object hooks.
%timeit deserialized = json.loads(serialized) 10 loops, best of 3: 36.2 ms per loop
The lesson here is to carefully consider any custom encoding when serializing and deserializing to JSON, as they can have a significant impact on overall performance.
In [293]: %timeit serialized = yaml.dump(big_data) 1 loops, best of 3: 1.22 s per loop In[294]: %timeit deserialized = yaml.load(serialized) 1 loops, best of 3: 2.03 s per loop In [295]: len(serialized) Out[295]: 200091
OK. YAML is really, really slow. However, note something interesting: the serialized size is only 200,091 bytes. Much better than both Pickle and JSON. Let’s take a quick look inside:
In [300]: print serialized[:211] - a: &id001 boolean: true int_list: [1, 2, 3] none: null number: 3.44 text: string when: 2016-03-13 00:11:44 - a: *id001 when: 2016-03-13 00:11:44 - a: *id001 when: 2016-03-13 00:11:44
YAML is very clever here. It determines that all 5,000 dictionaries share the same "a" key value, so it only stores it once and references it using *id001
for all objects.
Security is often a critical issue. Pickle and YAML are vulnerable to code execution attacks due to the construction of Python objects. Cleverly formatted files can contain arbitrary code that will be executed by Pickle or YAML. No need to panic. This is by design and documented in Pickle's documentation:
Warning: The pickle module is not designed to protect against erroneous or maliciously constructed data. Never cancel data received from untrusted or unauthenticated sources.
And the content in the YAML document:
Warning: It is unsafe to call yaml.load with any data received from an untrusted source! yaml.load is as powerful as pickle.load, so it can call any Python function.
Just know that you should not use Pickle or YAML to load serialized data received from untrusted sources. JSON is fine, but if you have a custom encoder/decoder you might be exposed as well.
Theyaml module provides the yaml.safe_load()
function which only loads simple objects, but then you lose a lot of the functionality of YAML and may choose to just use JSON.
There are many other serialization formats available. Here are some of them.
Protobuf (i.e. Protocol Buffer) is Google's data interchange format. It is implemented in C but has Python bindings. It has a sophisticated architecture and packages data efficiently. Very powerful, but not very easy to use.
MessagePack is another popular serialization format. It is also binary and efficient, but unlike Protobuf it does not require a schema. It has a type system similar to JSON, but richer. Keys can be of any type, not just strings and non-UTF8 strings are supported.
CBOR stands for Concise Binary Object Representation. Likewise, it supports the JSON data model. CBOR is not as famous as Protobuf or MessagePack, but it is interesting for two reasons:
this is a big problem. So many choices, how do you choose? Let’s consider the various factors that should be considered:
I'll make it really simple for you and walk through a few common scenarios and the format I recommend for each:
Use pickle (cPickle) and HIGHEST_PROTOCOL
here. It's fast, efficient, and can store and load most Python objects without any special code. It can also be used as a local persistent cache.
Definitely YAML. Nothing beats its simplicity for anything humans need to read or edit. It has been successfully used by Ansible and many other projects. In some cases, you may prefer to use direct Python modules as configuration files. This might be the right choice, but it's not serialization, it's actually part of the program, not a separate configuration file.
JSON is the clear winner here. Today, Web APIs are most commonly used by JavaScript web applications that use JSON natively. Some web APIs may return other formats (e.g. csv for dense tabular result sets), but I think you can pack the csv data into JSON with minimal overhead (no need to repeat each row as an object with all column names ).
Use one of the binary protocols: Protobuf (if architecture is required), MessagePack, or CBOR. Run your own tests to verify the performance and representation capabilities of each option.
Serialization and deserialization of Python objects is an important aspect of distributed systems. You cannot send Python objects directly over the network. You often need to interoperate with other systems implemented in other languages, and sometimes you just want to store the state of your program in persistent storage.
Python comes with several serialization schemes in its standard library, and many more are available as third-party modules. Understanding all the options and the pros and cons of each will allow you to choose the method that best suits your situation.
The above is the detailed content of Python object serialization and deserialization: Part 2. For more information, please follow other related articles on the PHP Chinese website!