Tagged with json

Serializing python data to JSON - some edge cases

JSON seems like a great way to serialize python data objects - it's a subset of YAML, the built in json library is easy to use, it avoids the security issues of pickle, and there's nearly a one-to-one correspondence between python data and json. Nearly.

Just to be clear, I'm talking about python objects that are basically data (e.g., hierarchies of basic data types). Serializing arbitrary python objects (e.g., classes) is a larger topic.

The point of this post is that (for python 2.7):

data == json.loads(json.dumps(data))

is False a lot of the time, even for basic data types. So, json data serialization isn't as straightforward as you (meaning, I) might think. I'd like this to work with basic python data types such as list, dict, tuple, set, and numpy arrays. Okay, and namedtuple and OrderedDict, too.

Also, my goal is mainly to get back the relevant python types. I still want the exported data to be friendly enough to someone else who wants to read my JSON data, but I'm expecting them to have to do a little work to get data out my particular JSON file format (as opposed to having everything be in vanilla JSON data.)

Let's try a few cases out:

>>> import json
>>> data = [1, 6, 8, 9]
>>> data == json.loads(json.dumps(data))
True

>>> data = {"foo": 2938, "bar": 4.22, "baz": "test1"}
>>> data == json.loads(json.dumps(data))
True

>>> data = {"foo": ["nested", 4.5, "mixed", {"wow": False}], 
            "bar": {"more": ["nesting"]}}
>>> data == json.loads(json.dumps(data))
True

Great. That all worked as expected. Now let's get disappointed.

dicts with non-strings as keys

>>> data = {1: "foo", 2: "baz"}
>>> data == json.loads(json.dumps(data))
False

What? Why didn't that work, when everything was working so nicely before? Because you (meaning, I) didn't read the JSON spec closely enough.

>>> print json.loads(json.dumps(data))
{u"1": u"foo", u"2": u"baz"}

While there is a correspondence between python dicts and JSON objects, JSON objects can only have strings as attributes. So... this don't work.

tuple

>>> data = (1, 2, 3, 4)
>>> data == json.loads(json.dumps(data))
False

Why is that?

>>> print json.loads(json.dumps(data))
[1, 2, 3, 4]

Right. JSON encodes tuples as lists.

namedtuple

>>> from collections import namedtuple
>>> MyTuple = namedtuple("MyTuple", "foo baz")
>>> data = MyTuple(foo=1, baz=2)
>>> data == json.loads(json.dumps(data))
False
>>> print json.loads(json.dumps(data))
[1, 2]

Nuts.

simplejson

At this point I should probably mention simplejson, which is the externally maintained dev version of the standard json library. The main thing is that it's a drop in replacement for json, it has a c extension that adds a speed boost, as well as implements a few features that json does not. Namely something for namedtuple.

You can do a pip install simplejson, but if you are on Windows the c extension build step may not work, so get the binaries from Christoph Gohlke here.

namedtuple again

How does simplejson help us with namedtuple? The option to export namedtuples as json objects:

>>> import simplejson as json
>>> from collections import namedtuple
>>> MyTuple = namedtuple("MyTuple", "foo baz")
>>> data = MyTuple(foo=1, baz=2)
>>> print json.loads(json.dumps(data, namedtuple_as_object=True))
{"foo": 1, "baz": 2}
>>> data == json.loads(json.dumps(data, namedtuple_as_object=True))
False

So that didn't solve our problem. It retains the key/value information, but not the fact that it's a namedtuple.

numpy arrays

While this isn't a basic python data type, it's often the data that I care about.

>>> import numpy as np
>>> data = np.array([[1,2,3], [4,5,6]])
>>> print json.loads(json.dumps(data))
TypeError: array([[1, 2, 3], [4, 5, 6]]) is not JSON serializable

Okay, the standard advice is to use the numpy array .tolist() function.

>>> data = np.array([[1,2,3], [4,5,6]])
>>> print json.loads(json.dumps(data.tolist()))
[[1, 2, 3], [4, 5, 6]]
>>> data == json.loads(json.dumps(data.tolist()))
array([[ True,  True,  True],
       [ True,  True,  True]], dtype=bool)

Wait, what happened there? Oh yeah, numpy thinks that equality check is ambiguous. This is actually a pretty complicated and subtle issue. See here and here for more information. For now, just use numpy's array_equal function (which won't work for nested data structures.)

>>> np.array_equal(data, json.loads(json.dumps(data.tolist())))
True

Okay. Not bad. But the reconstituted array is not a numpy ndarray, but a list. So, not perfect. You'd need to do:

>>> print np.array(json.loads(json.dumps(data.tolist())))

Again, lets not gloss over this equality issue. Storing numpy arrays in nested python structures and then comparing them is non-trivial. We'll need some way of dealing with that.

jsonpickle

Another aside. It looks like jsonpickle is very close to what I want - a clean enough JSON representation of python datatypes. However, the current out-of-the-box version (0.4.0) doesn't handle the non-string-keys dictionary, doesn't handle numpy arrays, doesn't handle namedtuples, and has a warning that it doesn't sanitize the JSON input. So one approach to solving this json data problem would be to add specific handlers to jsonpickle for certain objects. I'm going to take another approach, and write my own version to make sure I understand all of the issues.

Just Give Me Some Code

Gotcha.

Here's a snippet that will make list, set, tuple, namedtuple, OrderedDict, numpy ndarrays, and dicts with non-string (but still data) keys serialize to JSON, unserialize to the right type, and not abuse JSON too much. We'll use something similar to the jsonpickle format (wrapping special cases in an object with a single attribute containing a type; e.g., {"py/set": [...]}), though no guarantees that it is compatible. Heck, no guarantees that any of this works. This approach presumably can be extended to other types as well (datetime, bigint, complex, etc...), not my arbitrary "data" distinction.

from collections import namedtuple, Iterable, OrderedDict
import numpy as np
import simplejson as json

def isnamedtuple(obj):
    """Heuristic check if an object is a namedtuple."""
    return isinstance(obj, tuple) \
           and hasattr(obj, "_fields") \
           and hasattr(obj, "_asdict") \
           and callable(obj._asdict)

def serialize(data):
    if data is None or isinstance(data, (bool, int, long, float, basestring)):
        return data
    if isinstance(data, list):
        return [serialize(val) for val in data]
    if isinstance(data, OrderedDict):
        return {"py/collections.OrderedDict":
                [[serialize(k), serialize(v)] for k, v in data.iteritems()]}
    if isnamedtuple(data):
        return {"py/collections.namedtuple": {
            "type":   type(data).__name__,
            "fields": list(data._fields),
            "values": [serialize(getattr(data, f)) for f in data._fields]}}
    if isinstance(data, dict):
        if all(isinstance(k, basestring) for k in data):
            return {k: serialize(v) for k, v in data.iteritems()}
        return {"py/dict": [[serialize(k), serialize(v)] for k, v in data.iteritems()]}
    if isinstance(data, tuple):
        return {"py/tuple": [serialize(val) for val in data]}
    if isinstance(data, set):
        return {"py/set": [serialize(val) for val in data]}
    if isinstance(data, np.ndarray):
        return {"py/numpy.ndarray": {
            "values": data.tolist(),
            "dtype":  str(data.dtype)}}
    raise TypeError("Type %s not data-serializable" % type(data))

def restore(dct):
    if "py/dict" in dct:
        return dict(dct["py/dict"])
    if "py/tuple" in dct:
        return tuple(dct["py/tuple"])
    if "py/set" in dct:
        return set(dct["py/set"])
    if "py/collections.namedtuple" in dct:
        data = dct["py/collections.namedtuple"]
        return namedtuple(data["type"], data["fields"])(*data["values"])
    if "py/numpy.ndarray" in dct:
        data = dct["py/numpy.ndarray"]
        return np.array(data["values"], dtype=data["dtype"])
    if "py/collections.OrderedDict" in dct:
        return OrderedDict(dct["py/collections.OrderedDict"])
    return dct

def data_to_json(data):
    return json.dumps(serialize(data))

def json_to_data(s):
    return json.loads(s, object_hook=restore)

Use the functions data_to_json and json_to_data to do the actual serialization. Testing this out:

>>> MyTuple = namedtuple("MyTuple", "foo baz")
>>> TEST_DATA = [1, 2, 3,
                23.32987, 478.292222, -0.0002384,
                "testing",
                False,
                [4, 5, 6, [7, 8], 9],
                ("mixed", 5, "tuple"),
                {"str": 1, "str2": 2},
                {1: "str", 2: "str4", (5, 6): "str8"},
                {4, 8, 2, "string", (4, 8, 9)},
                None,
                MyTuple(foo=1, baz=2),
                OrderedDict(
                    [("my", 23), ("order", 55), ("stays", 44), ("fixed", 602)]),
                np.array([[1, 2, 3], [4, 5, 6]]),
                np.array([[1.2398, 2.4848, 3.484884], [4.10, 5.3, 6.999992]]),
               ] 
>>> print data_to_json(TEST_DATA)
[1, 2, 3, 23.32987, 478.292222, -0.0002384, "testing", false, [4, 5, 6, [7, 8], 9], {"py/tuple": ["mixed", 5, "tuple"]}, {"str2": 2, "str": 1}, {"py/dict": [[{"py/tuple": [5, 6]}, "str8"], [1, "str"], [2, "str4"]]}, {"py/set": [{"py/tuple": [4, 8, 9]}, 8, 2, 4, "string"]}, null, {"py/collections.namedtuple": {"values": [1, 2], "fields": ["foo", "baz"], "type": "MyTuple"}}, {"py/collections.OrderedDict": [["my", 23], ["order", 55], ["stays", 44], ["fixed", 602]]}, {"py/numpy.ndarray": {"dtype": "int32", "values": [[1, 2, 3], [4, 5, 6]]}}, {"py/numpy.ndarray": {"dtype": "float64", "values": [[1.2398, 2.4848, 3.484884], [4.1, 5.3, 6.999992]]}}]

That looks like what we'd expect. Now, to verify, we'd like to say

>>> TEST_DATA == json_to_data(data_to_json(TEST_DATA))

but, we run into that nested-numpy-array-equality problem. So, here's a function that walks a complex data structure and does the comparison and handles the numpy array case:

def nested_equal(v1, v2):
    """Compares two complex data structures.

    This handles the case where numpy arrays are leaf nodes.
    """
    if isinstance(v1, basestring) or isinstance(v2, basestring):
        return v1 == v2
    if isinstance(v1, np.ndarray) or isinstance(v2, np.ndarray):
        return np.array_equal(v1, v2)
    if isinstance(v1, dict) and isinstance(v2, dict):
        return nested_equal(v1.items(), v2.items())
    if isinstance(v1, Iterable) and isinstance(v2, Iterable):
        return all(nested_equal(sub1, sub2) for sub1, sub2 in zip(v1, v2))
    return v1 == v2

now we can say

>>> nested_equal(TEST_DATA, json_to_data(data_to_json(TEST_DATA)))
True

Great! You can get a full listing of the code here: serialize_json.py

Limitations

Several. I think the numpy handling code isn't all that complete. There are a range of datatypes and record fields that you can store in numpy. This just handles vanilla nd-arrays. Also, this isn't that efficient in storage (it's a text format), especially for binary data. And, it's not that efficient in memory as well, since we create an entire new structure before passing it to the json converter (as opposed to a generator-style piecemeal solution). Security is not explicitly dealt with either. While we're not allowing any arbitrary class loading (part of the reason for the "data" distinction), we don't do anything explicit for extremely long strings or data, which could bring the system down.

After going through all of this, cPickle looks pretty good, huh? If you don't care about the security or the readability parts - cPickle is very good. If you need efficiency for large numpy arrays, check out their native formats (see savez_compressed, which can even take python dictionaries as keyword arguments, so you can save non-numpy-array data.)

References

Tagged , ,