Serialization madness, Unicode edition
Yesterday we were debugging an issue in XML serialization where only a portion of the document was getting deserialized before we encountered an error. It was a strange error, where it looked like when reading the XML the XML document abruptly ended. Chasing this problem down, however, was a bit of a struggle. Here’s the pipeline that got executed:
- Base 64 Decode
- GZip Decompress
- UTF8 Decode
- JSON Deserialize
- XML Serialize
- UTF8 Encode
- (Send message along the wire)
- UTF8 Decode
- XML Deserialize
Somewhere along the line, the Unicode “NUL” character crept in (U+0000) and the XML deserializer interpreted this as “end of document”. But tracing this back was obviously a bit of a challenge. We found the problem in the XML deserialization step, which was originally derived from an UTF8-encoded version of an XML-serialized object, that originated from a JSON-deserialized, UTF8-decoded, gzip-decompressed, Base-64 decoded string (that was also originally embedded in an XML document).
The fix came in something simple – something along the way treated “nulls” as Unicode null for serialization, so it was really just a matter of replacing the text in the JSON string for Unicode null (u+0000) with the text “null” for actual JSON null, and everything worked again.
It all worked, but I wonder if we could have fit a couple more steps in the pipeline, maybe encryption/signing for fun?