Schemas, Encoding, JSON, XML

Schema details

Application developers who understand the internals of the storage engine (s) within the TOM (target operating model); are in a very good position to build models, schemas and deploy the proper tooling and automation of testing and integration. For instance, if it is necessary to change a database’s tuning parameters, this thorough understanding will allow the developer to comprehend the ‘knock-on’ effect that a lower-or-higher value or threshold will have on the application and system, including the db tools that should be used to tune and test. Of course, any change to a db schema will necessarily impact the application models and integration with the data store, including data retrieval and rendering. A db schema change is usually a serious issue that will mandate many application level changes to models, code and documentation (usually the forgotten man in software engineering).

As well it flows the other way. If the application or its underlying models and logic (business, UI, data access) is changed, then this usually requires a change to data that it stores. Perhaps a new field or record type needs to be captured, or maybe existing data needs to be presented in a new way. Relational databases generally assume that all data in the database conforms to one schema: although that schema can be changed (through schema migrations; i.e., ALTER statements), however, there is exactly one schema in force at any one point in time.

Schema-less

By contrast, a schema-on-read (“schema-less”) databases or No-SQL (not a great name); don’t enforce a schema, so the database can contain a mixture of older and newer data formats written at different times. NoSQL or Document-databases are useful in systems in which there needs to be schema flexibility, due to changing logic, models or application-feature enhancement upgrades, in which the application becomes ‘de-coupled’ from a tight database-schema or data-access-layer logical model or when models and data, both old and new, are used.

Coupling

Usually with application-database coupling (even when de-coupled there is always a dependency), when a data format or schema is changed, a corresponding change to application code must occur (for example, you add a new field to a record, and the application code starts reading and writing that field). In most systems, application code changes cannot be deployed instantaneously. With server-side applications usually a process is followed called a staged rollout, deploying the new version to a few nodes at a time, checking whether the new version is running smoothly, and gradually updating all the nodes. This allows new versions to be deployed without service downtime, and thus encourages more frequent releases and better delivery.

This is the premise of node configuration deployments (with Ansible, Chef for example); in which the tool knows and has discovered the existing servers and updated those server nodes with the latest versions of the application and database. Immutable deployments, or blue-green DNS based version upgrades (eg Jenkins), do not pre-suppose existing nodes can create the underlying virtual servers and resources based on patterns and scripts.

Data structures and programs

Programs usually work with data in (at least) two different representations.

In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for efficient access and manipulation by the CPU (typically using pointers).
When you want to write data to a file or send it over the network, you must encode it as self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldn’t make sense to any other process, this sequence-of-bytes representation looks quite different from the data structures that are normally used in memory. Thus, we need translation between the two representations. The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling).

Many programming languages come with built-in support for encoding in-memory objects into byte sequences. For example, Java has java.io.Serializable, Ruby has Marshal and Python has Pickle. Many third-party libraries also exist, such as Kryo for Java.

Objects in memory

These encoding libraries are very convenient, because they allow in-memory objects to be saved and restored with minimal additional code. However, they also have several deep problems. The encoding is often tied to a programming language and reading the data in another language is very difficult. This may cause issues with language-lock in or integrating with other systems developed in a different coding language.

In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes. This is frequently a source of security problems. If an attacker can get your application to decode an arbitrary byte sequence, they can instantiate arbitrary classes, which in turn often allows them to remotely execute arbitrary code. Versioning data is often an afterthought in these libraries: as they are intended for quick and easy encoding of data, they often neglect the inconvenient problems of forward and backward compatibility. Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also often an afterthought.

For these reasons it’s generally a bad idea to use your language’s built-in encoding for anything other than very transient purposes.

JSON/XML

To overcome the encoding issue, industry standards in the form of JSON and XML have developed. XML is quite useful but is sometimes criticized for being too verbose and unnecessarily complicated. JSON’s popularity is mainly due to its built-in support in web browsers (by virtue of being a subset of JavaScript) and simplicity relative to XML. CSV is another popular language-independent format, albeit less powerful but still widely used.

At the encoding level, there are complexities. For example, there is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish between a number and a string that happens to consist of digits (except by referring to an external schema). JSON distinguishes strings and numbers, but it doesn’t distinguish integers and floating-point numbers, and it doesn’t specify a precision. JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they don’t support binary strings (sequences of bytes without a character encoding). Binary strings are a useful feature, so people get around this limitation by encoding the binary data as text using Base64.

Use of XML schemas is fairly widespread, but many JSON-based tools don’t bother using schemas. Since the correct interpretation of data (such as numbers and binary strings) depends on information in the schema, applications that don’t use XML/JSON schemas need to potentially hardcode the appropriate encoding/decoding logic instead.

CSV does not have any schema, so it is up to the application to define the meaning of each row and column. If an application change adds a new row or column, you will need to change the CSV manually. CSV is also a quite vague format (what happens if a value contains a comma or a newline character?).

In any event, despite some limitations, JSON and XML help resolve important issues around code and system interoperability. Over time it is expected that their related schemas (some tools do enforce JSON schemas); and usage will only become clearer and easier to work with.

==END

Blog

1 thought on “Schemas, Encoding, JSON, XML”