Manage Configurations: File Format

Posted on 2024-02-26 In CS/Math Disqus: Word count in article: 2.5k Reading time ≈ 9 mins.

Indeed, there are a bunch of formats available. Just list a few popular choices here:

JSON. Everyone who reads this post right now should have know it :)
INI. Do you know there is an operating system, called Windows?
YAML. Examples are Kubernetes, Jekyll, and CircleCI.
TOML. Examples are PEP 621, which introduces pyproject.toml to Python, and Cargo, which uses Cargo.toml to configure every Rust crate.
XML. Examples are Apache Hadoop and Apache Ant.

(Thanks to ChatGPT for providing some of the examples.)

I'd like to put the related formats into four categories, roughly based on the expressivity:

Lightweight configuration formats. Examples are JSON, INI, TOML, XML. There are basically human-readable serialization of dictionaries/lists.
Medium weight configuration formats. Examples are YAML, OmegaConf. These are more complex (in the sense that templating is possible to some extent) but not fully-fledged.
Heavyweight configuration formats. Examples are Nickel, Dhall, Jsonnet, Pkl and RCL. They introduce variables and function, becoming serious Domain-Specific Languages. Some of them are even Turing-complete.
General programming languages. Examples are Python (JupyterHub), Lua (WezTerm , neovim), VimL (Vim), Emacs Lisp (Emacs), and even Haskell (xmonad). What I find out is that these projects usually need to define functions/callbacks. I won't discuss them in this post.

Here is a comment from HackerNews, which has a similar classification:

Level 1 is just values in a file. The Linux kernel uses that.

Level 2 is a list of values, e.g. ini files.

Level 3 allows nesting. JSON, XML, and YAML are here.

Level 4 allows computation but limited. Dhall and Starlark are here.

Level 5 is a Turing-complete language. Python, Javascript, etc.

JSON and Its Derivatives (JSONC, Hjson, JSON5), XML, INI, TOML

JSON, a configuration format of minimalism. There are tons of criticism on JSON, but in fact, these decisions are made intentionally.

No comments, as it's a "dangerous practice" and introduces "unnecessary complexity".
Quoted keys. This is because keys can't be reserved keywords at that time, so some keys must be quoted. The minimalist design is to quote all keys.
No trailing comma for objects/arrays.
No multiline string.

The simplicity of JSON makes it a truly gold standard – just think about how many JSON parsers there are!

Due to its simplicity, there are a lot of derivatives of JSON, including JSONC, Hjson and JSON5. They are often supersets of JSON, addressing the issues above, but they're not widely accepted as JSON unfortunately.

Here is an example of JSON5.

{
  // comments
  unquoted: 'and you can quote me on that',
  singleQuotes: 'I can use "double quotes" here',
  lineBreaks: "Look, Mom! \
No \\n's!",
  hexadecimal: 0xdecaf,
  leadingDecimalPoint: .8675309, andTrailing: 8675309.,
  positiveSign: +1,
  trailingComma: 'in objects', andIn: ['arrays',],
  "backwardsCompatible": "with JSON",
}

XML. XML is also an ancient format. If you think JSON is verbose, you may want to take a look at XML – nothing beats XML in terms of verbosity. You always need to write the tag names twice to enclose the value. Are they really so important that you have to emphasize it again?

<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>

INI. INI files are ubiquitous in the Windows system. However, I don't find the spec for INI files, which is kind of weird…

TOML. TOML is created as a minimal language, and is mainly suitable for small configurations. It supports datetime as a first class type. TOML looks pretty similar to INI, but TOML has its spec and allows nested mapping.

A common criticism for TOML is its support for hierarchy. Consider the following example:

[[fruit]]
  name = "apple"
  [[fruit.color]]
    name = "red"
  [[fruit.color]]
    name = "green"
[[fruit]]
  name = "banana"
  [[fruit.color]]
    name = "yellow"

It's hard to see the structure. Moreover, what if we want to add one more layer of hierarchy on top of fruit? We have to change every fruit.color! Also, fruit is repeated for 5 times, and repetition means error-prone.

YAML and Its Derivatives (StrictYAML, OmegaConf)

YAML

YAML. YAML is a very popular configuration language. It has an indent-based succinct syntax, so it's often has fewer charachters than many other formats, and also it's very easy to learn.

YAML has more features than JSON/TOML. For example, it has sort of templating capabilities using anchors:

some_mapping: &mapping_anchor
  key1: value1
  key2: value2

other_mapping: *mapping_anchor
  key1: value3

But unfortunately YAML has so many footguns, as others have pointed out ¹ ² ³ ⁴ ⁵. Just take a few examples there:

An astonishing problem is The Norway problem, which basically means that the list [us, uk, no, sw] will be parsed to ["us", "uk", false, "sw"].
The explicit tags might be exploited to execute arbitrary code. Due to this, it's advised that TensorFlow should remove its support for YAML files (and they did).
Different implementations have different behaviors.

It's worthy nothing that YAML actually has two main versions (v1.1 and v1.2). In Python ecosystem, there are two parsers: PyYAML and ramuel.yaml, the former of which uses v1.1 and the latter uses v1.2. YAML v1.2 removes many features complained by people (e.g., the Norway problem above), but also some other features (e.g., <<, which is convenient for merging dicts, is no longer supported).

OmegaConf

OmegaConf. Used by Hydra, the configuration library is popular in Machine Learning. OmegaConf works as a parsing library but extends YAML's resolvers so that it can parse its own magics (e.g., variable interpolation and read environment variables). As the extended functions can be simply treated as strings during parsing, OmegaConf files are just regular YAML files.

1
2
3

user:
  name: ${oc.env:USER}
  home: /home/${oc.env:USER}

StrictYAML. StrictYAML aims to solve YAML's issues by removing features from YAML and defining schemas. I'll talk about it in the next section.

Jsonnet, Dhall, Starlark, Cue, Nickel, Pkl and RCL

YAML Extensions

There are a few YAML extensions for even more powerful templating capabilities (e.g., Yglu, YAMLScript and rimu). Unfortunately I don't have the time to evaluate all of them, or even summarize their methods. Among those, it seems that YAMLScript is under the yaml organization, and the main contributor, ingydotnet, is in the YAML Language Development Team.

Jsonnet

Jsonnet extends JSON by introducing varibles and functions.

There is no f-string-like interpolation, only pattern % val.
like the control of nested merge: use key: value for non-nested merging and key+: value for nested merging.
don't like :, ::, ::: for visibility.
lazy evaluation. It's very natural to have lazy evaluation as Jsonnet allows expressions like this:
1
2
3
4
{
name: "Alice",
greeting: "Hello, " + obj.name,
};
Essentially every value is an implicit function.

Speaking of JavaScript, there are few additions from ES6 which I feel are very especially for configuration languages:

Template literals allow string interpolation: `string text ${expression} string text`;
Spread syntax (...) allows merging two objects by {…a, …b}.
Object initializer shorthand allows {a, b} which is a syntactic sugar to {a: a, b: b}. Jsonnet can use self. when constructing an object so it's fine here.
Arrow function expressions allows (x) => x + 1. I like arrow functions more than regular definition — functions themselves are also objects, right?

However, Jsonnet was started before ES6 was finally released, so unfortunately Jsonnet was unable to adapt to these updates.

The official C implementation of Jsonnet is slow due to two reasons: (1) many functions of the stdlib are written in Jsonnet itself; (2) the lazy evaluation is not cached in some cases, so in extreme case the complexity can be exponential. Databricks then releases its own compiler to solve the issues.

Dhall

The first impression on Dhall is that it looks pretty similar to Haskell: the let .. in binding, {- -} for comments (quite like an emoji), \(user: Text) -> for anonymous functions, and the leading comma…

One very cool feature of Dhall is semantic hashing, which can hash even functions. Well, that's very interesting because it means the functions that can be expressed in Dhall is quite limited. Just imagine how simple it is to write a program which checks if Goldbach's conjecture is true.

However, Dhall is verbose. For example, Dhall supports generics, but the generic type must be specified whenever the function is invoke, like the User in Prelude.List.length User users. Most programming languages with generics have type inference but Dhall is an exception. It also surprises me when I know Dhall doesn't support multiple function arguments. The official solution is to use either currying or records.

let users: List User
    = [ { name = "John Doe", account = "john", age = 23 }
      , { name = "Jane Smith", account = "jane", age = 29 }
      , { name = "William Allen", account = "bill", age = 41 }
      ]

let companySize = Prelude.List.length User users

Starlark

Starlark uses a Python-like syntax but adds a few restrictions. Here summaries the difference between Starlark and Python.

Honestly, I don't think Python is a good base language for configuration: The syntax for dict, the most used data structure, is as verbose as JSON. Therefore, I haven't looked into Starlark seriously.

The importing scheme in Starlark sounds a little bit weird to me:

1	load("module.sky", "x", y2="y", "z") # assigns x, y2, and z

Well, the order of y2="y" and "z" makes me a little bit uncomfortable. Moreover, what is wrong with import? Why can't we simply use the same syntax as in regular Python?

1 2	from module.sky import x, z from module.sky import y as y2

Cue

Cue focuses more on data validation instead of expressiveness.

Arguably, validation should be the foremost task of any configuration language. Most configuration languages, however, focus on boilerplate removal. CUE is different in that it takes the validation first stance.

One fascinating idea in Cue is that types are values, and I strongly recommend the official explanation of Cue's core concepts. The theory behind it is beautiful: A type, as well as a constraint, is a set of possible values. To check subsets efficiently, Cue requires all values form a lattice where all values can be partially ordered and every two sets have a GCD and LCM.

Cue has also compared itself with other configuration languages, including GCL/Jsonnet/HCL. It raises a very interesting point:

Inheritance, is not commutative and idempotent in the general case. In other words, order matters. This makes it hard to track where values are coming from.

When I first saw it, I was very surprised: Considering merging two dictionaries A = {a: 1} and B = {a: 2}. It's clear that merge(A, B) != merge(B, A). So actually the point means that overwriting is impossible in Cue. One advantage is that whenever you see an assignment, its value will never be changed, making truth everywhere. I haven't evaluated Cue very seriously so it's hard to say if this is good or too restrictive, though. Nevertheless, I'll definitely consider Cue if verification is a top priority.

Nickel

Nickel is from the [[Nix]] community where the Nix language is used for configuration. Given the complexity of configuring a whole system, I have no doubt of Nickel's expressiveness. Nickel also compares itself with many other options in GitHub.

I'd like to quote an interesting comment from HackerNews:

The documentation for Nix… is special. I’ve never seen any configuration need to explain the fixed-point [1] in order to do something which should be pretty simple.

[1] https://nixos.org/guides/nix-pills/nixpkgs-overriding-packages.html

TIL fixed point + lazy evaluation can be used to evaluate Jsonnet-like expression! Fortunate we don't need to learn fixed point in Nickel :)

Another feature of Nickel is that it also validates data through Contracts.

Pkl

Before the post is finished, Apple released Pkl. As someone who pickles Python objects into a .pkl file, I'd say that this is an unfortunate coincidence. At first glance, Pkl seems to be a mature project:

It has many batteries included. It has builtin support for regular expressions, glob patterns, reflection, testing, benchmarking, and renderers for Jsonnet/protobuf/YAML/XML.
It has bindings for Java/Kotlin/Swift/Go, as well as editor integration in IntelliJ/VSCode/Vim.
It even comes with a style guide, which is great for large team projects.
It supports type annotation for verifying data.

Though, some choices of Pkl seem weird to me:

Lists are enclosed by {} which makes me feel a little bit uncomfortable. Perhaps I'm too used to Python's set? Or the authors want to use dicts to represent lists, like Lua?
Mixed properties/elements/entries.
There is no = in object definition (e.g., bird {name = "x"}), but there is when amending (e.g., parrot = (bird) {name = "y"}).
I'm not feeling well for the parenthesis in parrot = (bird) {name = "y"} either, as parenthesis often means the content inside it is second-order to the content outside of it, while here bird is obviously the first-order content.
In the tutorial of Pkl, there is no function but only methods, which reminds me of Java…

RCL

After I added the content for Pkl, another configuration language was posted on HackerNews: RCL. At the very beginning I didn't notice the author, but later when I was reading through the tutorial, I became so interested in the language that I have to check out the author. It surprised me at first, but then I realized it's very reasonable: The author is Ruud van Asseldonk, who is the author of the aforementioned post, The yaml document from hell. He also published a blog post for his rationale behind the new language.

If-else is now an expression rather than a statement.
The record syntax is more clean than the dict syntax.
String interpolation. The syntax is not important as long as we have it.
Anonymous function. Similar to TypeScript and looks much better than Python's lambda.
Comprehensions. RCL has the best comprehensions. It supports for, if and let. Moreover, I like the syntax that the loop is at the beginning instead of in the end, which is more ergonomic. IMO, Python's comprehension breaks the rule of definition first, usage later for variables.
String.ends_with instead of String.endswith. Absolutely personal preference.

I have a few concerns right now:

I'm not sure if Set is necessary, as I don't it often. As mentioned, I like the object initializer shorthand {a, b} so I'm thinking if we can adopt it right here.
Currently it doesn't support floats.
Dicts are not merged recursively.

The formatted indent of comprehension looks a little bit weird.

let labels = {
  for server in servers:
  let all_server_labels = server_labels[server] | default_labels;
  for label in all_server_labels:
  if not excluded_labels.contains(label):
  label
};

My preference is

let labels = {
  for server in servers:
    let all_server_labels = server_labels[server] | default_labels;
    for label in all_server_labels:
      if not excluded_labels.contains(label):
        label
};

I'm not sure whether we should allow statements in a comprehension (i.e., for …: let …; for …), as it may break the linearity of comprehension, that is, no block has sibling blocks. Actually, without the let statement here, the original indentation looks ok.

It's a hobby project without stability guarantee. But I really appreciate the author for the warning.

RCL is also thinking about typechecking.

The yaml document from hell↩︎
YAML: probably not so great after all↩︎
What YAML features does StrictYAML remove? - HitchDev↩︎
GitHub - cblp/yaml-sucks: YAML sucks.↩︎
🚨🚨 That's a lot of YAML 🚨🚨↩︎