News and views

Is Bayesian inference subjective?

Published on Tuesday, 04 January, 2022
Tags: community detection, Bayes, compression, MDL

In a previous blog post (covering [peixoto_descriptive_2021]) I discussed how inferential approaches to community detection are based on the formulation of generative models, via the definition of a likelihood \(P({\boldsymbol A}|{\boldsymbol b})\) for the network \({\boldsymbol A}\) conditioned on a partition \({\boldsymbol b}\). With this at hand, we find the best partition of the network according to the posterior distribution, using Bayes' rule, i.e.

\begin{equation*} P({\boldsymbol b}|{\boldsymbol A}) = \frac{P({\boldsymbol A}|{\boldsymbol b})P({\boldsymbol b})}{P({\boldsymbol A})}, \end{equation*}

where \(P({\boldsymbol b})\) is the prior probability for a partition \({\boldsymbol b}\).

In Bayesian statistics, probabilities are often described as representing a state of knowledge or even as quantification of a "personal belief" [de_finetti]. Does that mean that the results that we obtain using this method are subjective, and depend arbitrarily on how we choose our models and priors over parameters?

I think it is not very difficult to argue that answer to the above question is "no". There are different ways to go about this. For example, we can argue "a la Jaynes" that a "state of knowledge" is not something subjective, since it can be quantified and always needs to be substantiated [jaynes]. An alternative way of demonstrating this, and which I find the most compelling, is via the equivalence between inference and compression. Namely, we can write the numerator of the posterior distribution of eq:bayes as

\begin{equation*} P({\boldsymbol A}|{\boldsymbol b})P({\boldsymbol b}) = 2^{-\Sigma({\boldsymbol A},{\boldsymbol b})}, \end{equation*}

where the quantity \(\Sigma({\boldsymbol A},{\boldsymbol b})\) is known as the description length [grunwald_minimum_2007] of the network. It is computed as:

\begin{equation*} \Sigma({\boldsymbol A},{\boldsymbol b}) = \underset{\mathcal{D}({\boldsymbol A}|{\boldsymbol b})}{\underbrace{-\log_2P({\boldsymbol A}|{\boldsymbol b})}}\, \underset{\mathcal{M}({\boldsymbol b})}{\underbrace{- \log_2P({\boldsymbol b})}}. \end{equation*}

The second term \(\mathcal{M}({\boldsymbol b})\) in the above equation quantifies the amount of information in bits necessary to encode the parameters of the model, while the the first term \(\mathcal{D}({\boldsymbol A}|{\boldsymbol b})\) determines how many bits are necessary to encode the network itself, once the model parameters are known. Therefore, finding the most likely network partition is equivalent to finding the one that most compresses it — giving us a compelling implementation of Occam's razor.

The description length is not subjective in any way; in fact it is quite concrete. It literally means that if we infer the most likely model, it gives us a way of storing the data in a hard drive using \(\Sigma({\boldsymbol A},{\boldsymbol b})\) bits! As we know from our daily computer usage, compression is not a trivial matter, otherwise we would never run out of disk space, we would be able to download files instantly, etc. If we accept that Bayesian statistics is subjective, then we need also to accept that these technical obstacles we face are also subjective in nature, which is a rather distasteful proposition.

As I discussed previously, seeking compression avoids overfitting the data since it's not possible (asymptotically) to compress noise.

However, the concept of compression is more generally useful than just avoiding overfitting within a class of models. In fact, the description length gives us a model-agnostic objective criterion to compare different hypotheses for the data generating process according to their plausibility — in a manner that is not subjective. Namely, since Shannon's theorem tells us that the best compression can be achieved asymptotically only with the true data generating model, then if we are able to find a description length for a network using a particular model, regardless of how it is parametrized, this also means that we have automatically found an upper bound on the optimal compression achievable. By formulating different generative models and computing their description length, we have not only an objective criterion to compare them against each other, but we also have a way to limit further what can be obtained with any other model. The result is an overall scale on which different models can be compared, as we move closer to the limit of what can be uncovered for a particular data at hand.

In the figure below we show the description length values with some models obtained for a protein-protein interaction network for the organism Meleagris gallopavo (wild turkey).

/static/blog/figs/objective/compressed.png

Compression points towards the true model. Top: Protein-protein interaction network for the organism Meleagris gallopavo. The node colors indicate the best partition found with the DC-SBM/TC (there are more groups than colors, so some colors are repeated), and the edge colors indicate if they are attributed to triadic closure (red) or the DC-SBM (black). Bottom: Description length values according to different models. The unknown true model must yield a description length value smaller than the DC-SBM/TC, and no other model should be able to provide a superior compression that is statistically significant.

In particular, we can see that with the degree-corrected stochastic block model with triadic closure (DC-SBM/TC) [peixoto_disentangling_2021] we can achieve a description length that is far smaller than what would be possible with networks sampled from either the Erdős–Rényi, configuration, or planted partition (a SBM with strictly assortative communities [zhang_statistical_2020]) models, meaning that the inferred model is much closer to the true process that actually generated this network than the alternatives. Naturally, the actual process that generated this network is different from the DC-SBM/TC, and it likely involves, for example, mechanisms of node duplication which are not incorporated into this rather simple model. However, to the extent that the true process leaves statistically significant traces in the network structure [1], computing the description length according to it should provide further compression when compared to the alternatives. Therefore, we can try to extend or reformulate our models to incorporate features that we hypothesize to be more realistic, and then verify if this in fact the case, knowing that whenever we find a more compressive model, it is moving closer to the true one — or at least to what remains detectable from it for the finite data.

The discussion above glosses over some important technical aspects. For example, it is possible for two (or, in fact, many) models to have the same or very similar description length values. In this case, Occam's razor fails as a criterion to select between them, and we need to consider them collectively as equally valid hypotheses. This means, for example, that we would need to average over them when making specific inferential statements [peixoto_revealing_2021] — selecting between them arbitrarily can be interpreted as a form of overfitting. Furthermore, there is obviously no guarantee that the true model can actually be found for any particular data. This is only possible in the asymptotic limit of "sufficient data", which will vary depending on the actual model. Outside of this limit (which is the typical case in empirical settings, in particular when dealing with sparse networks), fundamental limits to inference are unavoidable, which means in practice that we will always have limited accuracy and some amount of error in our conclusions. However, when employing compression, these potential errors tend towards overly simple explanations, rather than overly complex ones. Whenever perfect accuracy is not possible, it is difficult to argue in favor of a bias in the opposite direction.

I emphasize that it is not possible to "cheat" when doing compression. For any particular model, the description length will have the same form

\begin{equation*} \Sigma(\boldsymbol A,\boldsymbol \theta) = \mathcal{D}(\boldsymbol A|\boldsymbol\theta) + \mathcal{M}(\boldsymbol \theta), \end{equation*}

where \(\boldsymbol \theta\) is some arbitrary set of parameters. If we constrain the model such that it becomes possible to describe the data with a number of bits \(\mathcal{D}(\boldsymbol A|\boldsymbol \theta)\) that is very small, this can only be achieved, in general, by increasing the number of parameters \(\boldsymbol\theta\), such that the number of bits \(\mathcal{M}(\boldsymbol\theta)\) required to describe them will also increase. Therefore, there is no generic way to achieve compression that bypasses actually formulating a meaningful hypothesis that matches statistically significant patterns seen in the data. One may wonder, therefore, if there is an automatized way of searching for hypotheses in a manner that guarantees optimal compression. The most fundamental way to formulate this question is to generalize the concept of minimum description length as follows: for any binary string \(\boldsymbol x\) (representing any measurable data), we define \(L(\boldsymbol x)\) as the length in bits of the shortest computer program that yields \(L(\boldsymbol x)\) as an output. The quantity \(L(\boldsymbol x)\) is know as Kolmogorov complexity, and if we would be able to compute it for a binary string representing an observed network, we would be able to determine the "true model" value in fig:compressed, and hence know how far we are from the optimum [2].

Unfortunately, an important result in information theory is that \(L(\boldsymbol x)\) is not computable. This means that it is strictly impossible to write a computer program that computes \(L(\boldsymbol x)\) for any string \(\boldsymbol x\) [3]. This does not invalidate using the description length as a criterion to select among alternative models, but it dashes any hope of fully automatizing the discovery of optimal hypotheses. (The upside to this is that scientists will never run out of things to do!)

EDIT (5/1/2022): Added the DC-SBM/TC as an additional model to the figure above.

[peixoto_descriptive_2021]Tiago P. Peixoto, "Descriptive vs. inferential community detection: pitfalls, myths and half-truths", arXiv: 2112.00183
[de_finetti]Bruno de Finetti, "Theory of Probability: A critical introductory treatment.", Chichester: John Wiley & Sons Ltd. (2017)
[jaynes]Edwin Thompson Jaynes, "Probability Theory: The Logic of Science", Cambridge University Press, (2003).
[peixoto_disentangling_2021]Tiago P. Peixoto, "Disentangling homophily, community structure and triadic closure in networks", arXiv: 2101.02510
[zhang_statistical_2020]Lizhi Zhang and Tiago P. Peixoto, "Statistical inference of assortative community structures." Physical Review Research 2, 043271 (2020). DOI: 10.1103/PhysRevResearch.2.043271
[peixoto_revealing_2021]Tiago P. Peixoto, "Revealing Consensus and Dissensus between Network Partitions", Physical Review X 11, 021003 (2021). DOI: 10.1103/PhysRevX.11.021003
[grunwald_minimum_2007]Peter D. Grünwald, The Minimum Description Length Principle (The MIT Press, 2007).
[1]Visually inspecting fig:compressed reveals what seems to be local symmetries in the network structure, presumably due to gene duplication. These patterns are not exploited by the SBM description, and points indeed to a possible path for further compression.
[2]As mentioned before, this would not necessarily mean that we would be able to find the actual true model in a practical setting with perfect accuracy, since for a finite \(\boldsymbol x\) there could be many programs of the same minimal length (or close) that generate it.
[3]There are two famous ways to prove this. One is by contradiction: if we assume that we have a program that computes \(L(\boldsymbol x)\), then we could use it as subroutine to write another program that outputs \(\boldsymbol x\) with a length smaller than \(L(\boldsymbol x)\). The other involves undecidabilty: if we enumerate all possible computer programs in order of increasing length and check if their outputs match \(\boldsymbol x\), we will eventually find programs that loop indefinitely. Deciding whether a program finishes in finite time is known as the "halting problem", which has been proved to be impossible to solve. In general, it cannot be determined if a program reaches an infinite loop in a manner that avoids actually running the program and waiting for it to finish. Therefore, this rather intuitive algorithm to determine \(L(\boldsymbol x)\) will not necessarily finish for any given string \(\boldsymbol x\). For more details the wikipedia page has a good overview.

Comments

Older posts

Inferring, explaining, and compressing

Published on Friday, 03 December, 2021