Hidden models and latent compression in community detection
This is an overdue post on a paper published last year with Alec Kirkley, named “Implicit Models, Latent Compression, Intrinsic Biases, and Cheap Lunches in Community Detection” [1].
\[ \newcommand{\b}{\boldsymbol b} \newcommand{\A}{\boldsymbol A} \]
There are essentially two kinds of community detection methods (or data analysis procedures in general): descriptive, and inferential [2]. Inferential methods attempt to fit a generative model to data, reject a null model, or in some other way articulate mechanisms of network formation, or more formally, a population of potential observations from which the observed data is only one possibility—so, for example, a network with assortative community structure becomes one where there’s a higher probability of an edge being formed between two nodes of the same group. A descriptive method, on the other hand, doesn’t evoke explicitly an edge formation mechanism, and relies instead on simply describing the structural patterns seen (e.g. a community is a group of nodes with sufficiently more internal than external observed edges). In either case we typically wish to find the best partition \(\hat\b(\A)\) of a network \(\A\). In the case of descriptive methods this is usually obtained via the optimization of a non-probabilistic quality function \(W(\A,\b)\), \[ \hat\b(\A) = \underset{\b}{\operatorname{arg max}}\; W(\A,\b). \] Some of the most serious problems of descriptive methods arise when their results are interpreted in an inferential way, e.g. ascribing the communities found to homophily or other mixing preferences. Many practitioners are surprised and concerned when these descriptive methods find nontrivial clusters in random graphs where the presence of any edge in the network has the exact same probability [3], in this way betraying a clear inferential goal, in contradiction with what the method being used is actually expected to accomplish—a goal perhaps shared even with the designers of the method, who acted inadvertently.
Despite its importance, this issue hasn’t yet prevented compromised methods such as modularity maxizimization to be employed en masse—although it’s unclear if most users even realize the problem. In any case, this is not the only instance of specious statistical practice permeating through vast areas of science [4,5]. Evidently, pontification and explanation only go so far.1
1 “If you’re explaining, you’re losing.” — Ronald Reagan
In an effort to understand a bit better the consequences of interpreting descriptive methods in an inferential way, Alec and I tried to reverse engineer them. In other words, we asked the question:
If we interpret the results of a community detection algorithm as inferential, to what generative model does it correspond?
We managed to go surprisingly far in answering this question!
Every method is inferential when the model is bad enough
Algorithmically, inferential methods are not so different from descriptive ones; in fact, at least at first,2 it all boils down to choosing a different quality function \[ W(\A,\b) = P(\b | \A), \] where \[ P(\b | \A) = \frac{P(\A | \b)P(\b)}{P(\A)} \tag{1}\] is the posterior probability of a partition \(\b\) given an observed network \(\A\). This isn’t a trivial difference, however, since defining the quality function in this way amounts precisely to reasoning about possible generative mechanism and our prior assumptions over their parameters, as discussed previously.
2 At first, there’s no algorithmic difference. However, in an inferential setting, we quickly realize that there’s no reason to focus only on the best partition, and we might choose instead to characterize the whole posterior distribution \(P(\b | \A)\), and average over it, so we can quantify uncertainties, competing hypotheses, and everything else that is essential to a serious scientific pipeline.
3 Interestingly, as we show in the paper, the precise shape of \(g(x)\) has asymptotically no effect whatsoever in the resulting generative model, up to a multiplicative “inverse temperature” constant.
The idea we explored in [1] is that we can essentially reverse Bayes’ formula in Equation 1, and obtain an implicit generative model that is compatible with any given quality function \(W(\A,\b)\): \[ P(\A,\b) = \frac{\mathrm{e}^{g(W(\A,\b))}}{Z}, \tag{2}\] where \(g(x)\) is any non-decreasing invertible function,3 and \(Z=\sum_{\A,\b} \mathrm{e}^{g(W(\A,\b))}\) is a normalization constant. In order to understand what that means, consider the illustration in Figure 1.
The networks generated by the implicit model in panel (b) of Figure 1 are markedly different from the original network in case 2 in (a), which would be generated only with a very low probability under this model. This happens because the mixing between groups tends to be homogeneous for networks sampled from the model, whereas in the network in case 2 in (a) the groups connect preferentially to a central group (in blue) and they have more heterogeneous densities. This mismatch indicates that the underlying model is in fact a poor representation of the network structure — which would be impossible to determine from the results of panel (a) alone. Therefore, characterizing the implicit models hidden behind community detection methods allows us to evaluate their ability to faithfully capture network structure in a systematic manner and reveal their intrinsic biases towards particular kinds of structure.
Furthermore, this reverse engineering allows us to compare different community detection methods on equal grounds; in particular it allows us to compute their description length, \[ \Sigma (\A,\b) = -W(\A,\b) + \log Z, \]
i.e. the amount of information required to describe both the data and the model parameters — which is a universal model selection criterion that removes the need for “ground truth” labels for method comparison. The “partition function” \(Z\) is not trivial to compute, but we show how it can be done for a wide class of quality functions.4
4 The computation of \(Z\) can in fact be seen as the main technical contribution of our work.
Does that mean that those that propose or use a specific quality function \(W(\A,\b)\) advocate for or even accept the implicit generative model of Equation 2 as a realistic explanation of their data? Well, what goes on in peoples’ minds is not our primary concern. The relevant point here is that they certainly cannot coherently object to this generative model, while at the same time interpreting the results of their algorithm in an inferential manner.
No “benign overfitting” in community detection
Well, and how do the descriptive results fare, when their implicit generative models come to light? We compared the descriptive methods modularity maximization and Infomap with a variety of inferential ones on a corpus of over 500 empirical networks, and the results are striking:
Infomap provides an inferior compression to the Erdős–Rényi model—the most random model of them all, and hence, one could expect, the least compressive—for 38% of the networks! The configuration model beats both methods in 45% (modularity) and 60% (Infomap) of the instances. Conversely, inferential methods based on variations of the stochastic block model (SBM) provide a better compression for 97% (Infomap) and 96% (modularity) of the data.5 In other words, modularity and Infomap, two very popular descriptive methods, massively overfit empirical data, and should not be used to aid inferential analyses.
5 Revealingly, those few instances where modularity and Infomap yield (usually marginally) superior compression, albeit with answers compatible with the inferential methods, are often for “toy” networks such as US college football, commonly used as simple examples in community detection papers. These networks have patterns that modularity and Infomap implicitly expect: uniformly assortativive communities with equal size and density. Given that their unrepresentative character is in diametrical opposition to how often they are used as illustrations, maybe such “toy” examples should be retired.
So, we’re not so lucky as with some kinds of statistical models that get regularization for free as the result of an apparent act of divine benevolece… In community detection we have to work for it, so that you (the user) do not have to.
There are a lot of other goodies in the paper, but I do not want to spoil the pleasure of reading its 28 pages and 14 figures! Here’s just a sneak peek:
We show how computing the description length of modularity can be used to select the most compressive value of the resolution parameter \(\gamma\), and in this way solve not only its resolution limit [6], but also its overfitting problem [3]. To those that care: you’re welcome.
We show that a broad class of quality functions—that includes modularity and Infomap—amounts to a planted partition SBM likelihood with very speficic priors on the strength of assortativity and the number of groups. It seems very difficult to make assumptions (explicit or otherwise) about community structure that evades what the SBM articulates!
We analyze the implicit priors of modularity maximization and Infomap and they’re crazy: they show abrupt transitions in their expected values, forbid entire regions of the parameter space, show nontrivial scaling with system size, etc. It’s ironic that sometimes people complain that the SBM embodies unrealistic assumptions about the data (not untrue—it’s a network histogram, nothing more), but have been unknowingly ingesting lethal doses of toxic, unjustifiable assumptions into their analyses this whole time.
We explore Bayes optimal instances of the community detection problem—i.e. instances where each corresponding method must behave optimally, and show that there’s a fundamental asymmetry between methods: a more general model such as the Nested SBM (NSBM) will behave virtually just as well for instances of the problem that are optimal for modularity, but the opposite is far from being true: for instances where the NSBM is optimal, modularity maximization performs abysmally.
This runs against what is arguably one of the most vacuous statements in the field: the “no free lunch” theorem for community detection [7], that states that every community detection algorithm shows exactly the same performance6 when averaged over “all problem instances”—considered to be fully uniform matchings between network and partition, formally defining a maximally random generative model. As previosly discussed, this uniform model generates problem instances that are strictly incompressible, therefore utterly unrepresentative of anything we want to consider either in theory or in practice.
6 The “same performance” that all algorithms exhibit amounts to an asymptotic accuracy of exactly zero, since in those uninformative instances an algorithm can do no better than guessing at random.
What we show for community detection is this work is valid much more broadly: One of the biggest lies ever told is that there are “model free” methods of data analysis. Not possible! The most that can be done is to hide models from plain sight. But with sufficiently careful reverse engineering, it might still be possible to reveal the horrors that lie beneath such obstructions.
References
Webmentions7
(Nothing yet)
7 Webmention is a standardized decentralized mechanism for conversations and interactions across the web.
Comments