In a previous blog post (covering [peixoto_descriptive_2021]) I discussed how inferential approaches to community detection are based on the formulation of generative models, via the definition of a likelihood $P(A|b)$ for the network $A$ conditioned on a partition $b$. With this at hand, we find the best partition of the network according to the posterior distribution, using Bayes' rule, i.e.

$$P(b|A)=\frac{P(A|b)P(b)}{P(A)},$$

where $P(b)$ is the prior probability for a partition $b$.

In Bayesian statistics, probabilities are often described as representing a state of knowledge or even as quantification of a “personal belief” [de_finetti]. Does that mean that the results that we obtain using this method are “subjective,” and depend arbitrarily on how we choose our models and priors over parameters?

I think it is not very difficult to argue that the answer to the above question is “no” — at least when operating with the colloquial meaning of “subjective” as something that is based on or influenced by personal feelings, tastes, or opinions.

First, it is important to distinguish between the colloquial and a more technical definition of what constitutes a “subjective” statement. According to the more technical definition, we say that a statement is subjective when its veracity is conditioned on the subject that makes the statement, without necessarily meaning that the subject is free to decide on its veracity. A good example of this type of subjectivity is time in Einstein's theory of relativity: it has subjective nature since it will be experienced differently depending on the frame of reference of the observer. Nevertheless, this does not mean that time can be freely determined by any observer, nor that it will be influenced by her personal feelings, tastes, or opinions. In other words, a subjective statement is not the same as an arbitrary statement. In this sense (and only in this sense), Bayesian statistics is indeed subjective, since an inferential conclusion will depend on the data observed and set of hypotheses considered by an individual. However, given the same data and set of hypotheses, two subjects must agree on the conclusion — it is not quite an arbitrary decision. In the colloquial sense of the term, Bayesian inference is not subjective.

There are different ways to demonstrate this more concretely. For example, we can argue “a la Jaynes” that a “state of knowledge” is not something arbitrary, since it can be quantified and always needs to be substantiated [jaynes]. An alternative way of showing this, and which I find the most compelling, is via the equivalence between inference and compression. Namely, we can write the numerator of the posterior distribution of eq:bayes as

$$P(A|b)P(b)={2}^{-\Sigma (A,b)},$$

where the quantity $\Sigma (A,b)$ is known as the description length [grunwald_minimum_2007] of the network. It is computed as:

$$\Sigma (A,b)=\underset{\mathcal{D}(A|b)}{\underset{\u23df}{-{\mathrm{log}}_{2}P(A|b)}}\phantom{\rule{0.1667em}{0ex}}\underset{\mathcal{M}(b)}{\underset{\u23df}{-{\mathrm{log}}_{2}P(b)}}.$$

The second term $\mathcal{M}(b)$ in the above equation quantifies the amount of information in bits necessary to encode the parameters of the model, while the the first term $\mathcal{D}(A|b)$ determines how many bits are necessary to encode the network itself, once the model parameters are known. Therefore, finding the most likely network partition is equivalent to finding the one that most compresses it — giving us a compelling implementation of Occam's razor.

The description length is not arbitrary in any way; in fact it is says something almost physical about the data. It means that if we infer the most likely model, it gives us a way of storing the data in a hard drive using $\Sigma (A,b)$ bits! As we know from our daily computer usage, compression is not an arbitrary decision, nor is it influenced by our personal feelings, tastes, or opinions — otherwise we would never run out of disk space, we would be able to download files instantly, etc. If we accept that Bayesian statistics is arbitrary, then we need also to accept that these technical obstacles we face are also arbitrary in nature, which is a rather absurd proposition.

As I discussed previously, seeking compression avoids overfitting the data since it's not possible (asymptotically) to compress noise.

However, the concept of compression is more generally useful than just avoiding overfitting within a class of models. In fact, the description length gives us a model-agnostic objective criterion to compare different hypotheses for the data generating process according to their plausibility — in a manner that is not only not arbitrary but also not subjective. Namely, since Shannon's theorem tells us that the best compression can be achieved asymptotically only with the true data generating model, then if we are able to find a description length for a network using a particular model, regardless of how it is parametrized, this also means that we have automatically found an upper bound on the optimal compression achievable. By formulating different generative models and computing their description length, we have not only an objective criterion to compare them against each other, but we also have a way to limit further what can be obtained with any other model. The result is an overall scale on which different models can be compared, as we move closer to the limit of what can be uncovered for a particular data at hand.

In the figure below we show the description length values with some models obtained for a protein-protein interaction network for the organism Meleagris gallopavo (wild turkey).

In particular, we can see that with the degree-corrected stochastic block model with triadic closure (DC-SBM/TC) [peixoto_disentangling_2021] we can achieve a description length that is far smaller than what would be possible with networks sampled from either the Erdős–Rényi, configuration, or planted partition (a SBM with strictly assortative communities [zhang_statistical_2020]) models, meaning that the inferred model is much closer to the true process that actually generated this network than the alternatives. Naturally, the actual process that generated this network is different from the DC-SBM/TC, and it likely involves, for example, mechanisms of node duplication which are not incorporated into this rather simple model. However, to the extent that the true process leaves statistically significant traces in the network structure [1], computing the description length according to it should provide further compression when compared to the alternatives. Therefore, we can try to extend or reformulate our models to incorporate features that we hypothesize to be more realistic, and then verify if this in fact the case, knowing that whenever we find a more compressive model, it is moving closer to the true one — or at least to what remains detectable from it for the finite data.

The discussion above glosses over some important technical aspects. For example, it is possible for two (or, in fact, many) models to have the same or very similar description length values. In this case, Occam's razor fails as a criterion to select between them, and we need to consider them collectively as equally valid hypotheses. This means, for example, that we would need to average over them when making specific inferential statements [peixoto_revealing_2021] — selecting between them arbitrarily can be interpreted as a form of overfitting. Furthermore, there is obviously no guarantee that the true model can actually be found for any particular data. This is only possible in the asymptotic limit of “sufficient data”, which will vary depending on the actual model. Outside of this limit (which is the typical case in empirical settings, in particular when dealing with sparse networks), fundamental limits to inference are unavoidable, which means in practice that we will always have limited accuracy and some amount of error in our conclusions. However, when employing compression, these potential errors tend towards overly simple explanations, rather than overly complex ones. Whenever perfect accuracy is not possible, it is difficult to argue in favor of a bias in the opposite direction.

I emphasize that it is not possible to “cheat” when doing compression. For any particular model, the description length will have the same form

$$\Sigma (A,\theta )=\mathcal{D}(A|\theta )+\mathcal{M}(\theta ),$$

where $\theta $ is some arbitrary set of parameters. If we constrain the model such that it becomes possible to describe the data with a number of bits $\mathcal{D}(A|\theta )$ that is very small, this can only be achieved, in general, by increasing the number of parameters $\theta $, such that the number of bits $\mathcal{M}(\theta )$ required to describe them will also increase. Therefore, there is no generic way to achieve compression that bypasses actually formulating a meaningful hypothesis that matches statistically significant patterns seen in the data. One may wonder, therefore, if there is an automatized way of searching for hypotheses in a manner that guarantees optimal compression. The most fundamental way to formulate this question is to generalize the concept of minimum description length as follows: for any binary string $x$ (representing any measurable data), we define $L(x)$ as the length in bits of the shortest computer program that yields $L(x)$ as an output. The quantity $L(x)$ is know as Kolmogorov complexity, and if we would be able to compute it for a binary string representing an observed network, we would be able to determine the “true model” value in fig:compressed, and hence know how far we are from the optimum [2].

Unfortunately, an important result in information theory is that $L(x)$ is not computable. This means that it is strictly impossible to write a computer program that computes $L(x)$ for any string $x$ [3]. This does not invalidate using the description length as a criterion to select among alternative models, but it dashes any hope of fully automatizing the discovery of optimal hypotheses. (The upside to this is that scientists will never run out of things to do!)

**EDIT (5/1/2022):** Added the DC-SBM/TC as an additional model to the
figure above.

[peixoto_descriptive_2021] | Tiago P. Peixoto, “Descriptive vs. inferential community detection: pitfalls, myths and half-truths”, arXiv: 2112.00183 |

[de_finetti] | Bruno de Finetti, “Theory of Probability: A critical introductory treatment.”, Chichester: John Wiley & Sons Ltd. (2017) |

[jaynes] | Edwin Thompson Jaynes, “Probability Theory: The Logic of Science”, Cambridge University Press, (2003). |

[peixoto_disentangling_2021] | Tiago P. Peixoto, “Disentangling homophily, community structure and triadic closure in networks”, arXiv: 2101.02510 |

[zhang_statistical_2020] | Lizhi Zhang and Tiago P. Peixoto, “Statistical inference of assortative community structures.” Physical Review Research 2, 043271 (2020). DOI: 10.1103/PhysRevResearch.2.043271 |

[peixoto_revealing_2021] | Tiago P. Peixoto, “Revealing Consensus and Dissensus between Network Partitions”, Physical Review X 11, 021003 (2021). DOI: 10.1103/PhysRevX.11.021003 |

[grunwald_minimum_2007] | Peter D. Grünwald, The Minimum Description Length Principle (The MIT Press, 2007). |

[1] | Visually inspecting fig:compressed reveals what seems to be local symmetries in the network structure, presumably due to gene duplication. These patterns are not exploited by the SBM description, and points indeed to a possible path for further compression. |

[2] | As mentioned before, this would not necessarily mean that we would be able to find the actual true model in a practical setting with perfect accuracy, since for a finite $x$ there could be many programs of the same minimal length (or close) that generate it. |

[3] | There are two famous ways to prove this. One is by contradiction: if we assume that we have a program that computes $L(x)$, then we could use it as subroutine to write another program that outputs $x$ with a length smaller than $L(x)$. The other involves undecidabilty: if we enumerate all possible computer programs in order of increasing length and check if their outputs match $x$, we will eventually find programs that loop indefinitely. Deciding whether a program finishes in finite time is known as the “halting problem”, which has been proved to be impossible to solve. In general, it cannot be determined if a program reaches an infinite loop in a manner that avoids actually running the program and waiting for it to finish. Therefore, this rather intuitive algorithm to determine $L(x)$ will not necessarily finish for any given string $x$. For more details the wikipedia page has a good overview. |

(This is a slightly modified version of Sec. IVC in [peixoto_descriptive_2021].)

In a previous blog post I explained how modularity maximization tends to overfit and find spurious community structure even in random graphs.

Sometimes practitioners are indeed aware that such non-inferential methods can find communities that are not supported by statistical evidence. In an attempt to extract an inferential conclusion from their results in spite of this, they compare the value of the quality function with a randomized version of the network — and if a significant discrepancy is found, they conclude that the community structure is statistically meaningful. Unfortunately, this approach is as fundamentally flawed as it is straightforward to implement.

The reason why the test fails is because in reality it answers a question that is different from the one intended. When we compare the value of the quality function (or any other test statistic) obtained from a network and its randomized counterpart, we can use this information to answer only the following question:

“Can we reject the hypothesis that the observed network was sampled from a random null model?”

No other information can be obtained from this test, including whether the network partition we obtained is significant. All we can determine is if the optimized value of the quality function is significant or not. The distinction between the significance of the quality function value and the network partition itself is subtle but crucial.

We illustrate the above difference with an example in fig:modularity_null (b). This network is created by starting with a fully random Erdős-Rényi (ER) network, and adding to it a few more edges so that it has an embedded clique of six nodes. The occurrence of such a clique from an ER model is very unlikely, so if we perform a statistical test on this network that is powerful enough, we should be able to rule out that it came from the ER model with good confidence. Indeed, if we use the value of maximum modularity for this test, and compare with the values obtained for the ER model with the name number of nodes and edges (see fig:modularity_null (a)), we are able to reach the correct conclusion that the null model should be rejected, since the optimized value of modularity is significantly higher for the observed network.

Should we conclude therefore that the communities found in the network are significant? If we inspect fig:modularity_null (b), we see that the maximum value of modularity indeed corresponds to a more-or-less decent detection of the planted clique. However, it also finds another seven completely spurious communities in the random part of the network. What is happening is clear — the planted clique is enough to increase the value of $Q$ such that it becomes a suitable test to reject the null model [1], but the test is not powerful enough to verify that the communities themselves are statistically meaningful. In short, the following two statements are not synonymous:

- The maximum value of $Q$ is significant.
- The corresponding network partition is significant.

Conflating the two will lead to the wrong conclusion about the significance of the communities uncovered.

In fig:modularity_null (c) we show the result of a more appropriate inferential approach, based on Bayesian inference as described in a previous blog post, that attempts to answer a much more relevant question: “which partition of the network into groups is more likely?” The result is able to cleanly separate the planted clique from the rest of the network, which is grouped into a single community.

This example also shows how the task of rejecting a null model is very oblique to Bayesian inference of generative models. The former attempts to determine what the network is not, while the latter what it is. The first task tends to be easy — we usually do not need very sophisticated approaches to determine that our data did not come from a null model, specially if our data is complex. On the other hand, even if approximative, the second task is far more revealing, constructive, and arguably more useful in general.

[peixoto_descriptive_2021] | Tiago P. Peixoto, “Descriptive vs. inferential community detection: pitfalls, myths and half-truths”, arXiv: 2112.00183 |

[1] | Note that it is possible to construct alternative examples, where instead of planting a clique, we introduce the placement of triangles, or other features that are known to increase the value of modularity, but that do not correspond to an actual community structure. |

(This is a slightly modified version of Sec. IVH in [peixoto_descriptive_2021].)

In two previous blog posts (first and second) I advocated for the use of statistical inference for community detection in networks, whenever our objective is of an inferential nature.

One possible objection to the use of statistical inference is when the generative models on which they are based are considered unrealistic for a particular kind of network. Although this type of consideration is ultimately important, it is not necessarily an obstacle. First we need to remember that realism is a matter of degree, not kind, since no model can be fully realistic, and therefore we should never be fully committed to “believe” any particular model. Because of this, an inferential approach can be used to target a particular kind of structure, and the corresponding model is formulated with this in mind, but without the need to describe other properties of the data. The stochastic block model (SBM) is a good example of this, since it is often used with the objective of finding communities, rather than any kind of network structure. A model like the SBM is a good way to offset the regularities that relate to the community structure with the irregularities present in real networks, without requiring us to believe that in fact it generated the network.

Furthermore, certain kinds of models are flexible enough so that they can approximate other models. For example, a good analogy with fitting the SBM to network data is to fit a histogram to numerical data, with the node partitioning being analogous to the data binning. Although a piecewise constant model is almost never the true underlying distribution, it provides a reasonable approximation in a tractable, nonparametric manner. Because of its capacity to approximate a wide class of distributions, we certainly do not need to believe that a histogram is the true data generating process to extract meaningful inferences from it. In fact, the same can be said of the SBM in its capacity to approximate a wide class of network models [olhede_network_2014].

This means that we can extract useful, statistically meaningful information from data even if the models we use are misspecified. For example, if a network is generated by a latent space model [hoff_latent_2002], and we fit a SBM to it, the communities that are obtained in this manner are not quite meaningless: they will correspond to discrete spatial regions. Hence, the inference would yield a caricature of the underlying latent space, amounting to a discretization of the true model — indeed, much like a histogram. This is very different, say, from finding communities in an Erdős–Rényi graph, which bear no relation to the true underlying model, and would be just overfitting the data. In contrast, the SBM fit to a spatial network would be approximately capturing the true model structure, in a manner that could be used to compress the data and make predictions (although not optimally).

Furthermore, the associated description length of a network model is a good criterion to tell whether the patterns we have found are actually simplifying our network description, without requiring the underlying model to be perfect. This happens in the same way as using a software like gzip makes our files smaller, without requiring us to believe that they are in fact generated by the Markov chain used by the underlying Lempel-Ziv algorithm.

Of course, realism is important as soon as we demand more from the point of view of interpretation and prediction. Are the observed community structures due to homophily or triadic clusure [peixoto_disentangling_2021]? Or are they due to spatial embedding [hoff_latent_2002]? What models are capable of reproducing other network descriptors, together with the community structure? Which models can better reconstruct incomplete networks [guimera_missing_2009] [peixoto_reconstructing_2018]? When answering these questions, we are forced to consider more detailed generative processes, and compare them. However, we are never required to believe them — models are always tentative, and should always be replaced by superior alternatives when these are found. Indeed, criteria such as minimum description length serve precisely to implement such a comparison between models, following the principle of Occam's razor. Therefore, the lack of realism of any particular model cannot be used to dismiss statistical inference as an underlying methodology.

It should be emphasized that, fundamentally, there is no alternative. Rejecting an inferential approach based on the SBM on the grounds that it is an unrealistic model (e.g. because of the conditional independence of the edges being placed, or some other unpalatable assumption), but instead preferring some other non-inferential community detection method is incoherent: As we discussed previously, every descriptive method can be mapped to an inferential analogue, with implicit assumptions that are hidden from view. Unless one can establish that the implicit assumptions are in fact more realistic, then the comparison cannot be justified. Unrealistic assumptions should be replaced by more realistic ones, not by burying one's head in the sand.

[peixoto_descriptive_2021] | Tiago P. Peixoto, “Descriptive vs. inferential community detection: pitfalls, myths and half-truths”, arXiv: 2112.00183 |

[olhede_network_2014] | Sofia C. Olhede and Patrick J. Wolfe, “Network histograms and universality of blockmodel approximation”, Proceedings of the National Academy of Sciences 111, 14722–14727 (2014). DOI: 10.1073/pnas.1400374111 |

[hoff_latent_2002] | (1, 2) Peter D Hoff, Adrian E Raftery, and Mark S Handcock, “Latent Space Approaches to Social Network Analysis,” Journal of the American Statistical Association 97, 1090–1098 (2002). DOI: 10.1198/016214502388618906 |

[peixoto_disentangling_2021] | Tiago P. Peixoto, “Disentangling homophily, community structure and triadic closure in networks”, arXiv: 2101.02510 |

[guimera_missing_2009] | Roger Guimerà and Marta Sales-Pardo, “Missing and spurious interactions and the reconstruction of complex networks”, Proceedings of the National Academy of Sciences 106, 22073 –22078 (2009). DOI: 10.1073/pnas.0908366106 |

[peixoto_reconstructing_2018] | Tiago P. Peixoto, “Reconstructing Networks with Unknown and Heterogeneous Errors”, Physical Review X 8, 041011 (2018). DOI: 10.1103/PhysRevX.8.041011 |

(This is a slightly modified version of Sec. IVG in [peixoto_descriptive_2021].)

For a wide class of optimization and learning problems there exist so-called “no-free-lunch” (NFL) theorems, which broadly state that when averaged over all possible problem instances, all algorithms show equivalent performance [wolpert_no_1995] [wolpert_lack_1996] [wolpert_no_1997]. Peel et al [peel_ground_2017] have proved that this is also valid for the problem of community detection, meaning that no single method can perform systematically better than any other, when averaged over “all community detection problems.” This has been occasionally interpreted as a reason to reject the claim that we should prefer certain classes of algorithms over others. This is, however, a misinterpretation of the theorem, as we will now discuss.

The NFL theorem for community detection is easy to state. Let us consider a generic community detection algorithm indexed by $f$, defined by the function $\underset{f}{\overset{\u02c6}{b}}(A)$, which ascribes a single partition to a network $A$. Peel et al [peel_ground_2017] consider an instance of the community detection problem to be an arbitrary pair $(A,b)$ composed of a network $A$ and the correct partition $b$ that one wants to find from $A$. We can evaluate the accuracy of the algorithm $f$ via an error (or “loss”) function

$$\u03f5(b,\underset{f}{\overset{\u02c6}{b}}(A))$$

which should take the smallest possible value if $\underset{f}{\overset{\u02c6}{b}}(A)=b$. If the error function does not have an inherent preference for any partition (it's “homogeneous”), then the NFL theorem states [wolpert_lack_1996] [peel_ground_2017]

$$\sum _{(A,b)}\u03f5(b,\underset{f}{\overset{\u02c6}{b}}(A))=\Lambda (\u03f5),$$

where $\Lambda (\u03f5)$ is a value that depends only on the error function chosen, but not on the community detection algorithm $f$. In other words, when averaged over all problem instances, all algorithms have the same accuracy. This implies, therefore, that in order for one class of algorithms to perform systematically better than another, we need to restrict the universe of problems to a particular subset. This is a seemingly straightforward result, but which is unfortunately very susceptible to misinterpretation and overstatement.

A common criticism of this kind of NFL theorem is that it is a poor representation of the typical problems we may encounter in real domains of application, which are unlikely to be uniformly distributed across the entire problem space. Therefore, as soon as we constrain ourselves to a subset of problems that are relevant to a particular domain, then this will favor some algorithms over others — but then no algorithm will be superior for all domains. But since we are typically only interested in some domains, the NFL theorem is then arguably “theoretically sound, but practically irrelevant” [schaffer_conservation_1994]. Although indeed correct, in the case of community detection this logic is arguably an understatement. This is because as soon as we restrict our domain to community detection problems that reveal something informative about the network structure, then we are out of reach of the NFL theorem, and some algorithms will do better than others, without evoking any particular domain of application. We demonstrate this in the following.

The framework of the NFL theorem operates on a liberal notion of what constitutes a community detection problem and its solution, which means for an arbitrary pair $(A,b)$ choosing the right $f$ such that $\underset{f}{\overset{\u02c6}{b}}(A)=b$. Under this framework, algorithms are just arbitrary mappings from network to partition, and there is no necessity to articulate more specifically how they relate to the structure of the network — community detection just becomes an arbitrary game of “guess the hidden node labels.” This contrasts with how actual community detection algorithms are proposed, which attempt to match the node partitions to patterns in the network, e.g. assortativity, general connection preferences between groups, etc. Although the large variety of algorithms proposed for this task already reveal a lack of consensus on how to precisely define it, few would consider it meaningful to leave the class of community detection problems so wide open as to accept any matching between an arbitrary network and an arbitrary partition as a valid instance.

Even though we can accommodate any (deterministic) algorithm deemed
valid according to any criterion under the NFL framework, most
algorithms in this broader class do something else altogether. In fact,
the absolute vast majority of them correspond to a maximally random
matching between network and partition, which amounts to little more
than just randomly guessing a partition for any given network, i.e. they
return widely different partitions for inputs that are very similar, and
overall point to no correlation between input and output
[1]. It is not difficult to accept that these random algorithms
perform equally “well” for any particular problem, or even all
problems, but the NFL theorem says that they have equivalent performance
even to algorithms that we may deem more meaningful. How do we make a
formal distinction between algorithms that are just randomly guessing
from those that are doing something coherent, that depends on
discovering actual network patterns? As it turns out, there is an answer
to this question that does not depend on particular domains of
application: we require the solutions found to be *structured* and
*compressive of the network*.

In order to interpret the statement of the NFL theorem in this vein, it is useful to re-write eq:nfl using an equivalent probabilistic language,

$$\sum _{A,b}P(A,b)\u03f5(b,\underset{f}{\overset{\u02c6}{b}}(A))=\Lambda \text{'}(\u03f5),$$

where $\Lambda \text{'}(\u03f5)\propto \Lambda (\u03f5)$, and
$P(A,b)\propto 1$ is the uniform
probability of encountering a problem instance. When writing the
theorem statement in this way, we notice immediately that instead of
being agnostic about problem instances, it implies a *very specific*
network generative model, which assumes a complete independence between
network and partition. Namely, if we restrict ourselves to networks of
$N$ nodes, we have then:

Therefore, the NFL theorem states simply that if we sample networks and partitions from a maximally random generative model, then all algorithms will have the same average accuracy at inferring the partition from the network. This is hardly a spectacular result — indeed the Bayes-optimal algorithm in this case, i.e. the one derived from the posterior distribution of the true generative model and which guarantees the best accuracy on average, consists of simply guessing partitions uniformly at random, ignoring the network structure altogether.

The probabilistic interpretation reveals that the NFL theorem makes a very specific assumption about what kind of community detection problem we are expecting, namely one where both the network and partition are sampled independently and uniformly at random. It is important to remember that it is not possible to make “no assumption” about a problem; we are always forced to make some assumption, which even if implicit does not exempt it from justification, and the uniform assumption of eq:uniform is no exception. In fig:nfl (a) we show a typical sample from this ensemble of community detection problems.

${\Sigma}_{\text{SBM}}(A|b)$, as a
reference. In (b) we show an example of a community detection problem
that is solvable, at least in principle, since
${\Sigma}_{\text{SBM}}(A|b)<{\Sigma}_{\text{min}}(A|b)$. In this case, the
partition can be used to inform the network structure, and
potentially vice-versa. This class of problem instance has a
negligible contribution to the sum in the NFL theorem in eq:nfl,
since it occurs only with an extremely small probability when sampled
from the uniform model of eq:uniform. It is therefore more
reasonable to state that the network in example (b) has an actual
community structure, while the one in (a) does not.

bits, and therefore the
partition is not learnable from the network alone with any
inferential algorithm. We show also the description length of the SBM
conditioned on the true partition,
In a very concrete sense, we can state that such problem instances contain no learnable community structure, or in fact no learnable network structure at all. We say that a community structure is learnable if the knowledge of the partition $b$ can be used to compress the network $A$, i.e. there exists an encoding $\mathcal{H}$ (i.e. a generative model) such that

where $\Sigma (A|b,\mathcal{H})=-{\mathrm{log}}_{2}P(A|b,\mathcal{H})$ is the description length of $A$ according to model $\mathcal{H}$, conditioned on the partition being known. However, it is a direct consequence of Shannon's source coding theorem [shannon_mathematical_1948], that for the vast majority of networks sampled from the model of eq:uniform the inequality above cannot be fulfilled as $N\to \infty $, i.e. the networks are incompressible [2]. This means that the true partition $b$ carries no information about the network structure, and vice versa, i.e. the partition is not learnable from the network. In view of this, the common interpretation of the NFL theorem as “all algorithms perform equally well” is in fact somewhat misleading, and can be more accurately phrased as “all algorithms perform equally poorly”, since no inferential algorithm can uncover the true community structure in most cases, at least no better than by chance alone. In other words, the universe of community detection problems considered in the NFL theorem is composed overwhelmingly of problems for which compression and explanation are not possible [4]. This uniformity between instances also reveals that there is no meaningful trade-off between algorithms for most instances, since all algorithms will yield the same negligible asymptotic performance, with an accuracy tending asymptotically towards zero as the number of nodes increases. In this setting, there is not only no free lunch, but in fact there is no lunch at all (see fig:nfl_trade_off).

If we were to restrict the space of possible community detection algorithms to those that provide actual explanations, then by definition this would imply a positive correlation between network and partition, i.e. [3]

Not only this implies a specific generative model but, as a consequence, also an optimal community detection algorithm, that operates based on the posterior distribution

$$P(b|A)=\frac{P(A|b)P(b)}{P(A)}.$$

Therefore, learnable community detection problems are invariably tied to an optimal class of algorithms, undermining to a substantial degree the relevance of the NFL theorem in practice. In other words, whenever there is an actual community structure in the network being considered — i.e. due to a systematic correlation between $A$ and $b$, such that $P(A,b)\ne P(A)P(b)$ — there will be algorithms that can exploit this correlation better than others (see fig:nfl (b) for an example of a learnable community detection problem). Importantly, the set of learnable problems form only an infinitesimal fraction of all problem instances, with a measure that tends to zero as the number of nodes increases, and hence remain firmly out of scope of the NFL theorem. This observation has been made before, and is equally valid, in the wider context of NFL theorems beyond community detection [streeter_two_2003] [mcgregor_no_2006] [everitt_universal_2013] [lattimore_no_2013].

Note that since there are many ways to choose a nonuniform model according to eq:informative, the optimal algorithms will still depend on the particular assumptions made via the choice $P(A,b)$. However, this does not imply that all algorithms have equal performance on compressible problem instances. If we sample a problem from the universe ${\mathcal{H}}_{1}$, with $P(A,b|{\mathcal{H}}_{1})$, but use instead two algorithms optimal in ${\mathcal{H}}_{2}$ and ${\mathcal{H}}_{3}$, respectively, their relative performances will depend on how close each of these universes is to ${\mathcal{H}}_{1}$, and hence will not be in general the same. In fact, if our space of universes is finite, we can compose them into a single unified universe [jaynes_probability_2003] according to

$$P(A,b)=\sum _{i=1}^{M}P(A,b|{\mathcal{H}}_{i})P({\mathcal{H}}_{i}),$$

which will incur a compression penalty of at most ${\mathrm{log}}_{2}M$ bits added to the description length of the optimal algorithm. This gives us a path, based on hierarchical Bayesian models and minimum description length, to achieve optimal or near-optimal performance on instances of the community detection problem that are actually solvable, simply by progressively expanding our set of hypotheses.

The idea that we can use compression as an inference criterion has been formalized by Solomonoff's theory of inductive inference, which forms a rigorous induction framework based on the principle of Occam's razor. Importantly, the expected errors of predictions achieved under this framework are provably upper-bounded by the Kolmogorov complexity of the data generating process [hutter_universal_2007], making the induction framework consistent. The Kolmogorov complexity is a generalization of the description length we have been using, and it is defined by the length of the shortest binary program that generates the data. The only major limitation of Solomonoff's framework is its uncomputability, i.e. the impossibility of determining Kolmogorov's complexity with any algorithm. However, this impossibility does not invalidate the framework, it only means that induction cannot be fully automatized: we have a consistent criterion to compare hypotheses, but no deterministic mechanism to produce directly the best hypothesis. There are open philosophical questions regarding the universality of this inductive framework [hutter_open_2009] [montanez_why_2017], but whatever fundamental limitations it may have do not follow directly from NFL theorems such as the one from [peel_ground_2017]. In fact, as mentioned in footnote [4], it is a rather simple task to use compression to reject the uniform hypothesis forming the basis of the NFL theorem for almost any network data.

Since compressive community detection problems are out of the scope of the NFL theorem, it is not meaningful to use it to justify avoiding comparisons between algorithms, on the grounds that all choices must be equally “good” in a fundamental sense. In fact, we do not need much sophistication to reject this line of argument, since the NFL theorem applies also when we are considering trivially inane algorithms, e.g. one that always returns the same partition for every network. The only domain where such an algorithm is as good as any other is when we have no community structure to begin with, which is precisely what the NFL theorem relies on.

Nevertheless, there are some lessons we can draw from the NFL
theorem. It makes it clear that the performance of algorithms are tied
directly to the inductive bias adopted, which should
always be made explicit. The superficial interpretation of the NFL
theorem as an inherent equity between all algorithms stems from the
assumption that considering all problem instances uniformly is
equivalent to being free of an inductive bias, but that is not
possible. The uniform assumption is itself an inductive bias, and one
that it is hard to justify in virtually any context, since it involves
almost exclusively unsolvable problems (from the point of view of
compressibility). In contrast, considering only compressible problem
instances is also an inductive bias, but one that relies only on Occam's
razor as a guiding principle. The advantage of the latter is that it is
independent of domain of application, i.e. we are making a statement
only about whether a partition can help explaining the network, without
having to specify how *a priori*.

In view of the above observations, it becomes easier to understand
results such as of Ghasemian et al [ghasemian_evaluating_2019] who
found that compressive inferential community detection methods tend to
systematically outperform descriptive methods in empirical settings,
when these are employed for the task of edge prediction. Even though
edge prediction and community detection are not the same task, and using
the former to evaluate the latter can lead in some cases to overfitting
[valles-catala_consistencies_2018], typically the most compressive
models will also lead to the best generalization. Therefore, the
superior performance of the inferential methods is understandable, even
though Ghasemian et al also found a minority of instances where some
descriptive methods can outperform inferential ones. To the extent that
these minority results cannot be attributed to overfitting, or technical
issues such as insufficient MCMC equilibration, it could simply mean
that the structure of these networks fall sufficiently outside of what
is assumed by the inferential methods, but without it being a necessary
trade-off that comes as a consequence of the NFL theorem — after all,
under the uniform assumption, edge prediction is also strictly
impossible, just like community detection. In other words, these
results do not rule out the existence of an algorithm that works better
in all cases considered, at least if their number is not too large
[5]. In fact, this is precisely what is achieved in
[ghasemian_stacking_2020] via model stacking,
i.e. a combination of several predictors into a meta-predictor that
achieves systematically superior performance. This points indeed to the
possibility of using universal methods to discover the latent
**compressive** modular structure of networks, without any tension
with the NFL theorem.

**EDIT** 10/1/2022: Added reference to Solomonoff's theory of induction.

**EDIT** 17/10/2022: Added fig:nfl.

[peixoto_descriptive_2021] |

[wolpert_no_1995] | David H. Wolpert and William G. Macready, No free lunch theorems for search, Tech. Rep. (Technical Report SFI-TR-95-02-010, Santa Fe Institute, 1995). |

[wolpert_lack_1996] | (1, 2) David H. Wolpert, “The Lack of A Priori
Distinctions Between Learning Algorithms”, Neural Computation 8,
1341–1390 (1996). DOI: 10.1162/neco.1996.8.7.1341 |

[wolpert_no_1997] | David H. Wolpert and William G. Macready, “No free lunch theorems for optimization”, IEEE transactions on evolutionary computation 1, 67–82 (1997). DOI: 10.1109/4235.585893 |

[peel_ground_2017] | (1, 2, 3, 4) Leto Peel, Daniel B. Larremore, and Aaron Clauset,
“The ground truth about metadata and community detection in
networks”, Science Advances 3, e1602548 (2017). DOI: 10.1126/sciadv.1602548 |

[shannon_mathematical_1948] | C. E Shannon, “A mathematical theory of communication”, Bell Syst Tech. J 27, 623 (1948). |

[schaffer_conservation_1994] | Cullen Schaffer, “A Conservation Law for Generalization Performance”, in Machine Learning Proceedings 1994, edited by William W. Cohen and Haym Hirsh (Morgan Kaufmann, San Francisco (CA), 1994) pp. 259–265. DOI: 10.1016/B978-1-55860-335-6.50039-8 |

[streeter_two_2003] | Matthew J. Streeter, “Two Broad Classes of Functions for Which a No Free Lunch Result Does Not Hold,” in Genetic and Evolutionary Computation — GECCO 2003, Lecture Notes in Computer Science, pp. 1418–1430. DOI: 10.1007/3-540-45110-2_15 |

[mcgregor_no_2006] | Simon McGregor, “No free lunch and algorithmic randomness,” in GECCO, Vol. 6 (2006) pp. 2–4 http://gpbib.pmacs.upenn.edu/gecco2006etc/papers/lbp124.pdf |

[everitt_universal_2013] | Tom Everitt, “Universal induction and optimisation: No free lunch?” (2013) https://www.diva-portal.org/smash/get/diva2:780784/FULLTEXT01.pdf |

[lattimore_no_2013] | Tor Lattimore and Marcus Hutter, “No Free Lunch versus Occam’s Razor in Supervised Learning,” in Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence: Papers from the Ray Solomonoff 85th Memorial Conference, Melbourne, VIC, Australia, November 30 – December 2, 2011, Lecture Notes in Computer Science, edited by David L. Dowe (Springer, Berlin, Heidelberg, 2013) pp. 223–235. DOI: 10.1007/978-3-642-44958-1_17 |

[jaynes_probability_2003] | E. T. Jaynes, “Probability Theory: The Logic of Science”, edited by G. Larry Bretthorst (Cambridge University Press, Cambridge, UK; New York, NY, 2003). |

[hutter_universal_2007] | Marcus Hutter, “On universal prediction and Bayesian confirmation”, Theoretical Computer Science Theory and Applications of Models of Computation, 384, 33–48 (2007). DOI: 10.1016/j.tcs.2007.05.016 |

[hutter_open_2009] | Marcus Hutter, “Open Problems in Universal Induction & Intelligence,” Algorithms 2, 879–906 (2009). DOI: 10.3390/a2030879 |

[montanez_why_2017] | George D. Montanez, “Why machine learning works”, (2017), https://www.cs.cmu.edu/~gmontane/montanez_dissertation.pdf |

[ghasemian_evaluating_2019] | Amir Ghasemian, Homa Hosseinmardi, and Aaron Clauset, “Evaluating Overfit and Underfit in Models of Network Community Structure”, IEEE Transactions on Knowledge and Data Engineering, 1–1 (2019). DOI: 10.1109/TKDE.2019.2911585 |

[valles-catala_consistencies_2018] | Toni Vallès-Català, Tiago P. Peixoto, Marta Sales-Pardo, and Roger Guimerà, “Consistencies and inconsistencies between model selection and link prediction in networks,” Physical Review E 97,062316 (2018). DOI: 10.1103/PhysRevE.97.062316 |

[ghasemian_stacking_2020] | Amir Ghasemian, Homa Hosseinmardi, Aram Galstyan, Edoardo M. Airoldi, and Aaron Clauset, “Stacking models for nearly optimal link prediction in complex networks”, Proceedings of the National Academy of Sciences 117, 23393–23400 (2020), DOI: 10.1073/pnas.1914950117 |

[decelle_asymptotic_2011] | Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová, “Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications”, Physical Review E 84, 066106 (2011). DOI: 10.1103/PhysRevE.84.066106 |

[1] | An interesting exercise is to count how many such algorithms exist. A given community detection algorithm $f$ needs to map each of all networks of $N$ nodes to one of labeled partitions of its nodes. Therefore, if we restrict ourselves to a single value of $N$, the total number of input-output tables is $\Xi (N{)}^{\Omega (N)}$. If we sample one such table uniformly at random, it will be asymptotically impossible to compress it using fewer than $\Omega (N){\mathrm{log}}_{2}\Xi (N)$ bits — a number that grows super-exponentially with $N$. As an illustration, a random community detection algorithm that works only with $N=100$ nodes would already need ${10}^{1479}$ terabytes of storage. Therefore, simply considering algorithms that humans can write and use (together with their expected inputs and outputs) already pulls us very far away from the general scenario considered by the NFL theorem. |

[2] | For finite networks a positive compression might be achievable with small probability, but due to chance alone, and not in a manner that makes its structure learnable. |

[3] | Note that eq:informative is a necessary but not sufficient condition for the community detection problem to be solvable. An example of this are networks generated by the SBM, which are solvable only if the strength of the community structure exceeds a detectability threshold [decelle_asymptotic_2011], even if eq:informative is fulfilled. |

[4] | (1, 2) One could argue that such a uniform model is justified by
the principle of maximum entropy, which states that in the absence of
prior knowledge about which problem instances are more likely, we
should assume they are all equally likely a priori. This argument
fails precisely because we do have sufficient prior knowledge that
empirical networks are not maximally random — specially those
possessing community structure, according to any meaningful
definition of the term. Furthermore, it is easy to verify for each
particular problem instance that the uniform assumption does not
hold; either by compressing an observed network using any generative
model (which should be asymptotically impossible under the uniform
assumption), or performing a statistical test designed to be able to
reject the uniform null model. It is exceedingly difficult to find an
empirical network for which the uniform model cannot be rejected with
near-absolute confidence. |

[5] | It is important to distinguish the actual statement of the NFL theorem — “all algorithms perform equally well when averaged over all problem instances” — from the alternative statement: “No single algorithm exhibits strictly better performance than all others over all instances.” Although the latter is a corollary of the former, it can also be true when the former is false. In other words, a particular algorithm can be better on average over relevant problem instances, but still underperform for some of them. In fact, it would only be possible for an algorithm to strictly dominate all others if it can always achieve perfect accuracy for every instance. Otherwise, there will be at least one algorithm (e.g. one that always returns the same partition) that can achieve perfect accuracy for a single network where the optimal algorithm does not (“even a broken clock is right twice a day”). Therefore, sub-optimal algorithms can eventually outperform optimal ones by chance when a sufficiently large number of instances is encountered, even when the NFL theorem is not applicable (and therefore this fact is not necessarily a direct consequence of it). |

(This is a continuations of the previous two posts, and a slightly modified version of chapter III in [peixoto_descriptive_2021].)

The most widespread method for community detection is modularity maximization [newman_modularity_2006], which happens also to be one the most problematic. This method is based on the modularity function,

$$Q(A,b)=\frac{1}{2E}\sum _{ij}({A}_{ij}-\frac{{k}_{i}{k}_{j}}{2E}){\delta}_{{b}_{i},{b}_{j}}$$

where ${A}_{ij}\in \left\{\mathrm{0,1}\right\}$ is an entry of the adjacency matrix, ${k}_{i}=\sum _{j}{A}_{ij}$ is the degree of node $i$, ${b}_{i}$ is the group membership of node $i$, and $E$ is the total number of edges. The method consists in finding the partition $\stackrel{\u02c6}{b}$ that maximizes $Q(A,b)$,

$$\stackrel{\u02c6}{b}=\underset{b}{\mathrm{argmax}}\phantom{\rule{0.2778em}{0ex}}Q(A,b).$$

The motivation behind the modularity function is that it compares the existence of an edge $(i,j)$ to the probability of it existing according to a null model, ${P}_{ij}={k}_{i}{k}_{j}/2E$, namely that of the configuration model [fosdick_configuring_2018] (or more precisely, the Chung-Lu model [chung_connected_2002]). The intuition for this method is that we should consider a partition of the network meaningful if the occurrence of edges between nodes of the same group exceeds what we would expect with a random null model without communities.

Despite its widespread adoption, this approach suffers from a variety of serious conceptual and practical flaws, which have been documented extensively [guimera_modularity_2004] [fortunato_resolution_2007] [good_performance_2010] [fortunato_community_2010] [fortunato_community_2016]. The most problematic one is that it purports to use an inferential criterion — a deviation from a null generative model — but is in fact merely descriptive. As has been recognized very early, this method categorically fails in its own stated goal, since it always finds high-scoring partitions in networks sampled from its own null model [guimera_modularity_2004].

The reason for this failure is that the method does not take into account the deviation from the null model in a statistically consistent manner. The modularity function is just a re-scaled version of the assortativity coefficient [newman_mixing_2003], a correlation measure of the community assignments seen at the endpoints of edges in the network. We should expect such a correlation value to be close to zero for a partition that is determined before the edges of the network are placed according to the null model, or equivalently, for a partition chosen at random. However, it is quite a different matter to find a partition that optimizes the value of $Q(A,b)$, after the network is observed. The deviation from a null model computed in eq:Q completely ignores the optimization step of eq:qmax, although it is a crucial part of the algorithm. As a result, the method of modularity maximization tends to massively overfit, and find spurious communities even in networks sampled from its null model. We are searching for patterns of correlations in a random graph, and most of the time we will find them. This is a pitfall known as “data dredging” or “p-hacking”, where one searches exhaustively for different patterns in the same data and reports only those that are deemed significant, according to a criterion that does not take into account the fact that we are doing this search in the first place.

We demonstrate this problem in fig:randomQ, where we show the distribution of modularity values obtained with a uniform configuration model with ${k}_{i}=5$ for every node $i$, considering both a random partition and the one that maximizes $Q(A,b)$. While for a random partition we find what we would expect, i.e. a value of $Q(A,b)$ close to zero, for the optimized partition the value is substantially larger. Inspecting the optimized partition in fig:randomQ (c), we see that it corresponds indeed to 15 seemingly clear assortative communities — which by construction bear no relevance to how the network was generated. They have been dredged out of randomness by the optimization procedure.

Somewhat paradoxically, another problem with modularity maximization is that in addition to systematically overfitting, it also systematically underfits. This occurs via the so-called resolution limit: in a connected network [1] the method cannot find more than $\sqrt{2E}$ communities [fortunato_resolution_2007], even if they seem intuitive or can be found by other methods. An example of this is shown in fig:resolution, where for a network generated with the SBM containing 30 communities, modularity maximization finds only 18, while an inferential approach has no problems finding the true structure. There are attempts to counteract the resolution limit by introducing a “resolution parameter” to the modularity function, but they are in general ineffective (see [peixoto_descriptive_2021]).

These two problems — overfitting and underfitting — can occur in tandem, such that portions of the network dominated by randomness are spuriously revealed to contain communities, whereas other portions with clear modular structure can have those obstructed. The result is a very unreliable method to capture the structure of heterogeneous networks. We demonstrate this in fig:resolution (c) and (d)

In addition to these major problems, modularity maximization also often possesses a degenerate landscape of solutions, with very different partitions having similar values of $Q(A,b)$ [good_performance_2010]. In these situations the partition with maximum value of modularity can be a poor representative of the entire set of high-scoring solutions and depend on idiosyncratic details of the data rather than general patterns — which can be interpreted as a different kind of overfitting.

The combined effects of underfitting and overfitting can make the results obtained with the method unreliable and difficult to interpret. As a demonstration of the systematic nature of the problem, in fig:Qrand (a) we show the number of communities obtained using modularity maximization for 263 empirical networks of various sizes and belonging to different domains, obtained from the Netzschleuder catalogue. Since the networks considered are all connected, the values are always below $\sqrt{2E}$, due to the resolution limit; but otherwise they are well distributed over the allowed range. However, in fig:Qrand (b) we show the same analysis, but for a version of each network that is fully randomized, while preserving the degree sequence. In this case, the number of groups remains distributed in the same range (sometimes even exceeding the resolution limit, because the randomized versions can end up disconnected). As fig:Qrand (c) shows, the number of groups found for the randomized networks is strongly correlated with the original ones, despite the fact that the former have no latent community structure. This is a strong indication of the substantial amount of noise that is incorporated into the partitions found with the method.

The systematic overfitting of modularity maximization — as well as other descriptive methods such as Infomap — has been also demonstrated recently in [ghasemian_evaluating_2019], from the point of view of edge prediction, on a separate empirical dataset of 572 networks from various domains.

Although many of the problems with modularity maximization were long known, for some time there were no principled solutions to them, but this is no longer the case. In the table below we summarize some of the main problems with modularity and how they are solved with inferential approaches.

Problem |
Principled solution via inference |

Modularity maximization overfits, and finds modules in fully random networks. [guimera_modularity_2004] | Bayesian inference of the SBM is designed from the ground to avoid this problem in a principled way and systematically succeeds [peixoto_bayesian_2019]. |

Modularity maximization has a resolution limit, and finds at most $\sqrt{2E}$ groups in connected networks [fortunato_resolution_2007]. | Inferential approaches with hierarchical priors [peixoto_hierarchical_2014] [peixoto_nonparametric_2017] or strictly assortative structures [zhang_statistical_2020] do not have any appreciable resolution limit, and can find a maximum number of groups that scales as $O(N/\mathrm{log}N)$. Importantly, this is achieved without sacrificing the robustness against overfitting. |

Modularity maximization has a characteristic scale, and tends to find communities of similar size; in particular with the same sum of degrees. | Hierarchical priors can be specifically chosen to be a priori agnostic about characteristic sizes, densities of groups and degree sequences [peixoto_nonparametric_2017], such that these are not imposed, but instead obtained from inference, in an unbiased way. |

Modularity maximization can only find strictly assortative communities. | Inferential approaches can be based on any generative model. The general SBM will find any kind of mixing pattern in an unbiased way, and has no problems identifying modular structure in bipartite networks, core-periphery networks, and any mixture of these or other patterns. There are also specialized versions for bipartite [larremore_efficiently_2014], core-periphery [zhang_identification_2015], and assortative patterns [zhang_statistical_2020], if these are being searched exclusively. |

The solution landscape of modularity maximization is often degenerate, with many different solutions with close to the same modularity value [good_performance_2010], and with no clear way of how to select between them. | Inferential methods are characterized by a posterior distribution of partitions. The consensus or dissensus between the different solutions [peixoto_revealing_2021] can be used to determine how many cohesive hypotheses can be extracted from inference, and to what extent is the model being used a poor or a good fit for the network. |

Because of the above problems, the use of modularity maximization should be discouraged, since it is demonstrably not fit for purpose as an inferential method. As a consequence, the use of modularity maximization in any recent network analysis can be arguably considered a “red flag” that strongly indicates methodological carelessness. In the absence of secondary evidence supporting the alleged community structures found, or extreme care to counteract the several limitations of the method, the safest assumption is that the results obtained with that method tend to contain a substantial amount of noise, rendering any inferential conclusion derived from them highly suspicious.

As a final note, we focus on modularity here not only for its widespread adoption but also because of its emblematic character. At a fundamental level, all of its shortcoming are shared with any descriptive method in the literature — to varied but always non-negligible degrees.

[peixoto_descriptive_2021] | (1, 2) Tiago P. Peixoto, “Descriptive
vs. inferential community detection: pitfalls, myths and half-truths”,
arXiv: 2112.00183 |

[newman_modularity_2006] | M. E. J. Newman, “Modularity and community structure in networks,” Proceedings of the National Academy of Sciences 103, 8577–8582 (2006). DOI: 10.1073/pnas.0601602103 |

[fosdick_configuring_2018] | B. Fosdick, D. Larremore, J. Nishimura, and J. Ugander, “Configuring Random Graph Models with Fixed Degree Sequences,” SIAM Review 60, 315–355 (2018). DOI: 10.1137/16M1087175 |

[chung_connected_2002] | Fan Chung and Linyuan Lu, “Connected Components in Random Graphs with Given Expected Degree Sequences,” Annals of Combinatorics 6, 125–145 (2002). DOI: 10.1007/PL00012580 |

[guimera_modularity_2004] | (1, 2, 3) Roger Guimerà, Marta Sales-Pardo, and Luís A. Nunes Amaral,
“Modularity from fluctuations in random graphs and complex networks,”
Physical Review E 70, 025101 (2004). DOI: 10.1103/PhysRevE.70.025101 |

[fortunato_resolution_2007] | (1, 2, 3) Santo Fortunato and Marc Barthélemy,
“Resolution limit in community detection”, Proceedings of the National
Academy of Sciences 104, 36–41 (2007). DOI: 10.1073/pnas.0605965104 |

[good_performance_2010] | (1, 2, 3) Benjamin H. Good, Yves-Alexandre de Montjoye,
and Aaron Clauset, “Performance of modularity maximization in practical
contexts”, Physical Review E 81, 046106 (2010). DOI: 10.1103/PhysRevE.81.046106 |

[fortunato_community_2010] | Santo Fortunato, “Community detection in graphs”, Physics Reports 486, 75–174 (2010). DOI: 16/j.physrep.2009.11.002 |

[fortunato_community_2016] | Santo Fortunato and Darko Hric, “Community detection in networks: A user guide”, Physics Reports(2016), DOI: 10.1016/j.physrep.2016.09.002 |

[newman_mixing_2003] | M. E. J. Newman, “Mixing patterns in networks”, Phys. Rev. E 67, 026126 (2003). DOI: 10.1103/PhysRevE.67.026126 |

[ghasemian_evaluating_2019] | Amir Ghasemian, Homa Hosseinmardi, and Aaron Clauset, “Evaluating Overfit and Underfit in Models of Network Community Structure,” IEEE Transactions on Knowledge and Data Engineering, 1–1 (2019). DOI: 10.1109/TKDE.2019.2911585 |

[peixoto_hierarchical_2014] | Tiago P. Peixoto, “Hierarchical Block Structures and High-Resolution Model Selection in Large Networks”, Physical Review X 4, 011047 (2014). DOI: 10.1103/PhysRevX.4.011047 |

[peixoto_nonparametric_2017] | (1, 2) Tiago P. Peixoto, “Nonparametric Bayesian inference of the
microcanonical stochastic block model”, Physical Review E 95, 012317 (2017).
DOI: 10.1103/PhysRevE.95.012317 |

[zhang_statistical_2020] | (1, 2, 3) Lizhi Zhang and Tiago P. Peixoto,
“Statistical inference of assortative community structures.” Physical
Review Research 2, 043271 (2020). DOI: 10.1103/PhysRevResearch.2.043271 |

[larremore_efficiently_2014] | Daniel B. Larremore, Aaron Clauset, and Abigail Z. Jacobs, “Efficiently inferring community structure in bipartite networks”, Physical Review E 90, 012805 (2014). DOI: 10.1103/PhysRevE.90.012805 |

[zhang_identification_2015] | Xiao Zhang, Travis Martin, and M. E. J. Newman, “Identification of core-periphery structure in networks,” Physical Review E 91, 032803 (2015). DOI: 10.1103/PhysRevE.91.032803 |

[peixoto_revealing_2021] | Tiago P. Peixoto, “Revealing Consensus and Dissensus between Network Partitions”, Physical Review X 11, 021003 (2021). DOI: 10.1103/PhysRevX.11.021003 |

[peixoto_bayesian_2019] | Tiago P. Peixoto, “Bayesian Stochastic Blockmodeling”, in Advances in Network Clustering and Blockmodeling (John Wiley & Sons, Ltd, 2019) pp. 289–332. DOI: 10.1002/9781119483298.ch11 |

[1] | Modularity maximization, like many descriptive community detection methods, will always place connected components in different communities. This is another clear distinction with inferential approaches, since fully random models — without latent community structure — can generate disconnected networks if they are sufficiently sparse. From an inferential point of view, it is therefore incorrect to assume that every connected component must belong to a different community. |

(This is a continuation of the previous blog post, and slightly modified version of chapter II in [peixoto_descriptive_2021])

Inferential approaches to community detection (see [peixoto_bayesian_2019] for a detailed introduction) are designed to provide explanations for network data in a principled manner. They are based on the formulation of generative models that include the notion of community structure in the rules of how the edges are placed. More formally, they are based on the definition of a likelihood $P(A|b)$ for the network $A$ conditioned on a partition $b$, and the inference is obtained via the posterior distribution, according to Bayes' rule, i.e.

$$P(b|A)=\frac{P(A|b)P(b)}{P(A)},$$

where $P(b)$ is the prior probability for a partition $b$. Overwhelmingly, the models used for this purpose are variations of the stochastic block model (SBM) [holland_stochastic_1983], where in addition to the node partition, it takes the probability of edges being placed between the different groups as an additional set of parameters. A particularly expressive variation is the degree-corrected SBM (DC-SBM) [karrer_stochastic_2011], with a marginal likelihood given by [peixoto_nonparametric_2017]

$$P(A|b)=\sum _{e,k}P(A|k,e,b)P(k|e,b)P(e|b),$$

where $e=\left\{{e}_{rs}\right\}$ is a matrix with elements ${e}_{rs}$ specifying how many edges go between groups $r$ and $s$, and $k=\left\{{k}_{i}\right\}$ are the degrees of the nodes. Therefore, this model specifies that, conditioned on a partition $b$, first the edge counts $e$ are sampled from a prior distribution $P(e|b)$, followed by the degrees from the prior $P(k|e,b)$, and finally the network is wired together according to the probability $P(A|k,e,b)$, which respects the constraints given by $k$, $e$, and $b$. See fig:dcsbm (a) for a illustration of this process.

This model formulation includes fully random networks as the special case when we have a single group. Together with the Bayesian approach, the use of this model will inherently favor a more parsimonious account of the data, whenever it does not warrant a more complex description — amounting to a formal implementation of Occam's razor. This is best seen by making a formal connection with information theory, and noticing that we can write the numerator of eq:bayes as

$$P(A|b)P(b)={2}^{-\Sigma (A,b)},$$

where the quantity $\Sigma (A,b)$ is known as the description length [grunwald_minimum_2007] of the network. It is computed as [1]:

$$\Sigma (A,b)=\underset{\mathcal{D}(A|k,e,b)}{\underset{\u23df}{-{\mathrm{log}}_{2}P(A|k,e,b)}}\phantom{\rule{0.1667em}{0ex}}\underset{\mathcal{M}(k,e,b)}{\underset{\u23df}{-{\mathrm{log}}_{2}P(k|e,b)-{\mathrm{log}}_{2}P(e|b)-{\mathrm{log}}_{2}P(b)}}.$$

The second set of terms $\mathcal{M}(k,e,b)$ in the above equation quantifies the amount of information in bits necessary to encode the parameters of the model [2]. The first term $\mathcal{D}(A|k,e,b)$ determines how many bits are necessary to encode the network itself, once the model parameters are known. This means that if Bob wants to communicate to Alice the structure of a network $A$, he first needs to transmit $\mathcal{M}(k,e,b)$ bits of information to describe the parameters $b$, $e$, and $k$, and then finally transmit the remaining $\mathcal{D}(A|k,e,b)$ bits to describe the network itself. Then, Alice will be able to understand the message by first decoding the parameters $(k,e,b)$ from the first part of the message, and using that knowledge to obtain the network $A$ from the second part, without any errors.

What the above connection shows is that there is a formal equivalence between inferring the communities of a network and compressing it. This happens because finding the most likely partition $b$ from the posterior $P(b|A)$ is equivalent to minimizing the description length $\Sigma (A,b)$ used by Bob to transmit a message to Alice containing the whole network.

Data compression amounts to formal implementation of Occam's razor because it penalizes models that are too complicated: if we want to describe a network using many communities, then the model part of the description length $\mathcal{M}(k,e,b)$ will be large, and Bob will need many bits to transmit the model parameters to Alice. However, increasing the complexity of the model will also reduce the first term $\mathcal{D}(A|k,e,b)$, since there are fewer networks that are compatible with the bigger set of constraints, and hence Bob will need a shorter second part of the message to convey the network itself once the parameters are known. Compression (and hence inference), therefore, is a balancing act between model complexity and quality of fit, where an increase in the former is only justified when it results in an even larger increase of the second, such that the total description length is minimized.

The reason why the compression approach avoids overfitting the data is due to a powerful fact from information theory, known as Shannon's source coding theorem [shannon_mathematical_1948], which states that it is impossible to compress data sampled from a distribution $P(x)$ using fewer bits per symbol than the entropy of the distribution, $H=-\sum _{x}P(x){\mathrm{log}}_{2}P(x)$. In our context, this means that it is impossible, for example, to compress a fully random network using a SBM with more than one group [3]. This means, for example, that when encountered with an example like in the figure we considered in the previous blog post, inferential methods will detect a single community comprising all nodes in the network, since any further division does not provide any increased compression, or equivalently, no augmented explanatory power. From the inferential point of view, a partition like in the previous figure (b) overfits the data, since it incorporates irrelevant random features — a.k.a. “noise” — into its description.

In fig:inferential (a) is shown an example of the results obtained with an inferential community detection algorithm, for a network sampled from the SBM. As shown in fig:inferential (b), the obtained partitions are still valid when carried over to an independent sample of the model, because the algorithm is capable of separating the general underlying pattern from the random fluctuations. As a consequence of this separability, this kind of algorithm does not find communities in fully random networks, which are composed only of “noise.”

Inferential approaches based on the SBM have an old history, and were introduced for the study of social networks in the early 80's [holland_stochastic_1983]. But despite such an old age, and having appeared repeatedly in the literature over the years (also under different names in other contexts), they entered the mainstream community detection literature rather late, arguably after the influential paper by Karrer and Newman that introduced the DC-SBM [karrer_stochastic_2011] in 2011, at a point where descriptive approaches were already dominating. However, despite the dominance of descriptive methods, the existence of inferential criteria was already long noticeable. In fact, in a well-known attempt to systematically compare the quality of a variety of descriptive community detection methods, the authors of [lancichinetti_benchmark_2008] proposed the now so-called LFR benchmark, offered as a more realistic alternative to the simpler Newman-Girvan benchmark [girvan_community_2002] introduced earlier. Both are in fact generative models, essentially particular cases of the DC-SBM, containing a “ground truth” community label assignment, against which the results of various algorithms are supposed to be compared. Clearly, this is an inferential evaluation criterion, although, historically, virtually all of the methods compared against that benchmark are descriptive in nature [lancichinetti_community_2009] (these studies were conducted mostly before inferential approaches had gained more traction). The use of such a criterion already betrays that the answer to the litmus test considered in the previous post would be “yes,” and therefore descriptive approaches are fundamentally unsuitable for the task. In contrast, methods based on statistical inference are not only more principled, but in fact provably optimal in the inferential scenario, in the sense that all conceivable algorithms can obtain either equal or worse performance, but none can do better [decelle_asymptotic_2011].

The conflation one often finds between descriptive and inferential goals in the literature of community detection likely stems from the fact that while it is easy to define benchmarks in the inferential setting, it is substantially more difficult to do so in a descriptive setting. Given any descriptive method (modularity maximization, Infomap, Markov stability, etc.) it is usually problematic to determine for which network these methods are optimal (or even if one exists), and what would be a canonical output that would be unambiguously correct. In fact, the difficulty with establishing these fundamental references already serve as evidence that the task itself is ill-defined. On the other hand, taking an inferential route forces one to start with the right answer, via a well-specified generative model that articulates what the communities actually mean with respect to the network structure. Based on this precise definition, one then derives the optimal detection method by employing Bayes' rule.

It is also useful to observe that inferential analyses of aspects of the network other than directly its structure might still be only descriptive of the structure itself. A good example of this is the modelling of dynamics that take place on a network, such as a random walk. This is precisely the case of the Infomap method, which models a simulated teleporting random walk on a network in an inferential manner, using for that a division of the network into groups. While this approach can be considered inferential with respect to an artificial dynamics, it is still only descriptive when it comes to the actual network structure (and will suffer the same problems, such a finding communities in fully random networks). Communities found in this way could be useful for particular tasks, such as to identify groups of nodes that would be similarly affected by a diffusion process. This could be used, for example, to prevent or facilitate the diffusion by removing or adding edges between the identified groups. In this setting, the answer to the litmus test would also be “no”, since what is important is how the network “is” (i.e. how a random walk behaves on it), not how it came to be, or if its features are there by chance alone. Once more, the important issue to remember is that the groups identified in this manner cannot be interpreted as having any explanatory power about the network structure itself, and cannot be used reliably to extract inferential conclusions from it. We are firmly in a descriptive, not inferential setting with respect to the network structure.

Another important difference between inferential and descriptive approaches is worth mentioning. Descriptive approaches are tied to very particular contexts, and cannot be directly compared to one another. This has caused great consternation in the literature, since there is a vast number of such methods, and little robust methodology on how to compare them. Indeed, why should we expect that the modules found by optimizing task scheduling should be comparable to those that optimize the description of a dynamics? In contrast, inferential approaches all share the same underlying context: they attempt to explain the network structure; they vary only in how this is done. They are, therefore, amenable to principled model selection procedures, designed to evaluate which is the most appropriate fit for any particular network, even if the models used operate with very different parametrizations. In this situation, the multiplicity of different models available becomes a boon rather than a hindrance, since they all contribute to a bigger toolbox we have at our disposal when trying to understand empirical observations.

Finally, inferential approaches offer additional advantages that make them more suitable as part a scientific pipeline. In particular, they can be naturally extended to accommodate measurement uncertainties [peixoto_reconstructing_2018] — an unavoidable property of empirical data, which descriptive methods almost universally fail to consider. This information can be used not only to propagate the uncertainties to the community assignments [peixoto_revealing_2021] but also to reconstruct the missing or noisy measurements of the network itself [guimera_missing_2009]. Furthermore, inferential approaches can be coupled with even more indirect observations such as time-series on the nodes [hoffmann_community_2020], instead of a direct measurement of the edges of the network, such that the network itself is reconstructed, not only the community structure [peixoto_network_2019]. All these extensions are possible because inferential approaches give us more than just a division of the network into groups; they give us a model estimate of the network, containing insights about its formation mechanism.

From a purely mathematical perspective, there is actually no formal distinction between descriptive and inferential methods, because every descriptive method can be mapped to an inferential one, according to some implicit model. Therefore, whenever we are attempting to interpret the results of a descriptive community detection method in an inferential way — i.e. make a statement about how the network came to be — we cannot in fact avoid making implicit assumptions about the model generating process that lies behind it. (At first this statement seems to undermine the distinction we have been making between descriptive and inferential methods, but in fact this is not the case, as we will see below.)

It is not difficult to demonstrate that it is possible to formulate any conceivable community detection method as a particular inferential method. Let us consider an arbitrary quality function

$$W(A,b)\in \mathbb{R}$$

which is used to perform community detection via the optimization

$${b}^{*}=\underset{b}{\mathrm{argmax}}\phantom{\rule{0.2778em}{0ex}}W(A,b).$$

We can then interpret the quality function $W(A,b)$ as the “Hamiltonian” of a posterior distribution

$$P(b|A)=\frac{{\mathrm{e}}^{\beta W(A,b)}}{Z(A)},$$

with normalization $Z(A)=\sum _{b}{\mathrm{e}}^{\beta W(A,b)}$. By making $\beta \to \infty $ we recover the optimization of eq:opt, or we may simply try to find the most likely partition according to the posterior, in which case $\beta >0$ remains an arbitrary parameter. Therefore, employing Bayes' rule in the opposite direction, we obtain the following effective generative model:

where $P(A)=\sum _{b}P(A|b)P(b)$ is the marginal distribution over networks, and $P(b)$ is the prior distribution for the partition. Due to the normalization of $P(A|b)$ we have the following constraint that needs to be fulfilled:

$$\sum _{A}\frac{{\mathrm{e}}^{\beta W(A,b)}}{Z(A)}P(A)=P(b).$$

Therefore, not all choices of $P(A)$ and $P(b)$ are compatible with the posterior distribution and the exact possibilities will depend on the actual shape of $W(A,b)$. However, one choice that is always possible is

$$P(A)=\frac{Z(A)}{\Xi},\phantom{\rule{2em}{0ex}}P(b)=\frac{\Omega (b)}{\Xi},$$

with $\Omega (b)=\sum _{A}{\mathrm{e}}^{\beta W(A,b)}$ and $\Xi =\sum _{A,b}{\mathrm{e}}^{\beta W(A,b)}$. Taking this choice leads to the effective generative model

$$P(A|b)=\frac{{\mathrm{e}}^{\beta W(A,b)}}{\Omega (b)}.$$

Therefore, inferentially interpreting a community detection algorithm with a quality function $W(A,b)$ is equivalent to assuming the generative model $P(A|b)$ and prior $P(b)$ above. Furthermore, this also means that any arbitrary community detection algorithm implies a description length [4] given (in nats) by

$$\Sigma (A,b)=-\beta W(A,b)+\mathrm{ln}\sum _{A\text{'},b\text{'}}{\mathrm{e}}^{\beta W(A\text{'},b\text{'})}.$$

What the above shows is that **there is no such thing as a “model-free”
community detection method**, since they are all equivalent to the
inference of some generative model. The only difference to a direct
inferential method is that in that case the modelling assumptions are
made explicitly, inviting rather than preventing scrutiny. Most often,
the effective model and prior that are equivalent to an ad hoc
community detection method will be difficult to interpret, justify, or
even compute.

Furthermore there is no guarantee that the obtained description length of eq:dl_W will yield a competitive or even meaningful compression. In particular, there is no guarantee that this effective inference will not overfit the data. Although we mentioned in the previous section that inference and compression are equivalent, the compression achieved when considering a particular generative model is constrained by the assumptions encoded in its likelihood and prior. If these are poorly chosen, no actual compression might be achieved, for example when comparing to the one obtained with a fully random model. This is precisely what happens with descriptive community detection methods: they overfit because their implicit modelling assumptions do not accommodate the possibility that a network may be fully random, or contain a balanced mixture of structure and randomness.

Since we can always interpret any community detection method as inferential, is it still meaningful to categorize some methods as descriptive? Arguably yes, because directly inferential approaches make their generative models and priors explicit, while for a descriptive method we need to extract them from back-engineering. Explicit modelling allows us to make judicious choices about the model and prior that reflect the kinds of structures we want to detect, relevant scales or lack thereof, and many other aspects that improve their performance in practice, and our understanding of the results. With implicit assumptions we are “flying blind”, relying substantially on serendipity and trial-and-error — not always with great success.

It is not uncommon to find criticisms of inferential methods due to a perceived implausibility of the generative models used — such as the conditional independence of the placement of the edges present in the SBM — although these assumptions are also present, but only implicitly, in other methods, like modularity maximization (see [peixoto_descriptive_2021]).

The above inferential interpretation is not specific to community detection, but is in fact valid for any learning problem. The set of explicit or implicit assumptions that must come with any learning algorithm is called an “inductive bias”. An algorithm is expected to function optimally only if its inductive bias agrees with the actual instances of the problems encountered. It is important to emphasize that no algorithm can be free of an inductive bias, we can only chose which intrinsic assumptions we make about how likely we are to encounter a particular kind of data, not whether we are making an assumption. Therefore, it is particularly problematic when a method does not articulate explicitly what these assumptions are, since even if they are hidden from view, they exist nonetheless, and still need to be scrutinized and justified. This means we should be particularly skeptical of the impossible claim that a learning method is “model free”, since this denomination is more likely to signal an unwillingness to expose the underlying modelling assumptions, which could potentially be revealed as unappealing and fragile when eventually forced to come under scrutiny.

[peixoto_descriptive_2021] | (1, 2) Tiago P. Peixoto, “Descriptive
vs. inferential community detection: pitfalls, myths and half-truths”,
arXiv: 2112.00183 |

[holland_stochastic_1983] | (1, 2) Paul W. Holland, Kathryn Blackmond Laskey,
and Samuel Leinhardt, “Stochastic blockmodels: First steps,” Social
Networks 5, 109–137 (1983). DOI: 16/0378-8733(83)90021-7 |

[karrer_stochastic_2011] | (1, 2) Brian Karrer and M. E. J. Newman,
“Stochastic blockmodels and community structure in networks,”
Physical Review E 83, 016107 (2011). DOI: 10.1103/PhysRevE.83.016107 |

[peixoto_nonparametric_2017] | Tiago P. Peixoto, “Nonparametric Bayesian inference of the microcanonical stochastic block model,” Physical Review E 95, 012317 (2017). DOI: 10.1103/PhysRevE.95.012317 |

[grunwald_minimum_2007] | Peter D. Grünwald, The Minimum Description Length Principle (The MIT Press, 2007). |

[shannon_mathematical_1948] | C. E Shannon, “A mathematical theory of communication”, Bell Syst Tech. J 27, 623 (1948). |

[lancichinetti_benchmark_2008] | Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi, “Benchmark graphs for testing community detection algorithms”, Physical Review E 78, 046110 (2008). DOI: 10.1103/PhysRevE.78.046110 |

[girvan_community_2002] | M. Girvan and M. E. J. Newman, “Community structure in social and biological networks,” Proceedings of the National Academy of Sciences 99, 7821–7826 (2002). DOI: 10.1073/pnas.122653799 |

[lancichinetti_community_2009] | Andrea Lancichinetti and Santo Fortunato, “Community detection algorithms: A comparative analysis”, Physical Review E 80, 056117 (2009). DOI: 10.1103/PhysRevE.80.056117 |

[decelle_asymptotic_2011] | Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová, “Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications”, Physical Review E 84, 066106 (2011). DOI: 10.1103/PhysRevE.84.066106 |

[peixoto_reconstructing_2018] | Tiago P. Peixoto, “Reconstructing Networks with Unknown and Heterogeneous Errors”, Physical Review X 8, 041011 (2018). DOI: 10.1103/PhysRevX.8.041011 |

[peixoto_revealing_2021] | Tiago P. Peixoto, “Revealing Consensus and Dissensus between Network Partitions”, Physical Review X 11, 021003 (2021). DOI: 10.1103/PhysRevX.11.021003 |

[guimera_missing_2009] | Roger Guimerà and Marta Sales-Pardo, ”Missing and spurious interactions and the reconstruction of complex networks“, Proceedings of the National Academy of Sciences 106, 22073–22078 (2009). DOI: 10.1073/pnas.0908366106 |

[hoffmann_community_2020] | Till Hoffmann, Leto Peel, Renaud Lambiotte, and Nick S. Jones, ”Community detection in networks without observing edges“, Science Advances 6, eaav1478 (2020). DOI: 10.1126/sciadv.aav1478 |

[peixoto_network_2019] | Tiago P. Peixoto, ”Network Reconstruction and Community Detection from Dynamics,“ Physical Review Letters 123, 128301 (2019). DOI: 10.1103/PhysRevLett.123.128301 |

[peixoto_bayesian_2019] | Tiago P. Peixoto, ”Bayesian Stochastic Blockmodeling“, in Advances in Network Clustering and Blockmodeling (John Wiley & Sons, Ltd, 2019) pp. 289–332. DOI: 10.1002/9781119483298.ch11 |

[mackay_information_2003] | David J. C. MacKay, ”Information Theory, Inference and Learning Algorithms", first edition ed. (Cambridge University Press, 2003). |

[1] | Note that the sum in eq:dcsbm-marginal vanishes because only one term is non-zero given a fixed network $A$. |

[2] | If a value $x$ occurs with probability $P(x)$, this means that in order to transmit it in a communication channel we need to answer at least $-{\mathrm{log}}_{2}P(x)$ yes-or-no questions to decode its value exactly. Therefore we need to answer one yes-or-no question for a value with $P(x)=1/2$, zero questions for $P(x)=1$, and ${\mathrm{log}}_{2}N$ questions for uniformly distributed values with $P(x)=1/N$. This value is called “information content”, and essentially measures the degree of “surprise” when encountering a value sampled from a distribution. See [mackay_information_2003] for a thorough but accessible introduction to information theory and its relation to inference. |

[3] | More accurately, this becomes impossible only when the network becomes asymptotically infinite; for finite networks the probability of compression is only vanishingly small. |

[4] | The description length of eq:dl_W is only valid if there are no further parameters in the quality function $W(A,b)$ other than $b$ that are being optimized. |

(This is a slightly modified version of chapter II in [peixoto_descriptive_2021])

Community detection is the task of dividing a network — typically one which is large — into many smaller groups of nodes that have a similar contribution to the overall network structure. With such a division, we can better summarize the large-scale structure of a network by describing how these groups are connected, instead of each individual node. This simplified description can be used to digest an otherwise intractable representation of a large system, providing insight into its most important patterns, how they relate to its function, and the underlying mechanisms responsible for its formation.

At a very fundamental level, community detection methods can be divided into two main categories: “descriptive” and “inferential.”

**Descriptive methods** attempt to find communities according to
some context-dependent notion of a good division of the network into
groups. These notions are based on the patterns that can be identified
in the network via an exhaustive algorithm, but without taking into
consideration the possible rules that were used to create them. These
patterns are used only to describe the network, not to explain
it. Usually, these approaches do not articulate precisely what
constitutes community structure to begin with, and focus instead only on
how to detect them. For this kind of method, concepts of statistical
significance, parsimony and generalizability are usually not evoked.

**Inferential methods**, on the other hand, start with an explicit
definition of what constitutes community structure, via a generative
model for the network. This model describes how a latent
(i.e. not observed) partition of the nodes would affect the placement of
the edges. The inference consists on reversing this procedure to
determine which node partitions are more likely to have been responsible
for the observed network. The result of this is a “fit” of a model to
data, that can be used as a tentative explanation of how it came to
be. The concepts of statistical significance, parsimony and
generalizability arise naturally and can be quantitatively assessed in
this context. See e.g. [peixoto_bayesian_2019].

Descriptive community detection methods are by far the most numerous, and those that are in most widespread use. However, this contrasts with the current state-of-the-art, which is composed in large part of inferential approaches. Here we point out the major differences between them and discuss how to decide which is more appropriate, and also why one should in general favor the inferential varieties whenever the objective is derive interpretations from data.

We begin by observing that descriptive clustering approaches are the method of choice in certain contexts. For instance, such approaches arise naturally when the objective is to divide a network into two or more parts as a means to solve a variety of optimization problems. Arguably, the most classic example of this is the design of Very Large Scale Integrated Circuits (VLSI). The task is to combine millions of transistors into a single physical microprocessor chip. Transistors that connect to each other must be placed together to take less space, consume less power, reduce latency, and reduce the risk of cross-talk with other nearby connections. To achieve this, the initial stage of a VLSI process involves the partitioning of the circuit into many smaller modules with few connections between them, in a manner that enables their efficient spatial placement, i.e. by positioning the transistors in each module close together and those in different modules farther apart.

Another notable example is parallel task scheduling, a problem that appears in computer science and operations research. The objective is to distribute processes (i.e. programs, or tasks in general) between different processors, so they can run at the same time. Since processes depend on the partial results of other processes, this forms a dependency network, which then needs to be divided such that the number of dependencies across processors is minimized. The optimal division is the one where all tasks are able to finish in the shortest time possible.

Both examples above, and others, have motivated a large literature on “graph partitioning” dating back to the 70s, which covers a family of problems that play an important role in computer science and algorithmic complexity theory.

Although reminiscent of graph partitioning, and sharing with it many algorithmic similarities, community detection is used more broadly with a different goal [fortunato_community_2016]. Namely, the objective is to perform data analysis, where one wants to extract scientific understanding from empirical observations. The communities identified are usually directly used for representation and/or interpretation of the data, rather than as a mere device to solve a particular optimization problem. In this context, a merely descriptive approach will fail at giving us a meaningful insight into the data, and can be misleading, as we will discuss in the following.

We illustrate the difference between descriptive and inferential approaches in fig:infvsdesc. We first make an analogy with the famous “face” seen on images of the Cydonia Mensae region of the planet Mars. A merely descriptive account of the image can be made by identifying the facial features seen, which most people immediately recognize. However, an inferential description of the same image would seek instead to explain what is being seen. The process of explanation must invariably involve at its core an application of the law of parsimony, or Occam's razor. This principle predicates that when considering two hypotheses compatible with an observation, the simplest one must prevail. Employing this logic results in the conclusion that what we are seeing is in fact a regular mountain, without denying that it looks like a face in that picture, but just accidentally. In other words, the “facial” description is not useful as an explanation, as it emerges out of random features rather than exposing any underlying mechanism.

Going out of the analogy and back to the problem of community detection, in the bottom of fig:infvsdesc we see a descriptive and an inferential account of an example network. The descriptive one is a division of the nodes into 13 assortative communities, which would be identified with many descriptive community detection methods available in the literature. Indeed, we can inspect visually that these groups form assortative communities, and most people would agree that these communities are really there, according to most definitions in use: these are groups of nodes with many more internal edges than external ones. However, an inferential account of the same network would reveal something else altogether. Specifically, it would explain this network as the outcome of a process where the edges are placed at random, without the existence of any communities. The communities that we see in fig:infvsdesc (a) are just a byproduct of this random process, and therefore carry no explanatory power. In fact, this is exactly how the network in this example was generated, i.e. by choosing a specific degree sequence and connecting the edges uniformly at random.

In fig:generation (a) we illustrate in more detail how the network in fig:infvsdesc was generated: The degrees of the nodes are fixed, forming “stubs” or “half-edges”, which are then paired uniformly at random forming the edges of the network. In fig:generation (b), like in fig:infvsdesc, the node colors show the partition found with descriptive community detection methods. However, this network division carries no explanatory power beyond what is contained in the degree sequence of the network, since it is generated otherwise uniformly at random. This becomes evident in fig:generation (c), where we show another network sampled from the same generative process, i.e. another random pairing, but partitioned according to the same division as in fig:generation (b). Since the nodes are paired uniformly at random, constrained only by their degree, this will create new apparent “communities” that are always uncorrelated with one another. Like the “face” on Mars, they can be seen and described, but they cannot explain.

We emphasize that the communities found in fig:generation (b) are indeed really there from a descriptive point of view, and they can in fact be useful for a variety of tasks. For example, the cut given by the partition, i.e. the number of edges that go between different groups, is only 13, which means that we need only to remove this number of edges to break the network into (in this case) 13 smaller components. Depending on context, this kind of information can be used to prevent a widespread epidemic, hinder undesired communication, or, as we have already discussed, distribute tasks among processors and design a microchip. However, what these communities cannot be used for is to explain the data. In particular, a conclusion that would be completely incorrect is that the nodes that belong to the same group would have a larger probability of being connected between themselves. As shown in fig:generation (a), this is clearly not the case, as the observed “communities” arise by pure chance, without any preference between the nodes.

Given the above differences, and the fact that both inferential and descriptive approaches have their uses depending on context, we are left with the question: Which approach is more appropriate for a given task at hand? In order to help answering this question, independent of the particular context, it is useful to consider the following “litmus test”:

Q: “Would the usefulness of our conclusions change if we learn, after obtaining the communities, that the network being analyzed is completely random?”

If the answer is “yes”, then an inferential approach is needed.

If the answer is “no”, then an inferential approach is not required.

If the answer to the above question is “yes”, then an inferential approach is warranted, since the conclusions depend on an interpretation of how the data were generated. Otherwise, a purely descriptive approach may be appropriate since considerations about generative processes are not relevant.

It is important to understand that the relevant question in this context is not whether the network being analyzed is actually fully random [1], since this is rarely the case for empirical networks. Instead, considering this hypothetical scenario serves as a test to evaluate if our task requires us to separate between actual latent community structure (i.e. those that are responsible for the network formation), from those that arise completely out of random fluctuations, and hence carry no explanatory power. Furthermore, most empirical networks, even if not fully random, like most interesting data, are better explained by a mixture of structure and randomness, and a method that cannot tell those apart cannot be used for inferential purposes.

Returning to the VLSI and task scheduling examples we considered in the previous section, it is clear that the answer to the litmus test above would be “no”, since it hardly matters how the network was generated and how we should interpret the partition found, as long as the integrated circuit can be manufactured and function efficiently, or the tasks finish in the minimal time. Interpretation and explanations are simply not the primary goals in these cases [2].

However, it is safe to say that in network data analyses very often the answer to the question above question would be “yes.” Typically, community detection methods are used to try to understand the overall large-scale network structure, determine the prevalent mixing patterns, make simplifications and generalizations, all in a manner that relies on statements about what lies behind the data, e.g. whether nodes were more or less likely to be connected to begin with. A majority of conclusions reached would be severely undermined if one would discover that the underlying network is in fact fully random. This means that these analyses are at a grave peril when using purely descriptive methods, since they are likely to be overfitting the data — i.e. confusing randomness with underlying structure.

[peixoto_descriptive_2021] |

[fortunato_community_2016] | Santo Fortunato and Darko Hric, “Community detection in networks: A user guide”, Physics Reports (2016), DOI: 10.1016/j.physrep.2016.09.002 |

[peixoto_bayesian_2019] | Tiago P. Peixoto, “Bayesian Stochastic Blockmodeling”, in Advances in Network Clustering and Blockmodeling (John Wiley & Sons, Ltd, 2019) pp. 289–332. DOI: 10.1002/9781119483298.ch11 |

[1] | “Fully random” here means sampled form a random graph model, like the Erdős-Rényi model, the configuration model, or some other null model where whatever communities we may ascribe to the nodes play no role in the placement of the edges. |

[2] | Although this is certainly true at a first instance, we can also argue that properly understanding why a certain partition was possible in the first place would be useful for reproducibility and to aid the design of future instances of the problem. For these purposes, an inferential approach would be more appropriate. |

I have been getting some questions about a 2018 paper [chang_estimation_2018] by Jinyuan Chang, Eric D. Kolaczyk, and Qiwei Yao that deals with reconstruction of noisy networks, i.e. networks that are measured with uncertainty, so that true edges may not be observed or fake ones may be spuriously introduced. Among other things, they state:

Under a simple model of network error, we show that consistent estimation of [subgraph] densities is impossible when the rates of error are unknown and only a single network is observed.

This seems like a contradiction of a paper of mine [peixoto_reconstructing_2018] where I presented a method to do precisely what is considered impossible in the above statement: reconstruct networks from single measurements, when the error rates are unknown. So, where lies the problem?

Let us begin by defining the reconstruction scenario, which is fairly simple. Suppose we observe a noisy network $X$, obtained by measuring a true network $A$, subject to the error rates $p$ and $q$, such that

$$P({X}_{ij}|{A}_{ij},p,q)=\{\begin{array}{ll}{p}^{1-{X}_{ij}}(1-p{)}^{{X}_{ij}},& \text{if}{A}_{ij}=1,\\ {q}^{{X}_{ij}}(1-q{)}^{1-{X}_{ij}},& \text{if}{A}_{ij}=0.\end{array}$$

In other words, $p$ is the probability of observing a missing edge, and $q$ is the probability of observing a spurious edge.

The reconstruction task is to obtain an estimate of $A$ based only on $X$, without knowing either $p$ or $q$. (Note that this reconstruction would also inherently give us an estimate for $p$ and $q$.)

Chang et al. consider estimators of subgraph densities that operate on the observed network $X$, in a manner that makes no explicit assumption about how the data are generated. Essentially they claim that not knowing the true values of $A$, $p$, and $q$, it is impossible to say anything about either of these values from $X$ alone.

It is important to understand that it is not in fact possible to make “no assumptions” about how data are generated. Assumptions are always made; they can only be implicit or explicit. Implicit assumptions, i.e. those that are hidden from view, are not exempt from justification. So-called “frequentist” estimators that make no explicit reference to a prior distribution are in fact formally equivalent to Bayesian estimators with a uniform prior, i.e. assuming that all parameters values are equally likely. In the case where the parameter is a graph, this means that our prior expectation is that $A$ is not only fully random, but in fact also dense, i.e. with a mean degree

, where $N$ is the number of nodes. Is this a reasonable assumption?In [peixoto_reconstructing_2018] we take instead a Bayesian approach, where we are explicit about our assumptions, yielding a posterior distribution for the reconstruction,

$$P(A|X)=\frac{P(X|A)P(A)}{P(X)}.$$

In this setting, we can recover the “impossibility” result of Chung et al by choosing the prior $P(A)$ as a constant. But this is not what should be done; instead we should choose a prior $P(A)$ that makes as little commitment as possible about the network structure before we see any data. Note that this is very different from choosing a uniform prior! A uniform prior would in fact be a very strong commitment, that is overwhelmingly likely to be wrong in almost every empirical setting. Instead, we need a nonparametric hierarchical model that includes everything from fully random to very structured networks as special cases, in a manner that encapsulates the kinds of data that we are likely to find.

In order to illustrate intuitively why this makes sense, let us consider a particular instance of the problem. Suppose that, without knowing the true network $A$ and the noise magnitudes $p$ and $q$, we observe the following noisy network $X$:

From pure intuition, when observing the above network, we would like to immediately claim that it is close to the true network, and that the noise magnitudes are low. Why? Because we know that a perfect lattice is unlikely to be formed by chance alone (i.e. from a uniform prior). Our intuitive prior knows that such things called lattices exist, and that when they occur, they look exactly like the figure above. And also, when the true network is a lattice, a high value of either $p$ or $q$ would destroy its pristine structure. The final conclusion is that the reconstruction of this network is not only possible, but in fact not very difficult.

The work [peixoto_reconstructing_2018] puts the above intuition on firmer terms by choosing the prior $P(A)$ to match an unknown stochastic block model (SBM). This model includes the “fully random” assumption as a special case, but is also capable of modelling a wide variety of structural patterns. Is this a realistic assumption? As it turns out, it is sufficiently generic to make the reconstruction possible in many cases, even when the model is not fully realistic.

As an example, we can consider the perfect lattice considered above. Below is the reconstructed version of this presumed noisy network, according to the method of [peixoto_reconstructing_2018] (see the HOWTO):

The thickness of the edge corresponds to the marginal posterior probability. Essentially, we conclude that the observed network is perfectly accurate, conforming to our intuition. The colors on the nodes shown above correspond to the node partition found with the SBM. Note that this is a very coarse and arguably displeasing generative model for this network, which would be generated by it with a very low probability. Nevertheless, even with such misspecification of the prior, the model is enough to detect that the underlying network is far from random, and enable reconstruction. The posterior estimates for the noise magnitudes are $p=0.002(2)$ and $q=3(3)\times {10}^{-7}$; indeed quite small. Not bad!

Of course, in [peixoto_reconstructing_2018] we consider situations where reconstruction is made for higher noise magnitudes, and also for real networks. But the above already serves to show that reconstruction from single measurements is indeed possible.

This should not be an earth-shattering conclusion. After all, single-measurement reconstructions of noisy images, time-series, and other high-dimensional objects are commonplace. Why not of networks? The key here is to abandon the idea that a network (like an image or a time-series) is a “singleton” $N=1$ object, and instead view it as a heterogeneous population of objects — namely the individual edges and nodes. And we should make assumptions that, while being agnostic about which kinds of pattern there should be, also allow for them to be detected in the first place.

[chang_estimation_2018] | Jinyuan Chang, Eric D. Kolaczyk, Qiwei Yao, “Estimation of subgraph density in noisy networks”, arXiv: 1803.02488 |