Tiago P. Peixoto

Untangling the hairball using statistical inference

Tiago P. Peixoto — Sun, 19 May 2024 22:00:00 GMT

TL;DR — Network visualization is often done in the wrong way: first a network layout is produced using a heuristic, and then follow-up analyses are performed and evaluated according to the layout. This is an inversion of priorities that subjugates algorithmic analyses to unreliable and often misleading visualization heuristics. Here I show how this inversion can be fixed: by performing the more meaningful algorithmic analysis first—like clustering or ordering—and then subjugating the visualization heuristics to it.

The seductive futility of network visualization

Network visualization is a ubiquitous task in data analysis. We can’t seem to resist trying to just see the network structure with our own eyes. Our visual cognitive abilities can be a powerful tool, since we can effortlessly identify some kinds of patterns that would otherwise be difficult to detect. It would be wasteful not to harness it.

Unfortunately, network data suffers from a fundamental problem of representation. Namely, networks are usually not low-dimensional objects to begin with, so we cannot directly inspect them in their “natural” space¹—unlike clouds, rocks, trees, insects, etc. Instead, we need first to project them into a low-dimensional representation (usually in 2D) that is digestible to our visual apparatus. This is a rather violent act that invariably distorts the structure in the data in important and unavoidable ways. With network data, not only “is the map not the territory”, but these two live in radically different universes. A faithful and universal representation of arbitrary networks in 2D (or even 3D) is simply impossible, and therefore is not a sensible goal.²

¹ Notable exceptions are spatial networks common in transportation systems, such as roads, railways, subways, etc.

² Faithful projections are not even always possible between spaces with the same number of dimensions, such as from the surface of a sphere to a plane.

In addition to these unavoidable distortions, data visualization in general is a double-edged sword—our sensitive pattern detection abilities mean that we can also see structures in the data that are not really there, or at least not in a statistically meaningful way.

The Danish power socket is the friendliest power socket.

This phenomenon is called pareidolia, which is an instance of a more general cognitive bias called apophenia—the tendency to perceive meaningful connections between unrelated things. Everyday examples of this include seeing common shapes in clouds, and faces in inanimate objects.

The combination of these unavoidable distortions with our cognitive bias to identify spurious patterns is not a good one for network visualization.

Let us inspect some typical instances:

Ceci n’est pas un réseau.
(C’est un ridiculogramme!)

Ceci n’est pas non plus un réseau.
(Credit: Mathieu Jacomy)

It’s easy to ridicule the visualization on the left: it’s the typical hairball that conveys little useful information. It overloads our pattern recognition abilities, and the whole thing just looks like a confusing mess. However, the visualization on the right is arguably also problematic. Despite showing a seemingly clearer picture—we can see a more obvious modular pattern—how can we be sure it’s not an illusion, caused by the algorithmic distortion, our cognitive bias,³ or both?

³ In fact, one particularly important instance of apophenia is the so-called clustering illusion: The tendency to see clusters in data which cannot be statistically justified. Some community detection algorithms suffer from the same problem.

Force-directed layouts only see assortativity

It’s important to observe that both drawings above have been obtained with the same algorithm: a force-directed layout.

Force-directed layout algorithms try to make the edge lengths as small as possible, representing them as attractive forces between the nodes at their endpoints, which are then compensated by an overall repulsive force between every pair of nodes. The final node positions are the ones that balance these competing forces. The usual interpretation of these layouts is that nodes that are close in the projected space are also close in the native “unprojected” structure of the network, and therefore we should be able to observe modules and other kinds of structural patterns visually.

However, this interpretation is not quite true in general, or even often. Such projections are mostly distortions, and while they are informed by the large-scale network structure, they cannot encapsulate it even in situations where we know the network is embedded in a well-defined metric space of a higher dimension [1]—and much less so in the more general case when it’s not. Networks are rich high-dimensional objects that diffuse blobs drawn on a 2D surface cannot really capture; at least not if we are not carefully specific about what patterns we want to extract.

Some might argue that force-directed visualizations are good enough when the structure in the data is very strong, and this is maybe what we should care about in most cases. Although it’s true that sometimes the genuine patterns in the data survive our callous abuse, there are simple scenarios with strong structure where the approach completely fails. For example, the figure below shows a random bipartite graph, i.e. there are two types of nodes, white and black, such that an edge can only connect a white to a black node,⁴ but are otherwise placed uniformly at random, visualized using a force-directed layout.

⁴ We observe this kind of bipartite pattern in heterosexual relations, for example.

Code

Hidden models and latent compression in community detection

Tiago P. Peixoto — Sun, 31 Mar 2024 22:00:00 GMT

This is an overdue post on a paper published last year with Alec Kirkley, named “Implicit Models, Latent Compression, Intrinsic Biases, and Cheap Lunches in Community Detection” [1].

There are essentially two kinds of community detection methods (or data analysis procedures in general): descriptive, and inferential [2]. Inferential methods attempt to fit a generative model to data, reject a null model, or in some other way articulate mechanisms of network formation, or more formally, a population of potential observations from which the observed data is only one possibility—so, for example, a network with assortative community structure becomes one where there’s a higher probability of an edge being formed between two nodes of the same group. A descriptive method, on the other hand, doesn’t evoke explicitly an edge formation mechanism, and relies instead on simply describing the structural patterns seen (e.g. a community is a group of nodes with sufficiently more internal than external observed edges). In either case we typically wish to find the best partition of a network . In the case of descriptive methods this is usually obtained via the optimization of a non-probabilistic quality function , Some of the most serious problems of descriptive methods arise when their results are interpreted in an inferential way, e.g. ascribing the communities found to homophily or other mixing preferences. Many practitioners are surprised and concerned when these descriptive methods find nontrivial clusters in random graphs where the presence of any edge in the network has the exact same probability [3], in this way betraying a clear inferential goal, in contradiction with what the method being used is actually expected to accomplish—a goal perhaps shared even with the designers of the method, who acted inadvertently.

Despite its importance, this issue hasn’t yet prevented compromised methods such as modularity maxizimization to be employed en masse—although it’s unclear if most users even realize the problem. In any case, this is not the only instance of specious statistical practice permeating through vast areas of science [4,5]. Evidently, pontification and explanation only go so far.¹

¹ “If you’re explaining, you’re losing.” — Ronald Reagan

In an effort to understand a bit better the consequences of interpreting descriptive methods in an inferential way, Alec and I tried to reverse engineer them. In other words, we asked the question:

If we interpret the results of a community detection algorithm as inferential, to what generative model does it correspond?

We managed to go surprisingly far in answering this question!

Every method is inferential when the model is bad enough

Algorithmically, inferential methods are not so different from descriptive ones; in fact, at least at first,² it all boils down to choosing a different quality function where is the posterior probability of a partition given an observed network . This isn’t a trivial difference, however, since defining the quality function in this way amounts precisely to reasoning about possible generative mechanism and our prior assumptions over their parameters, as discussed previously.

² At first, there’s no algorithmic difference. However, in an inferential setting, we quickly realize that there’s no reason to focus only on the best partition, and we might choose instead to characterize the whole posterior distribution , and average over it, so we can quantify uncertainties, competing hypotheses, and everything else that is essential to a serious scientific pipeline.

³ Interestingly, as we show in the paper, the precise shape of has asymptotically no effect whatsoever in the resulting generative model, up to a multiplicative “inverse temperature” constant.

The idea we explored in [1] is that we can essentially reverse Bayes’ formula in Equation 1, and obtain an implicit generative model that is compatible with any given quality function : where is any non-decreasing invertible function,³ and is a normalization constant. In order to understand what that means, consider the illustration in Figure 1.

Figure 1: Diagrammatic illustration of the inverse problem we consider in Ref. [1]. (a) A community detection algorithm provides a mapping of a network to a partition of its nodes. (b) This mapping can always be inverted, such that for any given partition of the nodes we can consider the set of all networks such that . This set of networks reveals the implicit generative model compatible with the community detection algorithm under consideration (we show three independent samples drawn uniformly from this set). The arrow from case 2 in panel (a) to (b) indicates that the same partition is considered in both examples.

The networks generated by the implicit model in panel (b) of Figure 1 are markedly different from the original network in case 2 in (a), which would be generated only with a very low probability under this model. This happens because the mixing between groups tends to be homogeneous for networks sampled from the model, whereas in the network in case 2 in (a) the groups connect preferentially to a central group (in blue) and they have more heterogeneous densities. This mismatch indicates that the underlying model is in fact a poor representation of the network structure — which would be impossible to determine from the results of panel (a) alone. Therefore, characterizing the implicit models hidden behind community detection methods allows us to evaluate their ability to faithfully capture network structure in a systematic manner and reveal their intrinsic biases towards particular kinds of structure.

Furthermore, this reverse engineering allows us to compare different community detection methods on equal grounds; in particular it allows us to compute their description length,

i.e. the amount of information required to describe both the data and the model parameters — which is a universal model selection criterion that removes the need for “ground truth” labels for method comparison. The “partition function” is not trivial to compute, but we show how it can be done for a wide class of quality functions.⁴

⁴ The computation of can in fact be seen as the main technical contribution of our work.

Does that mean that those that propose or use a specific quality function advocate for or even accept the implicit generative model of Equation 2 as a realistic explanation of their data? Well, what goes on in peoples’ minds is not our primary concern. The relevant point here is that they certainly cannot coherently object to this generative model, while at the same time interpreting the results of their algorithm in an inferential manner.

No “benign overfitting” in community detection

Well, and how do the descriptive results fare, when their implicit generative models come to light? We compared the descriptive methods modularity maximization and Infomap with a variety of inferential ones on a corpus of over 500 empirical networks, and the results are striking:

Fraction of networks in our corpus where a given model (vertical axis) achieves equal or better compression than the alternative model (horizontal axis).

Infomap provides an inferior compression to the Erdős–Rényi model—the most random model of them all, and hence, one could expect, the least compressive—for 38% of the networks! The configuration model beats both methods in 45% (modularity) and 60% (Infomap) of the instances. Conversely, inferential methods based on variations of the stochastic block model (SBM) provide a better compression for 97% (Infomap) and 96% (modularity) of the data.⁵ In other words, modularity and Infomap, two very popular descriptive methods, massively overfit empirical data, and should not be used to aid inferential analyses.

⁵ Revealingly, those few instances where modularity and Infomap yield (usually marginally) superior compression, albeit with answers compatible with the inferential methods, are often for “toy” networks such as US college football, commonly used as simple examples in community detection papers. These networks have patterns that modularity and Infomap implicitly expect: uniformly assortativive communities with equal size and density. Given that their unrepresentative character is in diametrical opposition to how often they are used as illustrations, maybe such “toy” examples should be retired.

So, we’re not so lucky as with some kinds of statistical models that get regularization for free as the result of an apparent act of divine benevolece… In community detection we have to work for it, so that you (the user) do not have to.

There are a lot of other goodies in the paper, but I do not want to spoil the pleasure of reading its 28 pages and 14 figures! Here’s just a sneak peek:

We show how computing the description length of modularity can be used to select the most compressive value of the resolution parameter , and in this way solve not only its resolution limit [6], but also its overfitting problem [3]. To those that care: you’re welcome.
We show that a broad class of quality functions—that includes modularity and Infomap—amounts to a planted partition SBM likelihood with very speficic priors on the strength of assortativity and the number of groups. It seems very difficult to make assumptions (explicit or otherwise) about community structure that evades what the SBM articulates!
We analyze the implicit priors of modularity maximization and Infomap and they’re crazy: they show abrupt transitions in their expected values, forbid entire regions of the parameter space, show nontrivial scaling with system size, etc. It’s ironic that sometimes people complain that the SBM embodies unrealistic assumptions about the data (not untrue—it’s a network histogram, nothing more), but have been unknowingly ingesting lethal doses of toxic, unjustifiable assumptions into their analyses this whole time.
We explore Bayes optimal instances of the community detection problem—i.e. instances where each corresponding method must behave optimally, and show that there’s a fundamental asymmetry between methods: a more general model such as the Nested SBM (NSBM) will behave virtually just as well for instances of the problem that are optimal for modularity, but the opposite is far from being true: for instances where the NSBM is optimal, modularity maximization performs abysmally.

This runs against what is arguably one of the most vacuous statements in the field: the “no free lunch” theorem for community detection [7], that states that every community detection algorithm shows exactly the same performance⁶ when averaged over “all problem instances”—considered to be fully uniform matchings between network and partition, formally defining a maximally random generative model. As previosly discussed, this uniform model generates problem instances that are strictly incompressible, therefore utterly unrepresentative of anything we want to consider either in theory or in practice.

⁶ The “same performance” that all algorithms exhibit amounts to an asymptotic accuracy of exactly zero, since in those uninformative instances an algorithm can do no better than guessing at random.

What we show for community detection is this work is valid much more broadly: One of the biggest lies ever told is that there are “model free” methods of data analysis. Not possible! The most that can be done is to hide models from plain sight. But with sufficiently careful reverse engineering, it might still be possible to reveal the horrors that lie beneath such obstructions.

References

[1]

T. P. Peixoto and A. Kirkley, Implicit Models, Latent Compression, Intrinsic Biases, and Cheap Lunches in Community Detection, Physical Review E 108, 024309 (2023).

[2]

T. P. Peixoto, Descriptive Vs. Inferential Community Detection in Networks: Pitfalls, Myths and Half-Truths, Elements in the Structure and Dynamics of Complex Networks (2023).

[3]

R. Guimerà, M. Sales-Pardo, and L. A. N. Amaral, Modularity from Fluctuations in Random Graphs and Complex Networks, Physical Review E 70, 025101 (2004).

[4]

T. Chari and L. Pachter, The Specious Art of Single-Cell Genomics, PLOS Computational Biology 19, e1011288 (2023).

[5]

R. L. Wasserstein and N. A. Lazar, The ASA Statement on p-Values: Context, Process, and Purpose, The American Statistician 70, 129 (2016).

[6]

S. Fortunato and M. Barthélemy, Resolution Limit in Community Detection, Proceedings of the National Academy of Sciences 104, 36 (2007).

[7]

L. Peel, D. B. Larremore, and A. Clauset, The Ground Truth about Metadata and Community Detection in Networks, Science Advances 3, e1602548 (2017).

Comments

Webmentions⁷

(Nothing yet)

⁷ Webmention is a standardized decentralized mechanism for conversations and interactions across the web.

Is Bayesian inference subjective?

Tiago P. Peixoto — Mon, 03 Jan 2022 23:00:00 GMT

In a previous blog post (covering [1]) I discussed how inferential approaches to community detection are based on the formulation of generative models, via the definition of a likelihood for the network conditioned on a partition With this at hand, we find the best partition of the network according to the posterior distribution, using Bayes' rule, i.e.

where is the prior probability for a partition .

In Bayesian statistics, probabilities are often described as representing a state of knowledge or even as quantification of a “personal belief” [2]. Does that mean that the results that we obtain using this method are “subjective,” and depend arbitrarily on how we choose our models and priors over parameters?

There’s an old debate in the statistics and philosophy literatures between “subjective” and “objective” Bayesianism, concerning primarily whether universal “non-informative” priors exist that perfectly represent the notion of maximum ignorance before any data is seen.¹

¹ The “subjective” position, as defended e.g. by de Finetti and Savage, argues simply that the postulates of inductive logic make no requirement on the prior distribution, which therefore needs to be chosen according to other criteria. The “objective” camp, represented most famously by Jeffreys and Janyes, argues in favor of additional postulates that enable the choice of universal “non-informative” priors. In the case of Jeffreys these are distributions invariant to reparametrization, and in Jaynes’ those derived according to the maximum entropy principle. Whether truly “non-informative” priors are indeed possible (they are not: the concept of “ignorance” does not exist in a vacuum and always requires initial choices on parametrization and constraints [3]) is not the main issue addressed in this post. See also footnote 3.

Regardless of the outcome of this debate, I think it is not very difficult to argue that the answer to the above question is “no” — at least when operating with the colloquial meaning of “subjective” as something that is based on or influenced by personal feelings, tastes, or opinions.

First, it is important to distinguish between the colloquial and a more technical definition of what constitutes a “subjective” statement. According to the more technical definition, we say that a statement is subjective when its veracity is conditioned on the subject that makes the statement, without necessarily meaning that the subject is free to decide on its veracity. A good example of this type of subjectivity is time in Einstein's theory of relativity: it has a subjective nature since it will be experienced differently depending on the frame of reference of the observer. Nevertheless, this does not mean that time can be freely determined by any observer, nor that it will be influenced by her personal feelings, tastes, or opinions. In other words, a subjective statement is not the same as an arbitrary statement. In this sense (and only in this sense), Bayesian statistics is indeed subjective, since an inferential conclusion will depend on the data observed and set of hypotheses considered by an individual. However, given the same data and set of hypotheses, two subjects must agree on the conclusion — it is not quite an arbitrary decision. In the colloquial sense of the term, Bayesian inference is not subjective.

There are different ways to demonstrate this more concretely. For example, we can argue “a la Jaynes” that a “state of knowledge” is not something arbitrary, since it can be quantified and always needs to be substantiated [4]. An alternative way of showing this, and which I find the most compelling, is via the equivalence between inference and compression. Namely, we can write the numerator of the posterior distribution of Equation 1 as

where the quantity is known as the description length [5] of the network. It is computed as:

The second term in the above equation quantifies the amount of information in bits necessary to encode the parameters of the model, while the first term determines how many bits are necessary to encode the data (the observed network itself), once the model parameters are known. Therefore, finding the most likely network partition is equivalent to finding the one that most compresses it — giving us a compelling implementation of Occam's razor.²

² An important technical remark is that Equation 2 only corresponds to an actual description length if the quantities involved are probability mass functions, i.e. the set of possible data and parameters are discrete. This is not a limitation of the framework, it only embodies the unavoidable fact that we can only extract finite information from data. Hypotheses defined over continuous parameters need either to be marginalized or discretized within a finite precision in order to be converted into specific inferential statements.

The description length is not arbitrary in any way; in fact it is says something almost physical about the data. It means that if we infer the most likely model, it gives us a way of storing the data in a hard drive using bits! As we know from our daily computer usage, compression is not an arbitrary decision, nor is it influenced by our personal feelings, tastes, or opinions — otherwise we would never run out of disk space, we would be able to download files instantly, etc. In other words, we cannot arbitrarily choose which model best fits the data in the same manner we cannot choose to make our computer files arbitrarily small. If we would accept that Bayesian statistics is arbitrary, then we would need also to accept that these physical obstacles we face are also arbitrary in nature, which is a rather absurd proposition, and demonstrably incorrect.

As I discussed previously, seeking compression avoids overfitting the data since it’s not possible (asymptotically) to compress statistical noise.

However, the concept of compression is more generally useful than just avoiding overfitting within a class of models. In fact, the description length gives us a model-agnostic objective criterion to compare different hypotheses for the data generating process according to their plausibility — in a manner that is not only not arbitrary but also not subjective. Namely, since Shannon's theorem tells us that the best compression can be achieved asymptotically only with the true data generating model, then if we are able to find a description length for a network using a particular model, regardless of how it is parametrized, this also means that we have automatically found an upper bound on the optimal compression achievable. By formulating different generative models and computing their description length, we have not only an objective criterion to compare them against each other, but we also have a way to limit further what can be obtained with any other model. The result is an universal scale on which different models can be compared, as we move closer to the limit of what can be uncovered for a particular data at hand.³

³ Note that the existence of this universal scale is othogonal to the debate on whether it is possible to define universal “non-informative” priors. Even if the choice of model and prior is arbitrary, its resulting compression of the data will not be.

In the figure below we show the description length values with some models obtained for a protein-protein interaction network for the organism Meleagris gallopavo (wild turkey).

Figure 1: Compression points towards the true model. Top: Protein-protein interaction network for the organism Meleagris gallopavo. The node colors indicate the best partition found with the DC-SBM/TC (there are more groups than colors, so some colors are repeated), and the edge colors indicate if they are attributed to triadic closure (red) or the DC-SBM (black). Bottom: Description length values according to different models. The unknown⁴ true model must yield a description length value smaller than the DC-SBM/TC, and no other model should be able to provide a superior compression that is statistically significant.

⁴ In an absolute sense, the true model is not only unkonwn, but also unknownable. What Figure 1 shows is that our task to appoximate it asymptotically.

In particular, we can see that with the degree-corrected stochastic block model with triadic closure (DC-SBM/TC) [6] we can achieve a description length that is far smaller than what would be possible with networks sampled from either the Erdős–Rényi, configuration, or planted partition (a SBM with strictly assortative communities [7] models, meaning that the inferred model is much closer to the true process that actually generated this network than the alternatives. Naturally, the actual process that generated this network is different from the DC-SBM/TC, and it likely involves, for example, mechanisms of node duplication which are not incorporated into this rather simple model. However, to the extent that the true process leaves statistically significant traces in the network structure,⁵ computing the description length according to it should provide further compression when compared to the alternatives. Therefore, we can try to extend or reformulate our models to incorporate features that we hypothesize to be more realistic, and then verify if this in fact the case, knowing that whenever we find a more compressive model, it is moving closer to the true one — or at least to what remains detectable from it for the finite data.

⁵ Visually inspecting Figure 1 reveals what seems to be local symmetries in the network structure, presumably due to gene duplication. These patterns are not exploited by the SBM description, and points indeed to a possible path for further compression.

The discussion above glosses over some important technical aspects. For example, it is possible for two (or, in fact, many) models to have the same or very similar description length values. In this case, Occam's razor fails as a criterion to select between them, and we need to consider them collectively as equally valid hypotheses. This means, for example, that we would need to average over them when making specific inferential statements [8] — selecting between them arbitrarily can be interpreted as a form of overfitting. Furthermore, there is obviously no guarantee that the true model can actually be found for any particular data. This is only possible in the asymptotic limit of “sufficient data”, which will vary depending on the actual model. Outside of this limit (which is the typical case in empirical settings, in particular when dealing with sparse networks), fundamental limits to inference are unavoidable, which means in practice that we will always have limited accuracy and some amount of error in our conclusions. However, when employing compression, these potential errors tend towards overly simple explanations, rather than overly complex ones. Whenever perfect accuracy is not possible, it is difficult to argue in favor of a bias in the opposite direction.

I emphasize that it is not possible to “cheat” when doing compression. For any particular model, the description length will have the same form

where is some arbitrary set of parameters. If we constrain the model such that it becomes possible to describe the data with a number of bits that is very small, this can only be achieved, in general, by increasing the number of parameters , such that the number of bits required to describe them will also increase. Therefore, there is no generic way to achieve compression that bypasses actually formulating a meaningful hypothesis that matches statistically significant patterns seen in the data.

One may wonder, therefore, if there is an automatized way of searching for hypotheses in a manner that guarantees optimal compression. The most fundamental way to formulate this question is to generalize the concept of minimum description length as follows: for any binary string (representing any measurable data), we define as the length in bits of the shortest computer program that yields as an output. The quantity is know as Kolmogorov complexity, and if we would be able to compute it for a binary string representing an observed network, we would be able to determine the “true model” value in Figure 1, and hence know how far we are from the optimum.⁶

⁶ As mentioned before, this would not necessarily mean that we would be able to find the actual true model in a practical setting with perfect accuracy, since for a finite there could be many programs of the same minimal length (or close) that generate it.

⁷ There are two famous ways to prove this. One is by contradiction: if we assume that we have a program that computes , then we could use it as subroutine to write another program that outputs with a length smaller than . The other involves undecidabilty: if we enumerate all possible computer programs in order of increasing length and check if their outputs match , we will eventually find programs that loop indefinitely. Deciding whether a program finishes in finite time is known as the “halting problem”, which has been proved to be impossible to solve. In general, it cannot be determined if a program reaches an infinite loop in a manner that avoids actually running the program and waiting for it to finish. Therefore, this rather intuitive algorithm to determine will not necessarily finish for any given string . For more details the wikipedia page has a good overview.

Unfortunately, an important result in information theory is that is not computable. This means that it is strictly impossible to write a computer program that computes for any string .⁷ This is a rather counterintuitive and frustrating fact that means that while we can keep trying, and maybe eventually even succeeding in compressing some data better than our last attempt, whether we have at any point achieved the optimal compression will remain forever unknowable.

Some interpret the uncomputability of Kolmogorov complexity as an invalidation of the overall compression approach to model selection, but I think this is fundamentally mistaken. This fact simply means that we cannot automate the discovery of optimal hypotheses, or know where the “finish line” is. However, compression is still a perfectly valid and objective criterion to judge the relative plausibility of competing hypotheses. This is really the best we can hope for, and there’s an upside: It means that scientists will never run out of things to do, and their capacity for creativity in formulating new hypotheses will never become obsolete!

References

[1]

T. P. Peixoto, Descriptive Vs. Inferential Community Detection in Networks: Pitfalls, Myths and Half-Truths, Elements in the Structure and Dynamics of Complex Networks (2023).

[2]

B. De Finetti, Theory of Probability: A Critical Introductory Treatment, Vol. 6 (John Wiley & Sons, 2017).

[3]

T. Seidenfeld, Why I Am Not an Objective Bayesian; Some Reflections Prompted by Rosenkrantz, Theory and Decision 11, 413 (1979).

[4]

E. T. Jaynes, Probability Theory: The Logic of Science (Cambridge University Press, Cambridge, UK ; New York, NY, 2003).

[5]

P. D. Grünwald, The Minimum Description Length Principle (The MIT Press, 2007).

[6]

T. P. Peixoto, Disentangling Homophily, Community Structure, and Triadic Closure in Networks, Physical Review X 12, 011004 (2022).

[7]

L. Zhang and T. P. Peixoto, Statistical Inference of Assortative Community Structures, Physical Review Research 2, 043271 (2020).

[8]

T. P. Peixoto, Revealing Consensus and Dissensus Between Network Partitions, Physical Review X 11, 021003 (2021).

Comments

Webmentions⁸

(Nothing yet)

⁸ Webmention is a standardized decentralized mechanism for conversations and interactions across the web.

Significant community structure via statistical tests?

Tiago P. Peixoto — Sun, 12 Dec 2021 23:00:00 GMT

This post is a slightly modified version of Sec. IVC in [1].

In a previous blog post I explained how modularity maximization tends to overfit and find spurious community structure even in random graphs.

Sometimes practitioners are indeed aware that such non-inferential methods can find communities that are not supported by statistical evidence. In an attempt to extract an inferential conclusion from their results in spite of this, they compare the value of the quality function with a randomized version of the network — and if a significant discrepancy is found, they conclude that the community structure is statistically meaningful. Unfortunately, this approach is as fundamentally flawed as it is straightforward to implement.

The reason why the test fails is because in reality it answers a question that is different from the one intended. When we compare the value of the quality function (or any other test statistic) obtained from a network and its randomized counterpart, we can use this information to answer only the following question:

“Can we reject the hypothesis that the observed network was sampled from a random null model?”

No other information can be obtained from this test, including whether the network partition we obtained is significant. All we can determine is if the optimized value of the quality function is significant or not. The distinction between the significance of the quality function value and the network partition itself is subtle but crucial.

We illustrate the above difference with an example in Figure 1 (b). This network is created by starting with a fully random Erdős-Rényi (ER) network, and adding to it a few more edges so that it has an embedded clique of six nodes. The occurrence of such a clique from an ER model is very unlikely, so if we perform a statistical test on this network that is powerful enough, we should be able to rule out that it came from the ER model with good confidence. Indeed, if we use the value of maximum modularity for this test, and compare with the values obtained for the ER model with the name number of nodes and edges (see Figure 1 (a)), we are able to reach the correct conclusion that the null model should be rejected, since the optimized value of modularity is significantly higher for the observed network.

Figure 1: The statistical significance of the maximum modularity value is not informative of the significance of the community structure. In (a) we show the distribution of optimized values of modularity for networks sampled from the Erdős-Rényi (ER) model with the same number of nodes and edges as the network shown in (b) and (c). The vertical line shows the value obtained for the partition shown in (b), indicating that the network is very unlikely to have been sampled from the ER model (). However, what sets this network apart from typical samples is the existence of a small clique of six nodes that would not occur in the ER model. The remaining communities found in (b) are entirely meaningless. In (c) we show the result of inferring the stochastic block model on this network, which perfectly identifies the planted clique without overfitting the rest of the network.

Should we conclude therefore that the communities found in the network are significant? If we inspect Figure 1 (b), we see that the maximum value of modularity indeed corresponds to a more-or-less decent detection of the planted clique. However, it also finds another seven completely spurious communities in the random part of the network. What is happening is clear — the planted clique is enough to increase the value of such that it becomes a suitable test to reject the null model¹, but the test is not powerful enough to verify that the communities themselves are statistically meaningful. In short, the following two statements are not synonymous:

¹ Note that it is possible to construct alternative examples, where instead of planting a clique, we introduce the placement of triangles, or other features that are known to increase the value of modularity, but that do not correspond to an actual community structure.

The maximum value of is significant.
The corresponding network partition is significant.

Conflating the two will lead to the wrong conclusion about the significance of the communities uncovered.

In Figure 1 (c) we show the result of a more appropriate inferential approach, based on Bayesian inference as described in a previous blog post, that attempts to answer a much more relevant question: “which partition of the network into groups is more likely?” The result is able to cleanly separate the planted clique from the rest of the network, which is grouped into a single community.

This example also shows how the task of rejecting a null model is very oblique to Bayesian inference of generative models. The former attempts to determine what the network is not, while the latter what it is. The first task tends to be easy — we usually do not need very sophisticated approaches to determine that our data did not come from a null model, specially if our data is complex. On the other hand, even if approximative, the second task is far more revealing, constructive, and arguably more useful in general.

References

[1]

T. P. Peixoto, Descriptive Vs. Inferential Community Detection in Networks: Pitfalls, Myths and Half-Truths, Elements in the Structure and Dynamics of Complex Networks (2023).

Comments

Webmentions²

(Nothing yet)

² Webmention is a standardized decentralized mechanism for conversations and interactions across the web.

Do we need to believe in generative models?

Tiago P. Peixoto — Tue, 07 Dec 2021 23:00:00 GMT

This post is a slightly modified version of Sec. IVH in [1].

In two previous blog posts (first and second) I advocated for the use of statistical inference for community detection in networks, whenever our objective is of an inferential nature.

One possible objection to the use of statistical inference is when the generative models on which they are based are considered unrealistic for a particular kind of network. Although this type of consideration is ultimately important, it is not necessarily an obstacle. First we need to remember that realism is a matter of degree, not kind, since no model can be fully realistic, and therefore we should never be fully committed to “believe” any particular model. Because of this, an inferential approach can be used to target a particular kind of structure, and the corresponding model is formulated with this in mind, but without the need to describe other properties of the data. The stochastic block model (SBM) is a good example of this, since it is often used with the objective of finding communities, rather than any kind of network structure. A model like the SBM is a good way to offset the regularities that relate to the community structure with the irregularities present in real networks, without requiring us to believe that in fact it generated the network.

Furthermore, certain kinds of models are flexible enough so that they can approximate other models. For example, a good analogy with fitting the SBM to network data is to fit a histogram to numerical data, with the node partitioning being analogous to the data binning. Although a piecewise constant model is almost never the true underlying distribution, it provides a reasonable approximation in a tractable, nonparametric manner. Because of its capacity to approximate a wide class of distributions, we certainly do not need to believe that a histogram is the true data generating process to extract meaningful inferences from it. In fact, the same can be said of the SBM in its capacity to approximate a wide class of network models [2].

This means that we can extract useful, statistically meaningful information from data even if the models we use are misspecified. For example, if a network is generated by a latent space model [3], and we fit a SBM to it, the communities that are obtained in this manner are not quite meaningless: they will correspond to discrete spatial regions. Hence, the inference would yield a caricature of the underlying latent space, amounting to a discretization of the true model — indeed, much like a histogram. This is very different, say, from finding communities in an Erdős–Rényi graph, which bear no relation to the true underlying model, and would be just overfitting the data. In contrast, the SBM fit to a spatial network would be approximately capturing the true model structure, in a manner that could be used to compress the data and make predictions (although not optimally).

Furthermore, the associated description length of a network model is a good criterion to tell whether the patterns we have found are actually simplifying our network description, without requiring the underlying model to be perfect. This happens in the same way as using a software like gzip makes our files smaller, without requiring us to believe that they are in fact generated by the Markov chain used by the underlying Lempel-Ziv algorithm.

Of course, realism is important as soon as we demand more from the point of view of interpretation and prediction. Are the observed community structures due to homophily or triadic clusure [4]? Or are they due to spatial embedding [3]? What models are capable of reproducing other network descriptors, together with the community structure? Which models can better reconstruct incomplete networks [5,6]?

When answering these questions, we are forced to consider more detailed generative processes, and compare them. However, we are never required to believe them — models are always tentative, approximative, and should always be replaced by superior alternatives when these are found. Indeed, criteria such as minimum description length serve precisely to implement such a comparison between models, following the principle of Occam's razor. Therefore, the lack of realism of any particular model cannot be used to dismiss statistical inference as an underlying methodology.

It should be emphasized that, fundamentally, there is no alternative. Rejecting an inferential approach based on the SBM on the grounds that it is an unrealistic model (e.g. because of the conditional independence of the edges being placed, or some other unpalatable assumption), but instead preferring some other non-inferential community detection method is incoherent: As we discussed previously, every descriptive method can be mapped to an inferential analogue, with implicit assumptions that are hidden from view. Unless one can establish that the implicit assumptions are in fact more realistic, then the comparison cannot be justified. Unrealistic assumptions should be replaced by more realistic ones, not by burying one’s head in the sand.

References

[1]

T. P. Peixoto, Descriptive Vs. Inferential Community Detection in Networks: Pitfalls, Myths and Half-Truths, Elements in the Structure and Dynamics of Complex Networks (2023).

[2]

S. C. Olhede and P. J. Wolfe, Network Histograms and Universality of Blockmodel Approximation, Proceedings of the National Academy of Sciences 111, 14722 (2014).

[3]

P. D. Hoff, A. E. Raftery, and M. S. Handcock, Latent Space Approaches to Social Network Analysis, Journal of the American Statistical Association 97, 1090 (2002).

[4]

T. P. Peixoto, Disentangling Homophily, Community Structure, and Triadic Closure in Networks, Physical Review X 12, 011004 (2022).

[5]

R. Guimerà and M. Sales-Pardo, Missing and Spurious Interactions and the Reconstruction of Complex Networks, Proceedings of the National Academy of Sciences 106, 22073 (2009).

[6]

T. P. Peixoto, Reconstructing Networks with Unknown and Heterogeneous Errors, Physical Review X 8, 041011 (2018).

Comments

Webmentions¹

(Nothing yet)

¹ Webmention is a standardized decentralized mechanism for conversations and interactions across the web.

No free lunch in community detection?

Tiago P. Peixoto — Mon, 06 Dec 2021 23:00:00 GMT

This post is a slightly modified version of Sec. IVG in [1].

For a wide class of optimization and learning problems there exist so-called “no-free-lunch” (NFL) theorems, which broadly state that when averaged over all possible problem instances, all algorithms show equivalent performance [2–4]. Peel et al [5] have proved that this is also valid for the problem of community detection, meaning that no single method can perform systematically better than any other, when averaged over “all community detection problems.” This has been occasionally interpreted as a reason to reject the claim that we should prefer certain classes of algorithms over others. This is, however, a misinterpretation of the theorem, as we will now discuss.

The NFL theorem for community detection is easy to state. Let us consider a generic community detection algorithm indexed by , defined by the function , which ascribes a single partition to a network . Peel et al [5] consider an instance of the community detection problem to be an arbitrary pair composed of a network and the correct partition that one wants to find from . We can evaluate the accuracy of the algorithm via an error (or “loss”) function

which should take the smallest possible value if . If the error function does not have an inherent preference for any partition (it's “homogeneous”), then the NFL theorem states [3,5].

where is a value that depends only on the error function chosen, but not on the community detection algorithm . In other words, when averaged over all problem instances, all algorithms have the same accuracy. This implies, therefore, that in order for one class of algorithms to perform systematically better than another, we need to restrict the universe of problems to a particular subset. This is a seemingly straightforward result, but which is unfortunately very susceptible to misinterpretation and overstatement.

A common criticism of this kind of NFL theorem is that it is a poor representation of the typical problems we may encounter in real domains of application, which are unlikely to be uniformly distributed across the entire problem space. Therefore, as soon as we constrain ourselves to a subset of problems that are relevant to a particular domain, then this will favor some algorithms over others — but then no algorithm will be superior for all domains. But since we are typically only interested in some domains, the NFL theorem is then arguably “theoretically sound, but practically irrelevant” [6]. Although indeed correct, in the case of community detection this logic is arguably an understatement. This is because as soon as we restrict our domain to community detection problems that reveal something informative about the network structure, then we are out of reach of the NFL theorem, and some algorithms will do better than others, without evoking any particular domain of application. We demonstrate this in the following.

The framework of the NFL theorem operates on a liberal notion of what constitutes a community detection problem and its solution, which means for an arbitrary pair choosing the right such that Under this framework, algorithms are just arbitrary mappings from network to partition, and there is no necessity to articulate more specifically how they relate to the structure of the network — community detection just becomes an arbitrary game of “guess the hidden node labels.” This contrasts with how actual community detection algorithms are proposed, which attempt to match the node partitions to patterns in the network, e.g. assortativity, general connection preferences between groups, etc. Although the large variety of algorithms proposed for this task already reveal a lack of consensus on how to precisely define it, few would consider it meaningful to leave the class of community detection problems so wide open as to accept any matching between an arbitrary network and an arbitrary partition as a valid instance.

Even though we can accommodate any (deterministic) algorithm deemed valid according to any criterion under the NFL framework, most algorithms in this broader class do something else altogether. In fact, the absolute vast majority of them correspond to a maximally random matching between network and partition, which amounts to little more than just randomly guessing a partition for any given network, i.e. they return widely different partitions for inputs that are very similar, and overall point to no correlation between input and output.¹ It is not difficult to accept that these random algorithms perform equally “well” for any particular problem, or even all problems, but the NFL theorem says that they have equivalent performance even to algorithms that we may deem more meaningful. How do we make a formal distinction between algorithms that are just randomly guessing from those that are doing something coherent, that depends on discovering actual network patterns? As it turns out, there is an answer to this question that does not depend on particular domains of application: we require the solutions found to be structured and compressive of the network.

¹ An interesting exercise is to count how many such algorithms exist. A given community detection algorithm needs to map each of all networks of nodes to one of labeled partitions of its nodes. Therefore, if we restrict ourselves to a single value of , the total number of input-output tables is . If we sample one such table uniformly at random, it will be asymptotically impossible to compress it using fewer than bits — a number that grows super-exponentially with . As an illustration, a random community detection algorithm that works only with nodes would already need terabytes of storage. Therefore, simply considering algorithms that humans can write and use (together with their expected inputs and outputs) already pulls us very far away from the general scenario considered by the NFL theorem.

In order to interpret the statement of the NFL theorem in this vein, it is useful to re-write Equation 1 using an equivalent probabilistic language,

where , and is the uniform probability of encountering a problem instance. When writing the theorem statement in this way, we notice immediately that instead of being agnostic about problem instances, it implies a very specific network generative model, which assumes a complete independence between network and partition. Namely, if we restrict ourselves to networks of nodes, we have then:

Therefore, the NFL theorem states simply that if we sample networks and partitions from a maximally random generative model, then all algorithms will have the same average accuracy at inferring the partition from the network. This is hardly a spectacular result — indeed the Bayes-optimal algorithm in this case, i.e. the one derived from the posterior distribution of the true generative model and which guarantees the best accuracy on average, consists of simply guessing partitions uniformly at random, ignoring the network structure altogether.

The probabilistic interpretation reveals that the NFL theorem makes a very specific assumption about what kind of community detection problem we are expecting, namely one where both the network and partition are sampled independently and uniformly at random. It is important to remember that it is not possible to make “no assumption” about a problem; we are always forced to make some assumption, which even if implicit does not exempt it from justification, and the uniform assumption of Equation 3 is no exception. In Figure 1 (a) we show a typical sample from this ensemble of community detection problems.

Figure 1: The NFL theorem involves predominantly instances of the community detection problem that are strictly incompressible, i.e. the true partitions cannot be used to explain the network. In (a) we show a typical sample of the uniform problem space given by Equation 3, for nodes, which yields a dense fully random network, randomly divided into groups. It is asymptotically impossible to use this partition to compress this network into fewer than bits, and therefore the partition is not learnable from the network alone with any inferential algorithm. We show also the description length of the SBM conditioned on the true partition, , as a reference. In (b) we show an example of a community detection problem that is solvable, at least in principle, since . In this case, the partition can be used to inform the network structure, and potentially vice-versa. This class of problem instance has a negligible contribution to the sum in the NFL theorem in eq:nfl_, since it occurs only with an extremely small probability when sampled from the uniform model of Equation 3. It is therefore more reasonable to state that the network in example (b) has an actual community structure, while the one in (a) does not.

In a very concrete sense, we can state that such problem instances contain no learnable community structure, or in fact no learnable network structure at all. We say that a community structure is learnable if the knowledge of the partition can be used to compress the network , i.e. there exists an encoding (i.e. a generative model) such that

where is the description length of according to model , conditioned on the partition being known. However, it is a direct consequence of Shannon's source coding theorem [7], that for the vast majority of networks sampled from the model of Equation 3 the inequality above cannot be fulfilled as , i.e. the networks are incompressible.² This means that the true partition carries no information about the network structure, and vice versa, i.e. the partition is not learnable from the network. In view of this, the common interpretation of the NFL theorem as “all algorithms perform equally well” is in fact somewhat misleading, and can be more accurately phrased as “all algorithms perform equally poorly”, since no inferential algorithm can uncover the true community structure in most cases, at least no better than by chance alone. In other words, the universe of community detection problems considered in the NFL theorem is composed overwhelmingly of problems for which compression and explanation are not possible.³ This uniformity between instances also reveals that there is no meaningful trade-off between algorithms for most instances, since all algorithms will yield the same negligible asymptotic performance, with an accuracy tending asymptotically towards zero as the number of nodes increases. In this setting, there is not only no free lunch, but in fact there is no lunch at all (see Figure 2).

² For finite networks a positive compression might be achievable with small probability, but due to chance alone, and not in a manner that makes its structure learnable.

³ One could argue that such a uniform model is justified by the principle of maximum entropy, which states that in the absence of prior knowledge about which problem instances are more likely, we should assume they are all equally likely a priori. This argument fails precisely because we do have sufficient prior knowledge that empirical networks are not maximally random — specially those possessing community structure, according to any meaningful definition of the term. Furthermore, it is easy to verify for each particular problem instance that the uniform assumption does not hold; either by compressing an observed network using any generative model (which should be asymptotically impossible under the uniform assumption), or performing a statistical test designed to reject the uniform null model. It is exceedingly difficult to find an empirical network for which the uniform model cannot be rejected with near-absolute confidence.

Figure 2: A common interpretation of the NFL theorem for community detection is that it reveals a necessary trade-off between algorithms: since they all have the same average performance, if one algorithm does better than another in one set of instances, it must do worse on a equal number of different instances, as depicted in panel (a). However, in the actual setting considered by the NFL theorem there is no meaningful trade-off: asymptotically, all algorithms perform maximally poorly for the vast majority of instances, as depicted in panel (b), since in these cases the network structure is uninformative of the partition. If we constrain ourselves to informative problem instances (which compose only an infinitesimal fraction of all instances), the NFL theorem is no longer applicable.

If we were to restrict the space of possible community detection algorithms to those that provide actual explanations, then by definition this would imply a positive correlation between network and partition, i.e.⁴

⁴ Note that Equation 4 is a necessary but not sufficient condition for the community detection problem to be solvable. An example of this are networks generated by the SBM, which are solvable only if the strength of the community structure exceeds a detectability threshold [8], even if Equation 4 is fulfilled.

Not only this implies a specific generative model but, as a consequence, also an optimal community detection algorithm, that operates based on the posterior distribution

Therefore, learnable community detection problems are invariably tied to an optimal class of algorithms, undermining to a substantial degree the relevance of the NFL theorem in practice. In other words, whenever there is an actual community structure in the network being considered — i.e. due to a systematic correlation between and , such that — there will be algorithms that can exploit this correlation better than others (see Figure 1 (b) for an example of a learnable community detection problem). Importantly, the set of learnable problems form only an infinitesimal fraction of all problem instances, with a measure that tends to zero as the number of nodes increases, and hence remain firmly out of scope of the NFL theorem. This observation has been made before, and is equally valid, in the wider context of NFL theorems beyond community detection [9–14].

Note that since there are many ways to choose a nonuniform model according to Equation 4, the optimal algorithms will still depend on the particular assumptions made via the choice . However, this does not imply that all algorithms have equal performance on compressible problem instances. If we sample a problem from the universe , with , but use instead two algorithms optimal in and , respectively, their relative performances will depend on how close each of these universes is to , and hence will not be in general the same. In fact, if our space of universes is finite, we can compose them into a single unified universe [15] according to

which will incur a compression penalty of at most bits added to the description length of the optimal algorithm. This gives us a path, based on hierarchical Bayesian models and minimum description length, to achieve optimal or near-optimal performance on instances of the community detection problem that are actually solvable, simply by progressively expanding our set of hypotheses.

The idea that we can use compression as an inference criterion has been formalized by Solomonoff's theory of inductive inference, which forms a rigorous induction framework based on the principle of Occam's razor. Importantly, the expected errors of predictions achieved under this framework are provably upper-bounded by the Kolmogorov complexity of the data generating process [16], making the induction framework consistent. The Kolmogorov complexity is a generalization of the description length we have been using, and it is defined by the length of the shortest binary program that generates the data. The only major limitation of Solomonoff's framework is its uncomputability, i.e. the impossibility of determining Kolmogorov's complexity with any algorithm. However, this impossibility does not invalidate the framework, it only means that induction cannot be fully automatized: we have a consistent criterion to compare hypotheses, but no deterministic mechanism to produce directly the best hypothesis. There are open philosophical questions regarding the universality of this inductive framework [18], but whatever fundamental limitations it may have do not follow directly from NFL theorems such as the one from [5]. In fact, as mentioned in the footnote above, it is a rather simple task to use compression to reject the uniform hypothesis forming the basis of the NFL theorem for almost any network data.

Since compressive community detection problems are out of the scope of the NFL theorem, it is not meaningful to use it to justify avoiding comparisons between algorithms, on the grounds that all choices must be equally “good” in a fundamental sense. In fact, we do not need much sophistication to reject this line of argument, since the NFL theorem applies also when we are considering trivially inane algorithms, e.g. one that always returns the same partition for every network. The only domain where such an algorithm is as good as any other is when we have no community structure to begin with, which is precisely what the NFL theorem relies on.

Nevertheless, there are some lessons we can draw from the NFL theorem. It makes it clear that the performance of algorithms are tied directly to the inductive bias adopted, which should always be made explicit. The superficial interpretation of the NFL theorem as an inherent equity between all algorithms stems from the assumption that considering all problem instances uniformly is equivalent to being free of an inductive bias, but that is not possible. The uniform assumption is itself an inductive bias, and one that it is hard to justify in virtually any context, since it involves almost exclusively unsolvable problems (from the point of view of compressibility). In contrast, considering only compressible problem instances is also an inductive bias, but one that relies only on Occam's razor as a guiding principle. The advantage of the latter is that it is independent of domain of application, i.e. we are making a statement only about whether a partition can help explaining the network, without having to specify how a priori.

In view of the above observations, it becomes easier to understand results such as of Ghasemian et al [19] who found that compressive inferential community detection methods tend to systematically outperform descriptive methods in empirical settings, when these are employed for the task of edge prediction. Even though edge prediction and community detection are not the same task, and using the former to evaluate the latter can lead in some cases to overfitting [20], typically the most compressive models will also lead to the best generalization. Therefore, the superior performance of the inferential methods is understandable, even though Ghasemian et al also found a minority of instances where some descriptive methods can outperform inferential ones. To the extent that these minority results cannot be attributed to overfitting, or technical issues such as insufficient MCMC equilibration, it could simply mean that the structure of these networks fall sufficiently outside of what is assumed by the inferential methods, but without it being a necessary trade-off that comes as a consequence of the NFL theorem — after all, under the uniform assumption, edge prediction is also strictly impossible, just like community detection. In other words, these results do not rule out the existence of an algorithm that works better in all cases considered, at least if their number is not too large ⁵. In fact, this is precisely what is achieved in [21] via model stacking, i.e. a combination of several predictors into a meta-predictor that achieves systematically superior performance. This points indeed to the possibility of using universal methods to discover the latent compressive modular structure of networks, without any tension with the NFL theorem.

⁵ It is important to distinguish the actual statement of the NFL theorem — “all algorithms perform equally well when averaged over all problem instances” — from the alternative statement: “No single algorithm exhibits strictly better performance than all others over all instances.” Although the latter is a corollary of the former, it can also be true when the former is false. In other words, a particular algorithm can be better on average over relevant problem instances, but still underperform for some of them. In fact, it would only be possible for an algorithm to strictly dominate all others if it can always achieve perfect accuracy for every instance. Otherwise, there will be at least one algorithm (e.g. one that always returns the same partition) that can achieve perfect accuracy for a single network where the optimal algorithm does not (“even a broken clock is right twice a day”). Therefore, sub-optimal algorithms can eventually outperform optimal ones by chance when a sufficiently large number of instances is encountered, even when the NFL theorem is not applicable (and therefore this fact is not necessarily a direct consequence of it).

References

[1]

T. P. Peixoto, Descriptive Vs. Inferential Community Detection in Networks: Pitfalls, Myths and Half-Truths, Elements in the Structure and Dynamics of Complex Networks (2023).

[2]

D. H. Wolpert and W. G. Macready, No Free Lunch Theorems for Search, Technical Report SFI-TR-95-02-010, Santa Fe Institute, 1995.

[3]

D. H. Wolpert, The Lack of A Priori Distinctions Between Learning Algorithms, Neural Computation 8, 1341 (1996).

[4]

D. H. Wolpert and W. G. Macready, No Free Lunch Theorems for Optimization, IEEE Transactions on Evolutionary Computation 1, 67 (1997).

[5]

L. Peel, D. B. Larremore, and A. Clauset, The Ground Truth about Metadata and Community Detection in Networks, Science Advances 3, e1602548 (2017).

[6]

C. Schaffer, A Conservation Law for Generalization Performance, in Machine Learning Proceedings 1994, edited by W. W. Cohen and H. Hirsh (Morgan Kaufmann, San Francisco (CA), 1994), pp. 259–265.

[7]

C. E. Shannon, A Mathematical Theory of Communication, Bell Syst Tech. J 27, 623 (1948).

[8]

A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, Asymptotic Analysis of the Stochastic Block Model for Modular Networks and Its Algorithmic Applications, Physical Review E 84, 066106 (2011).

[9]

M. J. Streeter, Two Broad Classes of Functions for Which a No Free Lunch Result Does Not Hold, in Genetic and Evolutionary Computation — GECCO 2003, edited by E. Cantú-Paz, J. A. Foster, K. Deb, L. D. Davis, R. Roy, U.-M. O’Reilly, H.-G. Beyer, R. Standish, G. Kendall, S. Wilson, M. Harman, J. Wegener, D. Dasgupta, M. A. Potter, A. C. Schultz, K. A. Dowsland, N. Jonoska, and J. Miller (Springer, Berlin, Heidelberg, 2003), pp. 1418–1430.

[10]

S. McGregor, No Free Lunch and Algorithmic Randomness, in GECCO, Vol. 6 (2006), pp. 2–4.

[11]

T. Everitt, Universal Induction and Optimisation: No Free Lunch?, (2013).

[12]

T. Lattimore and M. Hutter, No Free Lunch Versus Occam’s Razor in Supervised Learning, in Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence: Papers from the Ray Solomonoff 85th Memorial Conference, Melbourne, VIC, Australia, November 30 – December 2, 2011, edited by D. L. Dowe (Springer, Berlin, Heidelberg, 2013), pp. 223–235.

[13]

T. Everitt, T. Lattimore, and M. Hutter, Free Lunch for Optimisation Under the Universal Distribution, in 2014 IEEE Congress on Evolutionary Computation (CEC) (2014), pp. 167–174.

[14]

G. Schurz, Hume’s Problem Solved: The Optimality of Meta-Induction, Illustrated edition (The MIT Press, Cambridge, Massachusetts, 2019).

[15]

E. T. Jaynes, Probability Theory: The Logic of Science (Cambridge University Press, Cambridge, UK ; New York, NY, 2003).

[16]

M. Hutter, On Universal Prediction and Bayesian Confirmation, Theoretical Computer Science 384, 33 (2007).

[17]

M. Hutter, Open Problems in Universal Induction & Intelligence, Algorithms 2, 879 (2009).

[18]

G. D. Montanez, Why Machine Learning Works, (2017).

[19]

A. Ghasemian, H. Hosseinmardi, and A. Clauset, Evaluating Overfit and Underfit in Models of Network Community Structure, IEEE Transactions on Knowledge and Data Engineering 1 (2019).

[20]

T. Vallès-Català, T. P. Peixoto, M. Sales-Pardo, and R. Guimerà, Consistencies and Inconsistencies Between Model Selection and Link Prediction in Networks, Physical Review E 97, 062316 (2018).

[21]

A. Ghasemian, H. Hosseinmardi, A. Galstyan, E. M. Airoldi, and A. Clauset, Stacking Models for Nearly Optimal Link Prediction in Complex Networks, Proceedings of the National Academy of Sciences 117, 23393 (2020).

Comments

Webmentions⁶

(Nothing yet)

⁶ Webmention is a standardized decentralized mechanism for conversations and interactions across the web.

Modularity maximization considered harmful

Tiago P. Peixoto — Sun, 05 Dec 2021 23:00:00 GMT

This post is a continuation of the previous two posts, and a slightly modified version of chapter III in [1].

The most widespread method for community detection is modularity maximization [2], which happens also to be one of the most problematic. This method is based on the modularity function,

where is an entry of the adjacency matrix, is the degree of node , is the group membership of node , and is the total number of edges. The method consists in finding the partition that maximizes ,

The motivation behind the modularity function is that it compares the existence of an edge to the probability of it existing according to a null model, , namely that of the configuration model [3] (or more precisely, the Chung-Lu model [4]. The intuition for this method is that we should consider a partition of the network meaningful if the occurrence of edges between nodes of the same group exceeds what we would expect with a random null model without communities.

Despite its widespread adoption, this approach suffers from a variety of serious conceptual and practical flaws, which have been documented extensively [9]. The most problematic one is that it purports to use an inferential criterion—a deviation from a null generative model—but is in fact merely descriptive. As has been recognized very early, this method categorically fails in its own stated goal, since it always finds high-scoring partitions in networks sampled from its own null model [5].

The reason for this failure is that the method does not take into account the deviation from the null model in a statistically consistent manner. The modularity function is just a re-scaled version of the assortativity coefficient [10], a correlation measure of the community assignments seen at the endpoints of edges in the network. We should expect such a correlation value to be close to zero for a partition that is determined before the edges of the network are placed according to the null model, or equivalently, for a partition chosen at random. However, it is quite a different matter to find a partition that optimizes the value of , after the network is observed. The deviation from a null model computed in Equation 1 completely ignores the optimization step of Equation 2, although it is a crucial part of the algorithm. As a result, the method of modularity maximization tends to massively overfit, and find spurious communities even in networks sampled from its null model. We are searching for patterns of correlations in a random graph, and most of the time we will find them. This is a pitfall known as “data dredging” or “p-hacking”, where one searches exhaustively for different patterns in the same data and reports only those that are deemed significant, according to a criterion that does not take into account the fact that we are doing this search in the first place.

We demonstrate this problem in Figure 1, where we show the distribution of modularity values obtained with a uniform configuration model with for every node , considering both a random partition and the one that maximizes . While for a random partition we find what we would expect, i.e. a value of close to zero, for the optimized partition the value is substantially larger. Inspecting the optimized partition in Figure 1 (c), we see that it corresponds indeed to 15 seemingly clear assortative communities—which by construction bear no relevance to how the network was generated. They have been dredged out of randomness by the optimization procedure.

Figure 1: Modularity maximization systematically overfits, and finds spurious structures even its own null model. In this example we consider a random network model with nodes, with every node having degree . (a) Distribution of modularity values for a partition into 15 groups chosen at random, and for the optimized value of modularity, for networks sampled from the same model. (b) Adjacency matrix of a sample from the model, with the nodes ordered according to a random partition. (c) Same as (b), but with the nodes ordered according to the partition that maximizes modularity.

Somewhat paradoxically, another problem with modularity maximization is that in addition to systematically overfitting, it also systematically underfits. This occurs via the so-called “resolution limit”: in a connected network¹ the method cannot find more than communities [6], even if they seem intuitive or can be found by other methods. An example of this is shown in Figure 2, where for a network generated with the SBM containing 30 communities, modularity maximization finds only 18, while an inferential approach has no problems finding the true structure. There are attempts to counteract the resolution limit by introducing a “resolution parameter” to the modularity function, but they are in general ineffective [see 1].

¹ Modularity maximization, like many descriptive community detection methods, will always place connected components in different communities. This is another clear distinction with inferential approaches, since fully random models—without latent community structure—can generate disconnected networks if they are sufficiently sparse. From an inferential point of view, it is therefore incorrect to assume that every connected component must belong to a different community.

Figure 2: The resolution limit of modularity maximization prevents small communities from being identified, even if there is sufficient statistical evidence to support them. Panel (a) shows a network with communities sampled from an assortative SBM parametrization. The colors indicate the communities found with modularity maximization, where several pairs of true communities are merged together. Panel (b) shows the inference result of an assortative SBM [11], recovering the true communities with perfect accuracy. Panels (c) and (d) show the results for a similar model where a larger community has been introduced. In (c) we see the results of modularity maximization, which not only merges the smaller communities together, but also splits the larger community into several spurious ones — thus both underfitting and overfitting different parts of the network at the same time. In (d) we see the result obtained by inferring the SBM, which once again finds the correct answer.

These two problems—overfitting and underfitting—can occur in tandem, such that portions of the network dominated by randomness are spuriously revealed to contain communities, whereas other portions with clear modular structure can have those obstructed. The result is a very unreliable method to capture the structure of heterogeneous networks. We demonstrate this in Figure 2 (c) and (d)

In addition to these major problems, modularity maximization also often possesses a degenerate landscape of solutions, with very different partitions having similar values of [7]. In these situations the partition with maximum value of modularity can be a poor representative of the entire set of high-scoring solutions and depend on idiosyncratic details of the data rather than general patterns—which can be interpreted as a different kind of overfitting.

The combined effects of underfitting and overfitting can make the results obtained with the method unreliable and difficult to interpret. As a demonstration of the systematic nature of the problem, in Figure 3 (a) we show the number of communities obtained using modularity maximization for 263 empirical networks of various sizes and belonging to different domains, obtained from the Netzschleuder repository. Since the networks considered are all connected, the values are always below , due to the resolution limit; but otherwise they are well distributed over the allowed range. However, in Figure 3 (b) we show the same analysis, but for a version of each network that is fully randomized, while preserving the degree sequence. In this case, the number of groups remains distributed in the same range (sometimes even exceeding the resolution limit, because the randomized versions can end up disconnected). As Figure 3 (c) shows, the number of groups found for the randomized networks is strongly correlated with the original ones, despite the fact that the former have no latent community structure. This is a strong indication of the substantial amount of noise that is incorporated into the partitions found with the method.

Figure 3: Modularity maximization incorporates a substantial amount of noise into its results. (a) Number of groups found using modularity maximization for 263 empirical networks as a function of the number of edges. The dashed line corresponds to the upper bound due to the resolution limit. (b) The same as in (a) but with randomized versions of each network. (c) Correspondence between the number of groups of the original and randomized network. The dashed line shows the diagonal.

The systematic overfitting of modularity maximization—as well as other descriptive methods such as Infomap—has been also demonstrated recently in [12], from the point of view of edge prediction, on a separate empirical dataset of 572 networks from various domains.

Although many of the problems with modularity maximization were long known, for some time there were no principled solutions to them, but this is no longer the case. In the table below we summarize some of the main problems with modularity and how they are solved with inferential approaches.

Problem	Principled solution via inference
Modularity maximization overfits, and finds modules in fully random networks. [5]	Bayesian inference of the SBM is designed from the ground to avoid this problem in a principled way and systematically succeeds [13].
Modularity maximization has a resolution limit, and finds at most groups in connected networks [6]	Inferential approaches with hierarchical priors [14] [15] or strictly assortative structures [11] do not have any appreciable resolution limit, and can find a maximum number of groups that scales as . Importantly, this is achieved without sacrificing the robustness against overfitting.
Modularity maximization has a characteristic scale, and tends to find communities of similar size; in particular with the same sum of degrees.	Hierarchical priors can be specifically chosen to be a priori agnostic about characteristic sizes, densities of groups and degree sequences [15], such that these are not imposed, but instead obtained from inference, in an unbiased way.
Modularity maximization can only find strictly assortative communities.	Inferential approaches can be based on any generative model. The general SBM will find any kind of mixing pattern in an unbiased way, and has no problems identifying modular structure in bipartite networks, core-periphery networks, and any mixture of these or other patterns. There are also specialized versions for bipartite [16], core-periphery [17], and assortative patterns [11], if these are being searched exclusively.
The solution landscape of modularity maximization is often degenerate, with many different solutions with close to the same modularity value [7], and with no clear way of how to select between them.	Inferential methods are characterized by a posterior distribution of partitions. The consensus or dissensus between the different solutions [18] can be used to determine how many cohesive hypotheses can be extracted from inference, and to what extent is the model being used a poor or a good fit for the network.

Because of the above problems, the use of modularity maximization should be discouraged, since it is demonstrably not fit for purpose as an inferential method. As a consequence, the use of modularity maximization in any recent network analysis can be arguably considered a “red flag” that strongly indicates methodological carelessness. In the absence of secondary evidence supporting the alleged community structures found, or extreme care to counteract the several limitations of the method, the safest assumption is that the results obtained with that method tend to contain a substantial amount of noise, rendering any inferential conclusion derived from them highly suspicious.

As a final note, we focus on modularity here not only for its widespread adoption but also because of its emblematic character. At a fundamental level, all of its shortcoming are shared with any descriptive method in the literature—to varied but always non-negligible degrees.

References

[1]

T. P. Peixoto, Descriptive Vs. Inferential Community Detection in Networks: Pitfalls, Myths and Half-Truths, Elements in the Structure and Dynamics of Complex Networks (2023).

[2]

M. E. J. Newman, Modularity and Community Structure in Networks, Proceedings of the National Academy of Sciences 103, 8577 (2006).

[3]

B. Fosdick, D. Larremore, J. Nishimura, and J. Ugander, Configuring Random Graph Models with Fixed Degree Sequences, SIAM Review 60, 315 (2018).

[4]

F. Chung and L. Lu, Connected Components in Random Graphs with Given Expected Degree Sequences, Annals of Combinatorics 6, 125 (2002).

[5]

R. Guimerà, M. Sales-Pardo, and L. A. N. Amaral, Modularity from Fluctuations in Random Graphs and Complex Networks, Physical Review E 70, 025101 (2004).

[6]

S. Fortunato and M. Barthélemy, Resolution Limit in Community Detection, Proceedings of the National Academy of Sciences 104, 36 (2007).

[7]

B. H. Good, Y.-A. de Montjoye, and A. Clauset, Performance of Modularity Maximization in Practical Contexts, Physical Review E 81, 046106 (2010).

[8]

S. Fortunato, Community Detection in Graphs, Physics Reports 486, 75 (2010).

[9]

S. Fortunato and D. Hric, Community Detection in Networks: A User Guide, Physics Reports (2016).

[10]

M. E. J. Newman, Mixing Patterns in Networks, Phys. Rev. E 67, 026126 (2003).

[11]

L. Zhang and T. P. Peixoto, Statistical Inference of Assortative Community Structures, Physical Review Research 2, 043271 (2020).

[12]

A. Ghasemian, H. Hosseinmardi, and A. Clauset, Evaluating Overfit and Underfit in Models of Network Community Structure, IEEE Transactions on Knowledge and Data Engineering 1 (2019).

[13]

T. P. Peixoto, Bayesian Stochastic Blockmodeling, in Advances in Network Clustering and Blockmodeling (John Wiley & Sons, Ltd, 2019), pp. 289–332.

[14]

T. P. Peixoto, Hierarchical Block Structures and High-Resolution Model Selection in Large Networks, Physical Review X 4, 011047 (2014).

[15]

T. P. Peixoto, Nonparametric Bayesian Inference of the Microcanonical Stochastic Block Model, Physical Review E 95, 012317 (2017).

[16]

D. B. Larremore, A. Clauset, and A. Z. Jacobs, Efficiently Inferring Community Structure in Bipartite Networks, Physical Review E 90, 012805 (2014).

[17]

X. Zhang, T. Martin, and M. E. J. Newman, Identification of Core-Periphery Structure in Networks, Physical Review E 91, 032803 (2015).

[18]

T. P. Peixoto, Revealing Consensus and Dissensus Between Network Partitions, Physical Review X 11, 021003 (2021).

Comments

Webmentions²

(Nothing yet)

² Webmention is a standardized decentralized mechanism for conversations and interactions across the web.

Inferring, explaining, and compressing

Tiago P. Peixoto — Thu, 02 Dec 2021 23:00:00 GMT

This is a continuation of the previous blog post, and slightly modified version of chapter II in [1].

Inferential approaches to community detection (see [2] for a detailed introduction) are designed to provide explanations for network data in a principled manner. They are based on the formulation of generative models that include the notion of community structure in the rules of how the edges are placed. More formally, they are based on the definition of a likelihood for the network conditioned on a partition , and the inference is obtained via the posterior distribution, according to Bayes' rule, i.e.

where is the prior probability for a partition . Overwhelmingly, the models used for this purpose are variations of the stochastic block model (SBM) [3], where in addition to the node partition, it takes the probability of edges being placed between the different groups as an additional set of parameters. A particularly expressive variation is the degree-corrected SBM (DC-SBM) [4], with a marginal likelihood given by [5].

where is a matrix with elements specifying how many edges go between groups and , and are the degrees of the nodes. Therefore, this model specifies that, conditioned on a partition , first the edge counts are sampled from a prior distribution , followed by the degrees from the prior , and finally the network is wired together according to the probability , which respects the constraints given by , , and . See Figure 1 (a) for a illustration of this process.

Figure 1: Inferential community detection considers a generative process (a),where the unobserved model parameters are sampled from prior distributions. In the case of the DC-SBM, these are the priors for the partition , the number of edges between groups , and the node degrees, . Finally, the network itself is sampled from its model, . The inference procedure (b) consists on inverting the generative process given an observed network , corresponding to a posterior distribution , which then can be summarized by a marginal probability that a node belongs to a given group (represented as pie charts on the nodes).

This model formulation includes fully random networks as the special case when we have a single group. Together with the Bayesian approach, the use of this model will inherently favor a more parsimonious account of the data, whenever it does not warrant a more complex description — amounting to a formal implementation of Occam's razor. This is best seen by making a formal connection with information theory, and noticing that we can write the numerator of Equation 1 as

where the quantity is known as the description length [6] of the network. It is computed as:¹

¹ Note that the sum in Equation 2 vanishes because only one term is non-zero given a fixed network .

The second set of terms in the above equation quantifies the amount of information in bits necessary to encode the parameters of the model ². The first term determines how many bits are necessary to encode the network itself, once the model parameters are known. This means that if Bob wants to communicate to Alice the structure of a network , he first needs to transmit bits of information to describe the parameters , , and , and then finally transmit the remaining bits to describe the network itself. Then, Alice will be able to understand the message by first decoding the parameters from the first part of the message, and using that knowledge to obtain the network from the second part, without any errors.

² If a value occurs with probability , this means that in order to transmit it in a communication channel we need to answer at least yes-or-no questions to decode its value exactly. Therefore we need to answer one yes-or-no question for a value with , zero questions for , and questions for uniformly distributed values with . This value is called “information content”, and essentially measures the degree of “surprise” when encountering a value sampled from a distribution. See [7] for a thorough but accessible introduction to information theory and its relation to inference.

What the above connection shows is that there is a formal equivalence between inferring the communities of a network and compressing it. This happens because finding the most likely partition from the posterior is equivalent to minimizing the description length used by Bob to transmit a message to Alice containing the whole network.

Data compression amounts to formal implementation of Occam's razor because it penalizes models that are too complicated: if we want to describe a network using many communities, then the model part of the description length will be large, and Bob will need many bits to transmit the model parameters to Alice. However, increasing the complexity of the model will also reduce the first term , since there are fewer networks that are compatible with the bigger set of constraints, and hence Bob will need a shorter second part of the message to convey the network itself once the parameters are known. Compression (and hence inference), therefore, is a balancing act between model complexity and quality of fit, where an increase in the former is only justified when it results in an even larger increase of the second, such that the total description length is minimized.

The reason why the compression approach avoids overfitting the data is due to a powerful fact from information theory, known as Shannon's source coding theorem [8], which states that it is impossible to compress data sampled from a distribution using fewer bits per symbol than the entropy of the distribution, . In our context, this means that it is impossible, for example, to compress a fully random network using a SBM with more than one group.³ This means, for example, that when encountered with an example like in the figure we considered in the previous blog post, inferential methods will detect a single community comprising all nodes in the network, since any further division does not provide any increased compression, or equivalently, no augmented explanatory power. From the inferential point of view, a partition like in the previous figure (b) overfits the data, since it incorporates irrelevant random features — a.k.a. “noise” — into its description.

³ More accurately, this becomes impossible only when the network becomes asymptotically infinite; for finite networks the probability of compression is only vanishingly small.

In Figure 2 (a) is shown an example of the results obtained with an inferential community detection algorithm, for a network sampled from the SBM. As shown in Figure 2 (b), the obtained partitions are still valid when carried over to an independent sample of the model, because the algorithm is capable of separating the general underlying pattern from the random fluctuations. As a consequence of this separability, this kind of algorithm does not find communities in fully random networks, which are composed only of “noise.”

Figure 2: Inferential community detection aims to find a partition of the network according to a fit of a generative model that can explain its structure. In (a) is shown a network sampled from a stochastic block model (SBM) with 6 groups, and where the group assignments were hidden from view. The node colors show the groups found via Bayesian inference of the SBM. In (b) is shown another network sampled from same SBM, together with the same partition found in (a), showing that it carries a substantial explanatory power.

Role of inferential approaches in community detection

Inferential approaches based on the SBM have an old history, and were introduced for the study of social networks in the early 80's [3]. But despite such an old age, and having appeared repeatedly in the literature over the years (also under different names in other contexts), they entered the mainstream community detection literature rather late, arguably after the influential paper by Karrer and Newman that introduced the DC-SBM [4] in 2011, at a point where descriptive approaches were already dominating. However, despite the dominance of descriptive methods, the existence of inferential criteria was already long noticeable. In fact, in a well-known attempt to systematically compare the quality of a variety of descriptive community detection methods, the authors of [9] proposed the now so-called LFR benchmark, offered as a more realistic alternative to the simpler Newman-Girvan benchmark [10] introduced earlier. Both are in fact generative models, essentially particular cases of the DC-SBM, containing a “ground truth” community label assignment, against which the results of various algorithms are supposed to be compared. Clearly, this is an inferential evaluation criterion, although, historically, virtually all of the methods compared against that benchmark are descriptive in nature [11] (these studies were conducted mostly before inferential approaches had gained more traction). The use of such a criterion already betrays that the answer to the litmus test considered in the previous post would be “yes,” and therefore descriptive approaches are fundamentally unsuitable for the task. In contrast, methods based on statistical inference are not only more principled, but in fact provably optimal in the inferential scenario, in the sense that all conceivable algorithms can obtain either equal or worse performance, but none can do better [12].

The conflation one often finds between descriptive and inferential goals in the literature of community detection likely stems from the fact that while it is easy to define benchmarks in the inferential setting, it is substantially more difficult to do so in a descriptive setting. Given any descriptive method (modularity maximization, Infomap, Markov stability, etc.) it is usually problematic to determine for which network these methods are optimal (or even if one exists), and what would be a canonical output that would be unambiguously correct. In fact, the difficulty with establishing these fundamental references already serve as evidence that the task itself is ill-defined. On the other hand, taking an inferential route forces one to start with the right answer, via a well-specified generative model that articulates what the communities actually mean with respect to the network structure. Based on this precise definition, one then derives the optimal detection method by employing Bayes' rule.

It is also useful to observe that inferential analyses of aspects of the network other than directly its structure might still be only descriptive of the structure itself. A good example of this is the modelling of dynamics that take place on a network, such as a random walk. This is precisely the case of the Infomap method, which models a simulated teleporting random walk on a network in an inferential manner, using for that a division of the network into groups. While this approach can be considered inferential with respect to an artificial dynamics, it is still only descriptive when it comes to the actual network structure (and will suffer the same problems, such a finding communities in fully random networks). Communities found in this way could be useful for particular tasks, such as to identify groups of nodes that would be similarly affected by a diffusion process. This could be used, for example, to prevent or facilitate the diffusion by removing or adding edges between the identified groups. In this setting, the answer to the litmus test would also be “no”, since what is important is how the network “is” (i.e. how a random walk behaves on it), not how it came to be, or if its features are there by chance alone. Once more, the important issue to remember is that the groups identified in this manner cannot be interpreted as having any explanatory power about the network structure itself, and cannot be used reliably to extract inferential conclusions from it. We are firmly in a descriptive, not inferential setting with respect to the network structure.

Another important difference between inferential and descriptive approaches is worth mentioning. Descriptive approaches are tied to very particular contexts, and cannot be directly compared to one another. This has caused great consternation in the literature, since there is a vast number of such methods, and little robust methodology on how to compare them. Indeed, why should we expect that the modules found by optimizing task scheduling should be comparable to those that optimize the description of a dynamics? In contrast, inferential approaches all share the same underlying context: they attempt to explain the network structure; they vary only in how this is done. They are, therefore, amenable to principled model selection procedures, designed to evaluate which is the most appropriate fit for any particular network, even if the models used operate with very different parametrizations. In this situation, the multiplicity of different models available becomes a boon rather than a hindrance, since they all contribute to a bigger toolbox we have at our disposal when trying to understand empirical observations.

Finally, inferential approaches offer additional advantages that make them more suitable as part a scientific pipeline. In particular, they can be naturally extended to accommodate measurement uncertainties [13] — an unavoidable property of empirical data, which descriptive methods almost universally fail to consider. This information can be used not only to propagate the uncertainties to the community assignments [14] but also to reconstruct the missing or noisy measurements of the network itself [15]. Furthermore, inferential approaches can be coupled with even more indirect observations such as time-series on the nodes [16], instead of a direct measurement of the edges of the network, such that the network itself is reconstructed, not only the community structure [17]. All these extensions are possible because inferential approaches give us more than just a division of the network into groups; they give us a model estimate of the network, containing insights about its formation mechanism.

Behind every description there is an implicit generative model

From a purely mathematical perspective, there is actually no formal distinction between descriptive and inferential methods, because every descriptive method can be mapped to an inferential one, according to some implicit model. Therefore, whenever we are attempting to interpret the results of a descriptive community detection method in an inferential way — i.e. make a statement about how the network came to be — we cannot in fact avoid making implicit assumptions about the model generating process that lies behind it. (At first this statement seems to undermine the distinction we have been making between descriptive and inferential methods, but in fact this is not the case, as we will see below.)

It is not difficult to demonstrate that it is possible to formulate any conceivable community detection method as a particular inferential method. Let us consider an arbitrary quality function

which is used to perform community detection via the optimization

We can then interpret the quality function as the “Hamiltonian” of a posterior distribution

with normalization . By making we recover the optimization of Equation 4, or we may simply try to find the most likely partition according to the posterior, in which case remains an arbitrary parameter. Therefore, employing Bayes' rule in the opposite direction, we obtain the following effective generative model:

where is the marginal distribution over networks, and is the prior distribution for the partition. Due to the normalization of we have the following constraint that needs to be fulfilled:

Therefore, not all choices of and are compatible with the posterior distribution and the exact possibilities will depend on the actual shape of . However, one choice that is always possible is

with and . Taking this choice leads to the effective generative model

Therefore, inferentially interpreting a community detection algorithm with a quality function is equivalent to assuming the generative model and prior above. Furthermore, this also means that any arbitrary community detection algorithm implies a description length⁴ given (in nats) by

⁴ The description length of Equation 6 is only valid if there are no further parameters in the quality function other than that are being optimized.

What the above shows is that there is no such thing as a “model-free” community detection method, since they are all equivalent to the inference of some generative model. The only difference to a direct inferential method is that in that case the modelling assumptions are made explicitly, inviting rather than preventing scrutiny. Most often, the effective model and prior that are equivalent to an ad hoc community detection method will be difficult to interpret, justify, or even compute.

Furthermore there is no guarantee that the obtained description length of Equation 6 will yield a competitive or even meaningful compression. In particular, there is no guarantee that this effective inference will not overfit the data. Although we mentioned in the previous section that inference and compression are equivalent, the compression achieved when considering a particular generative model is constrained by the assumptions encoded in its likelihood and prior. If these are poorly chosen, no actual compression might be achieved, for example when comparing to the one obtained with a fully random model. This is precisely what happens with descriptive community detection methods: they overfit because their implicit modelling assumptions do not accommodate the possibility that a network may be fully random, or contain a balanced mixture of structure and randomness.

Since we can always interpret any community detection method as inferential, is it still meaningful to categorize some methods as descriptive? Arguably yes, because directly inferential approaches make their generative models and priors explicit, while for a descriptive method we need to extract them from back-engineering. Explicit modelling allows us to make judicious choices about the model and prior that reflect the kinds of structures we want to detect, relevant scales or lack thereof, and many other aspects that improve their performance in practice, and our understanding of the results. With implicit assumptions we are “flying blind”, relying substantially on serendipity and trial-and-error — not always with great success.

It is not uncommon to find criticisms of inferential methods due to a perceived implausibility of the generative models used — such as the conditional independence of the placement of the edges present in the SBM — although these assumptions are also present, but only implicitly, in other methods, like modularity maximization (see [1]).

The above inferential interpretation is not specific to community detection, but is in fact valid for any learning problem. The set of explicit or implicit assumptions that must come with any learning algorithm is called an “inductive bias”. An algorithm is expected to function optimally only if its inductive bias agrees with the actual instances of the problems encountered. It is important to emphasize that no algorithm can be free of an inductive bias, we can only chose which intrinsic assumptions we make about how likely we are to encounter a particular kind of data, not whether we are making an assumption. Therefore, it is particularly problematic when a method does not articulate explicitly what these assumptions are, since even if they are hidden from view, they exist nonetheless, and still need to be scrutinized and justified. This means we should be particularly skeptical of the impossible claim that a learning method is “model free”, since this denomination is more likely to signal an unwillingness to expose the underlying modelling assumptions, which could potentially be revealed as unappealing and fragile when eventually forced to come under scrutiny.

References

[1]

T. P. Peixoto, Descriptive Vs. Inferential Community Detection in Networks: Pitfalls, Myths and Half-Truths, Elements in the Structure and Dynamics of Complex Networks (2023).

[2]

T. P. Peixoto, Bayesian Stochastic Blockmodeling, in Advances in Network Clustering and Blockmodeling (John Wiley & Sons, Ltd, 2019), pp. 289–332.

[3]

P. W. Holland, K. B. Laskey, and S. Leinhardt, Stochastic Blockmodels: First Steps, Social Networks 5, 109 (1983).

[4]

B. Karrer and M. E. J. Newman, Stochastic Blockmodels and Community Structure in Networks, Physical Review E 83, 016107 (2011).

[5]

T. P. Peixoto, Nonparametric Bayesian Inference of the Microcanonical Stochastic Block Model, Physical Review E 95, 012317 (2017).

[6]

P. D. Grünwald, The Minimum Description Length Principle (The MIT Press, 2007).

[7]

D. J. C. MacKay, Information Theory, Inference and Learning Algorithms, First Edition (Cambridge University Press, 2003).

[8]

C. E. Shannon, A Mathematical Theory of Communication, Bell Syst Tech. J 27, 623 (1948).

[9]

A. Lancichinetti, S. Fortunato, and F. Radicchi, Benchmark Graphs for Testing Community Detection Algorithms, Physical Review E 78, 046110 (2008).

[10]

M. Girvan and M. E. J. Newman, Community Structure in Social and Biological Networks, Proceedings of the National Academy of Sciences 99, 7821 (2002).

[11]

A. Lancichinetti and S. Fortunato, Community Detection Algorithms: A Comparative Analysis, Physical Review E 80, 056117 (2009).

[12]

A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, Asymptotic Analysis of the Stochastic Block Model for Modular Networks and Its Algorithmic Applications, Physical Review E 84, 066106 (2011).

[13]

T. P. Peixoto, Reconstructing Networks with Unknown and Heterogeneous Errors, Physical Review X 8, 041011 (2018).

[14]

T. P. Peixoto, Revealing Consensus and Dissensus Between Network Partitions, Physical Review X 11, 021003 (2021).

[15]

R. Guimerà and M. Sales-Pardo, Missing and Spurious Interactions and the Reconstruction of Complex Networks, Proceedings of the National Academy of Sciences 106, 22073 (2009).

[16]

T. Hoffmann, L. Peel, R. Lambiotte, and N. S. Jones, Community Detection in Networks Without Observing Edges, Science Advances 6, eaav1478 (2020).

[17]

T. P. Peixoto, Network Reconstruction and Community Detection from Dynamics, Physical Review Letters 123, 128301 (2019).

Comments

Webmentions⁵

(Nothing yet)

⁵ Webmention is a standardized decentralized mechanism for conversations and interactions across the web.

Descriptive vs. inferential community detection

Tiago P. Peixoto — Wed, 01 Dec 2021 23:00:00 GMT

This post is a slightly modified version of chapter II in [1].

Community detection is the task of dividing a network — typically one which is large — into many smaller groups of nodes that have a similar contribution to the overall network structure. With such a division, we can better summarize the large-scale structure of a network by describing how these groups are connected, instead of each individual node. This simplified description can be used to digest an otherwise intractable representation of a large system, providing insight into its most important patterns, how they relate to its function, and the underlying mechanisms responsible for its formation.

At a very fundamental level, community detection methods can be divided into two main categories: “descriptive” and “inferential.”

Descriptive methods attempt to find communities according to some context-dependent notion of a good division of the network into groups. These notions are based on the patterns that can be identified in the network via an exhaustive algorithm, but without taking into consideration the possible rules that were used to create them. These patterns are used only to describe the network, not to explain it. Usually, these approaches do not articulate precisely what constitutes community structure to begin with, and focus instead only on how to detect them. For this kind of method, concepts of statistical significance, parsimony and generalizability are usually not evoked.

Inferential methods, on the other hand, start with an explicit definition of what constitutes community structure, via a generative model for the network. This model describes how a latent (i.e. not observed) partition of the nodes would affect the placement of the edges. The inference consists on reversing this procedure to determine which node partitions are more likely to have been responsible for the observed network. The result of this is a “fit” of a model to data, that can be used as a tentative explanation of how it came to be. The concepts of statistical significance, parsimony and generalizability arise naturally and can be quantitatively assessed in this context. See e.g. [2].

Descriptive community detection methods are by far the most numerous, and those that are in most widespread use. However, this contrasts with the current state-of-the-art, which is composed in large part of inferential approaches. Here we point out the major differences between them and discuss how to decide which is more appropriate, and also why one should in general favor the inferential varieties whenever the objective is derive interpretations from data.

Describing vs. explaining

We begin by observing that descriptive clustering approaches are the method of choice in certain contexts. For instance, such approaches arise naturally when the objective is to divide a network into two or more parts as a means to solve a variety of optimization problems. Arguably, the most classic example of this is the design of Very Large Scale Integrated Circuits (VLSI). The task is to combine millions of transistors into a single physical microprocessor chip. Transistors that connect to each other must be placed together to take less space, consume less power, reduce latency, and reduce the risk of cross-talk with other nearby connections. To achieve this, the initial stage of a VLSI process involves the partitioning of the circuit into many smaller modules with few connections between them, in a manner that enables their efficient spatial placement, i.e. by positioning the transistors in each module close together and those in different modules farther apart.

Another notable example is parallel task scheduling, a problem that appears in computer science and operations research. The objective is to distribute processes (i.e. programs, or tasks in general) between different processors, so they can run at the same time. Since processes depend on the partial results of other processes, this forms a dependency network, which then needs to be divided such that the number of dependencies across processors is minimized. The optimal division is the one where all tasks are able to finish in the shortest time possible.

Both examples above, and others, have motivated a large literature on “graph partitioning” dating back to the 70s, which covers a family of problems that play an important role in computer science and algorithmic complexity theory.

Although reminiscent of graph partitioning, and sharing with it many algorithmic similarities, community detection is used more broadly with a different goal [3]. Namely, the objective is to perform data analysis, where one wants to extract scientific understanding from empirical observations. The communities identified are usually directly used for representation and/or interpretation of the data, rather than as a mere device to solve a particular optimization problem. In this context, a merely descriptive approach will fail at giving us a meaningful insight into the data, and can be misleading, as we will discuss in the following.

We illustrate the difference between descriptive and inferential approaches in Figure 1. We first make an analogy with the famous “face” seen on images of the Cydonia Mensae region of the planet Mars. A merely descriptive account of the image can be made by identifying the facial features seen, which most people immediately recognize. However, an inferential description of the same image would seek instead to explain what is being seen. The process of explanation must invariably involve at its core an application of the law of parsimony, or Occam's razor. This principle predicates that when considering two hypotheses compatible with an observation, the simplest one must prevail. Employing this logic results in the conclusion that what we are seeing is in fact a regular mountain, without denying that it looks like a face in that picture, but just accidentally. In other words, the “facial” description is not useful as an explanation, as it emerges out of random features rather than exposing any underlying mechanism.

Figure 1: Difference between descriptive and inferential approaches to data analysis. As an analogy, on the top row we see two representations of the Cydonia Mensae region on Mars. On the top left is a descriptive account of what we see in the picture, namely a face. On the top right is an inferential of representation of what lies behind it, namely a mountain. (We show a more recent image of the same region with a higher resolution to represent an inferential interpretation of the figure on the left.) More concretely, on the bottom row we see two representations of the same network. On the bottom left we see a descriptive division into 13 assortative communities. On the bottom right we see an inferential representation as a fully random network, with no communities, since this is a more likely model of how this network was formed (see Figure 2).

Going out of the analogy and back to the problem of community detection, in the bottom of Figure 1 we see a descriptive and an inferential account of an example network. The descriptive one is a division of the nodes into 13 assortative communities, which would be identified with many descriptive community detection methods available in the literature. Indeed, we can inspect visually that these groups form assortative communities, and most people would agree that these communities are really there, according to most definitions in use: these are groups of nodes with many more internal edges than external ones. However, an inferential account of the same network would reveal something else altogether. Specifically, it would explain this network as the outcome of a process where the edges are placed at random, without the existence of any communities. The communities that we see in Figure 1 (a) are just a byproduct of this random process, and therefore carry no explanatory power. In fact, this is exactly how the network in this example was generated, i.e. by choosing a specific degree sequence and connecting the edges uniformly at random.

In Figure 2 (a) we illustrate in more detail how the network in Figure 1 was generated: The degrees of the nodes are fixed, forming “stubs” or “half-edges”, which are then paired uniformly at random forming the edges of the network. In Figure 2 (b), like in Figure 1, the node colors show the partition found with descriptive community detection methods. However, this network division carries no explanatory power beyond what is contained in the degree sequence of the network, since it is generated otherwise uniformly at random. This becomes evident in Figure 2 (c), where we show another network sampled from the same generative process, i.e. another random pairing, but partitioned according to the same division as in Figure 2 (b). Since the nodes are paired uniformly at random, constrained only by their degree, this will create new apparent “communities” that are always uncorrelated with one another. Like the “face” on Mars, they can be seen and described, but they cannot explain.

Figure 2: Descriptive community detection finds a partition of the network according to an arbitrary criterion that bears in general no relation to the rules that were used to generate it. In (a) is shown the generative model we consider, where first a degree sequence is given to the nodes (forming “stubs”, or “half-edges”) which then are paired uniformly at random, forming a graph. In (b) is shown a realization of this model. The node colors show the partition found with virtually any descriptive community detection method. In (c) is shown another network sampled from the same model, together with the same partition found in (b), which is completely uncorrelated with the new apparent communities seen, since they are the mere byproduct of the random placement of the edges. An inferential approach would find only a single community in both (b) and (c), since no partition of the nodes is relevant for the underlying generative model.

We emphasize that the communities found in Figure 2 (b) are indeed really there from a descriptive point of view, and they can in fact be useful for a variety of tasks. For example, the cut given by the partition, i.e. the number of edges that go between different groups, is only 13, which means that we need only to remove this number of edges to break the network into (in this case) 13 smaller components. Depending on context, this kind of information can be used to prevent a widespread epidemic, hinder undesired communication, or, as we have already discussed, distribute tasks among processors and design a microchip. However, what these communities cannot be used for is to explain the data. In particular, a conclusion that would be completely incorrect is that the nodes that belong to the same group would have a larger probability of being connected between themselves. As shown in Figure 2 (a), this is clearly not the case, as the observed “communities” arise by pure chance, without any preference between the nodes.

To infer or to describe? A litmus test

Given the above differences, and the fact that both inferential and descriptive approaches have their uses depending on context, we are left with the question: Which approach is more appropriate for a given task at hand? In order to help answering this question, independent of the particular context, it is useful to consider the following “litmus test”:

Q: “Would the usefulness of our conclusions change if we learn, after obtaining the communities, that the network being analyzed is completely random?”

If the answer is “yes”, then an inferential approach is needed.

If the answer is “no”, then an inferential approach is not required.

If the answer to the above question is “yes”, then an inferential approach is warranted, since the conclusions depend on an interpretation of how the data were generated. Otherwise, a purely descriptive approach may be appropriate since considerations about generative processes are not relevant.

It is important to understand that the relevant question in this context is not whether the network being analyzed is actually fully random, ¹ since this is rarely the case for empirical networks. Instead, considering this hypothetical scenario serves as a test to evaluate if our task requires us to separate between actual latent community structure (i.e. those that are responsible for the network formation), from those that arise completely out of random fluctuations, and hence carry no explanatory power. Furthermore, most empirical networks, even if not fully random, like most interesting data, are better explained by a mixture of structure and randomness, and a method that cannot tell those apart cannot be used for inferential purposes.

¹ “Fully random” here means sampled form a random graph model, like the Erdős-Rényi model, the configuration model, or some other null model where whatever communities we may ascribe to the nodes play no role in the placement of the edges.

² Although this is certainly true at a first instance, we can also argue that properly understanding why a certain partition was possible in the first place would be useful for reproducibility and to aid the design of future instances of the problem. For these purposes, an inferential approach would be more appropriate.

Returning to the VLSI and task scheduling examples we considered in the previous section, it is clear that the answer to the litmus test above would be “no”, since it hardly matters how the network was generated and how we should interpret the partition found, as long as the integrated circuit can be manufactured and function efficiently, or the tasks finish in the minimal time. Interpretation and explanations are simply not the primary goals in these cases.²

However, it is safe to say that in network data analyses very often the answer to the question above question would be “yes.” Typically, community detection methods are used to try to understand the overall large-scale network structure, determine the prevalent mixing patterns, make simplifications and generalizations, all in a manner that relies on statements about what lies behind the data, e.g. whether nodes were more or less likely to be connected to begin with. A majority of conclusions reached would be severely undermined if one would discover that the underlying network is in fact fully random. This means that these analyses are at a grave peril when using purely descriptive methods, since they are likely to be overfitting the data — i.e. confusing randomness with underlying structure.

References

[1]

T. P. Peixoto, Descriptive Vs. Inferential Community Detection in Networks: Pitfalls, Myths and Half-Truths, Elements in the Structure and Dynamics of Complex Networks (2023).

[2]

T. P. Peixoto, Bayesian Stochastic Blockmodeling, in Advances in Network Clustering and Blockmodeling (John Wiley & Sons, Ltd, 2019), pp. 289–332.

[3]

S. Fortunato and D. Hric, Community Detection in Networks: A User Guide, Physics Reports (2016).

Comments

Webmentions³

(Nothing yet)

³ Webmention is a standardized decentralized mechanism for conversations and interactions across the web.

Is network reconstruction impossible?

Tiago P. Peixoto — Thu, 28 Oct 2021 22:00:00 GMT

I have been getting some questions about a 2018 paper [1] by Jinyuan Chang, Eric D. Kolaczyk, and Qiwei Yao that deals with reconstruction of noisy networks, i.e. networks that are measured with uncertainty, so that true edges may not be observed or fake ones may be spuriously introduced. Among other things, they state:

“Under a simple model of network error, we show that consistent estimation of [subgraph] densities is impossible when the rates of error are unknown and only a single network is observed.”

This seems like a contradiction of a paper of mine [2] where I presented a method to do precisely what is considered impossible in the above statement: reconstruct networks from single measurements, when the error rates are unknown. So, where lies the problem?

Let us begin by defining the reconstruction scenario, which is fairly simple. Suppose we observe a noisy network , obtained by measuring a true network , subject to the error rates and , such that

In other words, is the probability of observing a missing edge, and is the probability of observing a spurious edge.

The reconstruction task is to obtain an estimate of based only on , without knowing either or . (Note that this reconstruction would also inherently give us an estimate for and .)

Chang et al. consider estimators of subgraph densities that operate on the observed network , in a manner that makes no explicit assumption about how the data are generated. Essentially they claim that not knowing the true values of , , and , it is impossible to say anything about either of these values from alone.

It is important to understand that it is not in fact possible to make “no assumptions” about how data are generated. Assumptions are always made; they can only be implicit or explicit. Implicit assumptions, i.e. those that are hidden from view, are not exempt from justification. So-called “frequentist” estimators that make no explicit reference to a prior distribution are in fact formally equivalent to Bayesian estimators with a uniform prior, i.e. assuming that all parameters values are equally likely. In the case where the parameter is a graph, this means that our prior expectation is that is not only fully random, but in fact also dense, i.e. with a mean degree , where is the number of nodes. Is this a reasonable assumption?

In [2] we take instead a Bayesian approach, where we are explicit about our assumptions, yielding a posterior distribution for the reconstruction,

In this setting, we can recover the “impossibility” result of Chung et al by choosing the prior as a constant. But this is not what should be done; instead we should choose a prior that makes as little commitment as possible about the network structure before we see any data. Note that this is very different from choosing a uniform prior! A uniform prior would in fact be a very strong commitment, that is overwhelmingly likely to be wrong in almost every empirical setting. Instead, we need a nonparametric hierarchical model that includes everything from fully random to very structured networks as special cases, in a manner that encapsulates the kinds of data that we are likely to find.

In order to illustrate intuitively why this makes sense, let us consider a particular instance of the problem. Suppose that, without knowing the true network and the noise magnitudes and , we observe the following noisy network :

A mysterious noisy network. What lies behind it?

From pure intuition, when observing the above network, we would like to immediately claim that it is close to the true network, and that the noise magnitudes are low. Why? Because we know that a perfect lattice is unlikely to be formed by chance alone (i.e. from a uniform prior). Our intuitive prior knows that such things called lattices exist, and that when they occur, they look exactly like the figure above. And also, when the true network is a lattice, a high value of either or would destroy its pristine structure. The final conclusion is that the reconstruction of this network is not only possible, but in fact not very difficult.

The work [2] puts the above intuition on firmer terms by choosing the prior to match an unknown stochastic block model (SBM) [3]. This model includes the “fully random” assumption as a special case, but is also capable of modelling a wide variety of structural patterns. Is this a realistic assumption? As it turns out, it is sufficiently generic to make the reconstruction possible in many cases, even when the model is not fully realistic.

As an example, we can consider the perfect lattice considered above. Below is the reconstructed version of this presumed noisy network, according to the method of [2] (see the HOWTO):

A mystery revealed.

The thickness of the edge corresponds to the marginal posterior probability. Essentially, we conclude that the observed network is perfectly accurate, conforming to our intuition. The colors on the nodes shown above correspond to the node partition found with the SBM. Note that this is a very coarse and arguably displeasing generative model for this network, which would be generated by it with a very low probability. Nevertheless, even with such misspecification of the prior, the model is enough to detect that the underlying network is far from random, and enable reconstruction. The posterior estimates for the noise magnitudes are and ; indeed quite small. Not bad!

Of course, in [2] we consider situations where reconstruction is made for higher noise magnitudes, and also for real networks. But the above already serves to show that reconstruction from single measurements is indeed possible.

This should not be an earth-shattering conclusion. After all, single-measurement reconstructions of noisy images, time-series, and other high-dimensional objects are commonplace. Why not of networks? The key here is to abandon the idea that a network (like an image or a time-series) is a “singleton” object, and instead view it as a heterogeneous population of objects — namely the individual edges and nodes. And we should make assumptions that, while being agnostic about which kinds of pattern there should be, also allow for them to be detected in the first place.

References

[1]

J. Chang, E. D. Kolaczyk, and Q. Yao, Estimation of Subgraph Density in Noisy Networks, arXiv:1803.02488 [Stat] (2020).

[2]

T. P. Peixoto, Reconstructing Networks with Unknown and Heterogeneous Errors, Physical Review X 8, 041011 (2018).

[3]

T. P. Peixoto, Bayesian Stochastic Blockmodeling, in Advances in Network Clustering and Blockmodeling (John Wiley & Sons, Ltd, 2019), pp. 289–332.

Comments

Webmentions¹

(Nothing yet)

¹ Webmention is a standardized decentralized mechanism for conversations and interactions across the web.

Tiago P. Peixoto

Untangling the hairball using statistical inference

The seductive futility of network visualization

Force-directed layouts only see assortativity

Hidden models and latent compression in community detection

Every method is inferential when the model is bad enough

No “benign overfitting” in community detection

References

Comments

Webmentions7

Is Bayesian inference subjective?

References

Comments

Webmentions8

Significant community structure via statistical tests?

References

Comments

Webmentions2

Do we need to believe in generative models?

References

Comments

Webmentions1

No free lunch in community detection?

References

Comments

Webmentions6

Modularity maximization considered harmful

References

Comments

Webmentions2

Inferring, explaining, and compressing

Role of inferential approaches in community detection

Behind every description there is an implicit generative model

References

Comments

Webmentions5

Descriptive vs. inferential community detection

Describing vs. explaining

To infer or to describe? A litmus test

References

Comments

Webmentions3

Is network reconstruction impossible?

References

Comments

Webmentions1

Webmentions⁷

Webmentions⁸

Webmentions²

Webmentions¹

Webmentions⁶

Webmentions²

Webmentions⁵

Webmentions³

Webmentions¹