The perplexing “connected cluster axiom”
TL;DR — There’s no good reason to always expect communities in networks to be made of strictly connected nodes. On the contrary, doing so leads to conceptual inconsistencies and overfitting.
There’s a meme1 in the network science literature, in which authors declare that a desirable property of a network clustering algorithm is for it to find communities that are strictly connected, i.e. there is an internal path between any two pairs of nodes that belong to the same cluster. When provided, this statement is virtually never followed by any justification, as if it were completely self-evident — an entirely obvious axiom, innately intuitive to any sentient being. This is particularly common of older papers, from the wild west era of the 2000s when authors were trying every heuristic under the sun to solve the community detection problem, without agreeing first on what the problem actually is, or should be. This has changed substantially in last 20 years, in particular in the last 15 years or so, where a principled theoretical framework for community detection has been established, based predominantly on statistical inference and information theory. But alas, in some corners, the “connected cluster axiom” persists. However, it is interesting how this expectation seems intuitive to some, despite being inconsistent with other simultaneously held assumptions.
1 In the original sense of the word. I’m not talking about images in social media.
For example, the most widely (mis-)used and infamous community detection method, modularity maximization, relies on the notion of a null model: a statistical generative model for networks that have no meaningful community structure. The most common generative model employed for this purpose is the configuration model, where nodes are connected at random, while guaranteeing that the degree of each node is fixed in expectation to a particular value, set to be equal to the one in the network being clustered. This choice begs the question: “Are networks generated by the null model typically connected?” The answer is actually “no,” since networks sampled from this null model are guaranteed to be connected only in the unrealistic asymptotic regime where every degree is sufficiently large. In most of the cases, the null models chosen to match most empirical networks do not result in connected networks. Therefore, a sane person would not simultaneously hold the conviction that clusters must be connected together with the idea that the configuration model is a suitable null model for community detection. Yet, bizarrely, when some authors notice that modularity maximization always puts disconnected components in their own separate communities, they think all is well with the universe, since the “connected cluster axiom” is satisfied, although the premise of the method has actually been completely subverted: it finds “meaningful” structures in its own null model! Instead of rendering the method unfit for purpose, this major inconsistency is often simply ignored.
Equally perplexing is when, due to a mathematical artifact or lack of statistical evidence, a method happens to produce disconnected clusters (for example due to the resolution limit of modularity), authors then choose to simply force the communities to be connected, once more ignoring any internal methodological consistency this entails. In network science, one can become very popular by disseminating such ideas [1].
The conceptual inconsistency of the “connected cluster axiom” is also evident when benchmarks are used to compare different methods. A prominent choice is the LFR benchmark [2], which consists of networks sampled from the degree-corrected stochastic block model (SBM) with particular parameter choices (assortative communities, with sizes and node degrees following a power law distribution). Is the benchmark guaranteed to produce connected clusters? No. Yet, the contradiction seems to go largely unnoticed by proponents of the “connected cluster axiom.”
A lot of this can be explained simply by carelessness. Some researchers seem to only want a quick simple tool that seems to do the job, and are not interested to learn about how it works (or doesn’t). Everyone else uses the same flawed method anyway, so who cares, specially if the analysis being done is superfluous after all — an unavoidable premise of this rationale.2
2 As I mentioned in another occasion, the reliance on heuristics like modularity maximization for community detection, with its many documented flaws, but also the UMAP method for dimensionality reduction, TDA, and so on are red flags that indicate that the authors have given up on a minimally rigorous understanding of their analysis. This does not necessarily mean that the overall analysis is bad as a whole, but whatever quality it has is achieved in spite of its methodological choices, rather than because of them.
At the risk of giving them undue attention, I just came across two recent papers [3,4] that heroically managed to employ the “connected cluster axiom” as a criterion to “improve” SBM inference. It seems the authors used graph-tool to find clusters in networks, and found that it yielded disconnected clusters for them. How truly embarrassing! Generously, the authors went a long way to “fix” the problem, and provided much needed “improvements” to the inference algorithm to prevent this calamity from occurring: by suitably post-processing the data (always a good idea!), and forcing disconnected clusters inside each group to belong to different communities.
(Since this is the internet, I should disclose that the above paragraph contains sarcasm.)
What is particularly egregious about the works mentioned above is that the authors manage to completely miss the point of the whole inferential approach based on SBMs. Because of the “connected cluster axiom,” they interpret what is an explicit strength of the approach — namely its ability to distinguish structure from randomness — as a defect. And therefore, the supposed “fix” they propose actually breaks the method. When we infer a SBM given some observed network, we try to find the groups of nodes that have the same (latent) probability of connecting to other nodes in the network. According to this model, in the assortative case,3 i.e. when the connection probability is larger inside groups than across, nodes belonging to the same group could still end up being disconnected. In other words, the generative model used does not attempt to obey the “connected cluster axiom.” Furthermore, trying to “fix” this, by either constraining the algorithm or modifying the data, means that the algorithm can no longer always find a single group in a maximally random network, such as one from an Erdős–Rényi or the configuration model. In other words, this inept “fix” just causes the approach to overfit in trivial scenarios, entirely nullifying its raison d’être. And frankly, that’s just rude.
3 I mention only in passing that when the group mixing is disassortative, the expectation that nodes of the same group should be connected is just schizophrenic at this point, rather than just excessive. Sadly, this is a relevant point for the papers in question, since they consider empirical networks, which are rarely purely assortative [5].
Needless to say, I expressly recommend users of graph-tool to avoid such alleged “improvements” at all costs.
The detection of planted community structure in sparse graphs is a challenging task. One can easily construct instances that are NP-hard, and the outcome will depend not only on the quality function being used, but also on the optimization/sampling algorithms chosen. The SBM parametrization and choice of priors can also have an important effect, as discussed in the literature [6]. Elaborating on alternative model choices that work better in particular scenarios is a perfectly valid endeavor. What is invalid is to bypass explicit modeling choices in favor of a heuristic kludge with no actual justification.
Where does the “connected cluster axiom” meme come from? Is it simply from the conflation between “connected component” and “cluster”? Are people inherently thinking of descriptive clusters that are helpful to explain some dynamics on a network, such as a flow or a spreading process, where connectedness plays a deciding role? If adherents of the “connected cluster axiom” only would actually try to explain their rationale, instead of taking it as given, it could be addressed. But I’m not too optimistic, since academics don’t usually miss the opportunity of explaining a conviction when they can, and hence, due to Bayes’ theorem, we can conclude that they probably can’t explain it when they don’t.
References
Webmentions4
(Nothing yet)
4 Webmention is a standardized decentralized mechanism for conversations and interactions across the web.
Comments
If you reply to this post, it will show as a comment below the original article.