(This is a continuations of the previous two posts, and a slightly modified version
of chapter III in [peixoto_descriptive_2021].)
The most widespread method for community detection is modularity
maximization [newman_modularity_2006], which happens also to be one
of the most problematic. This method is based on the modularity function,
\begin{equation*}
Q(\boldsymbol A,\boldsymbol b) = \frac{1}{2E}\sum_{ij}\left(A_{ij} - \frac{k_ik_j}{2E}\right)\delta_{b_i,b_j}
\end{equation*}
where \(A_{ij}\in\{0,1\}\) is an entry of the adjacency matrix,
\(k_i=\sum_jA_{ij}\) is the degree of node \(i\), \(b_i\) is
the group membership of node \(i\), and \(E\) is the total number of
edges. The method consists in finding the partition \(\hat{\boldsymbol b}\) that
maximizes \(Q({\boldsymbol A},{\boldsymbol b})\),
\begin{equation*}
\hat{\boldsymbol b} = \underset{\boldsymbol b}{\operatorname{argmax}}\; Q(\boldsymbol A,\boldsymbol b).
\end{equation*}
The motivation behind the modularity function is that it compares the
existence of an edge \((i,j)\) to the probability of it existing according
to a null model, \(P_{ij} = k_ik_j/2E\), namely that of the configuration
model [fosdick_configuring_2018] (or more precisely, the Chung-Lu
model [chung_connected_2002]). The intuition for this method is
that we should consider a partition of the network meaningful if the
occurrence of edges between nodes of the same group exceeds what we
would expect with a random null model without communities.
Despite its widespread adoption, this approach suffers from a variety of
serious conceptual and practical flaws, which have been documented
extensively [guimera_modularity_2004] [fortunato_resolution_2007]
[good_performance_2010] [fortunato_community_2010]
[fortunato_community_2016]. The most problematic one is that it
purports to use an inferential criterion — a deviation from a
null generative model — but is in fact merely descriptive. As has been
recognized very early, this method categorically fails in its own stated
goal, since it always finds high-scoring partitions in networks sampled
from its own null model [guimera_modularity_2004].
The reason for this failure is that the method does not take into
account the deviation from the null model in a statistically consistent
manner. The modularity function is just a re-scaled version of the
assortativity coefficient [newman_mixing_2003], a correlation measure
of the community assignments seen at the endpoints of edges in the
network. We should expect such a correlation value to be close to zero
for a partition that is determined before the edges of the network are
placed according to the null model, or equivalently, for a partition
chosen at random. However, it is quite a different matter to find a
partition that optimizes the value of \(Q({\boldsymbol
A},{\boldsymbol b})\), after the network is observed. The deviation from
a null model computed in eq:Q completely ignores the optimization step
of eq:qmax, although it is a crucial part of the algorithm. As a
result, the method of modularity maximization tends to massively
overfit, and find spurious communities even in networks sampled from its
null model. We are searching for patterns of correlations in a random
graph, and most of the time we will find them. This is a pitfall known
as “data dredging” or
“p-hacking”, where one
searches exhaustively for different patterns in the same data and
reports only those that are deemed significant, according to a criterion
that does not take into account the fact that we are doing this search
in the first place.
We demonstrate this problem in fig:randomQ, where we show the
distribution of modularity values obtained with a uniform configuration
model with \(k_i=5\) for every node \(i\), considering both a
random partition and the one that maximizes \(Q({\boldsymbol
A},{\boldsymbol b})\). While for a random partition we find what we would
expect, i.e. a value of \(Q({\boldsymbol A},{\boldsymbol b})\) close
to zero, for the optimized partition the value is substantially
larger. Inspecting the optimized partition in fig:randomQ (c),
we see that it corresponds indeed to 15 seemingly clear assortative
communities — which by construction bear no relevance to how the network
was generated. They have been dredged out of randomness by the
optimization procedure.
Somewhat paradoxically, another problem with modularity maximization is
that in addition to systematically overfitting, it also systematically
underfits. This occurs via the so-called resolution limit: in a
connected network the method cannot find more than
\(\sqrt{2E}\) communities [fortunato_resolution_2007], even if they
seem intuitive or can be found by other methods. An example of this is
shown in fig:resolution, where for a network generated with the SBM
containing 30 communities, modularity maximization finds only 18, while
an inferential approach has no problems finding the true
structure. There are attempts to counteract the resolution limit by
introducing a “resolution parameter” to the modularity function, but
they are in general ineffective (see [peixoto_descriptive_2021]).
These two problems — overfitting and underfitting — can occur in tandem,
such that portions of the network dominated by randomness are spuriously
revealed to contain communities, whereas other portions with clear
modular structure can have those obstructed. The result is a very
unreliable method to capture the structure of heterogeneous networks. We
demonstrate this in fig:resolution (c) and (d)
In addition to these major problems, modularity maximization also often
possesses a degenerate landscape of solutions, with very different
partitions having similar values of \(Q({\boldsymbol
A},{\boldsymbol b})\) [good_performance_2010]. In these situations the
partition with maximum value of modularity can be a poor representative
of the entire set of high-scoring solutions and depend on idiosyncratic
details of the data rather than general patterns — which can be
interpreted as a different kind of overfitting.
The combined effects of underfitting and overfitting can make the
results obtained with the method unreliable and difficult to
interpret. As a demonstration of the systematic nature of the problem,
in fig:Qrand (a) we show the number of communities obtained using
modularity maximization for 263 empirical networks of various sizes and
belonging to different domains, obtained from the Netzschleuder
catalogue. Since the networks considered
are all connected, the values are always below \(\sqrt{2E}\), due to
the resolution limit; but otherwise they are well distributed over the
allowed range. However, in fig:Qrand (b) we show the same
analysis, but for a version of each network that is fully randomized,
while preserving the degree sequence. In this case, the number of groups
remains distributed in the same range (sometimes even exceeding the
resolution limit, because the randomized versions can end up
disconnected). As fig:Qrand (c) shows, the number of groups
found for the randomized networks is strongly correlated with the
original ones, despite the fact that the former have no latent community
structure. This is a strong indication of the substantial amount of
noise that is incorporated into the partitions found with the method.
The systematic overfitting of modularity maximization — as well as other
descriptive methods such as Infomap — has been also demonstrated
recently in [ghasemian_evaluating_2019], from the point of view of edge
prediction, on a separate empirical dataset of 572 networks from various
domains.
Although many of the problems with modularity maximization were long
known, for some time there were no principled solutions to them, but
this is no longer the case. In the table below we summarize some of the
main problems with modularity and how they are solved with inferential
approaches.
Problem |
Principled solution via inference |
Modularity maximization overfits, and finds modules in fully random
networks. [guimera_modularity_2004] |
Bayesian inference of the SBM is designed from the ground to avoid
this problem in a principled way and systematically succeeds [peixoto_bayesian_2019]. |
Modularity maximization has a resolution limit, and finds at most
\(\sqrt{2E}\) groups in connected networks [fortunato_resolution_2007]. |
Inferential approaches with hierarchical priors [peixoto_hierarchical_2014]
[peixoto_nonparametric_2017] or strictly assortative structures
[zhang_statistical_2020] do not have any appreciable resolution limit, and can find
a maximum number of groups that scales as \(O(N/\log N)\). Importantly, this is
achieved without sacrificing the robustness against overfitting. |
Modularity maximization has a characteristic scale, and tends to find
communities of similar size; in particular with the same sum of degrees. |
Hierarchical priors can be specifically chosen to be a priori agnostic about
characteristic sizes, densities of groups and degree sequences
[peixoto_nonparametric_2017], such that these are not imposed, but instead obtained
from inference, in an unbiased way. |
Modularity maximization can only find strictly assortative communities. |
Inferential approaches can be based on any generative model. The general SBM will
find any kind of mixing pattern in an unbiased way, and has no problems identifying
modular structure in bipartite networks, core-periphery networks, and any mixture of
these or other patterns. There are also specialized versions for bipartite
[larremore_efficiently_2014], core-periphery [zhang_identification_2015], and
assortative patterns [zhang_statistical_2020], if these are being searched
exclusively. |
The solution landscape of modularity maximization is often degenerate,
with many different solutions with close to the same modularity value
[good_performance_2010], and with no clear way of how to select between
them. |
Inferential methods are characterized by a posterior distribution of partitions.
The consensus or dissensus between the different solutions [peixoto_revealing_2021]
can be used to determine how many cohesive hypotheses can be extracted from
inference, and to what extent is the model being used a poor or a good fit for the network. |
Because of the above problems, the use of modularity maximization should
be discouraged, since it is demonstrably not fit for purpose as an
inferential method. As a consequence, the use of modularity maximization
in any recent network analysis can be arguably considered a “red flag”
that strongly indicates methodological carelessness. In the absence of
secondary evidence supporting the alleged community structures found, or
extreme care to counteract the several limitations of the method, the
safest assumption is that the results obtained with that method tend to
contain a substantial amount of noise, rendering any inferential
conclusion derived from them highly suspicious.
As a final note, we focus on modularity here not only for its widespread
adoption but also because of its emblematic character. At a fundamental
level, all of its shortcoming are shared with any descriptive method in
the literature — to varied but always non-negligible degrees.
[peixoto_descriptive_2021] | (1, 2) Tiago P. Peixoto, “Descriptive
vs. inferential community detection: pitfalls, myths and half-truths”,
arXiv: 2112.00183 |
[guimera_modularity_2004] | (1, 2, 3) Roger Guimerà, Marta Sales-Pardo, and Luís A. Nunes Amaral,
“Modularity from fluctuations in random graphs and complex networks,”
Physical Review E 70, 025101 (2004). DOI: 10.1103/PhysRevE.70.025101 |
[fortunato_resolution_2007] | (1, 2, 3) Santo Fortunato and Marc Barthélemy,
“Resolution limit in community detection”, Proceedings of the National
Academy of Sciences 104, 36–41 (2007). DOI: 10.1073/pnas.0605965104 |
[peixoto_nonparametric_2017] | (1, 2) Tiago P. Peixoto, “Nonparametric Bayesian inference of the
microcanonical stochastic block model”, Physical Review E 95, 012317 (2017).
DOI: 10.1103/PhysRevE.95.012317 |
[zhang_statistical_2020] | (1, 2, 3) Lizhi Zhang and Tiago P. Peixoto,
“Statistical inference of assortative community structures.” Physical
Review Research 2, 043271 (2020). DOI: 10.1103/PhysRevResearch.2.043271 |
You can also use your Mastodon account to comment on this post by replying to this thread. To see the existing comments, press the button below.