Do we need to believe in generative models?

community detection
Bayes
compression
MDL
Author

Tiago P. Peixoto

Published

December 8, 2021

This post is a slightly modified version of Sec. IVH in  [1].

In two previous blog posts (first and second) I advocated for the use of statistical inference for community detection in networks, whenever our objective is of an inferential nature.

One possible objection to the use of statistical inference is when the generative models on which they are based are considered unrealistic for a particular kind of network. Although this type of consideration is ultimately important, it is not necessarily an obstacle. First we need to remember that realism is a matter of degree, not kind, since no model can be fully realistic, and therefore we should never be fully committed to “believe” any particular model. Because of this, an inferential approach can be used to target a particular kind of structure, and the corresponding model is formulated with this in mind, but without the need to describe other properties of the data. The stochastic block model (SBM) is a good example of this, since it is often used with the objective of finding communities, rather than any kind of network structure. A model like the SBM is a good way to offset the regularities that relate to the community structure with the irregularities present in real networks, without requiring us to believe that in fact it generated the network.

Furthermore, certain kinds of models are flexible enough so that they can approximate other models. For example, a good analogy with fitting the SBM to network data is to fit a histogram to numerical data, with the node partitioning being analogous to the data binning. Although a piecewise constant model is almost never the true underlying distribution, it provides a reasonable approximation in a tractable, nonparametric manner. Because of its capacity to approximate a wide class of distributions, we certainly do not need to believe that a histogram is the true data generating process to extract meaningful inferences from it. In fact, the same can be said of the SBM in its capacity to approximate a wide class of network models  [2].

This means that we can extract useful, statistically meaningful information from data even if the models we use are misspecified. For example, if a network is generated by a latent space model  [3], and we fit a SBM to it, the communities that are obtained in this manner are not quite meaningless: they will correspond to discrete spatial regions. Hence, the inference would yield a caricature of the underlying latent space, amounting to a discretization of the true model — indeed, much like a histogram. This is very different, say, from finding communities in an Erdős–Rényi graph, which bear no relation to the true underlying model, and would be just overfitting the data. In contrast, the SBM fit to a spatial network would be approximately capturing the true model structure, in a manner that could be used to compress the data and make predictions (although not optimally).

Furthermore, the associated description length of a network model is a good criterion to tell whether the patterns we have found are actually simplifying our network description, without requiring the underlying model to be perfect. This happens in the same way as using a software like gzip makes our files smaller, without requiring us to believe that they are in fact generated by the Markov chain used by the underlying Lempel-Ziv algorithm.

Of course, realism is important as soon as we demand more from the point of view of interpretation and prediction. Are the observed community structures due to homophily or triadic clusure  [4]? Or are they due to spatial embedding  [3]? What models are capable of reproducing other network descriptors, together with the community structure? Which models can better reconstruct incomplete networks  [5,6]?

When answering these questions, we are forced to consider more detailed generative processes, and compare them. However, we are never required to believe them — models are always tentative, approximative, and should always be replaced by superior alternatives when these are found. Indeed, criteria such as minimum description length serve precisely to implement such a comparison between models, following the principle of Occam's razor. Therefore, the lack of realism of any particular model cannot be used to dismiss statistical inference as an underlying methodology.

It should be emphasized that, fundamentally, there is no alternative. Rejecting an inferential approach based on the SBM on the grounds that it is an unrealistic model (e.g. because of the conditional independence of the edges being placed, or some other unpalatable assumption), but instead preferring some other non-inferential community detection method is incoherent: As we discussed previously, every descriptive method can be mapped to an inferential analogue, with implicit assumptions that are hidden from view. Unless one can establish that the implicit assumptions are in fact more realistic, then the comparison cannot be justified. Unrealistic assumptions should be replaced by more realistic ones, not by burying one’s head in the sand.

References

[1]
T. P. Peixoto, Descriptive Vs. Inferential Community Detection in Networks: Pitfalls, Myths and Half-Truths, Elements in the Structure and Dynamics of Complex Networks (2023).
[2]
S. C. Olhede and P. J. Wolfe, Network Histograms and Universality of Blockmodel Approximation, Proceedings of the National Academy of Sciences 111, 14722 (2014).
[3]
P. D. Hoff, A. E. Raftery, and M. S. Handcock, Latent Space Approaches to Social Network Analysis, Journal of the American Statistical Association 97, 1090 (2002).
[4]
T. P. Peixoto, Disentangling Homophily, Community Structure, and Triadic Closure in Networks, Physical Review X 12, 011004 (2022).
[5]
R. Guimerà and M. Sales-Pardo, Missing and Spurious Interactions and the Reconstruction of Complex Networks, Proceedings of the National Academy of Sciences 106, 22073 (2009).
[6]
T. P. Peixoto, Reconstructing Networks with Unknown and Heterogeneous Errors, Physical Review X 8, 041011 (2018).

Comments