I have a connected graph of 15000 nodes and would like to sample subgraphs of size 500. What options are available within the igraph system to generate subgraphs of a given size?
One technique that is sometimes used is to perform a random walk, and take the subgraph induced by the vertices traversed by the walk.
Nice idea. But Peranti should clarify the requirements for the sample. Random walk would give samples that are connected and are “communities” under some definitions of communities, so it is not a random sample of nodes or edges. If it is more important to have a random sample of nodes regardless of connectivity, use “sample” on vertex (node) ids and then induce a graph of the edges that connect them with induced_subgraph. It may not be a connected graph. Or if random sample of edges is what matters, sample the edge ids and use the nodes they connect.
Yes, @dan_suthers. Indeed, I am looking for a subgraph of 500 that is connected or the only subgraph component. However, I understand what you have mentioned that the resultant subgraph is no more random.
If sample the vertex ids and then induce the graph, it necessarily need not result in a single component subgraph. Isn’t it?
My main point is that one should first define the kind of sample one wants, and then identify a method that meets those requirements. If the most important criteria is a random sample of nodes, a random walk won’t do it. But if you want to randomly choose a connected component, then it seems to me that a random walk starting at a randomly chosen node would do it. However, I am not a statistician: there may be some bias in that method I am not aware of.
A further question: do you just want one randomly chosen connected component, or do you want to repeatedly sample the graph and get a distribution of components that is randomly distributed across all possible samples of components? Again, I’m not going to claim to have expertise about sampling, but it seems important to clarify your objectives.
@dan_suthers I am afraid, I do not have an answer to your question. I would think and come back to you.
@peranti Why do you want to take a sample? The usual motivation is that the original graph is too large to analyse in a reasonable amount of compute time. Thus we want a proxy which is small enough to work with, yet still representative of the original network. If this is what you are looking for, you need to consider whether the sampling method you chose preserves those properties of the larger network which you care about. Random walk based sampling is probably the simplest attempt at preserving local structure.
I am not deeply familiar with this topic, so I am reluctant to give more specific suggestions. However, the problem of taking a representative sample has been studied for a long time, thus you will find many papers on it. If I had to do this, I would look at a few recent papers and look for a review of existing methods, along with their tradeoffs (biases? what properties do they / do they not preserve?).
I agree that the objective should guide what type of sampling strategy is being used.
For a starting point on some problems of estimation and sampling networks, perhaps this would be a useful reference:
Kolaczyk E.D. (2009) Sampling and Estimation in Network Graphs. In: Statistical Analysis of Network Data. Springer Series in Statistics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-88146-1_5