Cocitation() Memory Management

I have a graph object with size 447,999 edges

object.size©

25476096 bytes

I need a cocitation analysis.

When I run cocitation©R closes after a few sec.

Probably it is due to high use of system resource usage

It there any suggestion ?

my problem is very similar to

The crucial point is the number of nodes n, not the number of edges. The matrix that is allocated is of size n \times n, which may quickly be too big. For example, with only 25000 nodes you may already need 4G of memory. However, you should be getting an error instead of R simply closing. Can you confirm that you at least get the warning?

It might be an idea to have an implementation of cocitation (and bibliographic coupling) that consumes less memory. @tamas, is there any specific reason that a full matrix is allocated? Perhaps we can open an issue for that at GitHub?

2 Likes

I think it is fairly pointless to have “cocitation coupling” and “bibliographic coupling” implemented in the C core. The best implementation will work with sparse matrices. High-level interfaces will then need to translate these sparse matrices to their own format.

Would it not be simpler to do the computation directly in the high-level interfaces, using their native sparse matrix datatype? After all, it’s a trivial matrix multiplication. @vtraag what do you think?

@Rafet_IRMAK If the adjacency matrix of a directed graph is A, then cocitation coupling is

A^T \cdot A

and bibliographic coupling is

A\cdot A^T

with the diagonal of the result needing to be replaced by all zeros.

1 Like

Maybe it’s still not a bad idea to have an implementation in the C core.

Perhaps could update the functions in the C core to have an extra argument that controls whether the result should be sparse or not? If we unconditionally switch to sparse, it may be a pain for existing users of the C core to adapt. A flag that controls the output format will make for a smoother transition.

2 Likes

Many Thanks

I have found a solution after some optimization.

  1. R do not give any warning before it closes.

2a) R 64 bit - 4.0
Windows 10

2b) memory.limit(310241024), physical REM 8 GB

2c) I have optimised the network decreased number of nodes and edges

2d) The highest memory ı succed to use is 25389.37

  1. There is no sepecific reason in my application for full matrix allocation. If is there a way to to work on catiation network in cluster I can prefer it. Gephi stores netwroks in edgelist but igraph stores netwroks in adjacencu matrix. I perform co-citation calculatioons in igraph and export to Gephi.

What I do is
cc<-cocitation©
cc<-graph_from_adjacency_matrix(cc)
as_edgelist(cc)
write.csv2(cc)

Many thanks.

Currently I have found a solution by optimization of Network. But I work on PUBMED data sets and I will need more power in the future. In this sample I work on 56K articles.

Ofcourse your suggestion is a solution but I need more experience. I am a physiotherapist (PhDT)

As usual, the reasons are historical :wink: Cocitation and bibliographic coupling was added fairly early to the library (Gabor was working on a project for which he needed these measures), and we had no sparse matrix data types around at that time. If we were to implement it today from scratch, we would have used sparse matrices for sure.

I’m also not sure what to do about it without breaking the API. Maybe we could just keep the existing versions and provide a sparse implementation in the C layer. The higher-level interfaces could then have a sparse=... keyword argument that selects between the two.

1 Like

Dear Tamas

I use co-citation netwroks to identfy knowledge base in medical article collections. Bibliometric coupling for scientific frontiers.

In Web of Scince and Scopus you can directly perform these analysis with bibliometrix package of R. But bibliometrix do not support PubMed.

Also citation reports are not directly avaliabli in PubMED as Web of Science and Scopus.
You have to write your own script. My early scripts were in VBA and I had performed co-citation and bibliometric analysis in Sci2. PubMed updated its API interface and VBA scirpt do not work properly after that update.

I have migrated my scripts in R. I found igraph very usefull in co-citation analysis. All software have limitation in bibliomet coupling analysis. Historical or significant articles have very dense citation relation. A hundred of important article in orthopedic literature published in 1980s retrived hundred millions of edges in bibliometric coupling. Simplification of this netwrok is not easy. However cocitation analysis on this article set can be performed after some optimization.

PubMed is important because it is a non commerical foundation. However Web of Science and Scopus are commercial. If you dont work in a univesity hospital or if you work in private practive PubMed become unique information portal.

PMC subset of PubMED consist of %100 free full full text articles.

I especialy prefere to study on PubMed to indetify information flow between scientist and medical workers in private practice and developing countires.

In medical literature these kinds of studies on PUBMED is very rare. The studies on Web of Science and Scopus are not rare.

By this reason igraph may play an important role in PubMed Studies.

I am not sure what can be a practical solution. Big data packages ? a new additional small package with parse matrix for Pubmed ? a function which suports work on cluster ?

You can find early findings of one of my cocitation study on COVID19 (will published in Journal of WORK in June ) https://www.linkedin.com/in/rafet-irmak-4211bb70/detail/recent-activity/posts/

I can sent full study with raw data. More details on cocitation studies on Pubmed

@Rafet_IRMAK Just use the matrix multiplication formulas I mentioned above.

EDIT:

This took me a bit of googling, as I don’t normally use R. Assuming your graph is called g,

library(Matrix)
am <- as_adj(g)
cocit <- t(am) %*% am
diag(cocit) <- 0

The adjacency matrix returned by R/igraph should be sparse by default (depending on the setting igraph_options('sparsematrices')), thus this will be a memory efficient way to obtain the cocitation matrix.

1 Like

Perhaps we can just provide a API breaking change in the next version, and we keep the current implementation until then?

Is there any reason why we would need to output a sparse matrix instead of a graph? I would think that returning an actual igraph_t object would be most useful to most users, also in the higher-level languages.

1 Like

You mean a weighted graph, i.e. return both an igraph_t and an igraph_vector_(int)_t?

The issue I’m concerned about is not breaking the API, but making it hard for existing users of the C API to adapt. I think we should make it easy to obtain the very same dense matrix output as before, and optionally allow obtaining something else (sparse matrix or weighted graph). Whether this is done with a new function (API not broken) or an extra flag in the existing function which controls what is being returned, I do not mind. I think a new function is just simpler because a single function with a flag effectively behaves like two separate functions anyway.

2 Likes

I am working on a similar problem, how do you set the original vertex ids a the end of the process?

I use igraph for citation and co-citation analysis only. I use Gephi for visulisation.
My algoritm is

edgelist–>adjecency matrix–>co-ciation analysis–>edgelist–>gephi

this algoritm protects original ids