Interest in GSoC project idea: Interface to network data repositories

I am Aditya Chaubey, a B.Tech student at IIT Madras, I came across igraph’s GSoC project statement and found the " Interface to network data repositories" project very interesting. I wanted to discuss a few ideas of my own over the topic. Can someone please confirm if this is the right place to do so?

It’s great to see that you’re interested in this project. Yes, this is indeed the right place. We’ll get back to you with more details after the weekend.

Thank you for your prompt reply!
My initial thought on this was if somehow we could import the dataset from Netzschleuder in GML or CSV form then we could load it into the graph. But looks like it is not that direct, some tuning needs to be done. I would like to work further on the same. Are there any other ideas from your end that should be explored?

Hello @szhorvat Here is a very basic code I wrote for the creation of graph from netzschleuder. It asks for the link, downloads the file temporarily and creates the graph out if it.
I shall include the Netzschleuder API to ask just the dataset name, instead of the link.
I think we can create a similar function for other sites as well, any reviews?

import os
import requests
import zstandard as zstd
import igraph as ig
import matplotlib.pyplot as plt
import tempfile

# URL of the compressed file
url = "https://networks.skewed.de/net/bison/files/bison.gml.zst"

#Trying to download the files temporarily
try:
    with tempfile.NamedTemporaryFile(delete=False, suffix=".zst") as tmp_zst_file:
        headers = {"User-Agent": "Mozilla/5.0"} 
        response = requests.get(url, headers=headers, stream=True)
        
        if response.status_code == 200:
            tmp_zst_file.write(response.content)
            tmp_zst_path = tmp_zst_file.name
        else:
            print(f"Failed to download file. HTTP Status: {response.status_code}")
            exit(1)

except Exception as e:
    print(f"Error downloading file: {e}")
    exit(1)

#Verify the file format
with open(tmp_zst_path, "rb") as f:
    if f.read(4) != b'\x28\xb5\x2f\xfd':
        print("Error: The downloaded file is not a valid Zstandard (.zst) file!")
        exit(1)

try:
    dctx = zstd.ZstdDecompressor()

    with tempfile.NamedTemporaryFile(delete=False, suffix=".gml") as tmp_gml_file:
        with open(tmp_zst_path, "rb") as compressed:
            dctx.copy_stream(compressed, tmp_gml_file)  # Streamed decompression
        tmp_gml_path = tmp_gml_file.name  # Save temp file path before closing

except zstd.ZstdError as e:
    print(f"Zstandard decompression error: {e}")
    exit(1)

try:
    g = ig.Graph.Read_GML(tmp_gml_path)  # Load from temp file
    print(g.summary())
except Exception as e:
    print(f"Error loading GML file: {e}")
    exit(1)

fig, ax = plt.subplots(figsize=(8, 8))
ig.plot(
    g,
    target=ax,
    layout="auto",
    vertex_size=10,
    vertex_color="red",
    #vertex_label=g.vs["name"] if "name" in g.vertex_attributes() else None,
    edge_width=0.5,
    edge_color="#AAA",
)
plt.show()

# Clean up: Delete temporary files
os.remove(tmp_zst_path)
os.remove(tmp_gml_path)

print("Graph visualization complete!")

I am unclear to where to add the final modified code for PR.
Should it be added under the IO/libraries.py or should I create a new file for handling data repositories where we can add other sites as well?

My initial thought on this was if somehow we could import the dataset from Netzschleuder in GML or CSV form then we could load it into the graph.

Ideally we should try making use of the native format of Netzschleuder (i.e. the .gt format). CSV is complicated as the data is spread over several files and you would need to re-assemble them. GML could work but I’m not sure how the file format handles vertex or edge attributes that contain lists of values – most likely they are converted to strings in GML. Try to find a few datasets where there are vertex or edge attributes and check how they are handled in GML in Netzschleuder.

If we decide to use the .gt format, then we have two options:

  • quick one is to just depend on graph-tool itself and use it to load the graph from the .gt file into a graph-tool graph, then convert it to igraph with Graph.from_graph_tool(). Downside is the extra dependency – graph-tool is not on PyPI so it is not pip-installable.
  • a better solution (i.e. more sustainable in the long term) is to implement a parser for the .gt format that produces an igraph graph directly.

Should it be added under the IO/libraries.py or should I create a new file for handling data repositories where we can add other sites as well?

I would probably add it to src/igraph/io/repositories.py as a separate function or class-with-methods, depending on how you imagine the API. Come up with a proposal for the API and then we can continue from there. Note that Netzschleuder datasets may contain multiple graphs, but most datasets contain only one. I believe that the ideal API should let the user just specify the name of the dataset and get the single network in the dataset as a result if there is only one network in the dataset – but we should resist the temptation to guess which network the user meant if there are multiple networks in a dataset.

At this time igraph’s importer includes composite GML attributes with a warning. However, Netzschleuder doesn’t seem to make proper use of composite attributes for encoding lists—it seems to serialize lists into strings.

I see, I shall try to implement a parser for the .gt format, to produce the graph in igraph directly.
For now should i make a PR for a function to make graph from GML compressed file from Netzschleuder?
The following is the code snippet I am planning to implement.
It asks for the dataset name and the sub-networks name, only if there are multiple graph dataset in it. It then downloads (temporarily) and decompresses the GML file and returns the graph as a result

import os
import requests
import zstandard as zstd
import igraph as ig
import tempfile


def load_graph_from_netzschleuder(name: str, net: str = None):
    """
    Downloads, decompresses, and loads a graph from GML.zst file from Netzschleuder.
    
    Parameters:
        name (str): The name of the dataset (e.g., "bison").
        net (str, optional): The specific network file. Defaults to `name` if None.
        
    Returns:
        igraph.Graph: The loaded graph.
    """
    base_url = "https://networks.skewed.de/net"
    

    net = net or name

    # Check if the dataset and Sub-netowrk exists
    dataset_url = f"{base_url}/{name}"
    response = requests.head(dataset_url, headers={"User-Agent": "Mozilla/5.0"}, timeout=5)
    if response.status_code != 200:
        raise ValueError(f"Dataset '{name}' does not exist at {dataset_url}.")


    file_url = f"{dataset_url}/files/{net}.gml.zst"
    response = requests.head(file_url, headers={"User-Agent": "Mozilla/5.0"}, timeout=5)
    if response.status_code != 200:
        raise ValueError(f"Network file '{net}.gml.zst' does not exist at {file_url}.")

    try:
        # Download the file
        with tempfile.NamedTemporaryFile(delete=False, suffix=".zst") as tmp_zst_file:
            response = requests.get(file_url, headers={"User-Agent": "Mozilla/5.0"}, stream=True, timeout=10)
            
            if response.status_code == 200:
                tmp_zst_file.write(response.content)
                tmp_zst_path = tmp_zst_file.name
            else:
                raise ValueError(f"Failed to download the file. HTTP Status: {response.status_code}")


        # Decompress the file
        dctx = zstd.ZstdDecompressor()
        with tempfile.NamedTemporaryFile(delete=False, suffix=".gml") as tmp_gml_file:
            with open(tmp_zst_path, "rb") as compressed:
                dctx.copy_stream(compressed, tmp_gml_file)  
            tmp_gml_path = tmp_gml_file.name 

        g = ig.Graph.Read_GML(tmp_gml_path)

    except requests.RequestException as e:
        raise RuntimeError(f"Network error: {e}")
    except zstd.ZstdError as e:
        raise RuntimeError(f"Decompression error: {e}")
    except Exception as e:
        raise RuntimeError(f"Error processing file: {e}")
    finally:
        if os.path.exists(tmp_zst_path):
            os.remove(tmp_zst_path)
        if os.path.exists(tmp_gml_path):
            os.remove(tmp_gml_path)

    return g

I shall add test cases for it as well. This could be one of the functions under the ‘repositories.py’ we can add other functions for other types of files as well. OR once the code for .gt file is ready we can replace it then.

Hello!
I have already noticed some issues with it, maybe the other way to get across it would be to change how we read gml file.

Can you please suggest some resources to check on how to implement a parser for the .gt format?

Hello @tamas , Is there a specified format for the proposal?

Nope, we are flexible, just write up something in an informal way with examples of the planned API of any new functions or classes you propose for addition and then we can proceed from there.

Hello @tamas here is the proposal I have drafted, Please let me know if I missed out on something, GSOC 2025 Proposal - Google Docs

Greetings @tamas and @szhorvat ! Please let me know your reviews upon the same or if something needs to be further clarified.
Or should i go forward with submitting the same as the proposal.
Proposal doc

Greetings sir!
With about 1 week to the final deadline, I humbly request you to share your feedback on the above given proposal.
Any positive or negative remarks from your end would be highly sought after.

I have submitted the proposal on the GSoC site as mentioned above and am now awaiting your feedback and suggestions for any necessary revisions.
With kind regards.