Spaces:
Sleeping
Sleeping
""" | |
Graph summarization finds smaller representations of graphs resulting in faster | |
runtime of algorithms, reduced storage needs, and noise reduction. | |
Summarization has applications in areas such as visualization, pattern mining, | |
clustering and community detection, and more. Core graph summarization | |
techniques are grouping/aggregation, bit-compression, | |
simplification/sparsification, and influence based. Graph summarization | |
algorithms often produce either summary graphs in the form of supergraphs or | |
sparsified graphs, or a list of independent structures. Supergraphs are the | |
most common product, which consist of supernodes and original nodes and are | |
connected by edges and superedges, which represent aggregate edges between | |
nodes and supernodes. | |
Grouping/aggregation based techniques compress graphs by representing | |
close/connected nodes and edges in a graph by a single node/edge in a | |
supergraph. Nodes can be grouped together into supernodes based on their | |
structural similarities or proximity within a graph to reduce the total number | |
of nodes in a graph. Edge-grouping techniques group edges into lossy/lossless | |
nodes called compressor or virtual nodes to reduce the total number of edges in | |
a graph. Edge-grouping techniques can be lossless, meaning that they can be | |
used to re-create the original graph, or techniques can be lossy, requiring | |
less space to store the summary graph, but at the expense of lower | |
reconstruction accuracy of the original graph. | |
Bit-compression techniques minimize the amount of information needed to | |
describe the original graph, while revealing structural patterns in the | |
original graph. The two-part minimum description length (MDL) is often used to | |
represent the model and the original graph in terms of the model. A key | |
difference between graph compression and graph summarization is that graph | |
summarization focuses on finding structural patterns within the original graph, | |
whereas graph compression focuses on compressions the original graph to be as | |
small as possible. **NOTE**: Some bit-compression methods exist solely to | |
compress a graph without creating a summary graph or finding comprehensible | |
structural patterns. | |
Simplification/Sparsification techniques attempt to create a sparse | |
representation of a graph by removing unimportant nodes and edges from the | |
graph. Sparsified graphs differ from supergraphs created by | |
grouping/aggregation by only containing a subset of the original nodes and | |
edges of the original graph. | |
Influence based techniques aim to find a high-level description of influence | |
propagation in a large graph. These methods are scarce and have been mostly | |
applied to social graphs. | |
*dedensification* is a grouping/aggregation based technique to compress the | |
neighborhoods around high-degree nodes in unweighted graphs by adding | |
compressor nodes that summarize multiple edges of the same type to | |
high-degree nodes (nodes with a degree greater than a given threshold). | |
Dedensification was developed for the purpose of increasing performance of | |
query processing around high-degree nodes in graph databases and enables direct | |
operations on the compressed graph. The structural patterns surrounding | |
high-degree nodes in the original is preserved while using fewer edges and | |
adding a small number of compressor nodes. The degree of nodes present in the | |
original graph is also preserved. The current implementation of dedensification | |
supports graphs with one edge type. | |
For more information on graph summarization, see `Graph Summarization Methods | |
and Applications: A Survey <https://dl.acm.org/doi/abs/10.1145/3186727>`_ | |
""" | |
from collections import Counter, defaultdict | |
import networkx as nx | |
__all__ = ["dedensify", "snap_aggregation"] | |
def dedensify(G, threshold, prefix=None, copy=True): | |
"""Compresses neighborhoods around high-degree nodes | |
Reduces the number of edges to high-degree nodes by adding compressor nodes | |
that summarize multiple edges of the same type to high-degree nodes (nodes | |
with a degree greater than a given threshold). Dedensification also has | |
the added benefit of reducing the number of edges around high-degree nodes. | |
The implementation currently supports graphs with a single edge type. | |
Parameters | |
---------- | |
G: graph | |
A networkx graph | |
threshold: int | |
Minimum degree threshold of a node to be considered a high degree node. | |
The threshold must be greater than or equal to 2. | |
prefix: str or None, optional (default: None) | |
An optional prefix for denoting compressor nodes | |
copy: bool, optional (default: True) | |
Indicates if dedensification should be done inplace | |
Returns | |
------- | |
dedensified networkx graph : (graph, set) | |
2-tuple of the dedensified graph and set of compressor nodes | |
Notes | |
----- | |
According to the algorithm in [1]_, removes edges in a graph by | |
compressing/decompressing the neighborhoods around high degree nodes by | |
adding compressor nodes that summarize multiple edges of the same type | |
to high-degree nodes. Dedensification will only add a compressor node when | |
doing so will reduce the total number of edges in the given graph. This | |
implementation currently supports graphs with a single edge type. | |
Examples | |
-------- | |
Dedensification will only add compressor nodes when doing so would result | |
in fewer edges:: | |
>>> original_graph = nx.DiGraph() | |
>>> original_graph.add_nodes_from( | |
... ["1", "2", "3", "4", "5", "6", "A", "B", "C"] | |
... ) | |
>>> original_graph.add_edges_from( | |
... [ | |
... ("1", "C"), ("1", "B"), | |
... ("2", "C"), ("2", "B"), ("2", "A"), | |
... ("3", "B"), ("3", "A"), ("3", "6"), | |
... ("4", "C"), ("4", "B"), ("4", "A"), | |
... ("5", "B"), ("5", "A"), | |
... ("6", "5"), | |
... ("A", "6") | |
... ] | |
... ) | |
>>> c_graph, c_nodes = nx.dedensify(original_graph, threshold=2) | |
>>> original_graph.number_of_edges() | |
15 | |
>>> c_graph.number_of_edges() | |
14 | |
A dedensified, directed graph can be "densified" to reconstruct the | |
original graph:: | |
>>> original_graph = nx.DiGraph() | |
>>> original_graph.add_nodes_from( | |
... ["1", "2", "3", "4", "5", "6", "A", "B", "C"] | |
... ) | |
>>> original_graph.add_edges_from( | |
... [ | |
... ("1", "C"), ("1", "B"), | |
... ("2", "C"), ("2", "B"), ("2", "A"), | |
... ("3", "B"), ("3", "A"), ("3", "6"), | |
... ("4", "C"), ("4", "B"), ("4", "A"), | |
... ("5", "B"), ("5", "A"), | |
... ("6", "5"), | |
... ("A", "6") | |
... ] | |
... ) | |
>>> c_graph, c_nodes = nx.dedensify(original_graph, threshold=2) | |
>>> # re-densifies the compressed graph into the original graph | |
>>> for c_node in c_nodes: | |
... all_neighbors = set(nx.all_neighbors(c_graph, c_node)) | |
... out_neighbors = set(c_graph.neighbors(c_node)) | |
... for out_neighbor in out_neighbors: | |
... c_graph.remove_edge(c_node, out_neighbor) | |
... in_neighbors = all_neighbors - out_neighbors | |
... for in_neighbor in in_neighbors: | |
... c_graph.remove_edge(in_neighbor, c_node) | |
... for out_neighbor in out_neighbors: | |
... c_graph.add_edge(in_neighbor, out_neighbor) | |
... c_graph.remove_node(c_node) | |
... | |
>>> nx.is_isomorphic(original_graph, c_graph) | |
True | |
References | |
---------- | |
.. [1] Maccioni, A., & Abadi, D. J. (2016, August). | |
Scalable pattern matching over compressed graphs via dedensification. | |
In Proceedings of the 22nd ACM SIGKDD International Conference on | |
Knowledge Discovery and Data Mining (pp. 1755-1764). | |
http://www.cs.umd.edu/~abadi/papers/graph-dedense.pdf | |
""" | |
if threshold < 2: | |
raise nx.NetworkXError("The degree threshold must be >= 2") | |
degrees = G.in_degree if G.is_directed() else G.degree | |
# Group nodes based on degree threshold | |
high_degree_nodes = {n for n, d in degrees if d > threshold} | |
low_degree_nodes = G.nodes() - high_degree_nodes | |
auxiliary = {} | |
for node in G: | |
high_degree_neighbors = frozenset(high_degree_nodes & set(G[node])) | |
if high_degree_neighbors: | |
if high_degree_neighbors in auxiliary: | |
auxiliary[high_degree_neighbors].add(node) | |
else: | |
auxiliary[high_degree_neighbors] = {node} | |
if copy: | |
G = G.copy() | |
compressor_nodes = set() | |
for index, (high_degree_nodes, low_degree_nodes) in enumerate(auxiliary.items()): | |
low_degree_node_count = len(low_degree_nodes) | |
high_degree_node_count = len(high_degree_nodes) | |
old_edges = high_degree_node_count * low_degree_node_count | |
new_edges = high_degree_node_count + low_degree_node_count | |
if old_edges <= new_edges: | |
continue | |
compression_node = "".join(str(node) for node in high_degree_nodes) | |
if prefix: | |
compression_node = str(prefix) + compression_node | |
for node in low_degree_nodes: | |
for high_node in high_degree_nodes: | |
if G.has_edge(node, high_node): | |
G.remove_edge(node, high_node) | |
G.add_edge(node, compression_node) | |
for node in high_degree_nodes: | |
G.add_edge(compression_node, node) | |
compressor_nodes.add(compression_node) | |
return G, compressor_nodes | |
def _snap_build_graph( | |
G, | |
groups, | |
node_attributes, | |
edge_attributes, | |
neighbor_info, | |
edge_types, | |
prefix, | |
supernode_attribute, | |
superedge_attribute, | |
): | |
""" | |
Build the summary graph from the data structures produced in the SNAP aggregation algorithm | |
Used in the SNAP aggregation algorithm to build the output summary graph and supernode | |
lookup dictionary. This process uses the original graph and the data structures to | |
create the supernodes with the correct node attributes, and the superedges with the correct | |
edge attributes | |
Parameters | |
---------- | |
G: networkx.Graph | |
the original graph to be summarized | |
groups: dict | |
A dictionary of unique group IDs and their corresponding node groups | |
node_attributes: iterable | |
An iterable of the node attributes considered in the summarization process | |
edge_attributes: iterable | |
An iterable of the edge attributes considered in the summarization process | |
neighbor_info: dict | |
A data structure indicating the number of edges a node has with the | |
groups in the current summarization of each edge type | |
edge_types: dict | |
dictionary of edges in the graph and their corresponding attributes recognized | |
in the summarization | |
prefix: string | |
The prefix to be added to all supernodes | |
supernode_attribute: str | |
The node attribute for recording the supernode groupings of nodes | |
superedge_attribute: str | |
The edge attribute for recording the edge types represented by superedges | |
Returns | |
------- | |
summary graph: Networkx graph | |
""" | |
output = G.__class__() | |
node_label_lookup = {} | |
for index, group_id in enumerate(groups): | |
group_set = groups[group_id] | |
supernode = f"{prefix}{index}" | |
node_label_lookup[group_id] = supernode | |
supernode_attributes = { | |
attr: G.nodes[next(iter(group_set))][attr] for attr in node_attributes | |
} | |
supernode_attributes[supernode_attribute] = group_set | |
output.add_node(supernode, **supernode_attributes) | |
for group_id in groups: | |
group_set = groups[group_id] | |
source_supernode = node_label_lookup[group_id] | |
for other_group, group_edge_types in neighbor_info[ | |
next(iter(group_set)) | |
].items(): | |
if group_edge_types: | |
target_supernode = node_label_lookup[other_group] | |
summary_graph_edge = (source_supernode, target_supernode) | |
edge_types = [ | |
dict(zip(edge_attributes, edge_type)) | |
for edge_type in group_edge_types | |
] | |
has_edge = output.has_edge(*summary_graph_edge) | |
if output.is_multigraph(): | |
if not has_edge: | |
for edge_type in edge_types: | |
output.add_edge(*summary_graph_edge, **edge_type) | |
elif not output.is_directed(): | |
existing_edge_data = output.get_edge_data(*summary_graph_edge) | |
for edge_type in edge_types: | |
if edge_type not in existing_edge_data.values(): | |
output.add_edge(*summary_graph_edge, **edge_type) | |
else: | |
superedge_attributes = {superedge_attribute: edge_types} | |
output.add_edge(*summary_graph_edge, **superedge_attributes) | |
return output | |
def _snap_eligible_group(G, groups, group_lookup, edge_types): | |
""" | |
Determines if a group is eligible to be split. | |
A group is eligible to be split if all nodes in the group have edges of the same type(s) | |
with the same other groups. | |
Parameters | |
---------- | |
G: graph | |
graph to be summarized | |
groups: dict | |
A dictionary of unique group IDs and their corresponding node groups | |
group_lookup: dict | |
dictionary of nodes and their current corresponding group ID | |
edge_types: dict | |
dictionary of edges in the graph and their corresponding attributes recognized | |
in the summarization | |
Returns | |
------- | |
tuple: group ID to split, and neighbor-groups participation_counts data structure | |
""" | |
neighbor_info = {node: {gid: Counter() for gid in groups} for node in group_lookup} | |
for group_id in groups: | |
current_group = groups[group_id] | |
# build neighbor_info for nodes in group | |
for node in current_group: | |
neighbor_info[node] = {group_id: Counter() for group_id in groups} | |
edges = G.edges(node, keys=True) if G.is_multigraph() else G.edges(node) | |
for edge in edges: | |
neighbor = edge[1] | |
edge_type = edge_types[edge] | |
neighbor_group_id = group_lookup[neighbor] | |
neighbor_info[node][neighbor_group_id][edge_type] += 1 | |
# check if group_id is eligible to be split | |
group_size = len(current_group) | |
for other_group_id in groups: | |
edge_counts = Counter() | |
for node in current_group: | |
edge_counts.update(neighbor_info[node][other_group_id].keys()) | |
if not all(count == group_size for count in edge_counts.values()): | |
# only the neighbor_info of the returned group_id is required for handling group splits | |
return group_id, neighbor_info | |
# if no eligible groups, complete neighbor_info is calculated | |
return None, neighbor_info | |
def _snap_split(groups, neighbor_info, group_lookup, group_id): | |
""" | |
Splits a group based on edge types and updates the groups accordingly | |
Splits the group with the given group_id based on the edge types | |
of the nodes so that each new grouping will all have the same | |
edges with other nodes. | |
Parameters | |
---------- | |
groups: dict | |
A dictionary of unique group IDs and their corresponding node groups | |
neighbor_info: dict | |
A data structure indicating the number of edges a node has with the | |
groups in the current summarization of each edge type | |
edge_types: dict | |
dictionary of edges in the graph and their corresponding attributes recognized | |
in the summarization | |
group_lookup: dict | |
dictionary of nodes and their current corresponding group ID | |
group_id: object | |
ID of group to be split | |
Returns | |
------- | |
dict | |
The updated groups based on the split | |
""" | |
new_group_mappings = defaultdict(set) | |
for node in groups[group_id]: | |
signature = tuple( | |
frozenset(edge_types) for edge_types in neighbor_info[node].values() | |
) | |
new_group_mappings[signature].add(node) | |
# leave the biggest new_group as the original group | |
new_groups = sorted(new_group_mappings.values(), key=len) | |
for new_group in new_groups[:-1]: | |
# Assign unused integer as the new_group_id | |
# ids are tuples, so will not interact with the original group_ids | |
new_group_id = len(groups) | |
groups[new_group_id] = new_group | |
groups[group_id] -= new_group | |
for node in new_group: | |
group_lookup[node] = new_group_id | |
return groups | |
def snap_aggregation( | |
G, | |
node_attributes, | |
edge_attributes=(), | |
prefix="Supernode-", | |
supernode_attribute="group", | |
superedge_attribute="types", | |
): | |
"""Creates a summary graph based on attributes and connectivity. | |
This function uses the Summarization by Grouping Nodes on Attributes | |
and Pairwise edges (SNAP) algorithm for summarizing a given | |
graph by grouping nodes by node attributes and their edge attributes | |
into supernodes in a summary graph. This name SNAP should not be | |
confused with the Stanford Network Analysis Project (SNAP). | |
Here is a high-level view of how this algorithm works: | |
1) Group nodes by node attribute values. | |
2) Iteratively split groups until all nodes in each group have edges | |
to nodes in the same groups. That is, until all the groups are homogeneous | |
in their member nodes' edges to other groups. For example, | |
if all the nodes in group A only have edge to nodes in group B, then the | |
group is homogeneous and does not need to be split. If all nodes in group B | |
have edges with nodes in groups {A, C}, but some also have edges with other | |
nodes in B, then group B is not homogeneous and needs to be split into | |
groups have edges with {A, C} and a group of nodes having | |
edges with {A, B, C}. This way, viewers of the summary graph can | |
assume that all nodes in the group have the exact same node attributes and | |
the exact same edges. | |
3) Build the output summary graph, where the groups are represented by | |
super-nodes. Edges represent the edges shared between all the nodes in each | |
respective groups. | |
A SNAP summary graph can be used to visualize graphs that are too large to display | |
or visually analyze, or to efficiently identify sets of similar nodes with similar connectivity | |
patterns to other sets of similar nodes based on specified node and/or edge attributes in a graph. | |
Parameters | |
---------- | |
G: graph | |
Networkx Graph to be summarized | |
node_attributes: iterable, required | |
An iterable of the node attributes used to group nodes in the summarization process. Nodes | |
with the same values for these attributes will be grouped together in the summary graph. | |
edge_attributes: iterable, optional | |
An iterable of the edge attributes considered in the summarization process. If provided, unique | |
combinations of the attribute values found in the graph are used to | |
determine the edge types in the graph. If not provided, all edges | |
are considered to be of the same type. | |
prefix: str | |
The prefix used to denote supernodes in the summary graph. Defaults to 'Supernode-'. | |
supernode_attribute: str | |
The node attribute for recording the supernode groupings of nodes. Defaults to 'group'. | |
superedge_attribute: str | |
The edge attribute for recording the edge types of multiple edges. Defaults to 'types'. | |
Returns | |
------- | |
networkx.Graph: summary graph | |
Examples | |
-------- | |
SNAP aggregation takes a graph and summarizes it in the context of user-provided | |
node and edge attributes such that a viewer can more easily extract and | |
analyze the information represented by the graph | |
>>> nodes = { | |
... "A": dict(color="Red"), | |
... "B": dict(color="Red"), | |
... "C": dict(color="Red"), | |
... "D": dict(color="Red"), | |
... "E": dict(color="Blue"), | |
... "F": dict(color="Blue"), | |
... } | |
>>> edges = [ | |
... ("A", "E", "Strong"), | |
... ("B", "F", "Strong"), | |
... ("C", "E", "Weak"), | |
... ("D", "F", "Weak"), | |
... ] | |
>>> G = nx.Graph() | |
>>> for node in nodes: | |
... attributes = nodes[node] | |
... G.add_node(node, **attributes) | |
... | |
>>> for source, target, type in edges: | |
... G.add_edge(source, target, type=type) | |
... | |
>>> node_attributes = ('color', ) | |
>>> edge_attributes = ('type', ) | |
>>> summary_graph = nx.snap_aggregation(G, node_attributes=node_attributes, edge_attributes=edge_attributes) | |
Notes | |
----- | |
The summary graph produced is called a maximum Attribute-edge | |
compatible (AR-compatible) grouping. According to [1]_, an | |
AR-compatible grouping means that all nodes in each group have the same | |
exact node attribute values and the same exact edges and | |
edge types to one or more nodes in the same groups. The maximal | |
AR-compatible grouping is the grouping with the minimal cardinality. | |
The AR-compatible grouping is the most detailed grouping provided by | |
any of the SNAP algorithms. | |
References | |
---------- | |
.. [1] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation | |
for graph summarization. In Proc. 2008 ACM-SIGMOD Int. Conf. | |
Management of Data (SIGMOD’08), pages 567–580, Vancouver, Canada, | |
June 2008. | |
""" | |
edge_types = { | |
edge: tuple(attrs.get(attr) for attr in edge_attributes) | |
for edge, attrs in G.edges.items() | |
} | |
if not G.is_directed(): | |
if G.is_multigraph(): | |
# list is needed to avoid mutating while iterating | |
edges = [((v, u, k), etype) for (u, v, k), etype in edge_types.items()] | |
else: | |
# list is needed to avoid mutating while iterating | |
edges = [((v, u), etype) for (u, v), etype in edge_types.items()] | |
edge_types.update(edges) | |
group_lookup = { | |
node: tuple(attrs[attr] for attr in node_attributes) | |
for node, attrs in G.nodes.items() | |
} | |
groups = defaultdict(set) | |
for node, node_type in group_lookup.items(): | |
groups[node_type].add(node) | |
eligible_group_id, neighbor_info = _snap_eligible_group( | |
G, groups, group_lookup, edge_types | |
) | |
while eligible_group_id: | |
groups = _snap_split(groups, neighbor_info, group_lookup, eligible_group_id) | |
eligible_group_id, neighbor_info = _snap_eligible_group( | |
G, groups, group_lookup, edge_types | |
) | |
return _snap_build_graph( | |
G, | |
groups, | |
node_attributes, | |
edge_attributes, | |
neighbor_info, | |
edge_types, | |
prefix, | |
supernode_attribute, | |
superedge_attribute, | |
) | |