Chapter
1
Iterative Vector Diffusion for the Detection of Modularity in Large Networks
Jennifer
Hallinan & Guy Smith
Institute for
Molecular Biosciences & School of ITEE
The University of
Queensland
J.Hallinan@imb.uq.edu.au
Systems biology is an emerging discipline which attempts to understand complex biological systems as coherent systems, rather than as collections of components. Complex biological systems can often be viewed as networks for this purpose, with the nodes of the network representing entities such as genes, proteins, or individual organisms, and the edges between nodes representing interactions between those entities.
Many of the biological systems
which have been recast as networks have proven to be scale free and/or small
world networks (Watts & Strogatz, 1998; Wagner & Fell, 2000; Jeong et al., 2000; Wagner, 2001), a finding
which leads to the hypothesis that such an organization may be more or less
ubiquitous in natural systems.
The importance of a modular organization in the evolution and development of complex organisms is widely accepted. Modularity is evident at many scales, from cascades of gene activation, to the existence of developmental modules, to the presence of functional modularity in systems such as the brain. Biological modules have been argued to be fundamental units of selection; modules may be recombined in different ways to produce quite different functions, permitting the evolution and development of complex anatomies without a corresponding genomic complexity.
It is a moot question, however, whether this functional modularity is reflected in a corresponding structural modularity detectable in the type of interaction network discussed above. These networks are generally sparsely connected, with a scale-free pattern of connectivity ; is such an organization compatible with the existance of structural modules within the network ? If so, do these structural modules reflect functional modularity ? If this was in fact the case, the detection of structural modularity in an biological interaction network would permit the constuction of hypothesis with regard to underlying functional modularity, which could guide further research in vivo. The ability to computationally identify functionally important modules and the genes or proteins having the greatest importance to such modules would be invaluable for our understanding of cellular processes in health and disease.
There is some evidence for structural modularity in the most complete intracellular interaction network known to date, the protein-protein interaction network of the yeast Saccharomyces cerevisiae. Maslov & Sneppen (2002) note that the highly connected hubs in the yeast network tend to be connected to nodes of low connectivity, and suggest that this pattern is consistent with the hypothesis that the proteins form functional modules organized around individual hubs. Further, von Meering et al. (2002) suggest that most yeast proteins interact with others having the same cellular role and subcellular localization. If this is, in fact, the case, it implies that structural modularity may reflect, at least to some extent, an underlying functional modularity. However, this is only weak and indirect evidence for the existance of modularity in subcellular interaction networks.
The first step towards answering some of the questions regarding biological network modularity is the development of objective algorithms for the identification of modules in large networks. In this paper we describe such an algorithm, using an iterative vector diffusion approach. The performance of the algorithm is tested upon simulated data and a series of social networks with visually apparent modularity. It is then applied to the yeast protein-protein interaction dataset.
Network analysis is widely
applied to the study of social networks, and there are a number of standard
algorithms applied to the task of identifying interesting subsets of nodes within
a graph. Of these, closest to modularity detection is the clique. A clique is a subgraph in which
every possible pair of nodes is directly connected and the clique is not
contained in any other clique. Since the concept of a clique is widely used,
and relatively close to that of a module, we used p-cliques of networks as a basis for comparison with our
algorithm.
The iterative vector diffusion
algorithm operates in the context of a graph G
consisting of a vertex set V(G) =
{v1…vn} and en edge set E(G) = {e1…em} where
each edge consists of two vertices. The algorithm is initialized by assigning
to each vertex a binary vector of length n,
initialized to
![]()
The algorithm then proceeds iteratively. At each iteration an edge is selected at random and the vectors associated with each of its vertices are moved towards each other by a small amount, d. This diffusion process is interated until a stopping criterion is met. The number of iterations performed, n, is dependant upon both the number of connections in the network, c, and the size of d, such that
,
where a is the average amount by which a vector is changed in the course of the run. A value for a of 0.1 was selected empirically in trials on artificially generated networks.
The final set of vectors is then subjected to a
standard clustering algorithm. The clustering algorithms used was Kohonen’s
Self-Organizing Map (SOM) (Kohonen, 1996).
Since the objective of this
algorithm is the detection of structural modularity in relatively large
networks, a network with strong and visually apparent clustering was required
for testing purposes. Although biological networks tend to sparsely connected,
making the visual determination of topological structure difficult, social
networks generally have relatively high connectivity. The algorithm was
therefore tested on a large social network, derived from Internet Relay Chat
(IRC) data, before being applied to a more challenging biological network, the Saccharomyces cerevisiae protein-protein
interaction network.
Internet Relay Chat (IRC) is a
widely-used, internet-based discussion forum. The supporting software is free,
widely available, easy to use and is available for all operating systems; IRC
is thus used by thousands of individuals of wide-ranging geographic, social and
technical backgrounds. On the day in which we collected data, the DALnet IRC
network consisted of 29 servers supporting 97,100 users on 35,362 channels. A
channel is a discussion forum. Anyone can start a channel, a single individual
may be on many channels simultaneously, and anyone can join most channels. When
a channel is empty it ceases to exist. The system is therefore large, highly
dynamic and self-organizing. Social relations on IRC can be considered as a
network, the nodes of which are channels. A link between two channels occurs
when one or more individuals are on both channels simultaneously.
We collected two datasets. One
set, the small network, consisted of the 1-neighbourhood of a single channel.
The “large network” comprised the 2-neighbourhood of the same channel,
collected at a different time on the same day.
Table 1 shows the
statistics of the IRC networks. The diameter of a network is the average of the
longest of the shortest paths between every pair of nodes; the cluster
coefficient is a measure of the degree of local connectivity of the network.
The cluster coefficient is defined by Watts (1999) as
, where
is the number of
edges in the neighbourhood of vertex v,
and
is the total number of possible edges in Gv. The numbers in brackets beside the diameter
and cluster coefficient of the network are the equivalent values for a randomly
connected network with the same number of nodes and edges.
Table 1. Statistics measured for the IRC networks
|
Metric |
Small Network |
Large Network |
|
Number of nodes |
23 |
641 |
|
Number of links |
44 |
2,891 |
|
Average connectivity |
1.91 |
4.51 |
|
Diameter |
2 (4) |
5 (6) |
|
Cluster coefficient |
0.727 (0.114) |
0.805 (0.013) |
The small network is clearly
far more structured than the equivalent random network. Figure 1 shows the
results of modularity detection in this network using both the 0.5 p-cliques and the iterative vector
diffusion algorithm, followed by clustering using a SOM. Both algorithms
perform well on this network. The only significant difference between the two
is the large cluster to the top left of the network, which is considered a
single module under the p-cliques
algorithm, but becomes two distinct clusters under iterative vector diffusion.

Figure 1. Detecting clusters in the small IRC network. a) The 0.5 p-cliques of the network. b) Iterative vector diffusion followed by a SOM clustering algorithm.
The large network is
considerably more complex than the small one, both in terms of number of nodes
and degree of connectivity. Although
there is clearly a significant degree of structure in this network compared
with the equivalent random network, it is much more complex than the small
network. Figure 2 shows the modularity found in this network. In this case p-cliques do not capture the local
structural information in which we are interested. Iterative vector diffusion
finds a larger number of clusters, many of which correspond to the visually
apparent clusters in the network.

Figure 2. a) 0.5 p-cliques of the large IRC network.

Figure 2 b) Modules detected by iterative vector diffusion on the large IRC network.
The yeast two-hybrid screening
procedure (Fields & Song, 1989) is a high-throughput technique for the
detection of interactions between pairs of proteins. The complete genome of the
brewer’s yeast Saccharomyces. cerevisiae
has been sequenced, and all the proteins it encodes can be synthesized. It is
therefore possible to screen every protein produced by the organism against all
others, and theoretically acquire a complete picture of the protein-protein
interaction network. In practice the data collected are noisy and incomplete
(see von Meering et al. for a
systematic review of the data). Despite this, the yeast network is perhaps the
most complete and well-characterized biological network currently available.
For these
experiments we used a network consisting of the largest single connected
component in the original yeast protein-protein interaction data collected by
Uetz (2000). This network consists of 473 nodes (proteins) and
562 edges (interactions between proteins). The average connectivity of the
network is thus 1.19, considerably lower than the connectivity of the IRC
networks.
As well as data on
the interactions between proteins, information of biologically important
protein characteristics such as cellular rôle and subcellular localization is
known for some of the proteins. The Yeast Protein Database classifies the
proteins into 28 subcellular localizations. Since von Meering et al. (2002) argue that most proteins
will interact with other proteins from the same subcellular location, it is
feasible to hope that any clusters in the network will correspond largely with
protein subcellular localization. Figure 3 shows the protein-protein
interaction network, color-coded in three different ways. In part a) the nodes
are coded-coded by subcellular localization; dark blue represents “Unknown”. In
part b) the 0.5 p-cliques are
shown, and in part c) the modularity found by iterative vector diffusion.

Figure 3. The yeast protein-protein interaction network. a) Nodes coded by subcellular localization; dark blue indicates the localization is unknown. b) 0.5 p-cliques of the network. c) Iterative vector diffusion on the network.
On the simple IRC network,
modularity is relatively simple to detect. Both the p-clique algorithm and the iterative vector diffusion
algorithm identify the clusters of nodes which are visually apparent. For the
more complex networks, however, the p-cliques
do not correspond with the type of clusters likely to correspond to structural
or functional modules; indeed, there is no reason to believe that cliques and
locally coherent structural modules should coincide. There are consistently
more iterative vector diffusion clusters than cliques, and the IVD clusters are
more local. In a complex network, cliques tend to have a network-wide
membership.
The modules identified in the
yeast protein-protein interaction network are interesting because of their
potential biological meaning. Two thirds of the proteins produced by Saccharomyces cerevisiae are
uncharacterized with respect to subcellular localization. Of the 4,313 proteins
with an identified location in the YPD as of May 2002, 962 (22%) occur in more
than one subcellular compartment. In the network used, proteins with more than
one location were classed as belonging to one of their identified locations at
random. This simplification is likely to distort the clustering to some extent.
Despite this complication, the modularity detection may still be useful.
Inspection of Figure 4c) indicates that long, sparsely branching chains of
interactions are characteristic of this network. Most of these form individual
modules under IVD. Although the subcellular localization data is incomplete for
all of these chains, combining the modularity assignment with the principle put
forward by von Meering et al.
(2002) that most proteins interact with others from the same subcellular
compartment permits the identification of a set of proteins likely to be of
interest to a researcher working on one of the set. This type of module-based
analysis can be extended; the manner in which that can be done is the subject
of ongoing work in our laboratory.
Many naturally arising
networks can be seen to have considerable amounts of organization. In order to
determine whether this topological organization reflects structural and/or
functional modularity in the underlying system, it is necessary to develop
algorithms for the objective detection of modularly linked clusters of nodes.
We describe a very simple algorithm, the iterative vector diffusion algorithm,
which is able to detect visually apparent clusters of nodes in networks of
varying sizes and degrees of complexity. Modularity detection is the first step
towards the objective quantification of the structure of the complex networks
which are ubiquitous in the natural world, with the eventual aim of
understanding the relationship between network structure, function and
dynamics.
Brooks, R., & Maes, P. (ed.), 1994, Artificial Life IV, MIT Press (Cambridge).
Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95: 14863 - 14868.
Fields, S. & Song, O., 1989, A novel genetic system to detect protein-protein interactions. Nature 340: 245 - 246.
Hartman, J., & Wernecke, J., 1996, The VRML 2.0 Handbook - Building Moving Worlds on the Web, Addison-Wesley Developers Press (Reading).
Heudin, J.C. (ed.), 1998, Virtual Worlds - Proceedings of the First Int. Conf. on Virtual Worlds, Springer-Verlag Lecture Notes in Computer Science (Berlin), 1434, 5.
Husbands, P., & Harvey, I. (ed.), 1997, Fourth European Conference on Artificial Life, MIT Press (Cambridge).
Kohonen, T., Hynninen, J., Kangas, J. & Laaksonen, J. (1996). SOM_PAK: The Self-Organizing Map Program Package. Technical Report A31, Helsinki University of Technology.
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & Barabasi, A.-L., 2000, The large-scale organization of metabolic networks. Nature 407: 651 - 654.
Krueger, M.W., 1991, Artificial Reality II, Addison-Wesley (Reading).
Langton, C.G. (ed.), 1994, Artificial Life III, SFI Studies in the Sciences of Complexity, Addison-Wesley (Reading), 17.
Langton, C.G., 1988, Artificial Life, in Artificial Life, edited by C.G. Langton, SFI Studies in the Sciences of Complexity, Addison-Wesley (Reading), 6, 1.
Maslov, S. & Sneppen, K., 2002, Specificity and stability in topology of protein networks. Science 296: 910 - 913.
Sanchez, E., & Tomassini, M. (ed.), 1996, Towards Evolvable Hardware - The Evolutionary Engineering Approach, , Springer-Verlag Lecture Notes in Computer Science (Berlin), 1062.
Sutherland, I., 1965, The Ultimate Display, Proceedings IFIP Congress, 506.
Uetz, P., Glot, L., Cagney, G., Mansfield, T. A., Judson, R. S.,Knight, J. R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., Qureshi-Emili, A., Li, Y., Godwin, B., Conover, D., Kalbfleisch, T., Vijayadamodar, G., Yang, M., Johnston, M., Fields, S. & Rothberg, J. M., 2000, A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403: 623 - 631.
Verna, D., & Grumbach, A., 1998, Can we define Virtual Reality? The MRIC Model, in Virtual Worlds, edited by J.C. Heudin, Springer-Verlag Lecture Notes in Computer Science (Berlin), 1434, 41.
von Meering, C., Krause, R., Snel, B., Cornell, M., Oliver, S. G., Fields, S. & Bork, P., 2002, Comparative assessment of large-scale data sets of protein-protein interactions Nature 417: 399 - 403.
Wagner, A., 2001, The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Santa Fe Institute Working Paper 01-04-022.
Watts, D. J. & Strogatz, S. H., 1998, Collective dynamics of 'small-world' networks. Nature 393: 440 - 442.
Watts, D. J. (1999). Small Worlds: The Dynamics of Networks between Order and Randomness. Princeton, NJ: Princeton University Press.