Chapter 1

Iterative Vector Diffusion for the Detection of Modularity in Large Networks

 

Jennifer Hallinan & Guy Smith

Institute for Molecular Biosciences & School of ITEE

The University of Queensland

J.Hallinan@imb.uq.edu.au

1.1.   Introduction: Systems as Networks

Systems biology is an emerging discipline which attempts to understand complex biological systems as coherent systems, rather than as collections of components. Complex biological systems can often be viewed as networks for this purpose, with the nodes of the network representing entities such as genes, proteins, or individual organisms, and the edges between nodes representing interactions between those entities.

Many of the biological systems which have been recast as networks have proven to be scale free and/or small world networks (Watts & Strogatz, 1998; Wagner & Fell, 2000; Jeong et al., 2000; Wagner, 2001), a finding which leads to the hypothesis that such an organization may be more or less ubiquitous in natural systems.

1.1.1.   Modularity in Biological Networks

The importance of a modular organization in the evolution and development of complex organisms is widely accepted. Modularity is evident at many scales, from cascades of gene activation, to the existence of developmental modules, to the presence of functional modularity in systems such as the brain. Biological modules have been argued to be fundamental units of selection; modules may be recombined in different ways to produce quite different functions, permitting the evolution and development of complex anatomies without a corresponding genomic complexity.

It is a moot question, however, whether this functional modularity is reflected in a corresponding structural modularity detectable in the type of interaction network discussed above. These networks are generally sparsely connected, with a scale-free pattern of connectivity ; is such an organization compatible with the existance of structural modules within the network ? If so, do these structural modules reflect functional modularity ? If this was in fact the case, the detection of structural modularity in an biological interaction network would permit the constuction of hypothesis with regard to underlying functional modularity, which could guide further research in vivo. The ability to computationally identify functionally important modules and the genes or proteins having the greatest importance to such modules would be invaluable for our understanding of cellular processes in health and disease.

There is some evidence for structural modularity in the most complete intracellular interaction network known to date, the protein-protein interaction network of the yeast Saccharomyces cerevisiae. Maslov & Sneppen (2002) note that the highly connected hubs in the yeast network tend to be connected to nodes of low connectivity, and suggest that this pattern is consistent with the hypothesis that the proteins form functional modules organized around individual hubs. Further, von Meering et al. (2002) suggest that most yeast proteins interact with others having the same cellular role and subcellular localization. If this is, in fact, the case, it implies that structural modularity may reflect, at least to some extent, an underlying functional modularity. However, this is only weak and indirect evidence for the existance of modularity in subcellular interaction networks.

The first step towards answering some of the questions regarding biological network modularity is the development of objective algorithms for the identification of modules in large networks. In this paper we describe such an algorithm, using an iterative vector diffusion approach. The performance of the algorithm is tested upon simulated data and a series of social networks with visually apparent modularity. It is then applied to the yeast protein-protein interaction dataset.

1.1.2   Detecting Modularity

Network analysis is widely applied to the study of social networks, and there are a number of standard algorithms applied to the task of identifying interesting subsets of nodes within a graph. Of these, closest to modularity detection is the clique. A clique is a subgraph in which every possible pair of nodes is directly connected and the clique is not contained in any other clique. Since the concept of a clique is widely used, and relatively close to that of a module, we used p-cliques of networks as a basis for comparison with our algorithm.

1.2.   The Iterative Vector Diffusion Algorithm

The iterative vector diffusion algorithm operates in the context of a graph G consisting of a vertex set V(G) = {v1…vn} and en edge set E(G) = {e1…em} where each edge consists of two vertices. The algorithm is initialized by assigning to each vertex a binary vector of length n, initialized to

The algorithm then proceeds iteratively. At each iteration an edge is selected at random and the vectors associated with each of its vertices are moved towards each other by a small amount, d. This diffusion process is interated until a stopping criterion is met. The number of iterations performed, n, is dependant upon both the number of connections in the network, c, and the size of d, such that

,

where a is the average amount by which a vector is changed in the course of the run. A value for a of 0.1 was selected empirically in trials on artificially generated networks.

The final set of vectors is then subjected to a standard clustering algorithm. The clustering algorithms used was Kohonen’s Self-Organizing Map (SOM) (Kohonen, 1996).

1.3.   Modularity in the IRC Networks

Since the objective of this algorithm is the detection of structural modularity in relatively large networks, a network with strong and visually apparent clustering was required for testing purposes. Although biological networks tend to sparsely connected, making the visual determination of topological structure difficult, social networks generally have relatively high connectivity. The algorithm was therefore tested on a large social network, derived from Internet Relay Chat (IRC) data, before being applied to a more challenging biological network, the Saccharomyces cerevisiae protein-protein interaction network.

1.3.1.   Data Sets

Internet Relay Chat (IRC) is a widely-used, internet-based discussion forum. The supporting software is free, widely available, easy to use and is available for all operating systems; IRC is thus used by thousands of individuals of wide-ranging geographic, social and technical backgrounds. On the day in which we collected data, the DALnet IRC network consisted of 29 servers supporting 97,100 users on 35,362 channels. A channel is a discussion forum. Anyone can start a channel, a single individual may be on many channels simultaneously, and anyone can join most channels. When a channel is empty it ceases to exist. The system is therefore large, highly dynamic and self-organizing. Social relations on IRC can be considered as a network, the nodes of which are channels. A link between two channels occurs when one or more individuals are on both channels simultaneously.

We collected two datasets. One set, the small network, consisted of the 1-neighbourhood of a single channel. The “large network” comprised the 2-neighbourhood of the same channel, collected at a different time on the same day.

1.3.2.   Results

Table 1 shows the statistics of the IRC networks. The diameter of a network is the average of the longest of the shortest paths between every pair of nodes; the cluster coefficient is a measure of the degree of local connectivity of the network. The cluster coefficient is defined by Watts (1999) as

, where  is the number of edges in the neighbourhood of vertex v, and is the total number of possible edges in Gv. The numbers in brackets beside the diameter and cluster coefficient of the network are the equivalent values for a randomly connected network with the same number of nodes and edges.

Table 1. Statistics measured for the IRC networks

Metric

Small Network

Large Network

Number of nodes

23

641

Number of links

44

2,891

Average connectivity

1.91

4.51

Diameter

2 (4)

5 (6)

Cluster coefficient

0.727 (0.114)

0.805 (0.013)

 

The small network is clearly far more structured than the equivalent random network. Figure 1 shows the results of modularity detection in this network using both the 0.5 p-cliques and the iterative vector diffusion algorithm, followed by clustering using a SOM. Both algorithms perform well on this network. The only significant difference between the two is the large cluster to the top left of the network, which is considered a single module under the p-cliques algorithm, but becomes two distinct clusters under iterative vector diffusion.

Figure 1. Detecting clusters in the small IRC network. a) The 0.5 p-cliques of the network. b) Iterative vector diffusion followed by a SOM clustering algorithm.

 

The large network is considerably more complex than the small one, both in terms of number of nodes and degree of connectivity.  Although there is clearly a significant degree of structure in this network compared with the equivalent random network, it is much more complex than the small network. Figure 2 shows the modularity found in this network. In this case p-cliques do not capture the local structural information in which we are interested. Iterative vector diffusion finds a larger number of clusters, many of which correspond to the visually apparent clusters in the network.

Figure 2.  a) 0.5 p-cliques of the large IRC network.

 

 

Figure 2 b) Modules detected by iterative vector diffusion on the large IRC network.

1.4   Modularity in the Yeast Protein-Protein Interaction Network

The yeast two-hybrid screening procedure (Fields & Song, 1989) is a high-throughput technique for the detection of interactions between pairs of proteins. The complete genome of the brewer’s yeast Saccharomyces. cerevisiae has been sequenced, and all the proteins it encodes can be synthesized. It is therefore possible to screen every protein produced by the organism against all others, and theoretically acquire a complete picture of the protein-protein interaction network. In practice the data collected are noisy and incomplete (see von Meering et al. for a systematic review of the data). Despite this, the yeast network is perhaps the most complete and well-characterized biological network currently available.

For these experiments we used a network consisting of the largest single connected component in the original yeast protein-protein interaction data collected by Uetz (2000). This network consists of 473 nodes (proteins) and 562 edges (interactions between proteins). The average connectivity of the network is thus 1.19, considerably lower than the connectivity of the IRC networks.

As well as data on the interactions between proteins, information of biologically important protein characteristics such as cellular rôle and subcellular localization is known for some of the proteins. The Yeast Protein Database classifies the proteins into 28 subcellular localizations. Since von Meering et al. (2002) argue that most proteins will interact with other proteins from the same subcellular location, it is feasible to hope that any clusters in the network will correspond largely with protein subcellular localization. Figure 3 shows the protein-protein interaction network, color-coded in three different ways. In part a) the nodes are coded-coded by subcellular localization; dark blue represents “Unknown”. In part b) the 0.5 p-cliques are shown, and in part c) the modularity found by iterative vector diffusion.

Figure 3. The yeast protein-protein interaction network. a) Nodes coded by subcellular localization; dark blue indicates the localization is unknown. b) 0.5 p-cliques of the network. c) Iterative vector diffusion on the network.

1.5   Discussion and Conclusions

On the simple IRC network, modularity is relatively simple to detect. Both the p-clique algorithm and the iterative vector diffusion algorithm identify the clusters of nodes which are visually apparent. For the more complex networks, however, the p-cliques do not correspond with the type of clusters likely to correspond to structural or functional modules; indeed, there is no reason to believe that cliques and locally coherent structural modules should coincide. There are consistently more iterative vector diffusion clusters than cliques, and the IVD clusters are more local. In a complex network, cliques tend to have a network-wide membership.

The modules identified in the yeast protein-protein interaction network are interesting because of their potential biological meaning. Two thirds of the proteins produced by Saccharomyces cerevisiae are uncharacterized with respect to subcellular localization. Of the 4,313 proteins with an identified location in the YPD as of May 2002, 962 (22%) occur in more than one subcellular compartment. In the network used, proteins with more than one location were classed as belonging to one of their identified locations at random. This simplification is likely to distort the clustering to some extent. Despite this complication, the modularity detection may still be useful. Inspection of Figure 4c) indicates that long, sparsely branching chains of interactions are characteristic of this network. Most of these form individual modules under IVD. Although the subcellular localization data is incomplete for all of these chains, combining the modularity assignment with the principle put forward by von Meering et al. (2002) that most proteins interact with others from the same subcellular compartment permits the identification of a set of proteins likely to be of interest to a researcher working on one of the set. This type of module-based analysis can be extended; the manner in which that can be done is the subject of ongoing work in our laboratory.

Many naturally arising networks can be seen to have considerable amounts of organization. In order to determine whether this topological organization reflects structural and/or functional modularity in the underlying system, it is necessary to develop algorithms for the objective detection of modularly linked clusters of nodes. We describe a very simple algorithm, the iterative vector diffusion algorithm, which is able to detect visually apparent clusters of nodes in networks of varying sizes and degrees of complexity. Modularity detection is the first step towards the objective quantification of the structure of the complex networks which are ubiquitous in the natural world, with the eventual aim of understanding the relationship between network structure, function and dynamics.

References

Brooks, R., & Maes, P. (ed.), 1994, Artificial Life IV, MIT Press (Cambridge).

Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95: 14863 - 14868.

Fields, S. & Song, O., 1989, A novel genetic system to detect protein-protein interactions. Nature 340: 245 - 246.

Hartman, J., & Wernecke, J., 1996, The VRML 2.0 Handbook - Building Moving Worlds on the Web, Addison-Wesley Developers Press (Reading).

Heudin, J.C. (ed.), 1998, Virtual Worlds - Proceedings of the First Int. Conf. on Virtual Worlds, Springer-Verlag Lecture Notes in Computer Science (Berlin), 1434, 5.

Husbands, P., & Harvey, I. (ed.), 1997, Fourth European Conference on Artificial Life, MIT Press (Cambridge).

Kohonen, T., Hynninen, J., Kangas, J. & Laaksonen, J. (1996). SOM_PAK: The Self-Organizing Map Program Package. Technical Report A31, Helsinki University of Technology.

Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & Barabasi, A.-L., 2000, The large-scale organization of metabolic networks. Nature 407: 651 - 654.

Krueger, M.W., 1991, Artificial Reality II, Addison-Wesley (Reading).

Langton, C.G. (ed.), 1994, Artificial Life III, SFI Studies in the Sciences of Complexity, Addison-Wesley (Reading), 17.

Langton, C.G., 1988, Artificial Life, in Artificial Life, edited by C.G. Langton, SFI Studies in the Sciences of Complexity, Addison-Wesley (Reading), 6, 1.

Maslov, S. & Sneppen, K., 2002, Specificity and stability in topology of protein networks. Science 296: 910 - 913.

Sanchez, E., & Tomassini, M. (ed.), 1996, Towards Evolvable Hardware - The Evolutionary Engineering Approach, , Springer-Verlag Lecture Notes in Computer Science (Berlin), 1062.

Sutherland, I., 1965, The Ultimate Display, Proceedings IFIP Congress, 506.

Uetz, P., Glot, L., Cagney, G., Mansfield, T. A., Judson, R. S.,Knight, J. R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., Qureshi-Emili, A., Li, Y., Godwin, B., Conover, D., Kalbfleisch, T., Vijayadamodar, G., Yang, M., Johnston, M., Fields, S. & Rothberg, J. M., 2000, A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403: 623 - 631.

Verna, D., & Grumbach, A., 1998, Can we define Virtual Reality? The MRIC Model, in Virtual Worlds, edited by J.C. Heudin, Springer-Verlag Lecture Notes in Computer Science (Berlin), 1434, 41.

von Meering, C., Krause, R., Snel, B., Cornell, M., Oliver, S. G., Fields, S. & Bork, P., 2002, Comparative assessment of large-scale data sets of protein-protein interactions Nature 417: 399 - 403.

Wagner, A., 2001, The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Santa Fe Institute Working Paper 01-04-022.

Watts, D. J. & Strogatz, S. H., 1998, Collective dynamics of 'small-world' networks. Nature 393: 440 - 442.

Watts, D. J. (1999). Small Worlds: The Dynamics of Networks between Order and Randomness. Princeton, NJ: Princeton University Press.