Router-level community structure of the Internet Autonomous Systems

In a first stage, we analysed the community structure of the Internet router-level graph by applying several community detection methods:

Infomap, by Rosvall and Bergstrom, based in the description length [21].

LPM, the Label Propagation Method by Raghavan et al. [22], which performs a diffusion process on the graph.

Deltacom, an efficient greedy algorithm for modularity optimization introduced here.

Louvain, a fast modularity optimization algorithm [23].

CommUGP, a local community detection method [24].

In the following subsection we shall introduce Deltacom, an algorithm for hierarchical partitioning based on modularity maximization. Later on, we shall evaluate the results using a similarity metric.

3.1 A hierarchical partitioning method based on multiresolution modularity

Deltacom is based on the optimization of modularity. We recall that modularity was introduced by Newman in [25], [26] and, since then, several methods for its maximization have been proposed, like the one by Guimerà and Amaral based on simulated annealing [27], the extremal optimization method by Duch and Arenas [28], the fast greedy algorithm by Blondel et al. [23], or the multilevel algorithm by Noack and Rotta [29].

The following two results about modularity are fundamental to our present work: in [30] Fortunato and Barthélemy showed that modularity has a resolution limit, i.e., its maximization tends to put small communities together when they are connected among them; in [31] Reichardt and Bornholdt observed that modularity can be understood as the Hamiltonian of a ferromagnetic Potts model. Thus, maximizing the modularity implies finding the ground-state of this model, and the authors developed a simulated annealing based procedure for doing it.

Deltacom considers the optimization of the modularity Q as a particular case of the optimization of a more general functional Q t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i9.gif” alt=”View MathML“/>

, with a resolution parameter t, in which modularity corresponds to a normalized resolution, i.e., t = 1

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i10.gif” alt=”View MathML“/>

. The source of our idea can be traced back to the γ resolution parameter in [31], and has also been followed by Lancichinetti and Fortunato in [32]. In [31], the maximization is based on simulated annealing, it must run at one single resolution each time, and its computational complexity is very high. The advantage of our method is that the resolution evolves dynamically, so that all the partitions at different resolutions can be produced during one single run, and with a low computational complexity.

Deltacom is based on an agglomerative greedy algorithm; each step of the agglomerative process is a local maximum for Q t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i11.gif” alt=”View MathML“/>

, i.e., the generalized modularity, at some particular resolution t. In other words, for each t-value a community partition is found, which is locally t-optimal in some sense. As these communities are joined, the resolution shrinks and t gets smaller. For t = 1

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i12.gif” alt=”View MathML“/>

, the structure is a local maximum for the classical modularity. In other words, we observe the graph as by using a magnifying glass, obtaining partitions at every resolution level. The result is a hierarchical structure in which partitions at higher resolution levels are always refinements (in a mathematical sense) of those at lower ones.

Here we present a brief description of how the algorithm works. Further mathematical details on the properties of Q t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i13.gif” alt=”View MathML“/>

and of our maximization algorithm can be found in [33].

3.1.1 Newman’s modularity

Given a graph G = ( V , E )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i14.gif” alt=”View MathML“/>

and a partition of it, C

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i15.gif” alt=”View MathML“/>

, the modularity of the partition is defined as [26]:

Q ( C ) = 1 2 m ∑ ( i , j ) ∈ V [ A i j − k i k j 2 m ] δ ( i , j ) ,

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i16.gif” alt=”View MathML“/>

(1)

where m is the amount of edges, A = ( A i j )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i17.gif” alt=”View MathML“/>

, i , j ∈ V

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i18.gif” alt=”View MathML“/>

is the adjacency matrix of the graph, k i

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i19.gif” alt=”View MathML“/>

is the degree of the node i and δ ( i , j )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i20.gif” alt=”View MathML“/>

is the indicator function that points out if nodes i and j belong to the same community in the partition C

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i21.gif” alt=”View MathML“/>

.

The expression on the right side of Eq. 1 compares the internal connections in the communities (represented by the A i j ⋅ δ ( i , j )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i22.gif” alt=”View MathML“/>

product) with the expected number of connections under a random graph with the same expected degrees in the nodes and the same communities. It can be restated as a sum throughout the communities, in the following way:

where k C

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i24.gif” alt=”View MathML“/>

is the sum of degrees of the nodes in each community C, and e C

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i25.gif” alt=”View MathML“/>

is twice the number of internal edges of it.

Now, in many agglomerative clustering methods, communities are joined by pairs at each step, until finding a local maximum. Joining two communities C and C ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i26.gif” alt=”View MathML“/>

during the process produces the following variation in the Q functional:

where e ( C , C ′ )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i28.gif” alt=”View MathML“/>

is twice the number of edges between C and C ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i29.gif” alt=”View MathML“/>

.

3.1.2 Introducing a resolution parameter

We introduce a resolution parameter t as

Now the functional variation after joining two communities C and C ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i31.gif” alt=”View MathML“/>

becomes:

It is clear that a positive Δ Q t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i33.gif” alt=”View MathML“/>

value for a particular t also implies a positive (and even higher) Δ Q t ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i34.gif” alt=”View MathML“/>

for any resolution t ′ < t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i35.gif” alt=”View MathML“/>

. In other words, any agglomerative process which monotonically increases Δ Q t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i36.gif” alt=”View MathML“/>

also serves as a process monotonically increasing Δ Q t ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i37.gif” alt=”View MathML“/>

for any t ′ < t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i38.gif” alt=”View MathML“/>

.

It is also immediate that a very large t value discourages any join, because Δ Q t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i39.gif” alt=”View MathML“/>

is negative for every pair of communities. It can also be shown that for t larger enough the global maximum for Q t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i40.gif” alt=”View MathML“/>

would have each node isolated in its own community.

Thus, what we propose is to start with a large enough t (so that the optimal initial partition C t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i41.gif” alt=”View MathML“/>

has as many communities as nodes) and then to decrease t as little as possible so that a join can be made, i.e., so that some Δ Q t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i42.gif” alt=”View MathML“/>

becomes positive. This t value is just

t ( C ) = max C , C ′ ∈ C { e ( C , C ′ ) ⋅ 2 m k C ⋅ k C ′ } .

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i43.gif” alt=”View MathML“/>

(6)

When a local maximum is found for some t (i.e., no pair of communities can be joined without decreasing Δ Q t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i44.gif” alt=”View MathML“/>

), we decrease the resolution to t ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i45.gif” alt=”View MathML“/>

such that Δ Q t ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i46.gif” alt=”View MathML“/>

becomes positive for some pair of communities, following the previous formula. The agglomerative process continues until obtaining as many communities as connected components of the graph, and the obtained result is a set of locally optimal partitions C t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i47.gif” alt=”View MathML“/>

for every resolution t, and such that the finer partitions are refinements of the coarser-grained ones. Even more, these partitions are what we call weakly optimal partitions, in the sense that not only the join of any pair of communities C, C ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i48.gif” alt=”View MathML“/>

would decrease Q t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i49.gif” alt=”View MathML“/>

, but also any coarser partition (i.e. obtained by joining its communities in any way) would also decrease it.

The code for the Deltacom algorithm is freely available at SourceForge [34]; given a graph as a list of edges, it produces a set of community partitions for every resolution value.

The formalization of the process is contained in Algorithm 1. This algorithm produces weakly optimal partitions for the generalized modularity Q t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i50.gif” alt=”View MathML“/>

. In particular, C 1

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i51.gif” alt=”View MathML“/>

is a weakly optimal partition for the classical modularity Q. The notation C ( a , b ]

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i52.gif” alt=”View MathML“/>

represents that the partition is weakly optimal for every resolution in the ( a , b ]

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i53.gif” alt=”View MathML“/>

interval. For a complexity analysis, we shall consider keeping a set of lists L C

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i54.gif” alt=”View MathML“/>

, one for each community, containing its neighbour communities C ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i55.gif” alt=”View MathML“/>

and the edge cut with each of them. The lists shall be ordered by a community identifier. The initial construction of these structures has time complexity O ( m ⋅ log ( m ) )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i56.gif” alt=”View MathML“/>

(as each list has size s C

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i57.gif” alt=”View MathML“/>

, with ∑ C s C = m

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i58.gif” alt=”View MathML“/>

, and ∑ C s C ⋅ log ( s C ) ≤ m ⋅ log ( m )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i59.gif” alt=”View MathML“/>

). The while loop starting at line 1.6 has at most n cycles; each of these cycles has three steps: (a) finding a pair of communities with Δ ( Q t ) = 0

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i60.gif” alt=”View MathML“/>

; (b) joining the communities; and (c) decreasing t. The first step implies traversing all the lists; each Δ ( Q t )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i61.gif” alt=”View MathML“/>

computation is O ( 1 )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i62.gif” alt=”View MathML“/>

so this step’s complexity can be bounded to O ( m )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i63.gif” alt=”View MathML“/>

. Joining two communities implies merging their lists into one, and also updating the lists of their neighbour communities. As all the lists are ordered, this can be done by traversing those lists, which we can bound to O ( m )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i64.gif” alt=”View MathML“/>

. Finally, decreasing t implies traversing all the lists and computing e ( C , C ′ ) ⋅ 2 m k C ⋅ k C ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i65.gif” alt=”View MathML“/>

for each element. This is again O ( m )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i66.gif” alt=”View MathML“/>

. Putting all together, we have at most n cycles with O ( m )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i67.gif” alt=”View MathML“/>

. The final complexity is then O ( n m )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i68.gif” alt=”View MathML“/>

.

While O ( n m )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i69.gif” alt=”View MathML“/>

is a theoretical upper bound for a general graph, the practical running time for sparse graphs can be greatly reduced with some considerations: step (c) gives all the information for the next step (a), because the community pair which maximized t ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i70.gif” alt=”View MathML“/>

in line 1.11 is the same that will have Δ Q t = 0

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i71.gif” alt=”View MathML“/>

in the next iteration; also, by keeping a search structure like an ordered tree with all the pair of connected communities C , C ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i72.gif” alt=”View MathML“/>

ordered by e ( C , C ′ ) ⋅ 2 m k C ⋅ k C ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i73.gif” alt=”View MathML“/>

decreasingly, there is no need to traverse all the lists in each iteration, but we must just update the modified values into this tree and choose the community pair at its head as the one maximizing t ′

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i74.gif” alt=”View MathML“/>

. For sparse graphs, this process has an upper complexity bound of O ( m ⋅ log ( m ) ⋅ s C max )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i75.gif” alt=”View MathML“/>

, where s C max

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i76.gif” alt=”View MathML“/>

is the maximum number of neighbour communities that a community may have at any time of the process.

3.2 Finding ASes through similarity maximization

We applied the five forementioned algorithms (Infomap, LPM, Deltacom at t = 1

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i77.gif” alt=”View MathML“/>

, Louvain and CommUGP) to the CAIDA dataset. Each of these methods produces a partition C

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i78.gif” alt=”View MathML“/>

of the graph. In order to determine if the communities in these partitions are related to the ASes, we shall use this criterion: for each AS in the graph, and given a partition C

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i79.gif” alt=”View MathML“/>

, we find the community C ∈ C

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i80.gif” alt=”View MathML“/>

which maximizes the Jaccard similarity:

This most similar community shall be called C A S ( 1 )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i82.gif” alt=”View MathML“/>

:

And this maximum Jaccard similarity value will be called the recall score of the AS (according to [35]) and noted as R 1 ( A S )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i84.gif” alt=”View MathML“/>

:

In this way, we can evaluate each of the methods by observing the empirical cumulative distributions of the recall score of the ASes, R 1 ( A S )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i86.gif” alt=”View MathML“/>

(Eq. 7). Those methods with left-skewed cumulative distributions (i.e., recall values closer to 1.00) tend to better reproduce the Autonomous System structure. The cumulative distributions for the different methods are shown in Figure 2: blue for Infomap, yellow for LPM, red for Deltacom, green for CommUGP and pink for Louvain. At the top we plot the cumulative distribution considering all the ASes, and at the bottom we only consider ASes with at least 100 nodes (i.e., 639 ASes, which cover 90% of the network).

thumbnailFigure 2. Cumulative distribution of the AS recall score. (Upper) The curves represent the cumulative distribution of the recall scores over all the ASes for different community discovery methods: R 1 ( A S )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i87.gif” alt=”View MathML“/>

is shown for Deltacom at t = 1

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i88.gif” alt=”View MathML“/>

(i.e., modularity maximization) (red), LPM (yellow), Louvain modularity maximization (pink), CommUGP (green) and Infomap (blue). If we find the best resolution level for each AS using the multiresolution versions of Deltacom, Infomap and CPM, we get the black, orange and purple curves instead. The brown curve represents the Multiresolution Deltacom using the resolution value t ˜

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i89.gif” alt=”View MathML“/>

predicted by the regression line for each AS, and then computing R 1 ( A S )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i90.gif” alt=”View MathML“/>

in C t ˜

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i91.gif” alt=”View MathML“/>

. Finally, if we use 15% of the AS routers when matching the partition at C t ˜

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i92.gif” alt=”View MathML“/>

for each AS, then we get the light blue curve. (Lower) The same methods as in the upper picture, but restricted to ASes containing at least 100 nodes.

We can clearly see that the five methods fail to reproduce the ASes structure. In the best performing one, Infomap, only 50% of the large ASes reach a recall of 0.4, which is however small. A premature conclusion might state that the Internet graph does not have communities at the router level, or at least that they are not related with the Autonomous Systems. But we will show that this is not the case, and that the failure of the methods is in part due to the largely variable sizes, internal structures and functions of the Internet ASes. However, the structure of the Internet graph at the router level can be better captured by using multiresolution methods.

In order to assess our hypothesis we shall exploit the resolution parameter t of Deltacom, which provides us with a new level of freedom for matching the ASes in the hierarchical community tree. For each t in the range [ t min , t max ]

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i93.gif” alt=”View MathML“/>

we have a partition C t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i94.gif” alt=”View MathML“/>

. Thus, we shall define a new recall score R 2 ( A S )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i95.gif” alt=”View MathML“/>

based on exploring a parameter space Θ with different partitions C θ

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i96.gif” alt=”View MathML“/>

, θ ∈ Θ

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i97.gif” alt=”View MathML“/>

, associated to it such that the most similar community for an Autonomous System, C A S ( 2 )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i98.gif” alt=”View MathML“/>

, is:

C A S θ = arg max C ∈ C θ J ( C , A S ) , θ ∈ Θ , C A S ( 2 ) = arg max { C A S θ : θ ∈ Θ } J ( C A S θ , A S ) ,

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i99.gif” alt=”View MathML“/>

and the recall score as:

In the case of Deltacom, Θ = [ t min , t max ]

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i101.gif” alt=”View MathML“/>

and t plays the role of θ. In this way, the Jaccard similarity allows us to capture the moment in which each AS is formed. With this adjustments, the results clearly improve. The recall score R 2 ( A S )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i102.gif” alt=”View MathML“/>

(Eq. 8) over all the ASes is 0.87 (and 0.67 over the ASes of at least 100 nodes). The new cumulative distribution is presented as the black curve in Figure 2, and we shall call it Deltacom Multiresolution, as we are exploring all the communities at all possible resolutions.

The observed improvements tell us that each Autonomous System has a natural resolution at which it can be best observed. There is an important correlation between this resolution and the size of the AS, as the left picture in Figure 3 shows. The right picture in Figure 3 shows the evolution of classical modularity Q for the Internet graph as a function of the resolution t, i.e. as the algorithm evolves (from right to left). As one of the properties of our algorithm, Q t

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i103.gif” alt=”View MathML“/>

has its maximum at t for every possible t, so that the classical Q is maximal at t = 1

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i104.gif” alt=”View MathML“/>

.

thumbnailFigure 3. Correlation between the AS size and resolution in Deltacom and CPM, and evolution of Q in Deltacom. The left picture plots the correlation between the AS size (in terms of the amount of routers) and the resolution t at which it is best detected by Multiresolution Deltacom. The correlation coefficient between size and resolution in the log-scale is −0.93. A linear regression model between them gives the line y = 4.09 − 0.60 x

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i105.gif” alt=”View MathML“/>

shown in blue, with a determination of r 2 = 0.87

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i106.gif” alt=”View MathML“/>

. The central picture does the same for the best resolution γ in the Constant Potts Model (CPM), with a correlation of −0.65. On the right picture, the evolution of classical modularity ( Q 1

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i107.gif” alt=”View MathML“/>

) as a function of t during the agglomerative process of Deltacom.

We stress that several multiresolution methods exist in literature, but the advantage of ours is that it can generate a hierarchical partition covering all the possible resolution levels, in one single run. However, we compared our results with those of two other multiresolution methods, i.e., Infomap multiresolution and the Constant Potts Model:

The Infomap algorithm has a multiresolution variant [36], which generates a community tree with several levels k ∈ { 1 , … , k max }

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i108.gif” alt=”View MathML“/>

. We matched each of the ASes into this structure in the same way as we did with Multiresolution Deltacom: i.e., we find the best level for each AS, and then we compute the recall score (now Θ = { 1 , … , k max }

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i109.gif” alt=”View MathML“/>

and θ = k

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i110.gif” alt=”View MathML“/>

). Here we found a clear improvement too, which is shown in the orange curve of Figure 2, pointing out that it is important to find the correct resolution for each AS.

From the general schema described by Reichardt and Bornholdt in [31] several multiresolution approaches have been proposed. We used the Constant Potts Model by Traag et al. [37], which removes the null model in the Hamiltonian to cancel global dependences. For this algorithm we tested different values of γ ∈ [ 10 − 8 , 1 ]

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i111.gif” alt=”View MathML“/>

. We observed that this interval covers all the interesting situations, as for γ = 10 − 8

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i112.gif” alt=”View MathML“/>

we obtain a partition with 10 communities, and for γ = 1

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i113.gif” alt=”View MathML“/>

all the nodes are in different communities. We sampled the interval by dividing it into 4, 8, 16 and 32 pieces, equidistant in the logarithmic scale. When dividing it into 2 k

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i114.gif” alt=”View MathML“/>

pieces we obtain 2 k + 1

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i115.gif” alt=”View MathML“/>

values of γ which we use as the parameter space Θ = { γ 0 , γ 1 , … , γ 2 k }

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i116.gif” alt=”View MathML“/>

in order to match each AS against its most similar community (according to the recall score R 2 ( A S )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i117.gif” alt=”View MathML“/>

) among the 2 k + 1

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i118.gif” alt=”View MathML“/>

different partitions. We chose to stop the exploration at k = 5

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i119.gif” alt=”View MathML“/>

(33 values of γ) because the observed convergence suggested that refining the parameter space would not produce much better results. The curve for k = 5

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i120.gif” alt=”View MathML“/>

representing CPM is shown in purple in Figure 2. Finally, the central picture in Figure 3 shows the best γ value (from among the 33 considered values) for each AS, as a function of its size. Compared to Deltacom, in CPM the correlation between resolution and size is not so strong.

From this analysis we conclude that there is no single resolution at which we can find most of the ASes. This had already been observed in [32] for modularity, and is partly due to the fact that modularity is affected by resolution limit issues [38]. Similar theoretical analysis have been provided for Infomap’s map equation [39] and for some methods based on Hamiltonians of Potts models [40]. Despite these difficulties, here we show that the ASes do exist as communities at different levels of the hierarchical structure.

3.3 Detecting ASes from samples

The previous results show us that different ASes have different resolutions, and that we cannot expect to get all the ASes with classical community detection methods or at some particular resolution value. However, our results with the multiresolution methods are ideal, in the sense that we used the information of the AS structure in order to know at which resolution to stop for each AS. Now we wonder whether we might retrieve the ASes using a minimal amount of information: their size, and a small sample of their nodes. We observed that there is a strong dependence between the AS size and the AS resolution at which the AS is best identified by our algorithm, so that we shall use the linear regression in Figure 3 to approximate the best resolution for each AS, t ˜

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i121.gif” alt=”View MathML“/>

. If we look for the most similar community to each AS restricted to the partition produced by Deltacom at that approximate resolution, C t ˜

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i122.gif” alt=”View MathML“/>

, then we obtain the brown curve in dashed lines in Figure 2.

Lastly, we shall consider that we only know 15% of the nodes at random. Suppose that A S sample ⊂ A S

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i123.gif” alt=”View MathML“/>

is the known subset for each AS. We shall estimate the most similar community using the A S sample

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i124.gif” alt=”View MathML“/>

and the partition at resolution t ˜

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i125.gif” alt=”View MathML“/>

:

C ˆ A S ( 2 ) = arg max C ∈ C t ˜ J ( C , A S sample ) , R 3 ( A S ) = R ˆ 2 ( A S ) = J ( C ˆ A S ( 2 ) , A S ) ,

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i126.gif” alt=”View MathML“/>

(9)

i.e., we only consider the nodes of the AS which are part of our sample when choosing the most similar community. Instead, for computing the recall score R 3 ( A S )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i127.gif” alt=”View MathML“/>

(Eq. 9) we consider all the nodes of the AS, so that the method can be compared against the others. The results for R 3 ( A S )

<img class=”mathimg” src=”http://www.epjdatascience.com/content/inline/s13688-015-0048-y-i128.gif” alt=”View MathML“/>

are those of the light blue curve in dashed lines in Figure 2.

The closeness between the brown and light blue curves proves that we do not need to know the whole AS in order to identify it as a community at a certain step of the modularity optimization process: 15% of the nodes is enough in order to get as good a result as if we knew all the AS nodes. However, the distance between the brown and black curve tells us that the linear regression between size and resolution is just a rough approximation. We also point out that we tested sample sizes between 5% and 25% and the performance was similar (average recalls between 0.57 and 0.59), mainly because ideally the probability of misclassifying n random nodes of an AS decreases exponentially with n, so that even for a small sample of nodes the majority of them should be correctly classified if the AS exists as a community at that resolution (e.g., with a recall of at least 0.5). In conclusion, the sample-based method seems to be very successful, and is only limited by the error of the linear regression. We can thus identify many of the ASes just knowing their size and a small sample of their nodes. This method can also be extended to other networks, even if the relation between resolution and size is not known exactly. In that case, one should match the most similar community at each resolution, and then choose the resolution for which the most similar community has the closest size to the AS size, for example.




Source: Router-level community structure of the Internet Autonomous Systems

Via: Google Alert for Data Science