Graph of EU-2010

    Hosts and hostgraph (191k hosts)

    The hostgraph summarizes the URL to URL links by converting multiple links among pages in different hosts into a single (weighted) link among two hosts. The collection contains 191,388 different hosts, listed in the following file and numbered from 0 to 191,387. Note that due to redirections, alternate names and subdomains the host boundaries are fuzzy - read this description about handling redirects.

    The graph is formatted as follows: the first line contains the number of hosts. The second line contains the out-links of host 0, the third line the out-links of host 1, and so on. Each line is of the form "dest1:nlinks1 dest2:nlinks2, ..., destk:nlinksk", in which dest is the destination host id, and nlinks the number of page-to-page links between the two hosts.

    URLs and webgraph (23M URLs)

    The Web graph of this collection consist of 23,808,829 nodes representing pages, connected by approximately 600 million edges representing hyperlinks.
    Redirections are handled by rewriting each URL with status HTTP 3xx to its redirected location. URLs reported to have HTML mime-type but ending in media types such as .jpg, .gif, .jpeg, .png, .favicon.ico, etc. were also removed if the URL didn't contain query parameters. Extracted links were also rewritten using the information obtained from redirections.
    The file with the URLs contains one URL per line, starting at URL number 0 and ending at URL number 23,808,828. The URLs are sorted lexicographically, to increase the compression ratio when using the Boldi-Vigna (BV) compression technique. Note that the first URL is identified with the number 0.

    The graph is provided as plain text. The file format is as follows: each line represents an edge list for nodes in increasing order. On each line, the first number specifies the source node i, the second number the number of edges k and then the k successors of node i are enlisted in increasing order. Successors are separated by a single space. The nodes are numbered from 0 to n-1 according to the above URL list.

    For inquiries please contact Miklós ErdélyiLast updated: 18 May, 2010.