Pre-computed Web spam feature sets for EU-2010

    These per-host feature sets are provided to encourage participation in the ECML/PKDD 2010 Discovery Challenge.
    In the data the host IDs are assigned in the same ordering as in the
    DiscoveryChallenge2010.hostnames.txt.gz file. The collection contains pre-computed features for the English, French and German hosts from the first 99,349 hosts only.

    Feature set 1: link-based and transformed link-based features

    Computed from the graph files. Contains link-based features for the hosts, measured in both the home page and the page with the maximum PageRank in each host. Includes in-degree, out-degree, PageRank, edge reciprocity, assortativity coefficient, TrustRank, Truncated PageRank, estimation of supporters, etc.
    See description.
    Also contains simple numeric transformations of the link-based features for the hosts. These transformation were found to work better for classification in practice than the raw link-based features. This includes mostly ratios between features such as Indegree/PageRank or TrustRank/PageRank, and log(.) of several features.
    See description.

    The list of the url-id of the home page and the page with the maximum PageRank of each host is also available here:

    Feature set 2: content-based features

    Computed from the full version of the contents. These features include number of words in the home page, average word length, average length of the title, etc. for a sample of pages on each host. See description.

    Last updated: 17 May, 2010.