Pre-computed Web spam feature sets for EU-2010

    These per-host feature sets are provided to encourage participation in the ECML/PKDD 2010 Discovery Challenge.
    In the data the host IDs are assigned in the same ordering as in the
    DiscoveryChallenge2010.hostnames.txt.gz file. The collection contains pre-computed features for the English, French and German hosts from the first 99,349 hosts only.

    Feature set 1: link-based and transformed link-based features

    Computed from the graph files. Contains link-based features for the hosts, measured in both the home page and the page with the maximum PageRank in each host. Includes in-degree, out-degree, PageRank, edge reciprocity, assortativity coefficient, TrustRank, Truncated PageRank, estimation of supporters, etc.
    See description.
    Also contains simple numeric transformations of the link-based features for the hosts. These transformation were found to work better for classification in practice than the raw link-based features. This includes mostly ratios between features such as Indegree/PageRank or TrustRank/PageRank, and log(.) of several features.
    See description.

    The list of the url-id of the home page and the page with the maximum PageRank of each host is also available here:

    Feature set 2: content-based features

    Computed from the full version of the contents. These features include number of words in the home page, average word length, average length of the title, etc. for a sample of pages on each host. See description.

    See also

    1. Training labels
    2. URLs and hyperlinks
    3. Term frequencies
    4. Natural Language Processing features

    For inquiries please contact Miklós ErdélyiLast updated: 17 May, 2010.