ECML/PKDD 2010 Discovery Challenge Data Set

    Accidentally, the feature sets were produced on a smaller set of hosts and there are no features for approximately 300 domains in the training set. The feature files are being replaced by a new one with v2-prefix. Note that certain sites have no features due to redirects.
    Our apologies.

    The Web Quality datasets in this site are provided to advance research on Web document classification. These labels are intended for research purposes only. We advise you not to use these labels directly for search engine ranking or filtering.
    See licensing information.

    If you use this data, we strongly recommend you to subscribe to the discovery-challenge-2010-announces mailing list. There is also a discussion list, discovery-challenge-2010-discuss. New datasets, errata about the current datasets, challenges and conferences related to Web quality are posted to this low-volume, announcements-only mailing list. Also read the frequently asked questions.


    This is a large collection of annotated Web hosts labeled by the Hungarian Academy of Sciences (English), European Archive Foundation (French) and L3S Hannover (German), see credits. The base data is a set of 23M pages in 99K hosts in the .EU domain.
    These are the guidelines that were given for the assessment.
    The data was downloaded in February-March 2010 by the European Archive Foundation, with the support of the LiWA - Living Web Archives project. Assessment was supported from the Hungarian national project NKFP-07-A2 TEXTREND. If you use the EU2010 collection, you should acknowledge the source by citing it as:

    András A. Benczúr, Carlos Castillo, Miklós Erdélyi, Zoltán Gyöngyi, Julien Masanes, Michael Matthews. ECML/PKDD 2010 Discovery Challenge Data Set. Crawled by the European Archive Foundation.


    Check and evaluate your submission on your heldout set by using this set of scripts.


    The EU2010 data set is composed of five parts:

    1. Training labels
    2. URLs and hyperlinks
    3. Content-based and link-based Web spam features
    4. Term frequencies
    5. Natural Language Processing features

    All in one: v2-all_in_one.tgz1.9 GBs

    Licensing of this data set.

    For inquiries about the Challenge please contact András BenczúrLast updated: 27 June, 2010.