ECML/PKDD 2010 Discovery Challenge Training Labels

    h2>Training labels

    The V2 English training labels are below (if you are new to this page, read further for the description along with the V1 version):

    • English labels,
      and the host names.
    • The unified judgement aggregating multiple assessment as described in the
      rules page. Columns are www-hostID; www-less hostID; Spam; News/Editorial; Commercial; Educational/Research; Discussion; Personal/Leisure; Non-Neutral; Biased; UnTrusted (all with 1=yes, 0=no); and Utility (on a 0-9 scale)

    Compared to the V1 files,

    • those hosts that have no meaningful labels are not added again (too few text, misdetected language etc) - if you find this information useful, you may use it from the V1 labels (below)
    • There are two additional ID columns, one for the www and another for the www-less version of the domain so that you can use features for both.
      Notice the assessors' bad handling of redirections!

    To encourage language independent and cross-lingual processing, only a small amount of non-English training labels are released, all in V1 below.
    A random sample of approximately half of the labels are released, the rest is kept for
    testing. The training and testing sets were created so that no IP or second level domain ( is split into both sets. A few large domains have outlier distribution and were manually removed from both sets.
    See the description.

    Files are split into (ID, name, cname, IP) part and an ID followed by the labels, to avoid host names appear along with comments.
    The ID is the same as in the links/DiscoveryChallenge2010.hostnames.txt.gz (1.7 MB).
    The full set includes all hosts in the training set that are not labeled (cannot be viewed or accessed, not in the assessor's language, adult content, too few text, etc.). The useful labels are also included in a separate file.

    The format of the labels file, also explained in its header line, is

    1. ID: as in links/DiscoveryChallenge2010.hostnames.txt.gz
    2. UserID: the assessor
    3. Date of assessment
    4. Hosting Type
    5. Language, as possibly corrected by the assessor
    6. Adult Content
    7. Other Problem: in this case there must be a comment in the other file.
    8. Web Spam
    9. News/Editorial
    10. Commercial
    11. Educational/Research
    12. Discussion
    13. Personal/Leisure
    14. Media
    15. Database
    16. Readability-Vis: 1 flags significant visibility problems (too rare)
    17. Readability-Lang: 1 flags significant readability problems (too rare)
    18. Neutrality: from 3 (normal) to 1 (problematic)
    19. Bias: 1 flags significant problems
    20. Trustiness: from 3 (normal) to 1 (problematic)
    21. Assessor confidence: sure or unsure
    22. The autodetected language as in features/v2-host_autolang.csv.gz

    You should also consult the assessment guidelines.
    For evaluation, we consider multiple assessment and three-level scales as follows, see the rules page for more detail.

    • A useless assessment (does not display correctly, timeout, etc.) is discarded in favor of the useful one;
    • In case of multiple assessment, the labels are merged to favor the rare one,
      eg. a Spam and a NonSpam would be considered Spam;
    • In case of a three-level scale (opinion, trust) the lower values are merged.

    By the rules for example if we have triple assessment for a test site, one stating "mixed language", one stating "I trust it (3)", and the third one "I trust this marginally (2)", then the site receives negative preference for trust. In other words, a host will be demoted if at least one of the assessors had doubts.

    For inquiries please contact András BenczúrLast updated: 7 June, 2010.