Rules: ECML/PKDD 2010 Discovery Challenge

    Contents of the Dicovery Challenge rules

    1. Tasks
    2. Submission format
    3. Leaderboard
    4. Evaluation

    top


    Tasks

    See the description of the training labels and of the test set.

    Task 1: Classification

    A ranked list (see format) is required for the English test set for the categories

    • Web Spam
    • News/Editorial
    • Commercial
    • Educational/Research
    • Discussion
    • Personal/Leisure
    • Neutrality: from 3 (normal) to 1 (problematic)
    • Bias: 1 flags significant problems
    • Trustiness: from 3 (normal) to 1 (problematic)

    Evaluation is in terms of average NDCG as defined below.

    Task 2: Quality (English)

    Quality is measured as an aggregate function of genre, trust, factuality and bias and spam has lowest (0) quality.
    A single ranked list (see format) is required for the English test set and is evaluated in terms of NDCG as defined below.
    The utility value is defined based on genre, trust, factuality and bias and spam has utility 0. Based on the needs of a fictional Internet archive, News-Editorial and Educational sites worth the most (5), Discussion worth 4 while others are worth 3. Since multiple genre is allowed, we take the maximum. We give 2-2 bonus score for facts and for trust and we subtract 2 for bias.
    The utility score is given in column 12 of the training labels files.

    The following pseudocode below explains the rule. See the
    awk script here.

    value = 0;
    if (News-Edit OR Educational) {
    value = 5;
    } else if (Discussion) {
    value = 4;
    } else if (Commercial OR Personal-Leisure) {
    value = 3;
    }
    if (fact == 3) value += 2; // facts bonus
    if (bias == 1) value -= 2; // biased penalty
    if (trust == 3) value += 2; // trust bonus

    Task 3: Multilingual Quality (German and French)

    Quality predictions for the German and French testing sets (see format) is required as in Task 2. To emphasize cross-lingual methods, only a very small label sample is given for these languages.
    Evaluation is in terms of average NDCG as defined below.

    top


    Submission format:

    The first column contains the host ID of the appropriate testing set (English, French and German, respectively), and each of the following columns must contain a permutation of 1...N, the number of hosts in the test set. Every rank position should appear exactly once, otherwise the submission is rejected and we will manually notify the sender about the error as soon as we can. Column separator is space ' '.
    For Tasks 2 and 3, separate files with names
    submission_quality_en.txt
    submission_quality_de.txt
    submission_quality_fr.txt
    are required, with two columns each. For Task 1, a single file with name
    submission_task1.txtis requirered. Its columns, in order, mean

    1. host ID
    2. Web Spam rank (spam should come first in the list)
    3. News/Editorial rank
    4. Commercial rank
    5. Educational/Research rank
    6. Discussion rank
    7. Personal/Leisure rank
    8. Neutrality rank (neutral should come first in the list)
    9. Bias rank (biased should come first in the list)
    10. Trustiness rank (trusted should come first in the list)

    Lines starting with '#' are ignored so you may add a header line to help us be more verbose in case of a formatting error.
    Here is a BASH / AWK script to check the format.

    top


    Leaderboard

    Every team is allowed to submit one run per week per task that gets on the leaderboard. The submissions are handled manually based on a 25% fixed random sample of the testing labels. Please indicate team name and institution in a form that we can check to identify the team. A team name may be selected so that you may remain anonymous to other teams. Please also indicate in your submissions if you want to have this run appear on the leaderboard. Additional submissions beyond the first valid one per week will be checked but no feedback will be given other than possible formatting errors. We are willing to check format any time but will give performance feedback only over the leaderboard, please don't ask questions about your improvement etc.
    Submission adress: discoverychallenge at ilab dot sztaki dot hu.
    When the submission deadline passes, the last valid submission per task (with no formatting errors) of the team will be considered as the Challenge entry.

    top


    Evaluation

    For evaluation, we consider multiple assessment and three-level scales as follows.

    • A useless assessment (does not display correctly, timeout, etc.) is discarded in favor of the useful one;
    • In case of multiple assessment, the labels are merged to favor the rare one, eg. a Spam and a NonSpam would be considered Spam;
    • In case of a three-level scale (opinion, trust) the lower values are merged.

    Evaluation is in terms of NDCG with the following utility and discount functions, see e.g. description in Wikipedia.

    • For utility we use 1 for a "yes", 0 for a "no" label: the best "Spam" list starts with the spam, followed by nonspam; the best "Educational" list start with the educational sites etc. In the Quality Tasks 2 and 3, the utility is defined as the aggregate function above.
    • To emphasize performance over the entire list, the discount function is linear 1-i/N where N is the size of the testing set. We will tell you the value of N but not the IDs to discourage manual assessment. To justify the discount function, assume that you produce the list for an Internet archive that may crawl 50% or even more of all the seeds they identify.

    The script to compute NDCG is a modification of the python script used by the Learning to Rank Challenge 2010.

    For inquiries about the Challenge please contact András BenczúrLast updated: 25 June, 2010.

    top