Redirects, site boundaries, multiple assessment: more on the ECML/PKDD 2010 Discovery Challenge

    We consider www and identical for the evaluation purposes. Nevertheless the URLs were processed separately and you find features separate for the two versions.
    While same top level domain, or even same IP could mean very similar content, we consider these separate. However we made sure hosts from the same IP or top level domain are either completely in the training or the testing set.
    There is quite much consufion about redirects. While the assessor guideline emphasizes that assessment must correspond to pages within the assessed domain, furthermore assessors had to OK a message whenever they left the target domain, they still labeled hosts based on the redirection target. As much as we could, we tried to identify these hosts and remove the corresponding labels. There are 63 sites in the training set that we have already released. Also see the ID-redirection target pairs.
    We compiled the list of redirects.

    In the graph we removed redirects, i.e. rewrote each URL with status HTTP 3xx to its redirected location.
    Some sites have double assessment for the purpose of measuring interrater reliability (very few sites have more due to accedental errors in the interface). The source of the multiple assessment is as twofold:

    • random, by assigning both www and to assessors
    • systematic selection of "interesting" examples that had trust, factuality, readability or bias problems.

    Also read the frequently asked questions.