Frequently Asked Questions: ECML/PKDD 2010 Discovery Challenge

    1. What is this big mess on the site, some broken links, some data V2 while others not etc?

      As of May 18, we are in the process of publishing more data, the rules, the evaluation tools. Thank you for your patience.

    2. I am confused about the use of redirects. In case of eu -> eu redirects, how should we consider the host: 'TO' or 'FROM'. If we consider 'TO', then should we ignore 'FROM'?

      Just as in the case for the real Web, we had much trouble and confusion about redirects. First of all, we had accidentally included hosts that contain redirects only - we should have discarded them in the first place. Second, the assessors should have marked these as "too few text" and discard; instead quite a few of them labeled the target (TO).
      We were trying to make a situation as clean as possible for the challenge but some problems have already gone through. In the testing set no sites are supposed to be redirects only; in fact all testing sites should have all features available.
      In any case we have compiled the redirection data sets. These might be useful but labels are meant for the host and not for a redirect in the first place, and especially not for a redirection target outside the .eu domain. See the page about redirects.

    3. For some tasks we require the original content, not just the tf, df and feature values. How can I obtain the raw data?

      For this challenge (ECML/PKDD Discovery Challenge) we made a large set of features available to encourage a broad range of participants to enter the challenge, including many participants who usually do not work with large text corpora. For this reason we do not make the crawl available for now. However, we will make the crawl available after the challenge. If interested, please contact Andras Benczur for the details since the entire data set can only be shipped on hard disks.