NLP features: ECML/PKDD 2010 Discovery Challenge Data Set

    This page contains the Natural Language Processing features of the ECML/PKDD 2010 Discovery Challenge Data Set produced by courtesy of the LivingKnowledge project. The features are, unlike the rest of the data set, per URL, and include

    • URL
    • language extract using Nutch (that may differ from the overall host estimated language)
    • counts for sentence, token, character
    • count of various POS tags as described here
    • the twenty most common bigrams of above tags with corresponding counts
    • counts of certain chunk tags as output from OpenNLP english chunker
    • length of sentence in tokens histogram
    • counts of tags based on BBN Pronoun Coreference and Entity Type Corpus as output from SuperSense Tagger
    • counts of more precise organizations, people and locations that are derived from the original named entities above (e.g. people are forced to be in the form Aaaa Aaaaaa, etc) - in our application, we found the output of the tagger too noisy to display to the user, so we use these fields instead
    • counts of tokens of different character use (upper, lower, mixed etc.)

    See the field description.

    See also

    1. Training labels
    2. URLs and hyperlinks
    3. Content-based and link-based Web spam features
    4. Term frequencies

    For inquiries about the NLP features please contact Michael MatthewsLast updated: 17 May, 2010.