Web spam classification: a few features worth more

    In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less attention in the adversarial IR community, yields more improvement than new features and result in low cost yet accurate spam filters. Our original contributions are as follows:

    • We collect and handle a large number of features based on recent advances in Web spam filtering.

    • We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest significantly improve accuracy.

    • We conclude that, with appropriate learning techniques, a small and computationally inexpensive feature subset outperforms all previous results published so far on our data set and can only slightly be further improved by computationally expensive features.

    • We test our method on two major publicly available data sets, the Web Spam Challenge 2008 data set WEB-SPAM-UK2007 and the ECML/PKDD Discovery Challenge data set DC2010.

    Our classifier ensemble reaches an improvement of 5% in AUC over the Web Spam Challenge 2008 best result; more importantly our improvement is 3.5% based solely on less than 100 inexpensive content features and 5% if a small vocabulary bag of words representation is included. For DC2010 we improve over the best achieved NDCG for spam by 7.5% and over 5% by using inexpensive content features and a small bag of words representation.

    Miklós Erdélyi, András Garzó, András A. Benczúr
    WebQuality '11