Text Classification Kernels for Quality Prediction over the C3 Data Set

    We compare machine learning methods to predict quality aspects of the C3 dataset collected as a part of the Reconcile project. We give methods for automatically assessing the credibility, presentation, knowledge, intention and completeness by extending the attributes in the C3 dataset
    by the page textual content. We use Gradient Boosted Trees and recommender methods over the evaluator, site, evaluation triplets and their metadata and combine with text classifiers.
    In our experiments best results can be reached by the theoretically justified normalized SVM kernel. The normalization can be derived by using the Fisher information matrix of the text content. As the main contribution, we describe the theory of the Fisher matrix and show that SVM may be particularly suitable for difficult text classification tasks.

    Bálint Daróczy, Dávid Siklósi, Róbert Pálovics, András A. Benczúr
    The 5th International Workshop of Web Quality, Proceedings of the 24th International Conference on World Wide Web Companion, Florence, Italy 2015