Cross-lingual web spam classification

    While English language training data exists for several Web classification tasks, most notably for Web spam, we face an expensive human labeling procedure if we want to classify a Web domain in a language different from English.

    In this paper we overview how existing content and link based classification techniques work, how models can be ``translated'' from English into another language, and how language-dependent and independent methods combine.

    In particular we show that simple bag-of-words translation works very well and in this procedure we may also rely on mixed language Web hosts, i.e. those that contain an English translation of part of the local language text.

    Our experiments are conducted on the ClueWeb09 corpus as the training English collection and a large Portuguese crawl of the Portuguese Web Archive.
    To foster further research, we provide labels and precomputed values of term frequencies, content and link based features for both ClueWeb09 and the Portuguese data.

    András Garzó, Bálint Daróczy, Tamás Kiss, Dávid Siklósi, András A. Benczúr
    The 3rd Joint WICOW/AIRWeb Workshop on Web Quality Rio de Janeiro, Brasil. May 13, 2013. Proceedigs of the 22nd international conference on World Wide Web companion