Assessment Guidelines

    As an overall guideline, please follow the steps below. You may return to any of the steps later to change your original decision. Note that some of the first decisions shade the later ones, i.e. all properties of a host that is excluded (adult, several sites hosted under one domain, etc.) turn grey and cannot be assessed anymore.

    Take your time and look at different aspects: examine the Web sites carefully to check different aspects before taking a decision, particularly in sites that are hard to classify. Look at the in- and out-links, both within and outside of the data set. Use Google to find information about the present state of the site. Look at the present version - although the crawl has been done in February, things may have evolved since then.

    There are known limitations in the iframe mechanism of the assessment interface:

    • You cannot manually enter a URL, you will always get to your session. The session remembers your original assessment task and gives you a return link on the top left corner. You may still browse along links and view host properties.
    • Some pages force the browser to display them on top of the interface. For certain hosts we see no way of using the interface for this reason. Please send whatever information you can gather about these hosts manually by email. So far it seems this is not a widespread problem.
    • A wide monitor is sometimes very useful :(

    The interface warns you whenever you are viewing a site other than assessed. Each site should be assessed based on its own pages and not its external content. Note that even www.site.eu and site.eu are different and you get the warning. Typically one of these variants is a redirect to the canonical name and you should only be assessing the non-redirecting version. Hosts with redirects only fall in the "too few text" language category.

    1. First check some obvious reasons why the host may not be included in the sample at all. First mark language independent issues: if a page is in a language other than your selected one, it will probably be passed to another assessor even if it is obviously useless unless you mark the reason.

    You will not have to assess the host (host is EXCLUDEd) if

    • The host contains adult content (porn) - please mark the corresponding flag;
    • Mixed: multiple unrelated sites in the same host, several sites of different type under the same host name - please mark mixed;
    • The language is not your selected one or mixed over the site. The language is autodetected but there may be errors. In this case please also mark the correct language. Remember assessment is on the site level, hence a Web site in multiple languages is mixed if it has structure www.website.eu/en/, www.website.eu/de/ etc.; but en.website.eu/ is (most likely) in English, de.website.eu is in German etc. since they are different hosts. Check carefully these options when a site offers language selection. Also note that some sites personalize their language by the location of the browser. In this case the language of the archived version in the interface counts even if you view the live version in another language.
    • Too few text (language label): there are less than 10 pages on the site that contain text, or most of the pages have just a couple of words - in general the whole text over the site is too short. Hosts with only redirects fall in this category although they should have been excluded from the sample.
    • If there is another serious reason why the site should not be labeled, mark "Other problem". In this case labels are not stored but a comment explaining the reason is compulsory.

    In case you face problems in assessing either of the labels below, mark "Unsure" any time. Problems may be caused both by the labels we provide being inappropriate (we may have not thought of some issues or simply could not cover all aspects by the limited set of labels etc) or by some lack of domain knowledge, background information (you have no knowledge of the topic).

    Even if you mark a host "Unsure", give labels up to your best knowledge. In case you mark "Unsure" or even if you can fully complete the labeling of the site but you feel uncomfortable about some of your decisions, please add a short comment describing your problem. This helps in improving both the interface and the label set itself and the reliability of the labels by revising them. You are not allowed say "Unsure" with empty comment.

    2. Next check genre. The labels are not exclusive: a host may for example be educational and database at the same time.
    2a. First check if the site is spam. This is the most important and most tricky one. If the site is spam, the remaining labels do not have to be filled in since they are unreliable.

    We use the guidelines from http://barcelona.research.yahoo.net/webspam/datasets/uk2007/guidelines/. We give a short description. If in doubt, consult the full guidelines with examples. Note that we do not use "borderline" categories. If you have doubt, please try to verify and as a final resort mark "Unsure" as in the overall guidelines.

    General definition of Web spam:«any deliberate action that is meant to trigger an unjustifiably favorable [ranking], considering the page's true value» (Gyöngyi and García Molina 2005). Look for aspects of the host that are mostly to attract and/or redirect traffic.

    Sites that do Web Spam:

    • Include aspects designed to attract/redirect traffic.
    • Almost always have commercial intent.
    • Rarely offer relevant content for users browsing them.

    Typical Web Spam Aspects:

    • Include many unrelated keywords and links.
    • Use many keywords and punctuation marks such as dashes in the URL.
    • Redirect the user to another (usually unrelated) page.
    • Create many copies with substantially duplicate content.
    • Hide text by writing in the same color as the background of the page

    Pages that are only advertising, with very little content are spam, including automatically generated pages designed to sell advertising; sites that offer catalogs of products that are actually redirecting to other merchants, without providing extra value.

    Pages that do not use Web spam tricks should not be labeled spam regardless of their quality. Normal pages can be high-quality or low-quality resources - other aspects of quality are addressed by other labels.

    2b. Guideline for other genres. If in doubt, please mark "Unsure". Mark at least one genre (including spam and adult) but possibly more. You will not be allowed to continue if no genre is selected.

    • Editorial or news content: posts disclosing, announcing, disseminating news. Factual texts reporting on a state of affairs, like newswires (including sport) and police reports. Posts discussing, analyzing, advocating about a specific social/environmental/technological/economic issue, including propaganda adverts, political pamphlets.
    • Commercial content: product reviews, product shopping, on-line store, product catalogue, service catalogue, product related how-tos, FAQs, tutorials.
    • Educational and research content: tutorials, guidebooks, how-to guides, instructional material, educational material. Research papers, books. Catalogues, glossaries. Conferences, institutions, project pages. Health also belongs here.
    • Discussion spaces: includes dedicated forums, chat spaces, blogs, etc. Standard comment forms do not count.
    • Personal/Leisure: arts, music, home, family, kids, games, horoscopes etc. A personal blog for example belongs both here and to "discussion".
    • Media: video, audio, ... In general a site where the main content is not text but media. For example a site about music is probably leisure and not media.
    • Database: a "deep web" site whose content can be retrieved only by querying a database. Sites offering forms fall in this category.
    • Adult: porn (will be discarded from sample)

    Also mark unknown for hosts with little or no running text, like forms for queries, logins, download pages, flash animation, samples of source code, etc; one important subcategory here is index, i.e., portals, sitemaps, other lists of links (mostly containing incomplete or isolated sentences).

    3. Flag any serious readability issues you find regarding readability of the page for the two categories below. Decide at the host level, by viewing sufficient number of sample pages. A site that contains posts, forums etc related to its main content should be assessed based on its main content.

    • Serious perceptual issues: contrasting color, layout, etc. that makes the text hard to read
    • Serious linguistic correctness issues -- the text is poorly written: incorrect style, abundant grammar and spelling errors.

    4. Trustworthiness

    This measure applies only to hosts of the following type:

    • News
    • Commercial
    • Educational
    • Media
    • Database

    We assess sites based on viewing a sample of pages. We drop sites that contain a mix. A site that contains posts, forums etc related to its main content should be assessed based on its main content; these sites are not considered a mix.

    1. I do not trust this. There are aspects of the site that make me distrust this source.
    2. I trust this marginally. Looks like an authoritative source but its ownership is unclear.
    3. I trust this fully. This is a famous authoritative source (a famous newspaper, company, organization)

    5. Neutrality.
    First check if the site presents information as facts or opinions. Labels are:

    • Facts: I think these are mostly facts
    • Fact/Opinion: I think these are opinions and facts; facts are included in the site or referenced from external sources.
    • Opinion: I think this is mostly an opinion that may or may not be supported by facts, but little or no facts are included or referenced.

    Next we flag biased content. We adapt the definition from Wikipedia (http://en.wikipedia.org/wiki/NPOV): "The neutral point of view is a means of dealing with conflicting perspectives on a topic as evidenced by reliable sources. It requires that all majority- and significant-minority views be presented fairly, in a disinterested tone, and in rough proportion to their prevalence within the source material.The neutral point of view neither sympathizes with nor disparages its subject, nor does it endorse or oppose specific viewpoints. It is not a lack of viewpoint, but is rather a specific, editorially neutral, point of view. An article should clearly describe, represent, and characterize all the disputes within a topic, but should not endorse any particular point of view. It should explain who believes what, and why, and which points of view are most common. It may contain critical evaluations of particular viewpoints based on reliable sources, but even text explaining sourced criticisms of a particular view must avoid taking sides."

    We assess neutrality only for hosts of the following type:

    • News
    • Educational
    • Media
    • Database
    • Commercial, with the remark below.

    For commercial site this measure indicates no bias towards an undeclared target, e.g. review sites with a clear bias towards or against certain brand or product qualify as biased. A company home page advertising its own product qualify as neutral unless biased opinion is included towards competitors.

    We assess sites based on viewing a sample of pages. We drop sites that contain a mix. A site that contains posts, forums etc related to its main content should be assessed based on its main content; these sites are not considered a mix.

    Biased flag: This source seems very biased to me: promotes a particular religion, ideology, philosophy or political standpoint. Flag flame, assaults, dishonest opinion without reference to facts.

    Examples:

    For inquiries about the Challenge please contact András BenczúrLast updated: 23 April, 2010.

    Languages