These features are the transformed link-based features used in:
Carlos Castillo, Debora Donato, Aristides Gionis,
Vanessa Murdock, Fabrizio Silvestri:
"Know your Neighbors: Web Spam Detection using the Web Topology".
In Proceedings of ACM SIGIR, pp. 423-430. Amsterdam, Netherlands, 2007. ACM Press.
http://doi.acm.org/10.1145/1277741.1277814
http://www.dcc.uchile.cl/~ccastill/papers/cdgms_2006_know_your_neighbors.pdf
Note: when computing these features, we used a few conventions
to avoid null values:
log(x) = -50 if x<=0
x/0 = 1 if x==0
x/0 = 0 if x!=0
The list of the URL identifier home pages and the pages with the maximum
PageRank is here:
http://datamining.sztaki.hu/files/DiscoveryChallenge/features/v2-DiscoveryChallenge2010.homepageuid_maxpruid.csv.gz
In both cases, the second field points to an element in this list starting
from zero:
http://datamining.sztaki.hu/files/DiscoveryChallenge/links/v2-DiscoveryChallenge2010.urls.txt.gz
======================================================================
hostid
Identifier of the host in the hostgraph
log_indegree_hp
Log of indegree of home page (=hp) of the host
log_indegree_mp
Log of indegree of page with maximum pagerank (=mp) of the host
log_outdegree_hp
Log of outdegree of hp
log_outdegree_mp
Log of outdegree of mp
reciprocity_hp
Edge-reciprocity (fraction of out-links that are in-links) of hp
reciprocity_mp
Edge-reciprocity (fraction of out-links that are in-links) of mp
log_assortativity_hp
Log of assortativity (my degree divided by average degree of neighbors) of hp
log_assortativity_mp
Log of assortativity (my degree divided by average degree of neighbors) of mp
log_avgin_of_out_hp
Log of average indegree of outneighbors of hp
log_avgin_of_out_mp
Log of average indegree of outneighbors of mp
log(avgin_of_out*outdegree)_hp
Log of sum of the indegree of outneighbors of hp
log(avgin_of_out*outdegree)_mp
Log of sum of the indegree of outneighbors of mp
log_avgout_of_in_hp
Log of average outdegree of inneighbors of hp
log_avgout_of_in_mp
Log of average outdegree of inneighbors of mp
log(avgout_of_in*indegree)_hp
Log of the sum of the outdegree of inneighbors of hp
log(avgout_of_in*indegree)_mp
Log of the sum of the outdegree of inneighbors of mp
eq_hp_mp
Is the homepage the same as the page with max. pagerank? 0=no/1=yes
log_pagerank_hp
Log of pagerank of hp
log_pagerank_mp
Log of pagerank of mp
log(indegree/pagerank)_hp
Log of indegree/pagerank of hp
log(indegree/pagerank)_mp
Log of indegree/pagerank of mp
log(outdegree/pagerank)_hp
Log of outdegree/pagerank of hp
log(outdegree/pagerank)_mp
Log of outdegree/pagerank of mp
log_prsigma_hp
Log of st. dev of PageRank of inneighbors of hp
log_prsigma_mp
Log of st. dev of PageRank of inneighbors of mp
log(prsigma/pagerank)_hp
Log of st. dev of PageRank of inneighbors / pagerank of hp
log(prsigma/pagerank)_mp
Log of st. dev of PageRank of inneighbors / pagerank of mp
div_pagerank_mp_pagerank_hp
PageRank of mp divided by PageRank of hp
log_trustrank_hp
Log of TrustRank (using 3,800 trusted nodes from ODP) of hp
log_trustrank_mp
Log of TrustRank (using 3,800 trusted nodes from ODP) of mp
log(trustrank/pagerank)_hp
Log of TrustRank/PageRank of hp
log(trustrank/pagerank)_mp
Log of TrustRank/PageRank of mp
log(trustrank/indegree)_hp
Log of TrustRank/indegree of hp
log(trustrank/indegree)_mp
Log of TrustRank/indegree of mp
div_trustrank_mp_trustrank_hp
TrustRank of hp divided by TrustRank of mp
log_siteneighbors_1_hp
Log of number of different supporters (different sites) at distance 1 from hp
log_siteneighbors_2_hp
Log of number of different supporters (different sites) at distance 2 from hp
log_siteneighbors_3_hp
Log of number of different supporters (different sites) at distance 3 from hp
log_siteneighbors_4_hp
Log of number of different supporters (different sites) at distance 4 from hp
log_siteneighbors_1_mp
Log of number of different supporters (different sites) at distance 1 from mp
log_siteneighbors_2_mp
Log of number of different supporters (different sites) at distance 2 from mp
log_siteneighbors_3_mp
Log of number of different supporters (different sites) at distance 3 from mp
log_siteneighbors_4_mp
Log of number of different supporters (different sites) at distance 4 from mp
log(siteneighbors1/pagerank)_hp
Log of number of different supporters (different sites) at distance 1 from hp divided by PageRank
log(siteneighbors2/pagerank)_hp
Log of number of different supporters (different sites) at distance 2 from hp divided by PageRank
log(siteneighbors3/pagerank)_hp
Log of number of different supporters (different sites) at distance 3 from hp divided by PageRank
log(siteneighbors4/pagerank)_hp
Log of number of different supporters (different sites) at distance 4 from hp divided by PageRank
log(siteneighbors1/pagerank)_mp
Log of number of different supporters (different sites) at distance 1 from mp divided by PageRank
log(siteneighbors2/pagerank)_mp
Log of number of different supporters (different sites) at distance 2 from mp divided by PageRank
log(siteneighbors3/pagerank)_mp
Log of number of different supporters (different sites) at distance 3 from mp divided by PageRank
log(siteneighbors4/pagerank)_mp
Log of number of different supporters (different sites) at distance 4 from mp divided by PageRank
log(siteneighbors4/siteneighbors3)_hp
Log of number of different supporters (different sites) at distance 4 from hp divided by number of different supporters (different sites) at distance 3 from hp
log(siteneighbors4/siteneighbors3)_mp
Log of number of different supporters (different sites) at distance 4 from mp divided by number of different supporters (different sites) at distance 3 from mp
log(siteneighbors3/siteneighbors2)_hp
Log of number of different supporters (different sites) at distance 3 from hp divided by number of different supporters (different sites) at distance 2 from hp
log(siteneighbors3/siteneighbors2)_mp
Log of number of different supporters (different sites) at distance 3 from mp divided by number of different supporters (different sites) at distance 2 from mp
log(siteneighbors2/siteneighbors1)_hp
Log of number of different supporters (different sites) at distance 2 from hp divided by number of different supporters (different sites) at distance 1 from hp
log(siteneighbors2/siteneighbors1)_mp
Log of number of different supporters (different sites) at distance 2 from mp divided by number of different supporters (different sites) at distance 1 from mp
log_min_pairwise_ratio_siteneighbors_1_hp_siteneighbors_2_hp_siteneighbors_3_hp_siteneighbors_4_hp
Log of minimum change in the number of supporters (different sites) at distance i over distance i-1, for i=2..4, hp
log_min_pairwise_ratio_siteneighbors_1_mp_siteneighbors_2_mp_siteneighbors_3_mp_siteneighbors_4_mp
Log of minimum change in the number of supporters (different sites) at distance i over distance i-1, for i=2..4, mp
log_max_pairwise_ratio_siteneighbors_1_hp_siteneighbors_2_hp_siteneighbors_3_hp_siteneighbors_4_hp
Log of maximum change in the number of supporters (different sites) at distance i over distance i-1, for i=2..4, hp
log_max_pairwise_ratio_siteneighbors_1_mp_siteneighbors_2_mp_siteneighbors_3_mp_siteneighbors_4_mp
Log of maximum change in the number of supporters (different sites) at distance i over distance i-1, for i=2..4, mp
log_avg_pairwise_ratio_siteneighbors_1_hp_siteneighbors_2_hp_siteneighbors_3_hp_siteneighbors_4_hp
Log of average change in the number of supporters (different sites) at distance i over distance i-1, for i=2..4, hp
log_avg_pairwise_ratio_siteneighbors_1_mp_siteneighbors_2_mp_siteneighbors_3_mp_siteneighbors_4_mp
Log of average change in the number of supporters (different sites) at distance i over distance i-1, for i=2..4, mp
log_a_minus_b_over_c_siteneighbors_4_hp_siteneighbors_3_hp_pagerank_hp
Log of supporters at distance exactly 4 (different sites) divided by PageRank, hp
log_a_minus_b_over_c_siteneighbors_4_mp_siteneighbors_3_mp_pagerank_mp
Log of supporters at distance exactly 4 (different sites) divided by PageRank, mp
log_a_minus_b_over_c_siteneighbors_3_hp_siteneighbors_2_hp_pagerank_hp
Log of supporters at distance exactly 3 (different sites) divided by PageRank, hp
log_a_minus_b_over_c_siteneighbors_3_mp_siteneighbors_2_mp_pagerank_mp
Log of supporters at distance exactly 3 (different sites) divided by PageRank, mp
log_a_minus_b_over_c_siteneighbors_2_hp_siteneighbors_1_hp_pagerank_hp
Log of supporters at distance exactly 2 (different sites) divided by PageRank, hp
log_a_minus_b_over_c_siteneighbors_2_mp_siteneighbors_1_mp_pagerank_mp
Log of supporters at distance exactly 2 (different sites) divided by PageRank, mp
div_siteneighbors_1_hp_siteneighbors_1_mp
Supporters at distance 1 (different sites) of hp over mp
div_siteneighbors_2_hp_siteneighbors_2_mp
Supporters at distance 2 (different sites) of hp over mp
div_siteneighbors_3_hp_siteneighbors_3_mp
Supporters at distance 3 (different sites) of hp over mp
div_siteneighbors_4_hp_siteneighbors_4_mp
Supporters at distance 4 (different sites) of hp over mp
log_neighbors_2_hp
Log of supporters at distance 2, hp (note: supporters at distance 1 is indegree)
log_neighbors_3_hp
Log of supporters at distance 3, hp
log_neighbors_4_hp
Log of supporters at distance 4, hp
log_neighbors_2_mp
Log of supporters at distance 2, mp
log_neighbors_3_mp
Log of supporters at distance 3, mp
log_neighbors_4_mp
Log of supporters at distance 4, mp
log(neighbors2/pagerank)_hp
Log of supporters at distance 2 divided by PageRank, hp
log(neighbors3/pagerank)_hp
Log of supporters at distance 3 divided by PageRank, hp
log(neighbors4/pagerank)_hp
Log of supporters at distance 4 divided by PageRank, hp
log(neighbors2/pagerank)_mp
Log of supporters at distance 2 divided by PageRank, mp
log(neighbors3/pagerank)_mp
Log of supporters at distance 3 divided by PageRank, mp
log(neighbors4/pagerank)_mp
Log of supporters at distance 4 divided by PageRank, mp
log(neighbors4/neighbors3)_hp
Log of supporters at distance 4 divided by supporters at distance 3, hp
log(neighbors4/neighbors3)_mp
Log of supporters at distance 4 divided by supporters at distance 3, mp
log(neighbors3/neighbors2)_hp
Log of supporters at distance 3 divided by supporters at distance 2, hp
log(neighbors4/neighbors3)_mp
Log of supporters at distance 3 divided by supporters at distance 2, mp
log(neighbors2/indegree)_hp
Log of supporters at distance 2 divided by supporters at distance 1, hp
log(neighbors2/indegree)_hp
Log of supporters at distance 2 divided by supporters at distance 1, mp
log(min(neighbors2/indegree_neighbors3/neighbors2_neighbors4/neighbors3)))_hp
Log of minimum change of number of supporters at distance i over supporters at distance i-1, i=2..4, hp
log(min(neighbors2/indegree_neighbors3/neighbors2_neighbors4/neighbors3)))_mp
Log of minimum change of number of supporters at distance i over supporters at distance i-1, i=2..4, mp
log(max(neighbors2/indegree_neighbors3/neighbors2_neighbors4/neighbors3)))_hp
Log of maximum change of number of supporters at distance i over supporters at distance i-1, i=2..4, hp
log(max(neighbors2/indegree_neighbors3/neighbors2_neighbors4/neighbors3)))_mp
Log of maximum change of number of supporters at distance i over supporters at distance i-1, i=2..4, mp
log(avg(neighbors2/indegree_neighbors3/neighbors2_neighbors4/neighbors3)))_hp
Log of average change of number of supporters at distance i over supporters at distance i-1, i=2..4, hp
log(avg(neighbors2/indegree_neighbors3/neighbors2_neighbors4/neighbors3)))_mp
Log of average change of number of supporters at distance i over supporters at distance i-1, i=2..4, mp
log((neighbors4-neighbors3)/pagerank)_hp
Log of number of supporters at distance exactly 4 over pagerank, hp
log((neighbors4-neighbors3)/pagerank)_mp
Log of number of supporters at distance exactly 4 over pagerank, mp
log((neighbors3-neighbors2)/pagerank)_hp
Log of number of supporters at distance exactly 3 over pagerank, hp
log((neighbors3-neighbors2)/pagerank)_mp
Log of number of supporters at distance exactly 3 over pagerank, mp
log((neighbors2-indegree)/pagerank)_hp
Log of number of supporters at distance exactly 2 over pagerank, hp
log((neighbors2-indegree)/pagerank)_mp
Log of number of supporters at distance exactly 2 over pagerank, mp
div_neighbors_2_hp_neighbors_2_mp
Supporters at 2 in hp divided by supporters at 2 in mp
div_neighbors_3_hp_neighbors_3_mp
Supporters at 3 in hp divided by supporters at 4 in mp
div_neighbors_4_hp_neighbors_4_mp
Supporters at 3 in hp divided by supporters at 4 in mp
log_truncated_pagerank_1_hp
Log of TruncatedPageRank with T=1, hp
log_truncated_pagerank_2_hp
Log of TruncatedPageRank with T=2, hp
log_truncated_pagerank_3_hp
Log of TruncatedPageRank with T=3, hp
log_truncated_pagerank_4_hp
Log of TruncatedPageRank with T=4, hp
log_truncated_pagerank_1_mp
Log of TruncatedPageRank with T=1, mp
log_truncated_pagerank_2_mp
Log of TruncatedPageRank with T=2, mp
log_truncated_pagerank_3_mp
Log of TruncatedPageRank with T=3, mp
log_truncated_pagerank_4_mp
Log of TruncatedPageRank with T=4, mp
log(truncated_pagerank/pagerank)_hp
Log of TruncatedPageRank with T=1 divided by PageRank, hp
log(truncated_pagerank2/pagerank)_hp
Log of TruncatedPageRank with T=2 divided by PageRank, hp
log(truncated_pagerank3/pagerank)_hp
Log of TruncatedPageRank with T=3 divided by PageRank, hp
log(truncated_pagerank4/pagerank)_hp
Log of TruncatedPageRank with T=4 divided by PageRank, hp
log(truncated_pagerank/pagerank)_mp
Log of TruncatedPageRank with T=1 divided by PageRank, mp
log(truncated_pagerank2/pagerank)_mp
Log of TruncatedPageRank with T=2 divided by PageRank, mp
log(truncated_pagerank3/pagerank)_mp
Log of TruncatedPageRank with T=3 divided by PageRank, mp
log(truncated_pagerank4/pagerank)_mp
Log of TruncatedPageRank with T=4 divided by PageRank, mp
log(truncated_page_rank4/truncated_pagerank3)_hp
Log of TruncatedPageRank with T=4 divided by TruncatedPageRank with T=3, hp
log(truncated_page_rank4/truncated_pagerank3)_mp
Log of TruncatedPageRank with T=4 divided by TruncatedPageRank with T=3, mp
log(truncated_page_rank3/truncated_pagerank2)_hp
Log of TruncatedPageRank with T=3 divided by TruncatedPageRank with T=2, hp
log(truncated_page_rank3/truncated_pagerank2)_mp
Log of TruncatedPageRank with T=3 divided by TruncatedPageRank with T=2, hp
log(truncated_page_rank2/truncated_pagerank)_hp
Log of TruncatedPageRank with T=2 divided by TruncatedPageRank with T=1, hp
log(truncated_page_rank2/truncated_pagerank)_hp
Log of TruncatedPageRank with T=2 divided by TruncatedPageRank with T=1, mp
log(min(trpr2/trpr_trpr3/trpr2_trpr4/trpr3))_hp
Log of minimum of TruncatedPageRank with T=i over TruncatedPageRank with T=i-1, i=2..4, hp
log(min(trpr2/trpr_trpr3/trpr2_trpr4/trpr3))_mp
Log of minimum of TruncatedPageRank with T=i over TruncatedPageRank with T=i-1, i=2..4, mp
log(max(trpr2/trpr_trpr3/trpr2_trpr4/trpr3))_hp
Log of maximum of TruncatedPageRank with T=i over TruncatedPageRank with T=i-1, i=2..4, hp
log(max(trpr2/trpr_trpr3/trpr2_trpr4/trpr3))_mp
Log of maximum of TruncatedPageRank with T=i over TruncatedPageRank with T=i-1, i=2..4, mp
log(avg(trpr2/trpr_trpr3/trpr2_trpr4/trpr3))_hp
Log of average of TruncatedPageRank with T=i over TruncatedPageRank with T=i-1, i=2..4, hp
log(avg(trpr2/trpr_trpr3/trpr2_trpr4/trpr3))_mp
Log of average of TruncatedPageRank with T=i over TruncatedPageRank with T=i-1, i=2..4, mp
div_truncated_pagerank_1_mp_truncated_pagerank_1_hp
TruncatedPageRank with T=1 in mp, divided by TruncatedPageRank with T=1 in hp
div_truncated_pagerank_2_mp_truncated_pagerank_2_hp
TruncatedPageRank with T=2 in mp, divided by TruncatedPageRank with T=2 in hp
div_truncated_pagerank_3_mp_truncated_pagerank_3_hp
TruncatedPageRank with T=3 in mp, divided by TruncatedPageRank with T=3 in hp
div_truncated_pagerank_4_mp_truncated_pagerank_4_hp
TruncatedPageRank with T=4 in mp, divided by TruncatedPageRank with T=4 in hp