1、Feature Selection for Ranking ABSTRACT The reality is that many feature selection methods used in classification are directly applied to ranking We argue that because of the striking differences between ranking and classification,it is better to develop different feature selection methods for rankin
2、g.In this paper we propose a new feature selection methodDefine the ranking accuracy in terms of a performance measure or a loss function as the importance of the feature and the correlation between the ranking results of two features as the similarity between them.We also demonstrate how to solve t
3、he optimization problem in an efficient way.We have tested the effectiveness of our feature selection method on two information retrieval datasets and with two ranking model 1.INTRODUCTION Ranking SVM and RankNet First,feature selection can help enhance accuracy in many machine learning problems,whi
4、ch strongly indicates that feature selection is also necessary for ranking Second,feature selection can also help improve the efficiency of training.In information retrieval,especially in web search,usually the data size is very large and thus training of ranking models is computationally costly.The
5、re have been no methods of feature selection dedicatedly proposed for ranking.Most of the methods used in ranking were developed for classification.Feature selection methods in classification fall into three categorie The first category,which is named filter,feature selection is defined as a preproc
6、essing step and can be independent from learning.that information gain(IG)and chi-square(CHI)are among the most effective methods of feature selection for classification.The second category referred to as wrapperutilizes the learning system as a black box to scoresubsets of features The third catego
7、ry called the embeddedmethod performs feature selection withinthe process of training.Second,the evaluation measures(e.g.meanaverage precision(MAP)and normalizeddiscounted cumulative gain(NDCG)used inranking problems are different from thosemeasures used in classification 1)in ranking usually precis
8、ion is more important than recall while in classification both precision and recall are important;Several problems may arise when applying the feature selection methods to ranking First:existing feature selection methods for classification are not suitable for ranking.In ranking,a number of ordered
9、categories are used,representing the ranking relationship between instances,while in classification the categories are“flat”2)in ranking correctly ranking the top-n instances is more critical while in classification making a correct classification decision is of equal significance for all instances.
10、The following are the properties of the novel method for feature selection in ranking 1)The method makes use of ranking information,instead of simply viewing the ranks as flat categories.2)Inspired by the work in 11427,it considers the similarities between features,and tries to avoid selecting redun
11、dant features.3)It models feature selection for ranking as a multi-objective optimization problem.The final objective is to find a set of features with maximum importance and minimum similarity.4)It provides a greedy search algorithm to solve the optimization problem.The corresponding solution produ
12、ced is proven to be equivalent to the optimal solution to the original problem under certain condition.2.FEATURE SELECTIONMETHOD Overview Suppose the goal is to select(1tm)features fromthe entire feature set 1,2,.In our method we firstdefine the importance score of each feature,and definethe similar
13、ity between any two features and.Then weemploy an efficient algorithm to maximize the totalimportance scores and minimize the total similarityscores of a set of features.2.2 Importance of feature We first assign an importance score to each feature.Specifically,we propose using an evaluation measure
14、like MAP and NDCG or a loss function(e.g.pair-wise ranking errors 1013)to compute the importance score.In the former,we first rank instances using the feature,evaluate the performance in terms of the measure,and then take the evaluation result as the importance score.In the latter,we also rank insta
15、nces using the feature,and then view a score inversely proportional to the corresponding loss as the importance score 2.3 Similarity between features In this work,we measure the similarity between any two features on the basis of their ranking results.That is,we regard each feature as a ranking mode
16、l,and the similarity between two features is represented by the similarity between the ranking results that they produce.we choose Kendalls as an example.The Kendalls value of query q for any two features and can be calculated as follows,2.4 Optimization formulation we want to select those features
17、with largest total importance scores and smallest total similarity scores.Mathematically,this can be represented as follows:In(1),there are two objectives:to maximize the sum ofthe importance scores of individual features,and tominimize the sum of similarity scores between any twofeatures.we take a
18、common approach in optimizationand convert multi-objective programming to singleobjective programming using linear combination.2.5 Solution to optimization problem The optimization in(2)is a typical 0-1 integer programming problem.One possible approach would be to perform exhaustive search.However,t
19、he time complexity of it,(),is too high to make it applicable in real applications.We need to look for more practical solutions.In this work,we propose a greedy search algorithm for tackling the issueAlgorithm GAS(Greedy search Algorithm of feature Selection)1.Construct an undirected graph G0,in whi
20、ch each node represents a feature,the weight of node is and the weight of an edge between node and node is 2.Construct a set S to contain the selected features.Initially S0=.3.For i=1t,(1)Select the node with the largest weight,without loss of generality,suppose that the selected node is(2)A punishm
21、ent is conducted on all the other nodes according to their similarities with .That is,the weights of all the other nodes are updated as follows.(3)Add to the set S and remove it from graph G together with all the edges connected to it:4.Output St.Theorem 1:With the greedy search algorithm in Fig.1 o
22、ne can find the optimal solution to problem(2),provided that ,where denotes the selected feature set with Proof:The condition indicates that when selecting the(t+1)-th feature,we do not change the already-selected t features.Denote=|i=1,t,where is the ki-th feature selected in the i-th iteration.The
23、n the task turns out to be that of finding the(t+1)-th feature so that the following objective can be met.Since ,we can rewrite(3)as And since and =|i=1,t,(4)equals 3.EXPERIMENT SETTINGS 3.1 Datasets In our experiments,we used two benchmark datasets.1)The first dataset is the.gov data which was used
24、 in the topic distillation task of Web track of TREC 2004 28.2)The second dataset is the OHSUMED data 9,which was used in many experiments in information retrieval 610,including the TREC-9 filtering track 26.3.2 Evaluation measures 3.2.1 Mean average precision(MAP)MAP is a measure on precision of ra
25、nking results It is assumed that there are two types of documents:positive and negative(relevant and irrelevant).Precision at n measures the accuracy of top n results for a query.Average precision of a query is calculated basedon precision at n:3.2.2 Normalized discount cumulative gain(NDCG)NDCG is
26、designed for measuring rankingaccuracies when there are multiple levels ofrelevance judgment.Given a query,NDCG atposition n in is defined 3.3 Ranking model3.3.1 Ranking SVM Ranking SVM makes an extension of SVM to ranking;in contrast to traditional SVM which works on instances,Ranking SVM utilizes
27、instance pairs and theirpreference labels in training.The optimizationformulation of Ranking SVM is as follows:3.3.2 RankNet RankNet also uses instance pairs in training.it employsa neural network as the ranking function and relativeentropy as loss function.Let be the estimated posteriorprobability
28、and be the“true”posterior probability,and let .The loss for an instance pair inRankNet is defined as 3.4 Algorithms for comparison Our proposed algorithm has two variants.We listthem in the following table.For comparison,we selected IG and CHI as thebaselines.IG measures the reduction in uncertainty
29、(entropy)in classification prediction when knowingthe feature.CHI measures the degree ofindependence between the feature and thecategories.we also used“With All Features(WAF)”as another baseline,in order to show thebenefit of conducting feature selection.4.EXPERIMENTAL RESULTS 4.1 The.gov data Fig.2
30、 shows the performances of the feature selection methods on the.gov dataset when they work as preprocessors of Ranking SVM.Fig.3 shows the performances when using RankNet as the ranking model.In the figures,the x-axis represents the number of selected features.Experimental results indicate that in m
31、ost cases GAS-L can outperform GAS-E,although not significantly Experimental results also indicate that with GAS-L and GAS-E as feature selection methods the ranking performances of Ranking SVM are more stable than those with IG and CHI as feature selection methods.4.2 OHSUMED data Fig.4 shows the r
32、esults of different feature selectionmethods on the OHSUMED dataset when they work aspreprocessors of Ranking SVM Fig.5 shows the results of different feature selectionmethods on the OHSUMED dataset when they work aspreprocessors of RankNet4.3 Discussions From the results of the two datasets,we made
33、 thefollowing observations:1)Feature selection can improve the ranking performance more significantly for the.gov dataset than for the OHSUMED dataset.2)Our proposed algorithms outperform IG and CHI more significantly for the.gov dataset than for the OHSUMED dataset.To figure out the reasons,we cond
34、ucted thefollowing additional experiments.We first plotted the importance of each feature inthe two datasets in Fig.6.The x-axis representsfeatures and the y-axis represents their MAPvalues when they are regarded as ranking models.Furthermore,we plotted the similarity between anytwo features(in term
35、s of Kendalls)in the twodatasets in Fig.7.Here,both x-axis and y-axisrepresent features,and the level of darknessrepresents the strength of similarity(the darker,the more similar).Based on the discussions above,we conclude thatif the effects of features vary largely and there areredundant features,o
36、ur method can work very well.When applying our method in practice,therefore,one can first test the two aspects.5.CONCLUSIONS AND FUTURE WORK In this paper,we have proposed an optimization method for feature selection in ranking.he contributions of this paper include the following points.1)We have di
37、scussed the differences between classification and ranking,and made clear the limitations of the existing feature selection methods when applied to ranking.2)We have proposed a novel method to select features for ranking,in which the problem is formalized as an optimization issue.3)We have evaluated
38、 the proposed method using two public datasets,with two ranking models,and in terms of a number of evaluation measures.Experimental results have validated the effectiveness and efficiency of the proposed method.feature selection for ranking is an important research topic,for which there are still ma
39、ny open problems that need to be addressed.1)In this paper,we have used measures such as MAP and NDCG to compute the importance of a feature and used measures such as Kendalls to compute the similarity between features.2)In this paper we have only given a greedy search algorithm for the optimization
40、,which can guarantee to find the optimal solution of the integer programming problem under certain condition.one can expect an improvement on ranking performance over those reported in this paper.3)There are two objectives in our optimization method for feature selection.In this paper,we have combin
41、ed them linearly for simplicity.In principle,one could employ other ways to represent the tradeoff between the two objective 4)We have demonstrated the effectiveness of our method with two datasets,and with a small number of manually extracted features.It is necessary to further conduct experiments on larger datasets and with more features.