1、CS276BText Information Retrieval,Mining,and ExploitationLecture 523 January 2003RecapTodays topicsnFeature selection for text classificationnMeasuring classification performancenNearest neighbor categorizationFeature Selection:Why?nText collections have a large number of featuresn10,000 1,000,000 un
2、ique words and morenMake using a particular classifier feasiblenSome classifiers cant deal with 100,000s of featsnReduce training timenTraining time for some methods is quadratic or worse in the number of features(e.g.,logistic regression)nImprove generalizationnEliminate noise featuresnAvoid overfi
3、ttingRecap:Feature ReductionnStandard ways of reducing feature space for textnStemmingnLaugh,laughs,laughing,laughed-laughnStop word removalnE.g.,eliminate all prepositionsnConversion to lower casenTokenizationnBreak on all special characters:fire-fighter-fire,fighterFeature SelectionnYang and Peder
4、sen 1997nComparison of different selection criterianDF document frequencynIG information gainnMI mutual informationnCHI chi squarenCommon strategynCompute statistic for each termnKeep n terms with highest value of this statisticInformation Gain(Pointwise)Mutual InformationChi-SquareTerm presentTerm
5、absentDocument belongs to categoryABDocument does not belong to categoryCDX2=N(AD-BC)2/(A+B)(A+C)(B+D)(C+D)Use either maximum or average X2Value for complete independence?Document FrequencynNumber of documents a term occurs in nIs sometimes used for eliminating both very frequent and very infrequent
6、 termsnHow is document frequency measure different from the other 3 measures?Yang&Pedersen:ExperimentsnTwo classification methodsnkNN(k nearest neighbors;more later)nLinear Least Squares FitnRegression methodnCollectionsnReuters-22173n92 categoriesn16,000 unique termsnOhsumed:subset of medlinen14,00
7、0 categoriesn72,000 unique termsnLtc term weighting Yang&Pedersen:ExperimentsnChoose feature set sizenPreprocess collection,discarding non-selected features/wordsnApply term weighting-feature vector for each documentnTrain classifier on training setnEvaluate classifier on test setDiscussionnYou can
8、eliminate 90%of features for IG,DF,and CHI without decreasing performance.nIn fact,performance increasesincreases with fewer features for IG,DF,and CHI.nMutual information is very sensitive to small counts.nIG does best with smallest number of features.nDocument frequency is close to optimal.By far
9、the simplest feature selection method.nSimilar results for LLSF(regression).ResultsWhy is selecting common terms a good strategy?IG,DF,CHI Are Correlated.Information Gain vs Mutual InformationnInformation gain is similar to MI for random variablesnIndependence?nIn contrast,pointwise MI ignores non-o
10、ccurrence of termsnE.g.,for complete dependence,you get:nP(AB)/(P(A)P(B)=1/P(A)larger for rare terms than for frequent termsnYang&Pedersen:Pointwise MI favors rare termsFeature Selection:Other ConsiderationsnGeneric vs Class-SpecificnCompletely generic(class-independent)nSeparate feature set for eac
11、h classnMixed(a la Yang&Pedersen)nMaintainability over timenIs aggressive features selection good or bad for robustness over time?nIdeal:Optimal features selected as part of trainingYang&Pedersen:LimitationsnDont look at class specific feature selectionnDont look at methods that cant handle high-dim
12、ensional spacesnEvaluate category ranking(as opposed to classification accuracy)Feature Selection:Other MethodsnStepwise term selection nForwardnBackwardnExpensive:need to do n2 iterations of trainingnTerm clusteringnDimension reduction:PCA/SVDWord Rep.vs.Dimension ReductionnWord representations:one
13、 dimension for each word(binary,count,or weight)nDimension reduction:each dimension is a unique linear combination of all words(linear case)nDimension reduction is good for generic topics(“politics”),bad for specific classes(“ruanda”).Why?nSVD/PCA computationally expensivenHigher complexity in imple
14、mentationnNo clear examples of higher performance through dimension reductionWord Rep.vs.Dimension ReductionMeasuring ClassificationFigures of MeritnAccuracy of classification nMain evaluation criterion in academianMore in a momennSpeed of training statistical classifiernSpeed of classification(docs
15、/hour)nNo big differences for most algorithmsnExceptions:kNN,complex preprocessing requirementsnEffort in creating training set(human hours/topic)nMore on this in Lecture 9(Active Learning)Measures of AccuracynError rate nNot a good measure for small classes.Why?nPrecision/recall for classification
16、decisionsnF1 measure:1/F1=(1/P+1/R)nBreakeven pointnCorrect estimate of size of categorynWhy is this different?nPrecision/recall for ranking classesnStability over time/concept driftnUtilityPrecision/Recall for Ranking ClassesnExample:“Bad wheat harvest in Turkey”nTrue categoriesnWheatnTurkeynRanked
17、 category listn0.9:turkeyn0.7:poultryn0.5:armenian0.4:barleyn0.3:georgianPrecision at 5:0.1,Recall at 5:0.5Precision/Recall for Ranking ClassesnConsider problems with many categories(10)nUse method returning scores comparable across categories(not:Nave Bayes)nRank categories and compute average prec
18、ision recall(or other measure characterizing precision/recall curve)nGood measure for interactive support of human categorizationnUseless for an“autonomous”system(e.g.a filter on a stream of newswire stories)Concept DriftnCategories change over timenExample:“president of the united states”n1999:clin
19、ton is great featuren2002:clinton is bad featurenOne measure of a text classification system is how well it protects against concept drift.nFeature selection:good or bad to protect against concept drift?Micro-vs.Macro-AveragingnIf we have more than one class,how do we combine multiple performance me
20、asures into one quantity?nMacroaveraging:Compute performance for each class,then average.nMicroaveraging:Collect decisions for all classes,compute contingency table,evaluate.Micro-vs.Macro-Averaging:ExampleTruth:yesTruth:noClassifier:yes1010Classifier:no10970Truth:yesTruth:noClassifier:yes9010Classi
21、fier:no10890Truth:yesTruth:noClassifier:yes10020Classifier:no201860Class 1Class 2Micro.Av.TablenMacroaveraged precision:(0.5+0.9)/2=0.7nMicroaveraged precision:100/120=.83nWhy this difference?Reuters 1nNewswire textnStatistics(vary according to version used)nTraining set:9,610nTest set:3,662n50%of d
22、ocuments have no category assignednAverage document length:90.6nNumber of classes:92nExample classes:currency exchange,wheat,goldnMax classes assigned:14nAverage number of classes assignedn1.24 for docs with at least one categoryReuters 1nOnly about 10 out of 92 categories are largenMicroaveraging m
23、easures performance on large categories.Factors Affecting MeasuresnVariability of datanDocument size/lengthnquality/style of authorshipnuniformity of vocabularynVariability of“truth”/gold standardnneed definitive judgement on which topic(s)a doc belongs tonusually humannIdeally:consistent judgements
24、Accuracy measurementnConfusion matrix53Topic assigned by classifierActual TopicThis(i,j)entry means 53 of the docs actually intopic i were put in topic j by the classifier.Confusion matrixnFunction of classifier,topics and test docs.nFor a perfect classifier,all off-diagonal entries should be zero.n
25、For a perfect classifier,if there are n docs in category j than entry(j,j)should be n.nStraightforward when there is 1 category per document.nCan be extended to n categories per document.Confusion measures(1 class/doc)nRecall:Fraction of docs in topic i classified correctly:nPrecision:Fraction of do
26、cs assigned topic i that are actually about topic i:n“Correct rate”:(1-error rate)Fraction of docs classified correctly:jijiiccjjiiiccjijiiiiccIntegrated Evaluation/OptimizationnPrincipled approach to trainingnOptimize the measure that performance is measured withns:vector of classifier decision,z:v
27、ector of true classesnh(s,z)=cost of making decisions s for true assignments zUtility/CostnOne cost function h is based on contingency table.nAssume identical cost for all false positives etc.nCost C=l11*A+l12*B+l21*C+l22*DnFor this cost c,we get the following optimality criterionTruth:yesTruth:noCl
28、assifier:yesCost:11Count:ACost:12Count:BClassifier:noCost:21Count;CCost:22Count:DUtility/CostTruth:yesTruth:noClassifier:yes1112Classifier:no2122Most common cost:1 for error,0 for correct.Pi?Product cross-sale:high cost for false positive,low cost for false negative.Patent search:low cost for false
29、positive,high cost for false negative.Are All Optimal Rules of Form p?nIn the above examples,all you need to do is estimate probability of class membership.nCan all problems be solved like this?nNo!nProbability is often not sufficientnUser decision depends on the distribution of relevancenExample:in
30、formation filter for terrorismNave BayesVector Space ClassificationNearest Neighbor ClassificationRecall Vector Space RepresentationnEach doc j is a vector,one component for each term(=word).nNormalize to unit length.nHave a vector spacenterms are axesnn docs live in this spaceneven with stemming,ma
31、y have 10000+dimensions,or even 1,000,000+Classification Using Vector SpacesnEach training doc a point(vector)labeled by its topic(=class)nHypothesis:docs of the same topic form a contiguous region of spacenDefine surfaces to delineate topics in spaceTopics in a vector spaceGovernmentScienceArtsGive
32、n a test docnFigure out which region it lies innAssign corresponding classTest doc=GovernmentGovernmentScienceArtsBinary ClassificationnConsider 2 class problemsnHow do we define(and find)the separating surface?nHow do we test which region a test doc is in?Separation by HyperplanesnAssume linear sep
33、arability for now:nin 2 dimensions,can separate by a linenin higher dimensions,need hyperplanesnCan find separating hyperplane by linear programming(e.g.perceptron):nseparator can be expressed as ax+by=cLinear programming/PerceptronFind a,b,c,such thatax+by c for red pointsax+by c for green points.R
34、elationship to Nave Bayes?Find a,b,c,such thatax+by c for red pointsax+by c for green points.Linear ClassifiersnMany common text classifiers are linear classifiersnDespite this similarity,large performance differencesnFor separable problems,there is an infinite number of separating hyperplanes.Which
35、 one do you choose?nWhat to do for non-separable problems?Which hyperplane?In general,lots of possiblesolutions for a,b,c.Support Vector Machine(SVM)Support vectorsMaximizemarginnQuadratic programming problem nThe decision function is fully specified by subset of training samples,the support vectors
36、.nText classification method du journTopic of lecture 9Category:InterestCategory:InterestnExample SVM features Example SVM features n w wi i t ti i w wi i t ti i 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank 0.43 baker-0.71 dlrs-0.35 world-0.33 sees-0.25 year-0.24 group
37、-0.24 dlr-0.24 januaryMore Than Two ClassesnAny-of or multiclass classificationnFor n classes,decompose into n binary problemsnOne-of classification:each document belongs to exactly one classnHow do we compose separating surfaces into regions?nCentroid classificationnK nearest neighbor classificatio
38、nComposing Surfaces:Issues?Separating Multiple TopicsnBuild a separator between each topic and its complementary set(docs from all other topics).nGiven test doc,evaluate it for membership in each topic.nDeclare membership in topics nOne-of classification:nfor class with maximum score/confidence/prob
39、abilitynMulticlass classification:nFor classes above thresholdNegative examplesnFormulate as above,except negative examples for a topic are added to its complementary set.Positive examplesNegative examplesCentroid ClassificationnGiven training docs for a topic,compute their centroidnNow have a centr
40、oid for each topicnGiven query doc,assign to topic whose centroid is nearest.nExercise:Compare to RocchioExampleGovernmentScienceArtsk Nearest Neighbor ClassificationnTo classify document d into class cnDefine k-neighborhood N as k nearest neighbors of dnCount number of documents l in N that belong
41、to cnEstimate P(c|d)as l/kCover and Hart 1967nAsymptotically,the error rate of 1-nearest-neighbor classification is less than twice the Bayes rate.nAssume that query point coincides with a training point.nBoth query point and training point contribute error-2 times Bayes ratekNN vs.RegressionnkNN ha
42、s high variance and low bias.nLinear regression has low variance and high bias.kNN:DiscussionnClassification time linear in training setnTraining set generationnincompletely judged set can be problematic for multiclass problemsnNo feature selection necessarynScales well with large number of categori
43、esnDont need to train n classifiers for n classesnCategories can influence each othernSmall changes to one category can have ripple effectnScores can be hard to convert to probabilitiesnNo training necessarynActually:not true.Why?Number of neighborsReferencesnA Comparative Study on Feature Selection
44、 in Text Categorization(1997)Yiming Yang,Jan O.Pedersen.Proceedings of ICML-97,14th International Conference on Machine Learning.nEvaluating and Optimizing Autonomous Text Classification Systems(1995)David Lewis.Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Develo
45、pment in Information RetrievalnFoundations of Statistical Natural Language Processing.Chapter 16.MIT Press.Manning and Schuetze.nTrevor Hastie,Robert Tibshirani and Jerome Friedman,Elements of Statistical Learning:Data Mining,Inference and Prediction Springer-Verlag,New York.Kappa MeasurenKappa measuresnAgreement among codersnDesigned for categorical judgmentsnCorrects for chance agreementnKappa=P(A)P(E)/1 P(E)nP(A)proportion of time coders agreenP(E)what agreement would be by chancenKappa=0 for chance agreement,1 for total agreement.