
人工智能与数据挖掘教学l3.ppt
26页Chapter 3 Basic Data Mining Techniques3.1 Decision Trees (For classification)2024/9/221编辑pptIntroduction: Classification—A Two-Step Process •1. Model construction: build a model that can describe a set of predetermined classes–Preparation: Each tuple/sample is assumed to belong to a predefined class, labeled by the output attribute or class label attribute–This set of examples is used for model construction: training set–The model can be represented as classification rules, decision trees, or mathematical formulae –Estimate accuracy of the model•The known label of test sample is compared with the classified result from the model•Accuracy rate is the percentage of testing set samples that are correctly classified by the model•Note: Test set is independent of training set, otherwise over-fitting will occur•2. Model usage: use the model to classify future or unknown objects2024/9/222编辑pptClassification Process (1): Model ConstructionTrainingDataClassificationAlgorithmsIF rank = ‘professor’OR years > 6THEN tenured = ‘yes’ Classifier(Model)2024/9/223编辑pptClassification Process (2): Use the Model in PredictionClassifierTestingDataUnseen Data(Jeff, Professor, 4)Tenured?编辑ppt1 Example (1): Training DatasetAn example from Quinlan’s ID3(1986)2024/9/225编辑ppt1 Example (2): Output: A Decision Tree for “buys_computer”age?overcaststudent?credit rating?noyesfairexcellent<=30>40nonoyesyesyes30..402024/9/226编辑ppt2 Algorithm for Decision Tree Building•Basic algorithm (a greedy algorithm)–Tree is constructed in a top-down recursive divide-and-conquer manner–At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they are discretized in advance) –Examples are partitioned recursively based on selected attributes–Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)•Conditions for stopping partitioning–All samples for a given node belong to the same class–There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf–There are no samples left–Reach the pre-set accuracy2024/9/227编辑pptInformation Gain (信息增益)(ID3/C4.5)•Select the attribute with the highest information gain•Assume there are two classes, P and N–Let the set of examples S contain p elements of class P and n elements of class N–The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as2024/9/228编辑pptInformation Gain in Decision Tree Building•Assume that using attribute A, a set S will be partitioned into sets {S1, S2 , …, Sv} –If Si contains pi examples of P and ni examples of N, the entropy (熵), or the expected information needed to classify objects in all subsets Si is•The encoding information that would be gained by branching on A2024/9/229编辑pptAttribute Selection by Information Gain ComputationgClass P: buys_computer = “yes”gClass N: buys_computer = “no”gCompute the entropy for age:HenceSimilarly2024/9/2210编辑ppt3. Decision Tree Rules •Automate rule creation•Rules simplification and elimination•A default rule is chosen2024/9/2211编辑ppt3.1 Extracting Classification Rules from Trees•Represent the knowledge in the form of IF-THEN rules•One rule is created for each path from the root to a leaf•Rules are easier for humans to understand•ExampleIF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”IF age = “31…40” THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”2024/9/2212编辑pptIF Age <=43 & Sex = Male & Credit Card Insurance = NoTHEN Life Insurance Promotion = No(accuracy = 75%, Figure 3.4)A Simplified Rule Obtained by Removing Attribute AgeIF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No(accuracy = 83.3% (5/6), Figure 3.5)3.2 Rules simplification and elimination2024/9/2213编辑pptFigure 3.5 A two-node decision tree for the credit card database编辑ppt2024/9/2215编辑ppt4. Further discussion•Attributes with more values accuracy / splits GainRatio(A) = Gain(A) / SplitInfo(A)•Numerical attributes binary split•Stopping condition•More than 2 values•Other Methods for building decision trees•ID3•C4.5 •CART• CHAID2024/9/2216编辑ppt5. General consideration:Advantages of Decision Trees• Easy to understand.• Map nicely to a set of production rules.• Applied to real problems.• Make no prior assumptions about the data.• Able to process both numerical and categorical data.2024/9/2217编辑pptDisadvantages of Decision Trees• Output attribute must be categorical. • Limited to one output attribute.• Decision tree algorithms are unstable.• Trees created from numeric datasets can be complex.2024/9/2218编辑pptDecision Tree Attribute SelectionAppendix C2024/9/2219编辑pptComputing Gain Ratio编辑pptComputing Gain(A)编辑pptComputing Info(I)编辑pptComputing Info(I,A)编辑pptComputing Split Info(A)编辑ppt编辑ppt编辑ppt。






![河南新冠肺炎文件-豫建科[2020]63号+豫建科〔2019〕282号](http://img.jinchutou.com/static_www/Images/s.gif)





