
主成分分析4,双语.ppt
31页zf,,Chapter 5 Principal Components Analysis (PCA),2018/10/19,2 cxt,Presentation Outline,w What is PCA? w Geometrical approach to PCA w Analytical approach to PCA w Properties of PCA w How to determine the number of PC? w How to interpret the PC? w Use of PC scores,2018/10/19,3 cxt,5.1 reasons for using principal components analysis,Too Many Variables,2018/10/19,4 cxt,,Stone use 1929一1938 data in USA, and receive 17 variables which describe income-pay. He used principle component analysis and got three new variables F1、F2、F3. F1, total income;F2,total income increase ratio;F3,economy increase or decrease. These new variable can use three variables (I、I、t )which can be measured directly.,2018/10/19,5 cxt,,,2018/10/19,6 cxt,,Solutions Eliminate some redundant variables.– May lose important information that was uniquely reflected in the eliminated variables. Create composite scores from variables (sum or average).– Lost variability among the variables– Multiple scale scores may still be collinear Create weighted linear combinations of variables while retaining most of the variability in the data.– Fewer variables; little or no lost variation– No collinear scales.,2018/10/19,7 cxt,,An Easy ChoiceTo retain most of the information in the data while reducing the number of variables you must deal with, try principal components analysis. Most of the variability in the original data can be retained.but… Components may not be directly interpretable.,2018/10/19,8 cxt,,What is PCA?(什么是主成分分析) PCA is a technique for forming new variables which are linear composites of the original variables. The new variables are called principal components(PRIN’s). The maximum number of PRIN’s that can be formed is equal to the number of original variables. Usually the first few PRIN’s represent most of the information in the original variables and can replace the original variables and hence achieve data reduction, which is the main objective of PCA The PRIN’s are uncorrelated among themselves and can be used in regression,2018/10/19,9 cxt,,Principal Components Analysis(PCA) is a dimension reduction method that creates variables called principal components creates as many components as there are input variables. Principal Components are weighted linear combinations of input variables are orthogonal to and independent of other components are generated so that the first component accounts for the most variation in the xs, followed by the second component, and so on.,2018/10/19,10 cxt,,,平移、旋转坐标轴,,,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,,,,2018/10/19,11 cxt,,,,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,,,2018/10/19,12 cxt,,,,,,,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,•,2018/10/19,13 cxt,,1、data screening—Locating possible outlier in the data 2、reduce variables and use these new variables in clutering, discriminant analysis,regression. 3、PCA can help determine whether multicollinearity occurs among the predictor variables.,2018/10/19,14 cxt,5.2 Objectives of PCA,1、reduce the dimensionality of the data setwithout losing any information. This smaller number of variables can then used in ensuing analyses. 2、identify new meaningful underlying variables.The new variables are useful for variety of things including data screening, assumption checking, and cluster verifying.,2018/10/19,15 cxt,5.3 PCA on variance-covariance matrix,Analytical approach to PCA Assuming that there are p variables. We are interested in forming the following p principal components:where wij is the weight of the jth variable for the ith principal component.,2018/10/19,16 cxt,,The weights, wij, are estimated such that: 1. The first principal component, PRIN1, accounts for the maximum variance in the data, the second principal component, PRIN2, accounts for the maximum variance that has not been accounted for by the first principal component, and so on. 2. 3. 4.,2018/10/19,17 cxt,,Properties of Principal Components 1.2.which is equal to p if the X’s have been standardized. 3. The proportion of the total variance explained by the first k PRIN’sIf this proportion is close to one, then the first k PRIN’s can replace the original p variables without much loss of information.,2018/10/19,18 cxt,,4.It is called a loading and plays a big role in interpreting the meaning of PRINi. If the data are only mean-corrected, replace li in this formula by li/Var(Xj) 5. R-square of (PRINi, Xj) . This can be interpreted in two ways:• as the proportion of variance of Xj explained by PRINi, or• as the contribution or importance of Xj to PRINi. If the data are only mean-corrected, replace in this formula by,2018/10/19,19 cxt,5.4 PCA on the correlation matrix,,2018/10/19,20 cxt,5.5 determining the number of principal components,Percentage of Variance criterionUse enough PRIN’s to explain 75-80% of the total variance Latent Root criterion: specifies a threshold value for evaluating the eigenvalues of the derived PRIN’s. If the variables are standardized, only PRIN’s with an eigenvalue greater than 1 are significant and will be extracted. Scree Test criterion: A scree plot is derived by plotting the eigenvalues of each PRIN relative to the number of PRIN’s in order of extraction. The point where the line becomes horizontal (elbow) is the appropriate number of PRIN’s Horn’s “parallel procedure”(formalized scree plot),。
