
the r book dataframes.pdf
30页4DataframesLearning how to handle your data, how to enter them into the computer, and how to read them into R are among the most important topics you will need to master. R handles data in objects known as dataframes. A dataframe is an object with rows and columns (a bit like a matrix). The rows contain different observations from your study, or measurements from your experiment (these are sometimes called cases). The columnscontain the values of different variables (these are often called fields). The values in the body of a matrix can only be numbers, but the values in the body of a dataframe can be numbers, but they could also be text (e.g. thenames offactor levels forcategorical variables, likemaleorfemaleinavariable calledgender), they could be calendar dates (e.g. 23/5/04), or they could be logical variables (TRUE or FALSE). Here is a spreadsheet in the form of a dataframe with seven variables, the leftmost of which comprises the row names, and other variables are numeric (Area, Slope, Soil pH and Worm Density), categorical (Field Name and Vegetation) or logical (Damp is either true = T or false = F).Field NameAreaSlopeVegetationSoil pHDampWorm DensityNash’s Field3.611Grassland4.1F4 Silwood Bottom5.12Arable5.2F7 Nursery Field2.83Grassland4.3F2 Rush Meadow2.45Meadow4.9T5 Gunness’ Thicket3.80Scrub4.2F6 Oak Mead3.12Grassland3.9F2 Church Field3.53Grassland4.2F3 Ashurst2.10Arable4.8F4 The Orchard1.90Orchard5.7F9 Rookery Slope1.54Grassland5T7 Garden Wood2.910Scrub5.2F8 North Gravel3.31Grassland4.1F1 South Gravel3.72Grassland4F2 Observatory Ridge1.86Grassland3.8F0 Pond Field4.10Meadow5T6 Water Meadow3.90Meadow4.9T8 Cheapside2.28Scrub4.7T4(Continued)The R Book, Second Edition. Michael J. Crawley. © 2013 John Wiley see p. 26). The logic for the selection of rows can refer to values (and functions of values) in more than onecolumn. Suppose that we wanted the data from the fields where worm density was higher than the median (>median(Worm.density)) and soil pH was less than 5.2. In R, the logical operator for AND is the raggregatecreate a table after the fashion of tapply; rbyperform functions for each level of specified factors.Useofsummaryandbywiththewormsdatabasewasdescribedonp.163.Theaggregatefunctionisusedlike tapply to apply a function (mean in this case) to the levels of a specified categorical variable (Veg-etation in this case) for a specified range of variables (Area, Slope, Soil.pH and Worm.density)which are specified using their subscripts as a column index, worms[,c(2,3,5,7)]:aggregate(worms[,c(2,3,5,7)],by=list(veg=Vegetation),mean)vegAreaSlopeSoil.pH Worm.density 1Arable 3.866667 1.333333 4.8333335.333333 2 Grassland 2.911111 3.666667 4.1000002.444444 3Meadow 3.466667 1.666667 4.9333336.333333 4Orchard 1.900000 0.000000 5.7000009.000000 5Scrub 2.425000 7.000000 4.8000005.250000The by argument needs to be a list even if, as here, we have only one classifying factor. Here are theaggregated summaries cross-classified by Vegetation and Damp:aggregate(worms[,c(2,3,5,7)],by=list(veg=Vegetation,d=Damp),mean)vegdAreaSlopeSoil.pH Worm.density 1Arable FALSE 3.866667 1.333333 4.8333335.333333 2 Grassland FALSE 3.087500 3.625000 3.9875001.875000188THE R BOOK3Orchard FALSE 1.900000 0.000000 5.7000009.000000 4Scrub FALSE 3.350000 5.000000 4.7000007.000000 5 GrasslandTRUE 1.500000 4.000000 5.0000007.000000 6MeadowTRUE 3.466667 1.666667 4.9333336.333333 7ScrubTRUE 1.500000 9.000000 4.9000003.500000Note that this summary is unbalanced because there were no damp arable or orchard sites and no dry meadows.。
