

In cases of tie, you can make a random choice. For instance, if school A has a mean score of 6 for class 1 and 4 for class 2, you will reject class 2 and only take class 1 mean score for the school. And print only the class whose mean score comes out to be higher for the school.
SUMMARIZE IN R CODE
Write a code to find the mean marks of each school for both class 1 and 2, for students with roll no less than 6. You have a table for all school kids marks in a particular city. We have covered a few examples of the same in our article – comprehensive guide for data exploration in R.Ĭhallenge : Here is a simple problem you can attempt to solve using all the methods we have discussed. In case you are interested in using function similar to pivot tables or transposing the tables, you can consider using “reshape”. “sqldf” has all features you need to summarize the data in SQL statements. “ddply” in these cases is faster but will not give you options beyond just grouping. In general if you are trying to add this summarisation step in the middle of a process and need a table as output, you need to go for sqldf or ddply. Here’s a complete tutorial on useful packages for data manipulation in R – Faster Data Manipulation with these 7 R Packages. library(plyr)Īttach (iris) # mean petal length by speciesĭdply(iris,"Species",summarise, Petal.Length_mean = mean (Petal.Length))Īdditional Notes: You can also use packages such as dplyr, data.table to summarize data. Let’s do what we exactly did in tapply section. Summarization <- sqldf(select Species, mean(Petal.Length) from Petal.Length_mean where Species is not null group by Species’)įastest of all we discussed. I bring you a life line which you can use anytime. If you found any of the above statements difficult, don’t panic. So it does apply function on split frames. And then it creates a summary at this level. What did the function do? It simply splits the data by a class variable, which in this case is the specie. attach (iris)īy (iris, Species, colMeans) Species: setosa Hopefully the example will make it more clear. Function ‘by’ is an object-oriented wrapper for ‘tapply’ applied to data frames. Now comes a slightly more complicated algorithm. Tapply (iris$Petal.Length, Species, mean) Here is an example which will make the usage clear. Here is a function which completes the palette for R. Usage is “tapply(X, INDEX, FUN = NULL, …, simplify = TRUE)”, where X is “an atomic object, typically a vector” and INDEX is “a list of factors, each of same length as X”. Till now, all the function we discussed cannot do what Sql can achieve. “sapply” does the same thing as apply but returns a vector or matrix.

“lapply” returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.” l <- list (a = 1:10, b = 11:20) However this function is very specific to collapsing either row or column. This is the simplest of all the function which can do this job.

If understand well with scatter plots & histogram, you can refer to guide on data visualization in R.Īpply function returns a vector or array or list of values obtained by applying a function to either rows or columns. Generally, summarizing data means finding statistical figures such as mean, median, box plot etc. Hopefully this will make your journey much easier than it looks like. In this article I will cover primary ways to summarize data sets. For such audience, the biggest concern is to how do we do the same thing on R. People who transition from SAS or SQL are used to writing simple queries on these languages to summarize data sets. But, which one is the best ? I’ve answered this question below. People remain confused when it comes to summarizing data real quick in R.
