提问



假设我有两列数据。第一列包含诸如First,Second,Third等类别。第二列包含代表我看到First的次数。


例如:


Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3


我想按类别对数据进行排序并对频率求和:


Category     Frequency
First        30
Second       5
Third        34


我怎么会在R?

最佳参考


使用aggregate:


aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34





(嵌入@thelatemail评论),aggregate也有一个公式界面


aggregate(Frequency ~ Category, x, sum)


或者,如果要聚合多个列,可以使用.表示法(也适用于一列)


aggregate(. ~ Category, x, sum)





tapply:


tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34 





使用此数据:


x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 
                    Frequency=c(10,15,5,2,14,20,3))

其它参考1


最近,您还可以使用 dplyr 包来实现此目的:


library(dplyr)
x %>% 
  group_by(Category) %>% 
  summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34


或者,对于多个摘要列(也适用于一列):


x %>% 
  group_by(Category) %>% 
  summarise_each(funs(sum))


dplyr的更新>=0.5: summarise_each已被dplyr中的summarise_allsummarise_atsummarise_if系列函数替换。


或者,如果您有要分组的多个列,,您可以使用逗号分隔group_by中的所有列:


mtcars %>% 
  group_by(cyl, gear) %>%                            # multiple group columns
  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns


有关更多信息,包括%>%运算符,请参阅dplyr简介。 [47]

其它参考2


rcs提供的答案很简单。但是,如果您正在处理更大的数据集并需要提高性能,则可以采用更快的替代方法:


library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), 
                  Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed 
# 0.008     0.001     0.009 


让我们使用data.frame和上面的内容将它与同一个东西进行比较:


data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
                  Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user    system   elapsed 
# 0.008     0.000     0.015 


如果你想保留列,这就是语法:


data[,list(Frequency=sum(Frequency)),by=Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34


对于较大的数据集,差异将变得更加明显,如下面的代码所示:


data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
                  Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed 
# 0.055     0.004     0.059 
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), 
                  Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user    system   elapsed 
# 0.287     0.010     0.296 





对于多个聚合,您可以将lapply.SD组合如下


data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

其它参考3


这与这个问题有些相关。


您也可以使用 by()函数:


x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))


那些其他包(plyr,reshape)具有返回data.frame的好处,但是值得熟悉by(),因为它是一个基本函数。

其它参考4


library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))

其它参考5


几年后,只是添加另一个简单的基础R解决方案,由于某种原因不存在 - xtabs


xtabs(Frequency ~ Category, df)
# Category
# First Second  Third 
#    30      5     34 


或者如果想要data.frame回来


as.data.frame(xtabs(Frequency ~ Category, df))
#   Category Freq
# 1    First   30
# 2   Second    5
# 3    Third   34

其它参考6


只是添加第三个选项:


require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)


编辑:这是一个非常古老的答案。现在我建议使用group_by和dplyr汇总,就像在@docendo中一样。

其它参考7


虽然我最近成为大多数这类操作的转换dplyr,但对于某些事情,sqldf包仍然非常好(并且恕我直言更具可读性)。


以下是sqldf如何回答这个问题的一个例子


x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                  "Third", "Third", "Second")), 
                Frequency=c(10,15,5,2,14,20,3))

sqldf("select 
          Category
          ,sum(Frequency) as Frequency 
       from x 
       group by 
          Category")

##   Category Frequency
## 1    First        30
## 2   Second         5
## 3    Third        34

其它参考8


如果x是包含数据的数据框,则以下内容将执行您想要的操作:


require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)

其它参考9


cast代替recast(注'Frequency'现在是'value')


df  <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
                  , value = c(10,15,5,2,14,20,3))

install.packages("reshape")

result<-cast(df, Category ~ . ,fun.aggregate=sum)


要得到:


Category (all)
First     30
Second    5
Third     34