提问
假设我有两列数据。第一列包含诸如First,Second,Third等类别。第二列包含代表我看到First的次数。
例如:
Category Frequency
First 10
First 15
First 5
Second 2
Third 14
Third 20
Second 3
我想按类别对数据进行排序并对频率求和:
Category Frequency
First 30
Second 5
Third 34
我怎么会在R?
最佳参考
使用
aggregate
:aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
Category x
1 First 30
2 Second 5
3 Third 34
(嵌入@thelatemail评论),
aggregate
也有一个公式界面aggregate(Frequency ~ Category, x, sum)
或者,如果要聚合多个列,可以使用
.
表示法(也适用于一列)aggregate(. ~ Category, x, sum)
或
tapply
:tapply(x$Frequency, x$Category, FUN=sum)
First Second Third
30 5 34
使用此数据:
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
其它参考1
最近,您还可以使用 dplyr 包来实现此目的:
library(dplyr)
x %>%
group_by(Category) %>%
summarise(Frequency = sum(Frequency))
#Source: local data frame [3 x 2]
#
# Category Frequency
#1 First 30
#2 Second 5
#3 Third 34
或者,对于多个摘要列(也适用于一列):
x %>%
group_by(Category) %>%
summarise_each(funs(sum))
dplyr的更新>=0.5:
summarise_each
已被dplyr中的summarise_all
,summarise_at
和summarise_if
系列函数替换。或者,如果您有要分组的多个列,,您可以使用逗号分隔
group_by
中的所有列:mtcars %>%
group_by(cyl, gear) %>% # multiple group columns
summarise(max_hp = max(hp), mean_mpg = mean(mpg)) # multiple summary columns
有关更多信息,包括
%>%
运算符,请参阅dplyr简介。 [47]其它参考2
rcs提供的答案很简单。但是,如果您正在处理更大的数据集并需要提高性能,则可以采用更快的替代方法:
library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
# Category V1
# 1: First 30
# 2: Second 5
# 3: Third 34
system.time(data[, sum(Frequency), by = Category] )
# user system elapsed
# 0.008 0.001 0.009
让我们使用data.frame和上面的内容将它与同一个东西进行比较:
data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user system elapsed
# 0.008 0.000 0.015
如果你想保留列,这就是语法:
data[,list(Frequency=sum(Frequency)),by=Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
对于较大的数据集,差异将变得更加明显,如下面的代码所示:
data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user system elapsed
# 0.055 0.004 0.059
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user system elapsed
# 0.287 0.010 0.296
对于多个聚合,您可以将
lapply
和.SD
组合如下data[, lapply(.SD, sum), by = Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
其它参考3
这与这个问题有些相关。
您也可以使用 by()函数:
x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))
那些其他包(plyr,reshape)具有返回data.frame的好处,但是值得熟悉by(),因为它是一个基本函数。
其它参考4
library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))
其它参考5
几年后,只是添加另一个简单的基础R解决方案,由于某种原因不存在 -
xtabs
xtabs(Frequency ~ Category, df)
# Category
# First Second Third
# 30 5 34
或者如果想要
data.frame
回来as.data.frame(xtabs(Frequency ~ Category, df))
# Category Freq
# 1 First 30
# 2 Second 5
# 3 Third 34
其它参考6
只是添加第三个选项:
require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)
编辑:这是一个非常古老的答案。现在我建议使用group_by和dplyr汇总,就像在@docendo中一样。
其它参考7
虽然我最近成为大多数这类操作的转换
dplyr
,但对于某些事情,sqldf
包仍然非常好(并且恕我直言更具可读性)。以下是
sqldf
如何回答这个问题的一个例子x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
sqldf("select
Category
,sum(Frequency) as Frequency
from x
group by
Category")
## Category Frequency
## 1 First 30
## 2 Second 5
## 3 Third 34
其它参考8
如果
x
是包含数据的数据框,则以下内容将执行您想要的操作:require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)
其它参考9
用
cast
代替recast
(注'Frequency'
现在是'value'
)df <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
, value = c(10,15,5,2,14,20,3))
install.packages("reshape")
result<-cast(df, Category ~ . ,fun.aggregate=sum)
要得到:
Category (all)
First 30
Second 5
Third 34