提问



给出两个数据框:


df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))

df1
#  CustomerId Product
#           1 Toaster
#           2 Toaster
#           3 Toaster
#           4   Radio
#           5   Radio
#           6   Radio

df2
#  CustomerId   State
#           2 Alabama
#           4 Alabama
#           6    Ohio


我怎样才能做数据库样式,即sql样式,加入?也就是说,我怎么得到:[94]



  • df1df2的内连接:

    仅返回左表在右表中具有匹配键的行。

  • df1df2的外连接:

    返回两个表中的所有行,从左侧连接记录,右表中具有匹配的键。

  • df1df2
    的左外连接(或简称左连接)
    返回左表中的所有行,以及右表中具有匹配键的所有行。

  • df1df2
    的右外连接
    返回右表中的所有行,以及左表中具有匹配键的所有行。



额外学分:[95] [96] [97] [98]


如何进行SQL样式选择语句?

最佳参考


通过使用merge函数及其可选参数:


内部联接: merge(df1, df2)将适用于这些示例,因为R会自动通过公共变量名称加入帧,但您很可能希望指定merge(df1, df2, by = "CustomerId")]]确保您只匹配所需的字段。如果匹配变量在不同数据帧中具有不同的名称,也可以使用by.xby.y参数。


外部联接: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)


左外: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)


右外: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)


交叉加入: merge(x = df1, y = df2, by = NULL)


就像内部联接一样,您可能希望将CustomerId显式传递给R作为匹配变量。我认为最好明确说明要合并的标识符;如果输入data.frames意外更改并且以后更容易阅读,则更安全。


您可以通过赋予by一个向量来合并多个列,例如by = c("CustomerId", "OrderId")


如果要合并的列名不相同,则可以指定,例如by.x = "CustomerId_in_df1", by.y =CustomerId_in_df2where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2`是列中的名称第二个数据框。 (如果需要在多个列上合并,这些也可以是向量。)

其它参考1


我建议查看Gabor Grothendieck的sqldf包,它允许你用SQL表达这些操作。[99]


library(sqldf)

## inner join
df3 <- sqldf("SELECT CustomerId, Product, State 
              FROM df1
              JOIN df2 USING(CustomerID)")

## left join (substitute 'right' for right join)
df4 <- sqldf("SELECT CustomerId, Product, State 
              FROM df1
              LEFT JOIN df2 USING(CustomerID)")


我发现SQL语法比它的R等价物更简单,更自然(但这可能只反映了我的RDBMS偏见)。


有关连接的更多信息,请参阅Gabor的sqldf GitHub。[100]

其它参考2


内联接有 data.table 方法,这非常节省时间和内存(对于一些较大的data.frames来说是必需的):


library(data.table)

dt1 <- data.table(df1, key = "CustomerId") 
dt2 <- data.table(df2, key = "CustomerId")

joined.dt1.dt.2 <- dt1[dt2]


merge也适用于data.tables(因为它是通用的并且调用merge.data.table)


merge(dt1, dt2)


stackoverflow上记录的data.table:

如何进行data.table合并操作

将外键上的SQL连接转换为R data.table语法

合并更大数据的有效替代方案。框架R

如何在R中使用data.table进行基本的左外连接?


另一种选择是 plyr [105]中的join函数。


library(plyr)

join(df1, df2,
     type = "inner")

#   CustomerId Product   State
# 1          2 Toaster Alabama
# 2          4   Radio Alabama
# 3          6   Radio    Ohio


type的选项:innerleftrightfull


?join:与merge不同,[[join]]保留x的顺序,无论使用何种连接类型。

其它参考3


你也可以使用Hadley Wickham令人敬畏的dplyr包来加入。[106]


library(dplyr)

#make sure that CustomerId cols are both type numeric
#they ARE not using the provided code in question and dplyr will complain
df1$CustomerId <- as.numeric(df1$CustomerId)
df2$CustomerId <- as.numeric(df2$CustomerId)


变异连接:使用df2

中的匹配将列添加到df1

#inner
inner_join(df1, df2)

#left outer
left_join(df1, df2)

#right outer
right_join(df1, df2)

#alternate right outer
left_join(df2, df1)

#full join
full_join(df1, df2)


过滤联接:过滤掉df1中的行,不要修改列



semi_join(df1, df2) #keep only observations in df1 that match in df2.
anti_join(df1, df2) #drops all observations in df1 that match in df2.

其它参考4


在R Wiki上有一些很好的例子。我会在这里偷一对:[107]


合并方法


由于您的密钥命名相同,因此内部连接的简短方法是merge():


merge(df1,df2)


可以使用all关键字创建完整的内部联接(来自两个表的所有记录):


merge(df1,df2, all=TRUE)


df1和df2的左外连接:


merge(df1,df2, all.x=TRUE)


df1和df2的右外连接:


merge(df1,df2, all.y=TRUE)


你可以翻转em,slapem并将它们擦以获得你问过的另外两个外连接:)


下标方法


使用下标方法在左侧使用df1的左外连接将是:


df1[,"State"]<-df2[df1[ ,"Product"], "State"]


可以通过对左外连接下标示例进行mungling来创建外连接的其他组合。 (是的,我知道这相当于说我会把它留给读者练习......)

其它参考5


2014年新增内容:


特别是如果你一般也对数据操作感兴趣(包括排序,过滤,子集,总结等),你一定要看看dplyr,它带有各种功能,旨在方便你的它专门用于数据框架和某些其他数据库类型。它甚至提供了相当精细的SQL接口,甚至还有一个将(大多数)SQL代码直接转换为R的函数。


dplyr包中的四个与连接相关的函数是(引用):



  • inner_join(x, y, by = NULL, copy = FALSE, ...):返回所有行
    x其中y中存在匹配值,x和y中存在所有列

  • left_join(x, y, by = NULL, copy = FALSE, ...):返回x中的所有行,以及x和y中的所有列

  • semi_join(x, y, by = NULL, copy = FALSE, ...):返回x中匹配值的所有行
    y,只保留x的列。

  • anti_join(x, y, by = NULL, copy = FALSE, ...):从x返回所有行
    y中没有匹配值的地方,只保留x
  • 中的列


这一切都非常详细。[108]


选择列可以通过select(df,"column")完成。如果这对你来说不够SQL,那么就有了sql()函数,你可以按原样输入SQL代码,它会像你在R中编写的那样执行你指定的操作。一直以来(有关更多信息,请参阅dplyr/databases插图)。例如,如果正确应用,sql("SELECT * FROM hflights")将从hflightsdplyr表中选择所有列(tbl)。[109]

其它参考6


更新data.table方法以加入数据集。请参阅以下每种联接类型的示例。有两种方法,一种来自[.data.table,当传递第二个data.table作为第一个参数到子集时,另一种方法是使用merge函数调度到快速data.table方法。


2016-04-01更新 - 这不是愚人节的笑话!

在1.9.7版本的data.table连接现在能够使用现有索引,这极大地减少了连接的时间。 以下代码和基准测试不会在连接时使用data.table索引。如果您正在寻找近实时连接,则应使用data.table索引。





df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2L, 4L, 7L), State = c(rep("Alabama", 2), rep("Ohio", 1))) # one value changed to show full outer join

library(data.table)

dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
setkey(dt1, CustomerId)
setkey(dt2, CustomerId)
# right outer join keyed data.tables
dt1[dt2]

setkey(dt1, NULL)
setkey(dt2, NULL)
# right outer join unkeyed data.tables - use `on` argument
dt1[dt2, on = "CustomerId"]

# left outer join - swap dt1 with dt2
dt2[dt1, on = "CustomerId"]

# inner join - use `nomatch` argument
dt1[dt2, nomatch=0L, on = "CustomerId"]

# anti join - use `!` operator
dt1[!dt2, on = "CustomerId"]

# inner join
merge(dt1, dt2, by = "CustomerId")

# full outer join
merge(dt1, dt2, by = "CustomerId", all = TRUE)

# see ?merge.data.table arguments for other cases


基准测试基础R,sqldf,dplyr和data.table。

基准测试未加密/未加索引的数据集。如果在data.tables或使用sqldf的索引上使用键,则可以获得更好的性能。 Base R和dplyr没有索引或键,因此我没有在基准测试中包含该场景
基准测试是在5M-1行数据集上执行的,在连接列上有5M-2个常用值,因此可以测试每个场景(左,右,全,内),并且连接仍然不容易执行。





library(microbenchmark)
library(sqldf)
library(dplyr)
library(data.table)

n = 5e6
set.seed(123)
df1 = data.frame(x=sample(n,n-1L), y1=rnorm(n-1L))
df2 = data.frame(x=sample(n,n-1L), y2=rnorm(n-1L))
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)

# inner join
microbenchmark(times = 10L,
               base = merge(df1, df2, by = "x"),
               sqldf = sqldf("SELECT * FROM df1 INNER JOIN df2 ON df1.x = df2.x"),
               dplyr = inner_join(df1, df2, by = "x"),
               data.table = dt1[dt2, nomatch = 0L, on = "x"])
#Unit: milliseconds
#       expr        min         lq      mean     median        uq       max neval
#       base 15546.0097 16083.4915 16687.117 16539.0148 17388.290 18513.216    10
#      sqldf 44392.6685 44709.7128 45096.401 45067.7461 45504.376 45563.472    10
#      dplyr  4124.0068  4248.7758  4281.122  4272.3619  4342.829  4411.388    10
# data.table   937.2461   946.0227  1053.411   973.0805  1214.300  1281.958    10

# left outer join
microbenchmark(times = 10L,
               base = merge(df1, df2, by = "x", all.x = TRUE),
               sqldf = sqldf("SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.x = df2.x"),
               dplyr = left_join(df1, df2, by = c("x"="x")),
               data.table = dt2[dt1, on = "x"])
#Unit: milliseconds
#       expr       min         lq       mean     median         uq       max neval
#       base 16140.791 17107.7366 17441.9538 17414.6263 17821.9035 19453.034    10
#      sqldf 43656.633 44141.9186 44777.1872 44498.7191 45288.7406 47108.900    10
#      dplyr  4062.153  4352.8021  4780.3221  4409.1186  4450.9301  8385.050    10
# data.table   823.218   823.5557   901.0383   837.9206   883.3292  1277.239    10

# right outer join
microbenchmark(times = 10L,
               base = merge(df1, df2, by = "x", all.y = TRUE),
               sqldf = sqldf("SELECT * FROM df2 LEFT OUTER JOIN df1 ON df2.x = df1.x"),
               dplyr = right_join(df1, df2, by = "x"),
               data.table = dt1[dt2, on = "x"])
#Unit: milliseconds
#       expr        min         lq       mean     median        uq       max neval
#       base 15821.3351 15954.9927 16347.3093 16044.3500 16621.887 17604.794    10
#      sqldf 43635.5308 43761.3532 43984.3682 43969.0081 44044.461 44499.891    10
#      dplyr  3936.0329  4028.1239  4102.4167  4045.0854  4219.958  4307.350    10
# data.table   820.8535   835.9101   918.5243   887.0207  1005.721  1068.919    10

# full outer join
microbenchmark(times = 10L,
               base = merge(df1, df2, by = "x", all = TRUE),
               #sqldf = sqldf("SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.x = df2.x"), # not supported
               dplyr = full_join(df1, df2, by = "x"),
               data.table = merge(dt1, dt2, by = "x", all = TRUE))
#Unit: seconds
#       expr       min        lq      mean    median        uq       max neval
#       base 16.176423 16.908908 17.485457 17.364857 18.271790 18.626762    10
#      dplyr  7.610498  7.666426  7.745850  7.710638  7.832125  7.951426    10
# data.table  2.052590  2.130317  2.352626  2.208913  2.470721  2.951948    10

其它参考7


dplyr自0.4以来实现了所有这些连接,包括outer_join,但值得注意的是,用于前几个版本它不提供outer_join,因此有很多非常糟糕的hacky变通方法用户代码在一段时间(你仍然可以在那段时间内找到SO和Kaggle的答案)。


与加入相关的发布亮点:[110]


v0.5 (6/2016)[111]



  • 处理POSIXct类型,时区,重复项,不同因子级别。更好的错误和警告。

  • 用于控制后缀重复变量名称接收的新后缀参数(#1296)



v0.4.0 (1/2015)[112]



  • 实施正确加入和外部联接(#96)

  • 变异连接,它将新变量从另一个表中的匹配行添加到一个表中。过滤联接,根据是否与另一个表中的观察值匹配来过滤来自一个表的观察结果。



v0.3 (10/2014)[113] [114]



  • 现在可以通过每个表中的不同变量left_join:df1%>%left_join(df2,c(var1=var2))



v0.2 (5/2014)[115]



  • * _ join()不再重新排序列名(#324)



v0.1.3 (4/2014)



  • inner_join,left_join,semi_join,anti_join

  • outer_join 尚未实现,回退是使用base :: merge()(或plyr :: join())

  • 还没有实现right_join和 outer_join

  • 哈德利在这里提到其他优势

  • 目前有一个小功能合并,dplyr不具备单独的by.x,by.y列,例如Python pandas。



每个hadley在该问题上的评论的解决方法:[116] [117] [118]



  • right_join (x,y)就行而言与left_join(y,x)相同,只是列将是不同的订单。使用select(new_column_order)
  • 轻松解决
  • outer_join 基本上是union(left_join(x,y),right_join(x,y)) - 即保留两个数据框中的所有行。


其它参考8


在连接两个数据帧时,每个行有大约100万行,一个有2列,另一个有~20,我惊奇地发现merge(..., all.x = TRUE, all.y = TRUE)dplyr::full_join()更快。这是用dplyr v0.4


合并需要大约17秒,full_join大约需要65秒。


虽然有些食物,因为我通常默认使用dplyr进行操作任务。

其它参考9


对于具有0..*:0..1基数的左连接或具有0..1:0..*基数的右连接的情况,可以从连接器(0..1表中就地分配单侧列)直接到joinee(0..*表),从而避免创建一个全新的数据表。这需要将来自细节的键列匹配到连接器中并且索引+相应地对连接器的行进行排序以进行分配。


如果键是单列,那么我们可以使用match()的单个调用来进行匹配。这是我将在这个答案中涵盖的情况。[119]


这是一个基于OP的示例,除了我已经在df2中添加了一个id为7的额外行来测试joiner中非匹配键的情况。这实际上是df1左连接df2:


df1 <- data.frame(CustomerId=1:6,Product=c(rep('Toaster',3L),rep('Radio',3L)));
df2 <- data.frame(CustomerId=c(2L,4L,6L,7L),State=c(rep('Alabama',2L),'Ohio','Texas'));
df1[names(df2)[-1L**] <- df2[match(df1[,1L],df2[,1L]),-1L];
df1;
##   CustomerId Product   State
## 1          1 Toaster    <NA>
## 2          2 Toaster Alabama
## 3          3 Toaster    <NA>
## 4          4   Radio Alabama
## 5          5   Radio    <NA>
## 6          6   Radio    Ohio


在上面我硬编码了一个假设,即键列是两个输入表的第一列。我认为,一般来说,这不是一个不合理的假设,因为如果你有一个带有键列的data.frame,如果它还没有被设置为data.frame的第一列,那就太奇怪了。一开始。并且您可以随时重新排序列以实现它。这个假设的一个有利结果是,键列的名称不必是硬编码的,尽管我认为它只是用另一个假设替换一个假设。简洁是整数索引的另一个优点,也是速度。下面的基准我将改变实现以使用字符串名称索引来匹配竞争实现。


我认为这是一个特别合适的解决方案,如果你有几个表想要连接对一个大表。为每次合并重复重建整个表将是不必要且低效的。


另一方面,如果您因任何原因需要通过此操作保持不变,则不能使用此解决方案,因为它直接修改了joinee。虽然在这种情况下您可以简单地复制并在副本上执行就地分配。





作为旁注,我简要介绍了多列密钥的可能匹配解决方案。不幸的是,我找到的唯一匹配解决方案是:



  • 低效的连接。例如match(interaction(df1$a,df1$b),interaction(df2$a,df2$b)),或与paste()相同的想法。

  • 效率低下的笛卡尔连词,例如outer(df1$a,df2$a,`==`) & outer(df1$b,df2$b,`==`)

  • 基础R merge()和等效的基于包的合并函数,它总是分配一个新表来返回合并结果,因此不适合基于内部分配的解决方案。



例如,请参阅在不同数据框上匹配多个列并获取其他列作为结果,将两列与另外两列匹配,在多列上匹配,以及我最初提出的就地解决方案的问题,组合R中具有不同行数的两个数据帧。 [122]





基准



我决定进行自己的基准测试,以了解就地分配方法与此问题中提供的其他解决方案的比较。


测试代码:


library(microbenchmark);
library(data.table);
library(sqldf);
library(plyr);
library(dplyr);

solSpecs <- list(
    merge=list(testFuncs=list(
        inner=function(df1,df2,key) merge(df1,df2,key),
        left =function(df1,df2,key) merge(df1,df2,key,all.x=T),
        right=function(df1,df2,key) merge(df1,df2,key,all.y=T),
        full =function(df1,df2,key) merge(df1,df2,key,all=T)
    )),
    data.table.unkeyed=list(argSpec='data.table.unkeyed',testFuncs=list(
        inner=function(dt1,dt2,key) dt1[dt2,on=key,nomatch=0L,allow.cartesian=T],
        left =function(dt1,dt2,key) dt2[dt1,on=key,allow.cartesian=T],
        right=function(dt1,dt2,key) dt1[dt2,on=key,allow.cartesian=T],
        full =function(dt1,dt2,key) merge(dt1,dt2,key,all=T,allow.cartesian=T) ## calls merge.data.table()
    )),
    data.table.keyed=list(argSpec='data.table.keyed',testFuncs=list(
        inner=function(dt1,dt2) dt1[dt2,nomatch=0L,allow.cartesian=T],
        left =function(dt1,dt2) dt2[dt1,allow.cartesian=T],
        right=function(dt1,dt2) dt1[dt2,allow.cartesian=T],
        full =function(dt1,dt2) merge(dt1,dt2,all=T,allow.cartesian=T) ## calls merge.data.table()
    )),
    sqldf.unindexed=list(testFuncs=list( ## note: must pass connection=NULL to avoid running against the live DB connection, which would result in collisions with the residual tables from the last query upload
        inner=function(df1,df2,key) sqldf(paste0('select * from df1 inner join df2 using(',paste(collapse=',',key),')'),connection=NULL),
        left =function(df1,df2,key) sqldf(paste0('select * from df1 left join df2 using(',paste(collapse=',',key),')'),connection=NULL),
        right=function(df1,df2,key) sqldf(paste0('select * from df2 left join df1 using(',paste(collapse=',',key),')'),connection=NULL) ## can't do right join proper, not yet supported; inverted left join is equivalent
        ##full =function(df1,df2,key) sqldf(paste0('select * from df1 full join df2 using(',paste(collapse=',',key),')'),connection=NULL) ## can't do full join proper, not yet supported; possible to hack it with a union of left joins, but too unreasonable to include in testing
    )),
    sqldf.indexed=list(testFuncs=list( ## important: requires an active DB connection with preindexed main.df1 and main.df2 ready to go; arguments are actually ignored
        inner=function(df1,df2,key) sqldf(paste0('select * from main.df1 inner join main.df2 using(',paste(collapse=',',key),')')),
        left =function(df1,df2,key) sqldf(paste0('select * from main.df1 left join main.df2 using(',paste(collapse=',',key),')')),
        right=function(df1,df2,key) sqldf(paste0('select * from main.df2 left join main.df1 using(',paste(collapse=',',key),')')) ## can't do right join proper, not yet supported; inverted left join is equivalent
        ##full =function(df1,df2,key) sqldf(paste0('select * from main.df1 full join main.df2 using(',paste(collapse=',',key),')')) ## can't do full join proper, not yet supported; possible to hack it with a union of left joins, but too unreasonable to include in testing
    )),
    plyr=list(testFuncs=list(
        inner=function(df1,df2,key) join(df1,df2,key,'inner'),
        left =function(df1,df2,key) join(df1,df2,key,'left'),
        right=function(df1,df2,key) join(df1,df2,key,'right'),
        full =function(df1,df2,key) join(df1,df2,key,'full')
    )),
    dplyr=list(testFuncs=list(
        inner=function(df1,df2,key) inner_join(df1,df2,key),
        left =function(df1,df2,key) left_join(df1,df2,key),
        right=function(df1,df2,key) right_join(df1,df2,key),
        full =function(df1,df2,key) full_join(df1,df2,key)
    )),
    in.place=list(testFuncs=list(
        left =function(df1,df2,key) { cns <- setdiff(names(df2),key); df1[cns] <- df2[match(df1[,key],df2[,key]),cns]; df1; },
        right=function(df1,df2,key) { cns <- setdiff(names(df1),key); df2[cns] <- df1[match(df2[,key],df1[,key]),cns]; df2; }
    ))
);

getSolTypes <- function() names(solSpecs);
getJoinTypes <- function() unique(unlist(lapply(solSpecs,function(x) names(x$testFuncs))));
getArgSpec <- function(argSpecs,key=NULL) if (is.null(key)) argSpecs$default else argSpecs[**key**];

initSqldf <- function() {
    sqldf(); ## creates sqlite connection on first run, cleans up and closes existing connection otherwise
    if (exists('sqldfInitFlag',envir=globalenv(),inherits=F) && sqldfInitFlag) { ## false only on first run
        sqldf(); ## creates a new connection
    } else {
        assign('sqldfInitFlag',T,envir=globalenv()); ## set to true for the one and only time
    }; ## end if
    invisible();
}; ## end initSqldf()

setUpBenchmarkCall <- function(argSpecs,joinType,solTypes=getSolTypes(),env=parent.frame()) {
    ## builds and returns a list of expressions suitable for passing to the list argument of microbenchmark(), and assigns variables to resolve symbol references in those expressions
    callExpressions <- list();
    nms <- character();
    for (solType in solTypes) {
        testFunc <- solSpecs[**solType**]$testFuncs[**joinType**];
        if (is.null(testFunc)) next; ## this join type is not defined for this solution type
        testFuncName <- paste0('tf.',solType);
        assign(testFuncName,testFunc,envir=env);
        argSpecKey <- solSpecs[**solType**]$argSpec;
        argSpec <- getArgSpec(argSpecs,argSpecKey);
        argList <- setNames(nm=names(argSpec$args),vector('list',length(argSpec$args)));
        for (i in seq_along(argSpec$args)) {
            argName <- paste0('tfa.',argSpecKey,i);
            assign(argName,argSpec$args[**i**],envir=env);
            argList[**i**] <- if (i%in%argSpec$copySpec) call('copy',as.symbol(argName)) else as.symbol(argName);
        }; ## end for
        callExpressions[**length(callExpressions)+1L**] <- do.call(call,c(list(testFuncName),argList),quote=T);
        nms[length(nms)+1L] <- solType;
    }; ## end for
    names(callExpressions) <- nms;
    callExpressions;
}; ## end setUpBenchmarkCall()

harmonize <- function(res) {
    res <- as.data.frame(res); ## coerce to data.frame
    for (ci in which(sapply(res,is.factor))) res[**ci**] <- as.character(res[**ci**]); ## coerce factor columns to character
    for (ci in which(sapply(res,is.logical))) res[**ci**] <- as.integer(res[**ci**]); ## coerce logical columns to integer (works around sqldf quirk of munging logicals to integers)
    ##for (ci in which(sapply(res,inherits,'POSIXct'))) res[**ci**] <- as.double(res[**ci**]); ## coerce POSIXct columns to double (works around sqldf quirk of losing POSIXct class) ----- POSIXct doesn't work at all in sqldf.indexed
    res <- res[order(names(res))]; ## order columns
    res <- res[do.call(order,res),]; ## order rows
    res;
}; ## end harmonize()

checkIdentical <- function(argSpecs,solTypes=getSolTypes()) {
    for (joinType in getJoinTypes()) {
        callExpressions <- setUpBenchmarkCall(argSpecs,joinType,solTypes);
        if (length(callExpressions)<2L) next;
        ex <- harmonize(eval(callExpressions[**1L**]));
        for (i in seq(2L,len=length(callExpressions)-1L)) {
            y <- harmonize(eval(callExpressions[**i**]));
            if (!isTRUE(all.equal(ex,y,check.attributes=F))) {
                ex <<- ex;
                y <<- y;
                solType <- names(callExpressions)[i];
                stop(paste0('non-identical: ',solType,' ',joinType,'.'));
            }; ## end if
        }; ## end for
    }; ## end for
    invisible();
}; ## end checkIdentical()

testJoinType <- function(argSpecs,joinType,solTypes=getSolTypes(),metric=NULL,times=100L) {
    callExpressions <- setUpBenchmarkCall(argSpecs,joinType,solTypes);
    bm <- microbenchmark(list=callExpressions,times=times);
    if (is.null(metric)) return(bm);
    bm <- summary(bm);
    res <- setNames(nm=names(callExpressions),bm[**metric**]);
    attr(res,'unit') <- attr(bm,'unit');
    res;
}; ## end testJoinType()

testAllJoinTypes <- function(argSpecs,solTypes=getSolTypes(),metric=NULL,times=100L) {
    joinTypes <- getJoinTypes();
    resList <- setNames(nm=joinTypes,lapply(joinTypes,function(joinType) testJoinType(argSpecs,joinType,solTypes,metric,times)));
    if (is.null(metric)) return(resList);
    units <- unname(unlist(lapply(resList,attr,'unit')));
    res <- do.call(data.frame,c(list(join=joinTypes),setNames(nm=solTypes,rep(list(rep(NA_real_,length(joinTypes))),length(solTypes))),list(unit=units,stringsAsFactors=F)));
    for (i in seq_along(resList)) res[i,match(names(resList[**i**]),names(res))] <- resList[**i**];
    res;
}; ## end testAllJoinTypes()

testGrid <- function(makeArgSpecsFunc,sizes,overlaps,solTypes=getSolTypes(),joinTypes=getJoinTypes(),metric='median',times=100L) {

    res <- expand.grid(size=sizes,overlap=overlaps,joinType=joinTypes,stringsAsFactors=F);
    res[solTypes] <- NA_real_;
    res$unit <- NA_character_;
    for (ri in seq_len(nrow(res))) {

        size <- res$size[ri];
        overlap <- res$overlap[ri];
        joinType <- res$joinType[ri];

        argSpecs <- makeArgSpecsFunc(size,overlap);

        checkIdentical(argSpecs,solTypes);

        cur <- testJoinType(argSpecs,joinType,solTypes,metric,times);
        res[ri,match(names(cur),names(res))] <- cur;
        res$unit[ri] <- attr(cur,'unit');

    }; ## end for

    res;

}; ## end testGrid()





这是我之前演示的基于OP的示例的基准:


## OP's example, supplemented with a non-matching row in df2
argSpecs <- list(
    default=list(copySpec=1:2,args=list(
        df1 <- data.frame(CustomerId=1:6,Product=c(rep('Toaster',3L),rep('Radio',3L))),
        df2 <- data.frame(CustomerId=c(2L,4L,6L,7L),State=c(rep('Alabama',2L),'Ohio','Texas')),
        'CustomerId'
    )),
    data.table.unkeyed=list(copySpec=1:2,args=list(
        as.data.table(df1),
        as.data.table(df2),
        'CustomerId'
    )),
    data.table.keyed=list(copySpec=1:2,args=list(
        setkey(as.data.table(df1),CustomerId),
        setkey(as.data.table(df2),CustomerId)
    ))
);
## prepare sqldf
initSqldf();
sqldf('create index df1_key on df1(CustomerId);'); ## upload and create an sqlite index on df1
sqldf('create index df2_key on df2(CustomerId);'); ## upload and create an sqlite index on df2

checkIdentical(argSpecs);

testAllJoinTypes(argSpecs,metric='median');
##    join    merge data.table.unkeyed data.table.keyed sqldf.unindexed sqldf.indexed      plyr    dplyr in.place         unit
## 1 inner  644.259           861.9345          923.516        9157.752      1580.390  959.2250 270.9190       NA microseconds
## 2  left  713.539           888.0205          910.045        8820.334      1529.714  968.4195 270.9185 224.3045 microseconds
## 3 right 1221.804           909.1900          923.944        8930.668      1533.135 1063.7860 269.8495 218.1035 microseconds
## 4  full 1302.203          3107.5380         3184.729              NA            NA 1593.6475 270.7055       NA microseconds





在这里,我对随机输入数据进行基准测试,尝试不同的比例和两个输入表之间的键重叠的不同模式。此基准仍限于单列整数键的情况。同样,为了确保就地解决方案适用于同一表的左右连接,所有随机测试数据都使用0..1:0..1基数。这是在生成第二个data.frame的键列时通过采样而不替换第一个data.frame的键列来实现的。


makeArgSpecs.singleIntegerKey.optionalOneToOne <- function(size,overlap) {

    com <- as.integer(size*overlap);

    argSpecs <- list(
        default=list(copySpec=1:2,args=list(
            df1 <- data.frame(id=sample(size),y1=rnorm(size),y2=rnorm(size)),
            df2 <- data.frame(id=sample(c(if (com>0L) sample(df1$id,com) else integer(),seq(size+1L,len=size-com))),y3=rnorm(size),y4=rnorm(size)),
            'id'
        )),
        data.table.unkeyed=list(copySpec=1:2,args=list(
            as.data.table(df1),
            as.data.table(df2),
            'id'
        )),
        data.table.keyed=list(copySpec=1:2,args=list(
            setkey(as.data.table(df1),id),
            setkey(as.data.table(df2),id)
        ))
    );
    ## prepare sqldf
    initSqldf();
    sqldf('create index df1_key on df1(id);'); ## upload and create an sqlite index on df1
    sqldf('create index df2_key on df2(id);'); ## upload and create an sqlite index on df2

    argSpecs;

}; ## end makeArgSpecs.singleIntegerKey.optionalOneToOne()

## cross of various input sizes and key overlaps
sizes <- c(1e1L,1e3L,1e6L);
overlaps <- c(0.99,0.5,0.01);
system.time({ res <- testGrid(makeArgSpecs.singleIntegerKey.optionalOneToOne,sizes,overlaps); });
##     user   system  elapsed
## 22024.65 12308.63 34493.19


我写了一些代码来创建上述结果的日志 - 日志图。我为每个重叠百分比生成了一个单独的图。它有点混乱,但我喜欢在同一个图中表示所有解决方案类型和连接类型。


我使用样条插值来显示每个解决方案/连接类型组合的平滑曲线,使用单独的pch符号绘制。连接类型由pch符号捕获,左侧和右侧使用内部,左右角括号的点,以及完整的菱形。解决方案类型由颜色捕获,如图例中所示。


plotRes <- function(res,titleFunc,useFloor=F) {
    solTypes <- setdiff(names(res),c('size','overlap','joinType','unit')); ## derive from res
    normMult <- c(microseconds=1e-3,milliseconds=1); ## normalize to milliseconds
    joinTypes <- getJoinTypes();
    cols <- c(merge='purple',data.table.unkeyed='blue',data.table.keyed='#00DDDD',sqldf.unindexed='brown',sqldf.indexed='orange',plyr='red',dplyr='#00BB00',in.place='magenta');
    pchs <- list(inner=20L,left='<',right='>',full=23L);
    cexs <- c(inner=0.7,left=1,right=1,full=0.7);
    NP <- 60L;
    ord <- order(decreasing=T,colMeans(res[res$size==max(res$size),solTypes],na.rm=T));
    ymajors <- data.frame(y=c(1,1e3),label=c('1ms','1s'),stringsAsFactors=F);
    for (overlap in unique(res$overlap)) {
        x1 <- res[res$overlap==overlap,];
        x1[solTypes] <- x1[solTypes]*normMult[x1$unit]; x1$unit <- NULL;
        xlim <- c(1e1,max(x1$size));
        xticks <- 10^seq(log10(xlim[1L]),log10(xlim[2L]));
        ylim <- c(1e-1,10^((if (useFloor) floor else ceiling)(log10(max(x1[solTypes],na.rm=T))))); ## use floor() to zoom in a little more, only sqldf.unindexed will break above, but xpd=NA will keep it visible
        yticks <- 10^seq(log10(ylim[1L]),log10(ylim[2L]));
        yticks.minor <- rep(yticks[-length(yticks)],each=9L)*1:9;
        plot(NA,xlim=xlim,ylim=ylim,xaxs='i',yaxs='i',axes=F,xlab='size (rows)',ylab='time (ms)',log='xy');
        abline(v=xticks,col='lightgrey');
        abline(h=yticks.minor,col='lightgrey',lty=3L);
        abline(h=yticks,col='lightgrey');
        axis(1L,xticks,parse(text=sprintf('10^%d',as.integer(log10(xticks)))));
        axis(2L,yticks,parse(text=sprintf('10^%d',as.integer(log10(yticks)))),las=1L);
        axis(4L,ymajors$y,ymajors$label,las=1L,tick=F,cex.axis=0.7,hadj=0.5);
        for (joinType in rev(joinTypes)) { ## reverse to draw full first, since it's larger and would be more obtrusive if drawn last
            x2 <- x1[x1$joinType==joinType,];
            for (solType in solTypes) {
                if (any(!is.na(x2[**solType**]))) {
                    xy <- spline(x2$size,x2[**solType**],xout=10^(seq(log10(x2$size[1L]),log10(x2$size[nrow(x2)]),len=NP)));
                    points(xy$x,xy$y,pch=pchs[**joinType**],col=cols[solType],cex=cexs[joinType],xpd=NA);
                }; ## end if
            }; ## end for
        }; ## end for
        ## custom legend
        ## due to logarithmic skew, must do all distance calcs in inches, and convert to user coords afterward
        ## the bottom-left corner of the legend will be defined in normalized figure coords, although we can convert to inches immediately
        leg.cex <- 0.7;
        leg.x.in <- grconvertX(0.275,'nfc','in');
        leg.y.in <- grconvertY(0.6,'nfc','in');
        leg.x.user <- grconvertX(leg.x.in,'in');
        leg.y.user <- grconvertY(leg.y.in,'in');
        leg.outpad.w.in <- 0.1;
        leg.outpad.h.in <- 0.1;
        leg.midpad.w.in <- 0.1;
        leg.midpad.h.in <- 0.1;
        leg.sol.w.in <- max(strwidth(solTypes,'in',leg.cex));
        leg.sol.h.in <- max(strheight(solTypes,'in',leg.cex))*1.5; ## multiplication factor for greater line height
        leg.join.w.in <- max(strheight(joinTypes,'in',leg.cex))*1.5; ## ditto
        leg.join.h.in <- max(strwidth(joinTypes,'in',leg.cex));
        leg.main.w.in <- leg.join.w.in*length(joinTypes);
        leg.main.h.in <- leg.sol.h.in*length(solTypes);
        leg.x2.user <- grconvertX(leg.x.in+leg.outpad.w.in*2+leg.main.w.in+leg.midpad.w.in+leg.sol.w.in,'in');
        leg.y2.user <- grconvertY(leg.y.in+leg.outpad.h.in*2+leg.main.h.in+leg.midpad.h.in+leg.join.h.in,'in');
        leg.cols.x.user <- grconvertX(leg.x.in+leg.outpad.w.in+leg.join.w.in*(0.5+seq(0L,length(joinTypes)-1L)),'in');
        leg.lines.y.user <- grconvertY(leg.y.in+leg.outpad.h.in+leg.main.h.in-leg.sol.h.in*(0.5+seq(0L,length(solTypes)-1L)),'in');
        leg.sol.x.user <- grconvertX(leg.x.in+leg.outpad.w.in+leg.main.w.in+leg.midpad.w.in,'in');
        leg.join.y.user <- grconvertY(leg.y.in+leg.outpad.h.in+leg.main.h.in+leg.midpad.h.in,'in');
        rect(leg.x.user,leg.y.user,leg.x2.user,leg.y2.user,col='white');
        text(leg.sol.x.user,leg.lines.y.user,solTypes[ord],cex=leg.cex,pos=4L,offset=0);
        text(leg.cols.x.user,leg.join.y.user,joinTypes,cex=leg.cex,pos=4L,offset=0,srt=90); ## srt rotation applies *after* pos/offset positioning
        for (i in seq_along(joinTypes)) {
            joinType <- joinTypes[i];
            points(rep(leg.cols.x.user[i],length(solTypes)),ifelse(colSums(!is.na(x1[x1$joinType==joinType,solTypes[ord**]))==0L,NA,leg.lines.y.user),pch=pchs[**joinType**],col=cols[solTypes[ord**]);
        }; ## end for
        title(titleFunc(overlap));
        readline(sprintf('overlap %.02f',overlap));
    }; ## end for
}; ## end plotRes()

titleFunc <- function(overlap) sprintf('R merge solutions: single-column integer key, 0..1:0..1 cardinality, %d%% overlap',as.integer(overlap*100));
plotRes(res,titleFunc,T);


[[[124]


[[[125]


[[[126]





这是第二个大型基准测试,在关键列的数量和类型以及基数方面更加重要。对于此基准测试,我使用三个关键列:一个字符,一个整数和一个逻辑,对基数没有限制(即0..*:0..*)。 (一般来说,由于浮点比较的复杂性,不建议使用双值或复数值来定义关键列,并且基本上没有人使用原始类型,更不用说关键列了,所以我没有包含这些类型关键字。另外,为了便于说明,我最初试图通过包含一个POSIXct键列来使用四个关键列,但由于某种原因,POSIXct类型与sqldf.indexed解决方案并不合适,可能是到浮点比较异常,所以我删除了它。)


makeArgSpecs.assortedKey.optionalManyToMany <- function(size,overlap,uniquePct=75) {

    ## number of unique keys in df1
    u1Size <- as.integer(size*uniquePct/100);

    ## (roughly) divide u1Size into bases, so we can use expand.grid() to produce the required number of unique key values with repetitions within individual key columns
    ## use ceiling() to ensure we cover u1Size; will truncate afterward
    u1SizePerKeyColumn <- as.integer(ceiling(u1Size^(1/3)));

    ## generate the unique key values for df1
    keys1 <- expand.grid(stringsAsFactors=F,
        idCharacter=replicate(u1SizePerKeyColumn,paste(collapse='',sample(letters,sample(4:12,1L),T))),
        idInteger=sample(u1SizePerKeyColumn),
        idLogical=sample(c(F,T),u1SizePerKeyColumn,T)
        ##idPOSIXct=as.POSIXct('2016-01-01 00:00:00','UTC')+sample(u1SizePerKeyColumn)
    )[seq_len(u1Size),];

    ## rbind some repetitions of the unique keys; this will prepare one side of the many-to-many relationship
    ## also scramble the order afterward
    keys1 <- rbind(keys1,keys1[sample(nrow(keys1),size-u1Size,T),])[sample(size),];

    ## common and unilateral key counts
    com <- as.integer(size*overlap);
    uni <- size-com;

    ## generate some unilateral keys for df2 by synthesizing outside of the idInteger range of df1
    keys2 <- data.frame(stringsAsFactors=F,
        idCharacter=replicate(uni,paste(collapse='',sample(letters,sample(4:12,1L),T))),
        idInteger=u1SizePerKeyColumn+sample(uni),
        idLogical=sample(c(F,T),uni,T)
        ##idPOSIXct=as.POSIXct('2016-01-01 00:00:00','UTC')+u1SizePerKeyColumn+sample(uni)
    );

    ## rbind random keys from df1; this will complete the many-to-many relationship
    ## also scramble the order afterward
    keys2 <- rbind(keys2,keys1[sample(nrow(keys1),com,T),])[sample(size),];

    ##keyNames <- c('idCharacter','idInteger','idLogical','idPOSIXct');
    keyNames <- c('idCharacter','idInteger','idLogical');
    ## note: was going to use raw and complex type for two of the non-key columns, but data.table doesn't seem to fully support them
    argSpecs <- list(
        default=list(copySpec=1:2,args=list(
            df1 <- cbind(stringsAsFactors=F,keys1,y1=sample(c(F,T),size,T),y2=sample(size),y3=rnorm(size),y4=replicate(size,paste(collapse='',sample(letters,sample(4:12,1L),T)))),
            df2 <- cbind(stringsAsFactors=F,keys2,y5=sample(c(F,T),size,T),y6=sample(size),y7=rnorm(size),y8=replicate(size,paste(collapse='',sample(letters,sample(4:12,1L),T)))),
            keyNames
        )),
        data.table.unkeyed=list(copySpec=1:2,args=list(
            as.data.table(df1),
            as.data.table(df2),
            keyNames
        )),
        data.table.keyed=list(copySpec=1:2,args=list(
            setkeyv(as.data.table(df1),keyNames),
            setkeyv(as.data.table(df2),keyNames)
        ))
    );
    ## prepare sqldf
    initSqldf();
    sqldf(paste0('create index df1_key on df1(',paste(collapse=',',keyNames),');')); ## upload and create an sqlite index on df1
    sqldf(paste0('create index df2_key on df2(',paste(collapse=',',keyNames),');')); ## upload and create an sqlite index on df2

    argSpecs;

}; ## end makeArgSpecs.assortedKey.optionalManyToMany()

sizes <- c(1e1L,1e3L,1e5L); ## 1e5L instead of 1e6L to respect more heavy-duty inputs
overlaps <- c(0.99,0.5,0.01);
solTypes <- setdiff(getSolTypes(),'in.place');
system.time({ res <- testGrid(makeArgSpecs.assortedKey.optionalManyToMany,sizes,overlaps,solTypes); });
##     user   system  elapsed
## 38895.50   784.19 39745.53


生成的图,使用上面给出的相同绘图代码:


titleFunc <- function(overlap) sprintf('R merge solutions: character/integer/logical key, 0..*:0..* cardinality, %d%% overlap',as.integer(overlap*100));
plotRes(res,titleFunc,F);


[127]


[128]


[129]

其它参考10



  1. 使用merge函数,我们可以选择左表或右表的变量,就像我们熟悉SQL中的select语句一样(EX:选择a。* ...或选择b。* from)。 ....)

  2. 我们必须添加额外的代码,这些代码将从新连接的表中进行子集化。



    • SQL: - select a.* from df1 a inner join df2 b on a.CustomerId=b.CustomerId

    • R: - merge(df1, df2, by.x = "CustomerId", by.y = "CustomerId")[,names(df1)]




同样的方式



  • SQL: - select b.* from df1 a inner join df2 b on a.CustomerId=b.CustomerId

  • R: - merge(df1, df2, by.x = "CustomerId", by.y = "CustomerId")[,names(df2)]


其它参考11


对于所有列的内部联接,您还可以使用 data.table -package中的fintersect或 dplyr -package中的intersect作为merge的替代方案,但没有指定by - 列。这将给出两个数据帧之间相等的行:


merge(df1, df2)
#   V1 V2
# 1  B  2
# 2  C  3
dplyr::intersect(df1, df2)
#   V1 V2
# 1  B  2
# 2  C  3
data.table::fintersect(setDT(df1), setDT(df2))
#    V1 V2
# 1:  B  2
# 2:  C  3





示例数据:


df1 <- data.frame(V1 = LETTERS[1:4], V2 = 1:4)
df2 <- data.frame(V1 = LETTERS[2:3], V2 = 2:3)