# An example of base::split() for looping through groups

**Very statisticious on Very statisticious**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently had a question from a client about the simplest way to subset a data.frame and apply a function to each subset. “Simplest” could mean many things, of course, since what is simple for one person could appear very difficult to another. In this specific case I suggested using `base::split()`

as a possible option since it is one I find fairly approachable.

I turns out I don’t have a go-to example for how to get started with a `split()`

approach. So here’s a quick blog post about it! ????

## Table of Contents

# Load R packages

I’ll load **purrr** for looping through lists.

library(purrr) # 0.3.3

# A dataset with groups

I made a small dataset to use with `split()`

. The `id`

variable contains the group information. There are three groups, a, b, and c, with 10 observations per group. There are also two numeric variables, `var1`

and `var2`

.

dat = structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), var1 = c(4, 2.7, 3.4, 2.7, 4.6, 2.9, 2.2, 4.5, 4.6, 2.4, 3, 3.8, 2.5, 4, 3.6, 2.7, 4.5, 4.1, 4.2, 2.2, 4.9, 4.4, 3.6, 3.3, 2.7, 3.9, 4.9, 4.9, 4.3, 3.4), var2 = c(6, 22.3, 19.4, 22.8, 18.6, 14.2, 10.9, 22.7, 22.4, 11.7, 6, 13.3, 12.5, 6.3, 13.6, 20.5, 23.6, 10.9, 8.9, 20.9, 23.7, 15.9, 22.1, 11.6, 22, 17.7, 21, 20.8, 16.7, 21.4)), class = "data.frame", row.names = c(NA, -30L)) head(dat) # id var1 var2 # 1 a 4.0 6.0 # 2 a 2.7 22.3 # 3 a 3.4 19.4 # 4 a 2.7 22.8 # 5 a 4.6 18.6 # 6 a 2.9 14.2

# Create separate data.frames per group

If the goal is to apply a function to each dataset in each group, we need to pull out a dataset for each `id`

. One approach to do this is to make a subset for each group and then apply the function of interest to the subset. A classic approach would be to do the subsetting within a `for()`

loop.

This is a situation where I find `split()`

to be really convenient. It splits the data by a defined group variable so we don’t have to subset things manually.

The output from `split()`

is a list. If I split a dataset by groups, each element of the list will be a data.frame for one of the groups. Note the group values are used as the names of the list elements. I find the list-naming aspect of `split()`

handy for keeping track of groups in subsequent steps.

Here’s an example, where I split `dat`

by the `id`

variable.

dat_list = split(dat, dat$id) dat_list # $a # id var1 var2 # 1 a 4.0 6.0 # 2 a 2.7 22.3 # 3 a 3.4 19.4 # 4 a 2.7 22.8 # 5 a 4.6 18.6 # 6 a 2.9 14.2 # 7 a 2.2 10.9 # 8 a 4.5 22.7 # 9 a 4.6 22.4 # 10 a 2.4 11.7 # # $b # id var1 var2 # 11 b 3.0 6.0 # 12 b 3.8 13.3 # 13 b 2.5 12.5 # 14 b 4.0 6.3 # 15 b 3.6 13.6 # 16 b 2.7 20.5 # 17 b 4.5 23.6 # 18 b 4.1 10.9 # 19 b 4.2 8.9 # 20 b 2.2 20.9 # # $c # id var1 var2 # 21 c 4.9 23.7 # 22 c 4.4 15.9 # 23 c 3.6 22.1 # 24 c 3.3 11.6 # 25 c 2.7 22.0 # 26 c 3.9 17.7 # 27 c 4.9 21.0 # 28 c 4.9 20.8 # 29 c 4.3 16.7 # 30 c 3.4 21.4

# Looping through the list

Once the data are split into separate data.frames per group, we can loop through the list and apply a function to each one using whatever looping approach we prefer.

For example, if I want to fit a linear model of `var1`

vs `var2`

for each group I might do the looping with `purrr::map()`

or `lapply()`

.

Each element of the new list still has the grouping information attached via the list names.

map(dat_list, ~lm(var1 ~ var2, data = .x) ) # $a # # Call: # lm(formula = var1 ~ var2, data = .x) # # Coefficients: # (Intercept) var2 # 2.64826 0.04396 # # # $b # # Call: # lm(formula = var1 ~ var2, data = .x) # # Coefficients: # (Intercept) var2 # 3.80822 -0.02551 # # # $c # # Call: # lm(formula = var1 ~ var2, data = .x) # # Coefficients: # (Intercept) var2 # 3.35241 0.03513

I could also create a function that fit a model and then returned model output. For example, maybe what I really wanted to do is the fit a linear model and extract \(R^2\) for each group model fit.

r2 = function(data) { fit = lm(var1 ~ var2, data = data) broom::glance(fit) }

The output of my `r2`

function, which uses `broom::glance()`

, is a data.frame.

r2(data = dat) # # A tibble: 1 x 11 # r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC # <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> # 1 0.0292 -0.00550 0.867 0.841 0.367 2 -37.3 80.5 84.7 # # ... with 2 more variables: deviance <dbl>, df.residual <int>

Since the function output is a data.frame, I can use `purrr::map_dfr()`

to combine the output per group into a single data.frame. The `.id`

argument creates a new variable to store the list names in the output.

map_dfr(dat_list, r2, .id = "id") # # A tibble: 3 x 12 # id r.squared adj.r.squared sigma statistic p.value df logLik AIC # <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> # 1 a 0.0775 -0.0378 0.968 0.672 0.436 2 -12.7 31.5 # 2 b 0.0387 -0.0815 0.832 0.322 0.586 2 -11.2 28.5 # 3 c 0.0285 -0.0930 0.808 0.235 0.641 2 -10.9 27.9 # # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>

# Splitting by multiple groups

It is possible to split data by multiple grouping variables in the `split()`

function. The grouping variables must be passed as a list.

Here’s an example, using the built-in `mtcars`

dataset. I show only the first two list elements to demonstrate that the list names are now based on a combination of the values for the two groups. By default these values are separated by a `.`

(but see the `sep`

argument to control this).

mtcars_cylam = split(mtcars, list(mtcars$cyl, mtcars$am) ) mtcars_cylam[1:2] # $`4.0` # mpg cyl disp hp drat wt qsec vs am gear carb # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 # Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 # # $`6.0` # mpg cyl disp hp drat wt qsec vs am gear carb # Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 # Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 # Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4

If all combinations of groups are not present, the `drop`

argument in `split()`

allows us to drop missing combinations. By default combinations that aren’t present are kept as 0-length data.frames.

# Other thoughts on split()

I feel like `split()`

was a gateway function for me to get started working with lists and associated convenience functions like `lapply()`

and `purrr::map()`

for looping through lists. I think learning to work with lists and “list loops” also made the learning curve for list-columns in data.frames and the `nest()`

/`unnest()`

approach of analysis-by-groups a little less steep for me.

# Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(purrr) # 0.3.3 dat = structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), var1 = c(4, 2.7, 3.4, 2.7, 4.6, 2.9, 2.2, 4.5, 4.6, 2.4, 3, 3.8, 2.5, 4, 3.6, 2.7, 4.5, 4.1, 4.2, 2.2, 4.9, 4.4, 3.6, 3.3, 2.7, 3.9, 4.9, 4.9, 4.3, 3.4), var2 = c(6, 22.3, 19.4, 22.8, 18.6, 14.2, 10.9, 22.7, 22.4, 11.7, 6, 13.3, 12.5, 6.3, 13.6, 20.5, 23.6, 10.9, 8.9, 20.9, 23.7, 15.9, 22.1, 11.6, 22, 17.7, 21, 20.8, 16.7, 21.4)), class = "data.frame", row.names = c(NA, -30L)) head(dat) dat_list = split(dat, dat$id) dat_list map(dat_list, ~lm(var1 ~ var2, data = .x) ) r2 = function(data) { fit = lm(var1 ~ var2, data = data) broom::glance(fit) } r2(data = dat) map_dfr(dat_list, r2, .id = "id") mtcars_cylam = split(mtcars, list(mtcars$cyl, mtcars$am) ) mtcars_cylam[1:2]

**leave a comment**for the author, please follow the link and comment on their blog:

**Very statisticious on Very statisticious**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.