vignettes/brief-introduction-to-select-syntax.Rmd
brief-introduction-to-select-syntax.Rmd
Version Note: Up-to-date with v0.3.0
This article briefly introduces the usage of
dplyr::select
, and how it is applied to this package. In
the first section, I will briefly describe dplyr::select
syntax for R-beginners. If you are already familiar with
dplyr::select
syntax, then you can skip to the next section
where I describe how to apply the syntax in this pacakge.
dplyr::select
(abbreviated as select
hereafter) is an extremely power function for R. It allows you to subset
columns with the a set of syntax that is also known as the
select
syntax / semantics in the R community. A side note
here. With the new introduction of dplyr::across
function,
the select
syntax can be applied to
dplyr::mutate
and dplyr::filter
where make
these two already powerful function even more powerful. I will first
introduce the usage of :
, c()
and
-
. Then, I will discuss how to use everything
,
starts_with
, end_with
, contains
,
and where
. This is not an exhaustive list of the
select
syntax, but there are the most relevant one. If you
want to learn more, I encourage you to check the vignette of
dplyr
or just google it. There are tons of article that
discuss this in detail.
I am going to use the iris
dataset for the
demonstration. Let’s take a quick peek of the dataset.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
If I want to select the first 3 columns, you can use :
to do that
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
Next, if you want to combine selection then you can use
c()
. For example, I want the 1st, 3rd and 4th columns.
Then, you can do it like this
Sepal.Length Petal.Length Petal.Width
1 5.1 1.4 0.2
Sepal.Length Petal.Length Petal.Width
1 5.1 1.4 0.2
Finally, if you want to delete a column from selection, then you can
use -
. For example, you want to select all columns except
the 3rd column, then you can do it like this
Sepal.Length Sepal.Width Petal.Width Species
1 5.1 3.5 0.2 setosa
Sepal.Length Sepal.Width Petal.Width Species
1 5.1 3.5 0.2 setosa
Ok. Now you understand the basic usage. Let’s get to something a
little bit more advanced. First, let’s talk about my favorite which is
everything
. As the name entails, it select all the
variables in the data frame. It is usually used in combination with
c()
if you are using in select
function.
However, it is very powerful in other use cases like the one in this
package. For example, you want to fit a linear regression with all the
variables, then you can use everything
(a more detailed
discussion is presented in the next section).
# select all columns
iris %>% select(everything()) %>% head(1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
Sepal.Length Petal.Length Petal.Width Species
1 5.1 1.4 0.2 setosa
Next, we can talk about starts_with
.
starts_with
select all columns that is starts with a
certain specified string. For example, we want to select all columns
start with Sepal, then we can do something like this
iris %>% select(starts_with('Sepal')) %>% head(1)
Sepal.Length Sepal.Width
1 5.1 3.5
Similar to starts_with
, ends_with
select
all columns that is ends with a certain specified string. For example,
we want to select all columns ends with Width.
Sepal.Width Petal.Width
1 3.5 0.2
Next, we are going talk about contains
. As the name
entails, it select all columns that contains a specified string.
Sepal.Length Sepal.Width
1 5.1 3.5
Sepal.Width Petal.Width
1 3.5 0.2
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
Finally, we are going to conclude this section with
where
. where
is not used alone. It is usually
pair with a function return TRUE
or FALSE
. I
think the most common use case for this package is paired with
is.numeric
. where(is.numeric)
will select all
numeric variables. A little tip, you need to pass
is.numeric
instead of is.numeric()
. I will not
go into the detail of why because this is out of the scope of this
article. It required a little bit more advanced understanding of how
function work in R. If you have that, you wouldn’t reading this article
anyway.
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
First, I will demonstrate the usage of linear regression. I will first create a data frame. You don’t need to know anything about how this data frame is created. Just know that it has 1 DV / outcome / response variable (i.e, y) and 5 IV / predictor variable (i.e, x1 to x5)
set.seed(1)
test_data = data.frame(y = rnorm(n = 100,mean = 2,sd = 3),
x1 = rnorm(n = 100,mean = 1.5, sd = 4),
x2 = rnorm(n = 100,mean = 1.7, sd = 4),
x3 = rnorm(n = 100,mean = 1.5, sd = 4),
x4 = rnorm(n = 100,mean = 2, sd = 4),
x5 = rnorm(n = 100,mean = 1.5, sd = 4))
Ok, let’s fit that linear regression now.
# Without this package:
model1 = lm(data = test_data, formula = y ~ x1 + x2 + x3 + x4 + x5)
# With this package:
model2 = lm_model(data = test_data,
response_variable = y,
predictor_variable = c(everything(),-y))
Fitting Model with lm:
Formula = y ~ x1 + x2 + x3 + x4 + x5
This is already a step up from the basic lm()
function.
We can still make is even simpler by just passing
everyhing()
. The function is designed to remove the
response variable from predictor variables (if selected) automatically.
The following model3
is the same as model2
model3 = lm_model(data = test_data,
response_variable = y,
predictor_variable = everything())
Fitting Model with lm:
Formula = y ~ x1 + x2 + x3 + x4 + x5
The same logic is applied to all other functions in this package.
Arguments that support dplyr::select
syntax will ends with
“support dplyr::select syntax” in the description of the argument.That’s
it for this brief introduction. If you want to learn more about this
package, I encourage you to check out this article
or use vignette('quick-introduction')
if you are in R
Studio.