OFFSETS dplyr::lag - Offset elements by 1 dplyr::lead - Offset elements by -1 CUMULATIVE AGGREGATES dplyr::cumall - Cumulative all dplyr::cumany.
Overview
The tidyverse cheat sheet will guide you through some general information on the tidyverse, and then covers topics such as useful functions, loading in your data, manipulating it with dplyr and lastly, visualize it with ggplot2. In short, everything that you need to kickstart your data science learning with R! Do you want to learn more? Dplyr provides a grammar for manipulating tables in R. This cheatsheet will guide you through the grammar, reminding you how to select, filter, arrange, mutate, summarise, group, and join data frames and tibbles. (Previous version) Updated January 17.
Join and Split strc(., sep = ', collapse = NULL) Join multiple strings into a single string. Strc(letters, LETTERS) strc(., sep = ', collapse = NULL. The Shiny cheat sheet is a quick reference guide for building Shiny apps. Check out all of our cheat sheets here. Please post on RStudio Community.
dplyr isan R package for working with structured data both in and outside of R.dplyr makes data manipulation for R users easy, consistent, andperformant. With dplyr as an interface to manipulating Spark DataFrames,you can:
- Select, filter, and aggregate data
- Use window functions (e.g. for sampling)
- Perform joins on DataFrames
- Collect data from Spark into R
Statements in dplyr can be chained together using pipes defined by themagrittrR package. dplyr also supports non-standardevalutionof its arguments. For more information on dplyr, see theintroduction,a guide for connecting todatabases,and a variety ofvignettes.
Reading Data
You can read data into Spark DataFrames using the followingfunctions:
Function | Description |
---|---|
spark_read_csv | Reads a CSV file and provides a data source compatible with dplyr |
spark_read_json | Reads a JSON file and provides a data source compatible with dplyr |
spark_read_parquet | Reads a parquet file and provides a data source compatible with dplyr |
Regardless of the format of your data, Spark supports reading data froma variety of different data sources. These include data stored on HDFS(hdfs://
protocol), Amazon S3 (s3n://
protocol), or local filesavailable to the Spark worker nodes (file://
protocol)
Each of these functions returns a reference to a Spark DataFrame whichcan be used as a dplyr table (tbl
).
Flights Data
This guide will demonstrate some of the basic data manipulation verbs ofdplyr by using data from the nycflights13
R package. This packagecontains data for all 336,776 flights departing New York City in 2013.It also includes useful metadata on airlines, airports, weather, andplanes. The data comes from the US Bureau of TransportationStatistics,and is documented in ?nycflights13
Connect to the cluster and copy the flights data using the copy_to
function. Caveat: The flight data in nycflights13
is convenient fordplyr demonstrations because it is small, but in practice large datashould rarely be copied directly from R objects.
dplyr Verbs
Verbs are dplyr commands for manipulating data. When connected to aSpark DataFrame, dplyr translates the commands into Spark SQLstatements. Remote data sources use exactly the same five verbs as localdata sources. Here are the five verbs with their corresponding SQLcommands:
select
~SELECT
filter
~WHERE
arrange
~ORDER
summarise
~aggregators: sum, min, sd, etc.
mutate
~operators: +, *, log, etc.
Laziness
When working with databases, dplyr tries to be as lazy as possible:
It never pulls data into R unless you explicitly ask for it.
It delays doing any work until the last possible moment: it collectstogether everything you want to do and then sends it to the databasein one step.
For example, take the followingcode:
This sequence of operations never actually touches the database. It’snot until you ask for the data (e.g. by printing c4
) that dplyrrequests the results from the database.
Piping
You can usemagrittrpipes to write cleaner syntax. Using the same example from above, youcan write a much cleaner version like this:
Grouping
The group_by
function corresponds to the GROUP BY
statement in SQL.
Collecting to R
You can copy data from Spark into R’s memory by using collect()
.
collect()
executes the Spark query and returns the results to R forfurther analysis and visualization.
SQL Translation
It’s relatively straightforward to translate R code to SQL (or indeed toany programming language) when doing simple mathematical operations ofthe form you normally use when filtering, mutating and summarizing.dplyr knows how to convert the following R functions to Spark SQL:
Window Functions
dplyr supports Spark SQL window functions. Window functions are used inconjunction with mutate and filter to solve a wide range of problems.You can compare the dplyr syntax to the query it has generated by usingdbplyr::sql_render()
.
Peforming Joins
It’s rare that a data analysis involves only a single table of data. Inpractice, you’ll normally have many tables that contribute to ananalysis, and you need flexible tools to combine them. In dplyr, thereare three families of verbs that work with two tables at a time:
Mutating joins, which add new variables to one table from matchingrows in another.
Filtering joins, which filter observations from one table based onwhether or not they match an observation in the other table.
Set operations, which combine the observations in the data sets asif they were set elements.
All two-table verbs work similarly. The first two arguments are x
andy
, and provide the tables to combine. The output is always a new tablewith the same type as x
.
The following statements are equivalent:
Sampling
You can use sample_n()
and sample_frac()
to take a random sample ofrows: use sample_n()
for a fixed number and sample_frac()
for afixed fraction.
Writing Data
It is often useful to save the results of your analysis or the tablesthat you have generated on your Spark cluster into persistent storage.The best option in many scenarios is to write the table out to aParquet file using thespark_write_parquetfunction. For example:
This will write the Spark DataFrame referenced by the tbl R variable tothe given HDFS path. You can use thespark_read_parquetfunction to read the same table back into a subsequent Sparksession:
You can also write data as CSV or JSON using thespark_write_csv andspark_write_jsonfunctions.
Hive Functions
Many of Hive’s built-in functions (UDF) and built-in aggregate functions(UDAF) can be called inside dplyr’s mutate and summarize. The LanguangeReferenceUDFpage provides the list of available functions.
The following example uses the datediff and current_date HiveUDFs to figure the difference between the flight_date and the currentsystem date:
Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. Enter dplyr
. dplyr
is a package for making data manipulation easier.
Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str()
or data.frame()
, come built into R; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should to import it in every subsequent R session when you’ll need it.
While we’re installing stuff, let’s also install the ggplot2 package, which we’ll use next.
You might get asked to choose a CRAN mirror – this is basically asking you to choose a site to download the package from. The choice doesn’t matter too much; we recommend the RStudio mirror.
What is dplyr
?
The package dplyr
provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames. The thinking behind it was largely inspired by the package plyr
which has been in use for some time but suffered from being slow in some cases.dplyr
addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.
This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly, and pull back just what you need for analysis in R.
Selecting columns and filtering rows
We’re going to learn some of the most common dplyr
functions: select()
, filter()
, mutate()
, group_by()
, and summarize()
. To select columns of a data frame, use select()
. The first argument to this function is the data frame (surveys
), and the subsequent arguments are the columns to keep.
To choose rows, use filter()
:
Pipes
The pipe operator (%>%
) from the magrittr package makes it easy to chain these actions together: the output of one function becomes the input of the next.
Another cumbersome bit of typing. In RStudio, type Ctrl
+ Shift
+ M
and the %>%
operator will be inserted.
In the above we use the pipe to send the surveys
data set first through filter
, to keep rows where wgt
was less than 5, and then through select
to keep the species
and sex
columns. When the data frame is being passed to the filter()
and select()
functions through a pipe, we don’t need to include it as an argument to these functions anymore.
If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:
Note that the final data frame is the leftmost part of this expression.
Challenge
Using pipes, subset the data to include individuals collected before 1995, and retain the columns year
, sex
, and weight.
Mutate
String R Cheat Sheet
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or find the ratio of values in two columns. For this we’ll use mutate()
.
To create a new column of weight in kg:
If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head()
of the data (pipes work with non-dplyr functions too, as long as the dplyr
or magrittr
packages are loaded).
The first few rows are full of NAs, so if we wanted to remove those we could insert a filter()
in this chain:
is.na()
is a function that determines whether something is or is not an NA
. The !
symbol negates it, so we’re asking for everything that is not an NA
.
R Studio Cheat Sheet Ggplot
Challenge
Create a new dataframe from the survey data that meets the following criteria: contains only the species_id
column and a column that contains values that are the square-root of hindfoot_length
values (e.g. a new column hindfoot_sqrt
). In this hindfoot_sqrt
column, there are no NA values and all values are < 3.
Hint: think about how the commands should be ordered
Split-apply-combine data analysis and the summarize() function
Many data analysis tasks can be approached using the “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr
makes this very easy through the use of the group_by()
function. group_by()
splits the data into groups upon which some operations can be run. For example, if we wanted to group by sex and find the number of rows of data for each sex, we would do:
Here, tally()
is the action applied to the groups created to group_by()
and counts the total number of records for each category. group_by()
is often used together with summarize()
which collapses each group into a single-row summary of that group. So to view mean weight
by sex:
You can group by multiple columns too:
It looks like most of these species were never weighed. We could then discard rows where mean_weight
is NA
with filter()
:
Another thing we might do here is sort rows by mean_weight
, using arrange()
.
If you want them sorted from highest to lowest, use desc()
.
Also note that you can include multiple summaries.
Challenge
How many times was each plot_type
surveyed?
Challenge
Use group_by()
and summarize()
to find the mean, min, and max hindfoot length for each species.
Challenge
What was the heaviest animal measured in each year? Return the columns year
, genus
, species
, and weight
.
Hint: Use filter()
rather than summarize()
.
A bit of data cleaning
In preparations for the plotting, let’s do a bit of data cleaning: remove rows with missing species_id
, weight
, hindfoot_length
, or sex
.
There are a lot of species with low counts. Let’s remove the species with less than 10 counts.