Use of this document

This is a study note for Tibble data form, Data Import, Parsing vectors and files, Writing to a file, Tidy data, which are parts of data wrangling. For more details on the study material see https://r4ds.had.co.nz/wrangle-intro.html.

Prerequisites

library(dplyr) # data manipulation
library(tibble) # tibble form
library(readr) # read data
library(tidyr) # tidy messy data

1. tibble

The \(tibble\) package is a part of the core \(tidyverse\).

1.1 Tibbles vs. data.frame

  • It never changes the type of the inputs (e.g. it never converts strings to factors!),
  • It never changes the names of variables, and it never creates row names.
  • It’s possible for a tibble to have column names that are not valid R variable names.
  • Much easier to work with large data. Tibbles have a refined print method that shows only the first 10 rows. Customized print: print(n = 10, width = Inf).
  • tibbles are more strict: they never do partial matching. [[ can extract by name or position; $ only extracts by name.

2. readr

2.1 Data Input

The \(readr\) package, which is part of the core \(tidyverse\).

  • read_csv(): reads comma delimited files.

    • use skip = n to skip the first n lines; or use comment = "#" to drop all lines that start with (e.g.) #. ie. read_csv(x, skip=2).
    • use col_names = FALSE to tell read_csv() not to treat the first row as headings. ie. read_csv(x, col_names=False).
  • read_csv2(): reads semicolon separated files (common in countries where , is used as the decimal place),

  • read_tsv() reads tab delimited files, and read_delim() reads in files with any delimiter.

  • read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions().

  • read_table() reads a common variation of fixed width files where columns are separated by white space.

  • read_log() reads Apache style log files. (But also check out webreadr which is built on top of read_log() and provides many more helpful tools.)

  • problems(x) / stop_for_problems(x): access to errors while parsing fails / throw an error if there are any parsing problems

For more information, check out the Note-readFile with in the Note folder.

2.1.1 read_csv() vs. read.csv()

  • They are typically much faster (~10x) than their base equivalents.
  • They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names.
  • They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.

2.2 Parsing a vector

The parse_*() functions take a character vector and return a more specialised vector like a logical, integer, or date. Generic use are:

parse_logical(c("TRUE", "FALSE", "NA")) %>% str()
##  logi [1:3] TRUE FALSE NA
parse_integer(c("1", "2", "3")) %>% str()
##  int [1:3] 1 2 3
parse_date(c("2010-01-01", "1979-10-14")) %>% str()
##  Date[1:2], format: "2010-01-01" "1979-10-14"
parse_integer(c("123", "345", "abc", "123.45")) %>% problems()
## # A tibble: 2 x 4
##     row   col expected               actual
##   <int> <int> <chr>                  <chr> 
## 1     3    NA an integer             abc   
## 2     4    NA no trailing characters .45

Using parsers is mostly a matter of understanding what’s available and how they deal with different types of input. There are eight particularly important parsers:

2.2.1 Numbers

parse_double() is a strict numeric parser, and parse_number() is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways.

  • Issue 1: People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use. Solution:

    parse_double("1,23", locale = locale(decimal_mark = ","))
    ## [1] 1.23
  • Issue 2: Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”. Solution:

    parse_number("It cost $123.45")
    ## [1] 123.45
  • Issue 3: Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”. Solution:

    parse_number("$123,456,789") # default is Amerian grouping mark 
    ## [1] 123456789
    parse_number("123.456.789", locale = locale(grouping_mark = ".")) # Used in many parts of Europe
    ## [1] 123456789
    parse_number("123'456'789", locale = locale(grouping_mark = "'")) # Used in Switzerland
    ## [1] 123456789

2.2.2 Strings

parse_character() seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings.\(readr\) use UTF-8 encoding as defualt for reading and writing. If the data is not UTF-8 encoded, then do the following to

x1 <- "El Ni\xf1o was particularly bad this year"
guess_encoding(charToRaw(x1)) # Latin1 (aka ISO-8859-1)
## # A tibble: 2 x 2
##   encoding   confidence
##   <chr>           <dbl>
## 1 ISO-8859-1       0.46
## 2 ISO-8859-9       0.23
x1 %>% parse_character(locale = locale(encoding = "Latin1"))
## [1] "El Niño was particularly bad this year"

2.2.3 Factors

parse_factor() create factors, the data structure that R uses to represent categorical variables with fixed and known values. But if you have many problematic entries, it’s often easier to leave as character vectors and then use the tools you’ll learn about in strings and factors to clean them up

fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
## [1] apple  banana <NA>  
## attr(,"problems")
## # A tibble: 1 x 4
##     row   col expected           actual  
##   <int> <int> <chr>              <chr>   
## 1     3    NA value in level set bananana
## Levels: apple banana

2.2.4 Dates, date-times, and times

parse_datetime(), parse_date(), and parse_time() allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates.

  • parse_datetime() expects an ISO8601 date-time, which is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.

    parse_datetime("2010-10-01T2010")
    ## [1] "2010-10-01 20:10:00 UTC"
    parse_datetime("20101010")
    ## [1] "2010-10-10 UTC"
  • parse_date() expects a four digit year, a - or /, the month, a - or /, then the day:

    parse_date("2010-10-01")
    ## [1] "2010-10-01"
  • parse_time() expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier:

    library(hms)
    parse_time("01:10 am")
    ## 01:10:00
    parse_time("20:10:01")
    ## 20:10:01

2.2.5 Logical and Integer

parse_logical() and parse_integer() parse logicals and integers respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further.

2.3 Parsing a file

Parsing a file is quickly identify and assign a type to column or attribute in the inputting data set.

2.3.1 Strategy

\(readr\) uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. You can emulate this process with a character vector using guess_parser(), which returns readr’s best guess, and parse_guess() which uses that guess to parse the column:

guess_parser("2010-10-01")
## [1] "date"
str(parse_guess("2010-10-10"))
##  Date[1:1], format: "2010-10-10"

The heuristic tries each of the following types, stopping when it finds a match:

  • logical: contains only “F”, “T”, “FALSE”, or “TRUE”.
  • integer: contains only numeric characters (and -).
  • double: contains only valid doubles (including numbers like 4.5e-5).
  • number: contains valid doubles with the grouping mark inside.
  • time: matches the default time_format.
  • date: matches the default date_format.
  • date-time: any ISO8601 date.

If none of these rules apply, then the column will stay as a vector of strings.
If enconter Problems of parsing, check the link for more details: https://r4ds.had.co.nz/data-import.html. If reading a very large file, you might want to set n_max to a smallish number like 10,000 or 100,000. That will accelerate your iterations while you eliminate common problems

library(feather)
challenge <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
## Parsed with column specification:
## cols(
##   x = col_double(),
##   y = col_date(format = "")
## )

2.4 Writing to a file

\(readr\) also comes with two useful functions for writing data back to disk:

  • write_csv() and write_tsv(): writing data back to disk, make sure the following to increase the chances of the output file being read back in correctly
    • Always encoding strings in UTF-8.
    • Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.
  • write_excel_csv() : this writes a special character (a “byte order mark”) at the start of the file which tells Excel that you’re using the UTF-8 encoding.

Note that the type information is lost when you save to csv, The feather package implements a fast binary file format that can be shared across programming languages:

library(feather)
write_feather(challenge, "challenge.feather")
read_feather("challenge.feather")
## # A tibble: 2,000 x 2
##        x y         
##    <dbl> <date>    
##  1   404 NA        
##  2  4172 NA        
##  3  3004 NA        
##  4   787 NA        
##  5    37 NA        
##  6  2332 NA        
##  7  2489 NA        
##  8  1449 NA        
##  9  3665 NA        
## 10  3863 NA        
## # … with 1,990 more rows

2.5. Other types of data

To get other types of data into R, we recommend starting with the tidyverse packages listed below. They’re certainly not perfect, but they are a good place to start. For rectangular data:

  • \(haven\) reads SPSS, Stata, and SAS files.

  • \(readxl\) reads excel files (both .xls and .xlsx).

  • \(DBI\), along with a database specific backend (e.g. \(RMySQL\), \(RSQLite\), \(RPostgreSQL\) etc) allows you to run SQL queries against a database and return a data frame.

For hierarchical data: use \(jsonlite\) (by Jeroen Ooms) for json, and \(xml2\) for XML. Jenny Bryan has some excellent worked examples at https://jennybc.github.io/purrr-tutorial/.

3. tidyr

tidy data is a consistent way to organise your data in R. Getting your data into this format requires some upfront work, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand. There are three interrelated rules which make a dataset tidy:

These three rules are interrelated because it’s impossible to only satisfy two of the three. That interrelationship leads to an even simpler set of practical instructions:

3.1 two main advantages of tidy data

\(dplyr\), \(ggplot2\), and all the other packages in the \(tidyverse\) are designed to work with tidy data.

  • There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
  • There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

3.2 Spreading and gathering

To resolve one of two common problems:

  • One variable might be spread across multiple columns.
  • One observation might be scattered across multiple rows.

3.2.1 gathering

A common problem is a dataset where some of the column names are not names of variables, but values of a variable.gathering() gathers those columns into a new pair of variables

  • Column names: The set of columns that represent values.
  • key: The name of the variable whose values form the column names.
  • Value: The name of the variable whose values are spread over the cells.
table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
tidy4a <- table4a %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>% 
  gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
## Joining, by = c("country", "year")
## # A tibble: 6 x 4
##   country     year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Brazil      1999   37737  172006362
## 3 China       1999  212258 1272915272
## 4 Afghanistan 2000    2666   20595360
## 5 Brazil      2000   80488  174504898
## 6 China       2000  213766 1280428583

3.2.2 Spreading

Spreading() is the opposite of gathering. You use it when an observation is scattered across multiple rows.

  • key: The column that contains variable names.
  • value: The column that contains values from multiple variables.
table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table2 %>%
    spread(key = type, value = count)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

3.3 Separating and uniting

3.3.1 Separating

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears.

  • By default, separate() will split values wherever it sees a non-alphanumeric characte. If you wish to use a specific character to separate a column, you can pass the character to the sep argument of separate().
  • ask separate() to try and convert to better types using convert = TRUE.
table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
table3 %>% 
  separate(rate, into = c("cases", "population"), sep = ":")
## # A tibble: 6 x 4
##   country      year cases             population
##   <chr>       <int> <chr>             <chr>     
## 1 Afghanistan  1999 745/19987071      <NA>      
## 2 Afghanistan  2000 2666/20595360     <NA>      
## 3 Brazil       1999 37737/172006362   <NA>      
## 4 Brazil       2000 80488/174504898   <NA>      
## 5 China        1999 212258/1272915272 <NA>      
## 6 China        2000 213766/1280428583 <NA>
table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

3.3.2 uniting

unite() is the inverse of separate(): it combines multiple columns into a single column.

  • In this case we also need to use the sep argument. The default will place an underscore (_) between the values from different columns.
table5 
## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
table5 %>% 
  unite(new, century, year, sep = "")
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

3.4 Missing value

Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:

  • Explicitly, i.e. flagged with NA.
  • Implicitly, i.e. simply not present in the data.
stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

make the implicit missing value explicit by putting years in the columns

stocks
## # A tibble: 7 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4  NA   
## 5  2016     2   0.92
## 6  2016     3   0.17
## 7  2016     4   2.66
stocks %>% 
  spread(year, return)
## # A tibble: 4 x 3
##     qtr `2015` `2016`
##   <dbl>  <dbl>  <dbl>
## 1     1   1.88  NA   
## 2     2   0.59   0.92
## 3     3   0.35   0.17
## 4     4  NA      2.66

set na.rm = TRUE in gather() to turn explicit missing values implicit

stocks %>% 
  spread(year, return) %>% 
  gather(year, return, `2015`:`2016`, na.rm = TRUE)
## # A tibble: 6 x 3
##     qtr year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     2 2015    0.59
## 3     3 2015    0.35
## 4     2 2016    0.92
## 5     3 2016    0.17
## 6     4 2016    2.66

complete() takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NAs where necessary.

stocks %>% 
  complete(year, qtr)
## # A tibble: 8 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4  NA   
## 5  2016     1  NA   
## 6  2016     2   0.92
## 7  2016     3   0.17
## 8  2016     4   2.66

fill() takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).

treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)
treatment
## # A tibble: 4 x 3
##   person           treatment response
##   <chr>                <dbl>    <dbl>
## 1 Derrick Whitmore         1        7
## 2 <NA>                     2       10
## 3 <NA>                     3        9
## 4 Katherine Burke          1        4
treatment %>% 
  fill(person)
## # A tibble: 4 x 3
##   person           treatment response
##   <chr>                <dbl>    <dbl>
## 1 Derrick Whitmore         1        7
## 2 Derrick Whitmore         2       10
## 3 Derrick Whitmore         3        9
## 4 Katherine Burke          1        4